Infrastructure Guidance
Migrating from the API
Increase Rate Limit
Most customers slowly transition from the API to their local copy of the Shodan database. As part of the Enterprise license an organization has at least 100 Shodan accounts it can use, each with its own independent rate limit. This means that if you need to increase throughput and aren’t yet ready to setup an on-premise Shodan database then you can continue using the API. Simply rotate through the 100 API keys to drastically increase the requests per second to the Shodan API.
By the Numbers
- Shodan scans the Internet at least once a week. This means you need at least 1 week worth of daily data files to generate a snapshot of the Internet.
- The Shodan websites use 30 days of data for the search engine. For various reasons, Shodan settled on using a sliding window of 30 days for the search engine and IP lookups. If you want to replicate the Shodan API then you would want to store 30 days of data.
- Assets on Shodan Monitor are rescanned at least once a day. If you have assets that should get rescanned more frequently then you can add them to an asset group on Shodan Monitor. The data will show up in the bulk data files/ firehose the same as the data from the regular crawling.
Production Usage
Some of the public APIs, such as the Geonet and InternetDB APIs, are offered for convenience and if you plan on using them within a product or otherwise generate a large number of requests then please let us know so we can make sure it gets scaled properly to meet demand.
Architecture of the Shodan API
The technical details of how the bulk data fits into the infrastructure of an organization typically vary a lot and if you’re not sure about the hardware/ software/ network requirements then reach out to Shodan support. Below is a high-level view of how the Shodan REST API is built on-top of the bulk data provided by Shodan Enterprise.
The Shodan API is deployed behind Kubernetes and relies on Cassandra as well as Elastic to power the /shodan/host
methods:
- Cassandra: Apache Cassandra is an open-source NoSQL database that operates in a cluster to provide high performance throughput, no single point of failure and uses a partitioned row store as its data model. It’s used as the primary data store for all banners. Datastax, the commercial company behind Cassandra, provides an official K8s operator for Cassandra.
- Elastic: Elastic is the de facto standard for creating search capabilities and most of our customers already use them in some capacity. The Shodan API uses it for making the data searchable but it doesn’t store any data. Elastic has an official K8s operator for Elastic.
The general idea is to use a proven, performant NoSQL database such as Cassandra to store the full Shodan Firehose information in real-time and then use Elastic as an interface for advanced queries about the data. The general workflow of the system is to ingest the Shodan Firehose, store the banners in Cassandra and then index the banner in Elastic. The Shodan API expires data after 30 days using TTLs in both Cassandra and Elastic.
Data Model
Cassandra acts as a key/ value store where information is stored using a schema such as:
CREATE TABLE IF NOT EXISTS shodan.host ( ip inet, port int, transport tinyint, banner text, PRIMARY KEY (ip, port, transport)) WITH CLUSTERING ORDER BY (port ASC) AND default_time_to_live = 2592000 AND compression = {'class': 'ZstdCompressor', 'compression_level': 10};
In terms of Elastic mappings, we recommend starting off by simply indexing the entire JSON banner combined with the Cassandra primary key and making adjustments afterwards based on your use cases. Elastic has become much better at providing good defaults when it comes to indexing data so give it a try before creating an elaborate mapping. If you decide to disable the _source
field then make sure you store the Cassandra primary key in Elastic. This is the only information that needs to be stored. When you perform a search in Elastic it will return a list of these primary key values that reference the banner which matched the search query. At that point, simply lookup the banner in your Cassandra instance.
Data Files vs Firehose
The daily data files for raw-daily
, banners-daily
and banners-hourly
are generated by subscribing to the firehose so the data between them is the same. That being said, we recommend to use the banners-daily
to bootstrap the on-premise copy of Shodan and then use banners-hourly
for on-going updates. The firehose requires a stable, high-throughput connection that is harder to manage and generally requires more engineering overhead. We only recommend the firehose if real-time updates from the Shodan crawlers are required. For all other situations it is significantly easier to work with the hourly data files from banners-hourly
.
Working with Denormalized Data
Each banner stores information about a service as well as general metadata of the IP (geolocation, hostnames, etc.). There are a few things to consider when modeling your data storage:
- Check that a property exists before using it: many properties on the banner only exist if a value was collected. For example, if an Apache web server doesn’t show its version in the HTTP headers then the banner won’t have the version property set.
- Normalized vs Denormalized Data Models: each banner stores the full IP metadata to help you build a history of the IPs behavior. When storing the data from you should think whether you want to normalize the banners (separate tables for service information vs geolocation/ hostnames/ etc.) or whether you want to keep the banners as-is. At Shodan, we store the banners denormalized in Cassandra and build specific data stores based on subsets of that data which are optimized for our queries.