Crawler Algorithm

At a high level, the banner data collection has the following characteristics:

Frequency: the crawlers work 24/7 and update the database in real-time. At any moment you query the Shodan website you’re getting the latest picture of the Internet.
Globally Distributed: data is collected from around the world to prevent geographic bias. For example, many system administrators in the USA block entire Chinese IP ranges. Distributing Shodan crawlers around the world ensures that any sort of country-wide blocking won’t affect data gathering. The _shodan.region property lets you know the region of the crawler that created the banner.
Randomized: the basic algorithm for the crawlers is:
1. Generate a random IPv4 address
2. Generate a random port to test from the list of ports that Shodan understands
3. Check the random IPv4 address on the random port and grab a banner
4. Goto 1
This means that the crawlers don’t scan incremental network ranges. The crawling is performed completely random to ensure a uniform coverage of the Internet and prevent bias in the data at any given time. The algorithm is designed to randomly crawl the Internet once a week.

Additionally, there are other algorithms that rescan certain networks/ devices more frequently. For example, any assets that are configured on Shodan Monitor are rescanned at least once a day and it uses a specialized algorithm to increase data collection.
Protocol Detection: the crawlers automatically detect if the network service is running a different protocol than expected. For example, the standard port for SSH is 22 but some people run it on the web port of 80. If a crawler connects to port 80 and sees something that looks like SSH then it automatically uses the SSH banner grabber. As of August 2024, around 8,000 SSH services run on port 80.
Hostname-Aware: some websites only respond if the HTTP request includes a valid hostname which is why Shodan supports hostnames in its crawling infrastructure. Additionally, at the start of every month Shodan launches a hostname-based scan of all websites it knows about (a few hundred million). A hostname-based scan means that the Host and SNI headers are set with a hostname instead of the IP address of the server.
Cascading: if a banner returns information about peers or otherwise has information about another IP address that runs a service then the crawlers try to perform a banner grab on that IP/ service. For example: the default port for the mainline DHT (used by Bittorrent) is 6881. The banner for such a DHT node looks as follows:
```
DHT Nodes
97.94.250.250     58431
150.77.37.22      34149
113.181.97.227    63579
252.246.184.180   36408
83.145.107.53     52158
77.232.167.126    52716
25.89.240.146     27179
147.23.120.228    50074
85.58.200.213     27422
180.214.174.82    36937
241.241.187.233   60339
166.219.60.135    3297
149.56.67.21      13735
107.55.196.179    8748
```
With cascading enabled for the DHT banner grabber, the crawler now launches new banner grabbing requests for all of the peers. In the above example, the crawler would launch a scan for IP 54.70.96.157 on port 61770 using the dht banner grabber, IP 85.82.92.188 on port 42155 and so on. I.e. a single scan for an IP can cause a cascade of scans if the initial scan data contains information about other potential hosts.

To keep track of the relationship between the initial scan request and any child/ cascading requests we’ve introduced 2 new properties:
- _shodan.id: A unique ID for the banner. This property is guaranteed to exist if a cascading request could get launched from the service, though it doesn’t necessarily mean that any cascading requests succeeded.
- _shodan.options.referrer: Provides the unique ID of the banner that triggered the creation of the current banner. I.e. the referrer is the parent of the current banner.