Nikita the Spider

A Robots.txt Survey – The Good, The Bad and The Ugly

Nikita is a spider with good manners; she checks the robots.txt file of every site she accesses. By itself, a single robots.txt file is not very interesting. But a survey of a large number of them can inform us about how this file is used. This survey examines how many sites use a robots.txt file, whether or not they're used correctly, and how often one is likely to encounter some “quirky” features of robots.txt.

Can one go so far as to call the assembled results interesting? I'll leave that for you to decide. My survey of robots.txt files from about 25,000 different hosts (collected in the Spring of 2006) is below.

In December of 2006, I added statistics about a survey of over 150,000 robots.txt files and in January of 2008, I added a survey covering all of 2007.

About Specifications

Robots.txt is kind of an odd duck. It is not an official standard blessed by the Internet Powers That Be, but it is very widely used and there has been good agreement on the format (if only shaky understanding of it) since it was adopted in 1994. Two documents assume the role of standards; Martijn Koster is the author of both. The first, which I refer to as MK1994, has become the de facto standard. The second (MK1996) is a clearer and richer document IMHO, but it doesn't carry as much weight with the Internet community as MK1994.

Expires Header

Of the sites I surveyed, 2.7% specified an Expires header along with their robots.txt file. Of those, 52.6% (1.4% of the total) specified immediate expiration of the file. The remainder specified expiration times ranging from 1 second to 1024 days.

It suprises me that so few sites take advantage of the Expires header. In the discussion leading up to the first robots.txt standard (which is no longer online, unfortunately), someone suggested that there should be some way to indicate the expiration date from within the file itself. Martijn Koster reminded him that HTTP's Expires header already served exactly that purpose, and someone else said, “I agree with Martijn that we should use the (painfully obvious) existing mechanism”. In other words, it seems likely that use of the Expires header wasn't mentioned in MK1994 because it was deemed self-evident. MK1996 states explicitly that robots should respect the Expires header and defines a default expiration of seven days if it is absent.

Good Intentions, Bad Responses

A distressing 7.7% of the sites responded with content labelled as text/html. Since I couldn't examine all these files individually, I assumed that any file that contained <title or <html in the content was a Web page. By these criteria, at least 91% the responses labelled text/html really were HTML – presumably some sort of “Ooops, that file is missing” Web page. (Some spot checks added strength to this assertion.) The Webmasters of these sites need a gentle but firm whack with a clue stick. Requests for a resource that's not found should return response code 404, especially for a file like robots.txt where the response code is a meaningful part of the specification. (A 404 received in response to a request for robots.txt means that all robots are welcome.) For the record, none of sites that returned text/html media gave a 404 response code.

Of the remainder labelled as text/html, most were ordinary robots.txt files mislabelled as HTML content. (Don't put away that clue stick yet!) The others were varying bits of mess: empty files, non-ASCII garbage, etc.

404s

Over half of the sites -- 51.7% -- didn't provide a robots.txt and responded to Nikita's query with a 404. This is perfectly valid and simply means that robots are permitted unrestricted access to the site. If we assume that the Web pages returned above should really return 404s, then the number of sites without a robots.txt file jumps to almost 60%.

401s and 403s

Just .4% of the sites chose to use the part of MK1994 that says that 401 and 403 responses indicate that the site is off limits to all robots. And my guess is that some of these sites simply respond with a 401 or 403 for all files when the user agent looks like a robot's. In other words, this feature of the robots.txt spec is barely used.

Incomprehensible Responses

Five sites (all of them at aol.com) return a 202 (accepted) response code, which makes no sense at all in this context. A few others return 302 redirects that eventually redirect back to themselves in an infinite loop. One returns a 300 (multiple choices) which is meaningless to a robot, and another returns a 550 response code for which I can't even find a definition. While these sites would be interesting for an article entitled “101 Ways to Misconfigure Your Web Server”, they're a vanishingly small portion of this sample and don't merit further attention here.

Non-ASCII Characters

Robots.txt files containing non-ASCII are of particular interest to me because Nikita ran into a problem where Python’s robot exclusion rules parser crashes when confronted with non-ASCII under some circumstances. I wrote a new robots.txt parser to handle a wider range of robots.txt files.

Introducing the subject of non-ASCII also introduces the subject of encodings, on which MK1994 and MK1996 are silent. Fortunately, HTTP once again comes to the rescue. The HTTP specification says that text media has a default encoding of ISO-8859-1 (a superset of US-ASCII), so robots.txt files can legally contain ISO-8859-1 characters even if no encoding is specified via HTTP.

All but a tiny handful of the robots.txt files in the sample contain pure ASCII (handful being a scientific term defined as 0.2%). The fifty-five that don't can be divided into three categories. The first category is files that contain non-ASCII in the comments (e.g. “Det er ikke tillat med roboter, spidere og fremmede script på disse områdene”). Since a properly programmed spider ignores the comments, this category isn't too interesting.

The second category have non-ASCII in meaningful robots.txt fields. Oddly enough, almost all of these are a consequence of a robot called Hämähäkki (the Finnish word for spider) which appeared in a list of active robots in the mid-90s. The spider itself is gone, but a decade later, the name Hämähäkki lives on in robots.txt files. Apparently one or more automated tools built robots.txt files that listed all "known" spiders based on an outdated list at robottxt.org. Of the files in my sample containing non-ASCII, Hämähäkki was the only non-ASCII in thirty-two (over half) of them. Robots that aren't prepared to handle non-ASCII might have trouble with these. Posthumous kudos to Hämähäkki for keeping us on our toes.

Byte My BOM

The third category of non-ASCII robots.txt files are those that contain a BOM -- I found just twelve of these. This is another area where Python's robots.txt parser can get confused, and I suspect that it is not the only code library to have this weakness. The problem is that parsers that fail to account for the BOM see it as part of the first line of text. If that line is a comment (which it often is) then the BOM won't cause any problems. But if, for instance, the file consists of a UTF-8 BOM followed by a simple disallow-all rule, then some parsers might see this (the BOM is bolded):
User-agent: *
Disallow: /
To a parser, the user-agent line might just look like garbage, and as a result the Disallow line after it would be ignored. So the parser would see an “empty” robots.txt file and permit access to the entire site, which is exactly the opposite of what the author intended.

At minimum, robots.txt parsers should not allow BOMs to interfere with proper parsing of the file. Ideally, the parser would use the BOM as it was intended: to indicate the encoding of the file. Given that non-ASCII robots.txt files are so rare, I expect that support for them among code libraries is weak and support for BOMs is probably weaker still. Any Webmaster who codes a robots.txt file that contains non-ASCII and relies on proper interpretation of the BOM to decode it is asking for trouble. (Not to mention the fact that doing so violates the HTTP specification which says in section 3.7.1, Data in character sets other than ‘ISO-8859-1’ or its subsets MUST be labeled with an appropriate charset value.)

It is worth noting that under Windows 2000 (and probably Windows XP), Notepad adds a BOM if you save a text file as UTF-8 or Unicode. You can see this for yourself by using a hex viewer for Windows or using hexdump under Unix.

Extensions to MK1994/1996

Wildcard Characters

Google's robots.txt parser supports wildcards in pathnames. It seems likely that other bots support this too, but Google is the only one for which I have a reference.

Crawl-delay

A number of robots (among them Yahoo Slurp, Inktomi and MSNBot) support a Crawl-Delay: n specification, where n is the number of seconds that a bot should wait between requesting pages. In my sample, 1.4% of robots.txt files contained a Crawl-Delay specification.

I was curious about what crawl delays Webmasters choose, and I found that the numbers vary widely. In the 360 files that contained crawl delays, there were 514 delays specified. The minimum delay was 1 second, the maximum 172800 (which is 48 hours), the mean 890.34, the median 10 and the mode 1. Since this is such a broad range of data with some big numbers that skew the average, it's helpful to look at the percentage of crawl delays less than or equal to a given value. The table below shows exactly that. For instance, 62% of all of the crawl delays were ≤ 15 seconds.

Delay (seconds)% of delays ≤
1 25%
2 30%
3 32%
5 39%
1059%
1562%
2070%
3080%
6087%
12095%
90099%

Summary of Figures

The table below summarizes the frequency of the items discussed above. The figures given are from my sample of robots.txt files from 25,060 different hosts. “Different”, in this case, was determined by a case-insensitive string comparison of the non-path portion of the URL. For example, news.example.com was considered different from www.example.com.

FeatureOccurrencesPercentage
Expires header present6892.7%
Return text/html19377.7%
Return 4041295851.7%
Return 401 or 4031160.4%
Contain non-ASCII550.2%
Contain a BOM12< 0.1%
Specify Crawl-Delay3601.4%

Methodology and Conclusions

I sampled these robots.txt files as part of pre-alpha testing of Nikita the Spider. The sample includes the sites I spidered and they sites that they link to (because Nikita fetches robots.txt before checking a link). Whether or not there was a bias in my sample, I cannot say. Actually, I can't think of a way of building a sample that doesn't contain at least some bias. I hope that 25,000 files is a sample large enough to smooth out the inherent bias and from which to draw conclusions.

Speaking of conclusions, I have two contradictory ones. First, an observation: robots.txt hasn't changed much from its 1994 origins. 99.8% of them present on the Net today are pure ASCII, few extensions have been made to the original format and some of the elements of the original format (such as the use of 401 and 403 response codes) are barely used. Based on this, the fact that 50–60% of sites don't even bother with a robots.txt file, and on the number of bungled robots.txt files out there, one could conclude that the original specification wasn't very good. But I prefer the opposite conclusion: the original specification was a good one and still gets the job done. And with a little more respect for the existing “features” implied by HTTP (the Expires header and encoding specification) and the widespread acceptance of crawl-delay (which seems quite useful), the format might survive another ten years without further alteration.

Thanks for reading! Comments are welcome.