Understanding robots.txt and its Implications
A discussion on the robots.txt file, its legal implications, and what’s to be further taken in to account when scraping
If you’ve been scraping for a while, I know you’ve been dealing with the robots.txt since forever. However, the reality is that today this topic has become a little misunderstood. What began as a convention for managing web crawlers has evolved into the central piece of the conversation around data rights, AI training, and the definition of “public” data.
In this article, I’ll guide you deep into the robots.txt file, from its technical structure to the ethical and legal questions it raises for developers nowadays. We’ll also go a little further than only that, discussing the robots meta tags, a website's terms of service, and how they relate to the robots.txt.
Let’s dive into it!
Before proceeding, let me thank NetNut, the platinum partner of the month. They have prepared a juicy offer for you: up to 1 TB of web unblocker for free.
What is the robots.txt File?
As its core, the robots.txt file is a text file that lives at the root of a website (for instance, https://www.example.com/robots.txt) and sets guidelines for bots. On the practical side, the robots.txt file is a way webmasters have to implement the Robots Exclusion Protocol (REP) on a website they own.
In the article “Best Practices for Ethical Web Scraping”, I wrote that the robots.txt is the website owner’s way of saying:
Welcome, automated visitor. Here are the house rules.
So, it’s nothing more than a code of conduct. Think of the code of conduct of the gym you regularly go to: It tells you how you should behave to respect the people and the equipment you interact with. But it is not something that has the power to enforce the listed rules. The same happens with the robots.txt file: It sets the guidance for scrapers and crawlers, but not all bots will follow the instructions.
The robots.txt File and Bots’ Protocols
In networking, a protocol is a set of established rules and standards that dictate how devices communicate and exchange data over a network. It acts as a common language for hardware and software to understand each other. It ensures data is formatted, transmitted, and received correctly, reliably, and securely.
For managing and defining bots’ navigation rules, website owners use two different protocols:
The Robots Exclusion Protocol (REP): This is a way to tell bots which web pages and resources to avoid. What’s important to understand is the fundamental principle of the REP: It’s a voluntary, advisory protocol, not a security mechanism or an enforcement tool. As the REP is implemented through the robots.txt file, you can say that a robots.txt file is a set of instructions for cooperative bots, like search engine crawlers (Googlebot, Bingbot) or respectable scraping tools. Malicious bots or scrapers that do not care about a site’s policies can (and often do) ignore it completely.
The Sitemaps Protocol: Consider it as a robots’ inclusion protocol. It shows a web crawler which pages it can crawl. The Sitemap directive lives in an .xml file at the root of a website (for instance, https://www.example.com/sitemap.xml), and helps ensure that a crawler doesn’t miss any important pages. The Sitemap’s URL is recalled into the robots.txt file, creating a complete synergy that defines to bots what they are supposed to crawl and what they are not.
Below is a simple example of a robots.txt file:
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /search
User-agent: Googlebot
Allow: /search
Sitemap: <https://www.example.com/sitemap.xml>
As you can see:
For each user-agent, it defines the URLs that they are allowed or not allowed (disallowed) to crawl.
It reports the website Sitemap.
Sending many requests in a short period of time, from the same IP address, could lead to a block for your scraper. For this reason, we’re using a proxy provider like our partner Ping Proxies, that’s sharing with TWSC readers this offer.
💰 - Use TWSC for 15% OFF | $1.75/GB Residential Bandwidth | ISP Proxies in 15+ Countries
Advanced robots.txt Syntax and an Unwritten Rule
In robots.txt files, the Allow/Disallow structure is generally only the beginning. More advanced robots.txt files often contain more complex directives that a scraping professional must understand, like:
The Crawl-delay directive: It asks bots to wait a specific number of seconds between requests. For example, Crawl-delay: 10 tells bots to wait 10 milliseconds between each request. This directive is a non-standard, but it’s widely respected as it shows bots (and users) a signal about the server’s capacity. However, note that major crawlers like the Googlebot ignore it.
Wildcard (*) and end-of-URL ($) matching: The robots.txt file supports pattern matching. Specifically, the * acts as a wildcard, and the $ signifies the end of the URL. This allows administrators to define specific rules. For example, Disallow: /.pdf$ would block crawlers from accessing any URL that ends with .pdf. This could prevent downloading PDF files, while still allowing access to pages that might have “pdf“ in the middle of the URL.
Directive precedence: What happens when the Allow and Disallow rules conflict? Major crawlers follow a simple rule: the most specific directive wins. For example, if your robots.txt has Disallow: /media/ and Allow: /media/images/, crawlers will be allowed to access the /media/images/ directory because the Allow rule is the more specific one.
Below is an example of a robots.txt that contains all the rules and directives:
# -------------------------------------------------------------------
# Default Rules for All Bots
# -------------------------------------------------------------------
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /search?query=*
Disallow: /filter?*
Disallow: /*.pdf$
Disallow: /*.zip$
Disallow: /*.docx$
Crawl-delay: 5
# -------------------------------------------------------------------
# Specific Rules for Google's Main Crawler
# -------------------------------------------------------------------
User-agent: Googlebot
Allow: /
Disallow: /internal-notes/
# -------------------------------------------------------------------
# Sitemap Reference
# -------------------------------------------------------------------
Sitemap: <https://www.example.com/sitemap.xml>
The robots.txt File, Bot Management, and SEO
As said before, respecting the robots.txt file is purely voluntary. So, for website administrators, the primary purpose of this file lies in the crawl budget management. What I mean is that search engines like Google allocate a certain amount of monetary resources to crawl any given site. In this scenario, you basically want to tell these bots how to crawl your website so that you can reach your desired results in terms of SEO.
As with every application, the budget is a finite number. By disallowing unimportant pages in the robots.txt file, you can guide Googlebot to spend its (limited) time crawling and indexing the pages that actually matter for SEO. This ensures that the content you consider valuable is discovered and ranked on the search engines.
In other words, a well-defined robots.txt file keeps a website optimized for SEO and makes well-behaved bot activity under control. It will not do very much for managing malicious bot traffic, as respecting it is a voluntary act.
In Scrapy, you can also easily integrate the proxies from our partner Rayobyte.
💰 - Rayobyte is offering an exclusive 55% discount with the code WSC55 in all of their static datacenter & ISP proxies, only to web scraping club visitors.
You can also claim a 30% discount on residential proxies by emailing sales@rayobyte.com.
On the robots.txt and Scraping for Training LLMs
In recent years, the advisory nature of the robots.txt has been thrown into the spotlight with the explosion of large-scale data scraping for training LLMs. What once was a matter of “house rules”, is now at the center of a high-stakes debate over copyright, fair use, and the future of AI. And it’s been thrown in courts all around the world.
But let’s be honest for a moment: this approach should not surprise anyone. I mean, LLMs arrived in our world as a revolutionary technology. And as with every revolution, the regulations always come later.
For website owners, but also for website contributors (like writers, journalists, video makers, etc), it is normal to try and stick to something that protects their work. On the other hand, LLM providers said something like:” That content was publicly available on the Internet, so we just used it”. This is where “the thing” of the robots.txt file came into the game:
Website owners tried to use it as something legally binding.
Tech giants tried to say they didn’t do anything unlawful.
But where did these two “parties” say that? Well, on their websites and…in courts! In fact, in recent months, OpenAI, Anthropic, and other LLM providers have been cited in several courts due to their scraping activities. What’s happened, until now, is that judges have not always treated the robots.txt file as a strong basis for denying a court order to scrape a website, even if the purpose of the scraping activity relied on training LLMs. For this reason, some website owners began adding directives to deny known AI training bots from scraping their websites. Below is what you may find in some robots.txt files today:
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
However, respecting the robots.txt is still up to the bots (or their developers!). So, at the time of this writing, respecting it is still a matter of “well-behaved” bots. Whether this regards scraping for training LLMs or not.
Beyond robots.txt: Page-Level Directives and Terms of Service
The use of the robots.txt is not the only way websites communicate with bots. This is because the robots.txt acts as the gatekeeper for your website. It tells bots which areas are open or closed, but it lacks precision. In particular, it cannot give instructions for a single, specific page without affecting other pages in the same directory. To solve this issue, webmasters use two other communication methods: The robots meta tags and the site’s terms of service.
Let’s discuss them both.
The Robots Meta Tags for Page-Level Precision
The robots meta tag is an HTML tag in a webpage’s <head> section. It gives specific instructions to search engine crawlers about indexing and following links on that page. It controls the visibility in search results and prevents issues like duplicate content by using directives like noindex or nofollow. It offers page-specific control, completing the broader rules of the robots.txt file.
Let’s look at the HTML for a hypothetical “Login Success” page that contains meta tags:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Login Successful!</title>
<meta name="robots" content="noindex, nofollow">
</head>
<body>
<h1>Welcome back, user!</h1>
<p>You have successfully logged in.</p>
<a href="/my-account">Go to Your Account</a>
</body>
</html>
In a scenario like the above one, the robots.txt file might allow full access to the entire site. However, the HTML <meta name=”robots” content=”noindex, nofollow”> line provides a strong signal: “This specific page is not for you”. This is because robots meta tags report the following directives in the content attribute:
noindex: Is the most common directive. It explicitly tells a search engine not to include the page in its public index.
nofollow: Prevents the crawler from following any hyperlinks on the page to discover new URLs.
noarchive: Prevents search engines from storing a “cached” copy of the page. A webmaster might use this on a page with frequently updated data to ensure users always see the live version.
nosnippet: Prevents a search engine from showing a text snippet or video preview from the page in the search results.
noimageindex: Tells the crawler not to index any of the images on this specific page.
For a scraper, the robots meta tag is a signal of intent. It is not a technical barrier (your script can still download and parse the page), but ignoring a noindex or nofollow tag means you are knowingly disregarding the site owner’s explicit page-specific instructions. Ethically speaking, this is a step beyond ignoring a robots.txt rule. It shows you are collecting data from a page that the owner has actively marked as “not for public consumption” via search engines.
The Rulebook of the Web: Understanding a Website’s Terms of Service
At its core, a Terms of Service is a legally binding contract between the entity that owns and operates the website (the “Service Provider”) and the end-user (you, or your automated script). This contract governs your use of the website or service.
This happens because a website is not a public park. It is a piece of private property that the owner has opened to the public. While you are welcome to enter and use the space, there are rules you must follow. In this scenario, the ToS (also known as Terms of Use or Terms and Conditions) is the digital equivalent of a rulebook.
The purpose of a ToS is to protect the website owner. It’s a legal shield that sets clear expectations and limits the owner’s liability. It achieves this through the following functions:
Establishes rules of conduct: It defines what is and isn’t an acceptable behavior. This is where the owner can prohibit activities like spamming, harassing other users, uploading malicious content, or attempting to gain unauthorized access to the system.
Defines intellectual property rights: The ToS explicitly states who owns the content on the site. For example, it declares that the website’s logo, design, text, graphics, and underlying code are the copyrighted property of the owner.
Limits liability (the “disclaimer”): The ToS almost always include a “Limitation of Liability” or “Disclaimer of Warranties” clause. This says that the service is provided “as-is” and the owner is not responsible for any damages that may arise from its use. For example, if the website provides financial data that turns out to be inaccurate and you lose money, this clause aims to protect the owner from being sued.
Outlines permitted and prohibited uses: This is the most critical function in the context of web scraping. The ToS is where the owner defines how their service is meant to be used. It often contains a clause that explicitly prohibits any form of automated data collection, scraping, or data mining without prior written consent. This is the owner drawing a clear line, stating that their website is for human interaction via a browser, not for automated harvesting by bots.
Specifies jurisdiction and dispute resolution: It dictates the legal framework for any disputes. It specifies which country or state’s laws govern the agreement and where any potential lawsuits must be filed. This prevents the owner from being sued in a random court anywhere in the world.
Note that there are cases when a website’s robots.txt is permissive, but the ToS forbids scraping. In such cases, the ToS is the document that carries legal weight in a breach of contract dispute. By ignoring that clause, you are violating a term of the legal agreement you entered into by using the site.
So, here’s the thing: The robots.txt file is a polite suggestion written for machines. A Terms of Service is a legal contract written for humans and enforced by courts. Yet, this should not be something scary for scraping professionals. Many website owners, in fact, provide the ToS as a browsewrap agreement. This is a type of online contract where terms of service are accepted just by using a website or app, without needing an explicit “I agree”. Agreements provided like so are difficult to enforce legally because users may not notice the terms (unlike clickwrap agreements, which require active consent).
Conclusion
The robots.txt file gained a central role in web scraping. Recently, this is mainly due to all the lawsuits about scraping websites for training LLMs. Anyway, as of now, the robots.txt file still remains something that can be respected voluntarily. This has also been confirmed in several courts.
Still, copyright infringements and other legal issues can knock on your door, so carefulness is a must for web scraping professionals. And also, remember: It is more probable that troubles can come from not respecting the terms of service, rather than the robots.txt file.




