The most dangerous robots.txt is not the one with obvious errors. It is the one that looks correct, that nobody has reviewed in months, and that is silently blocking key pages that Googlebot never reaches. According to Screaming Frog’s internal audit data, approximately 30% of sites they analyse have some problematic directive in their robots.txt affecting pages that should be indexed.
The robots.txt file has been there since the early days of SEO. Someone generated it, at some point, and then nobody touched it again. That is precisely the problem.
Unlike a sitemap configuration or a canonical tag, the robots.txt operates at a layer before any other SEO signal: if you block a URL here, Googlebot never reads the title, the meta tags, or the schema markup. All the on-page optimisation work becomes irrelevant before it even starts.
What robots.txt is and how Googlebot actually interprets it
The robots.txt is a plain text file that lives at the root of your domain (https://yourdomain.com/robots.txt). Its purpose is to communicate to search robots which parts of the site they can and cannot crawl. The specification is simple in theory. In practice, there are enough interpretation nuances for errors to be frequent and costly.
The first thing to understand: robots.txt is not a security mechanism. Google follows it by convention, not by obligation. A malicious robot will ignore the file without technical consequences. The robots.txt only works for bots that respect the specification, primarily search engines.
The second point: Google caches the robots.txt. It does not read it on every visit. It downloads it periodically (every few days on active sites) and uses that cached version for all its crawl decisions until the next update. An urgent change to robots.txt can take 24–48 hours to be reflected in Googlebot’s actual behaviour.
The basic structure of a directive is straightforward:
User-agent: [bot name]
Disallow: [blocked path]
Allow: [permitted path]
Google accepts * as a universal wildcard for User-agent (all robots) and as a wildcard within paths. It also accepts $ to indicate the end of a URL. What Google does not accept are some advanced regular expression patterns that other bots do interpret — a frequent point of confusion when configurations are copied from other sources.
John Mueller, Senior Search Analyst at Google, has reiterated in multiple Q&A sessions that robots.txt does not guarantee privacy and that blocking a URL here does not remove it from the index. If the URL was already indexed when the Disallow directive is added, Google may keep it in results for months because it cannot visit it to verify whether it should be deindexed.
Error #1 — Disallow: / (blocking the entire site by accident)
This is the most severe error and, surprisingly, it occurs frequently enough for Google to mention it specifically in its documentation. The result: Googlebot cannot crawl any page on the site.
# INCORRECT — blocks the entire site from all bots
User-agent: *
Disallow: /
The most common cause is not malice or ignorance: it is copying the staging environment’s robots.txt to production. Development and pre-production environments typically block all crawling to prevent them from appearing in search results. When the deployment to production happens and someone copies the full configuration file, the block travels with it.
The correct version for a site that wants to allow general crawling:
# CORRECT — allows crawling of the entire site
User-agent: *
Disallow:
# If there are specific sections to block:
User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /cart/
An empty Disallow: line is the standard way to tell Googlebot it can crawl any path. You do not need to write Allow: / (although that also works).
The clearest warning sign of this error: Google Search Console starts showing a sharp drop in pages crawled, and the Coverage report shows dozens or hundreds of pages with the status “Excluded: blocked by robots.txt”. If you see that combination, check the robots.txt immediately.
Error #2 — Blocking CSS and JavaScript resources critical for rendering
This error does not block complete pages. It blocks the resources that Googlebot needs to render those pages correctly. The effect is more subtle but equally damaging.
When Googlebot visits a URL, it does not merely read the HTML. It downloads the CSS and JavaScript files referenced in the page to render it as a real user would see it. If the robots.txt blocks those resources, Googlebot sees a degraded version of the content — and that degraded version is what it evaluates for ranking.
The blocking patterns that cause this problem most frequently:
# INCORRECT — blocks resources needed for rendering
User-agent: *
Disallow: /wp-content/uploads/
Disallow: /wp-content/plugins/
Disallow: /assets/
Disallow: /static/
Disallow: /css/
Disallow: /js/
The intention behind these blocks is often reasonable: preventing Google from indexing individual files that have no value as pages. The problem is that “not indexing” and “not crawling” are different things. Google will not index a /assets/main.css file as if it were a search results page, but it does need to download it to render any page that uses it.
# CORRECT — allows crawling of resources, blocks only what is necessary
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-login.php
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
# CSS, JS and image resources are NOT blocked
Google Search Central explicitly documents this recommendation: allowing access to all files that the browser needs to render the page is essential for Googlebot to correctly evaluate content.
To verify whether Googlebot can access a page’s resources, use the URL Inspection tool in Google Search Console. The report shows whether any resources were blocked during the last crawl and which ones they were. If the “Page resources blocked” section appears, you have this problem.
Error #3 — Incorrect syntax: case-sensitivity and spaces that cost indexation
The robots.txt is stricter than it appears in its syntax. Two specific formatting errors cause problems that are difficult to detect without tools:
Case-sensitivity in paths: Paths in Disallow and Allow directives are case-sensitive according to Google’s specification. This means:
# INCORRECT if your real URL is /admin/ (lowercase)
User-agent: *
Disallow: /Admin/
# CORRECT — capitalisation must match the real URL exactly
User-agent: *
Disallow: /admin/
If your site has URLs with capital letters (something to avoid, but common in many CMS platforms), you need to block the exact versions. A block of /Admin/ does not affect /admin/ or /ADMIN/.
Spaces in the User-agent directive: The specification requires no space between User-agent: and the value. An incorrectly generated file may have:
# INCORRECT — space after the colon (causes problems in some parsers)
User-agent : *
Disallow: /admin/
# CORRECT
User-agent: *
Disallow: /admin/
Directives without a User-agent: Any directive that is not associated with a User-agent block is ignored. If someone adds a Disallow directive outside a block, it has no effect but also generates no visible error:
# INCORRECT — the Disallow without a preceding User-agent is ignored
Disallow: /private-area/
User-agent: *
Disallow: /admin/
Order matters for grouping: Google groups directives by User-agent block. If you have two separate blocks for the same User-agent, Google processes them independently. The result may not be what you expect:
# POTENTIALLY PROBLEMATIC — two separate blocks for the same user-agent
User-agent: *
Disallow: /admin/
User-agent: *
Disallow: /login/
# Google may apply only one of the two blocks
The correct approach is to group all directives for the same User-agent into a single block:
# CORRECT — a single block per User-agent
User-agent: *
Disallow: /admin/
Disallow: /login/
Disallow: /cart/
Error #4 — Incorrectly applied wildcard that blocks pages you want indexed
Wildcards (* and $) are powerful but require precision. A poorly written pattern can block dozens or hundreds of URLs you wanted to keep accessible to Googlebot.
The * wildcard in a path matches any sequence of characters at that position. The problem arises when the pattern is too generic:
# INCORRECT — blocks ALL URLs containing "?", including valid product pages
User-agent: *
Disallow: /*?
# This blocks:
# /product/blue-shirt?colour=blue
# /blog/article?utm_source=newsletter
# /services?tab=pricing ← important page you wanted indexed
If the goal is to block dynamic filter pages while allowing base product URLs, the directive needs to be more specific:
# BETTER — blocks only specific filter parameters
User-agent: *
Disallow: /*?orderby=
Disallow: /*?filter_colour=
Disallow: /*?paged=
The $ wildcard indicates the end of the URL. Useful for blocking files with specific extensions without blocking paths that start the same way:
# CORRECT — blocks .pdf files but not the /documents/ section
User-agent: *
Disallow: /*.pdf$
Without the $, Disallow: /*.pdf could also block a hypothetical URL like /documents/pdf-guides/ depending on parser interpretation. With the $, only URLs ending exactly in .pdf are blocked.
A particularly costly error in ecommerce: blocking pagination pages with an overly broad pattern:
# INCORRECT — blocks category pages with pagination (/category/page/2/)
User-agent: *
Disallow: /*/page/
# If your site has URLs like /services/page/2/, /blog/page/3/,
# those are also blocked even if you want them indexed
Before adding any wildcard directive, use the URL Inspection tool in Google Search Console to test the pattern against real URLs on your site. The tool shows exactly which URLs are blocked and which remain accessible.
Error #5 — Conflict between robots.txt and meta robots: which wins?
This is the most common conceptual error: assuming that robots.txt and the meta robots tag work the same way or that they complement each other intuitively. The reality is more complex and can produce unexpected results.
Fundamental rule: if a URL is blocked in robots.txt, Google cannot crawl it. If it cannot crawl it, it cannot read the meta tags it contains. This includes the <meta name="robots" content="noindex"> tag.
The most problematic scenario:
# In robots.txt:
User-agent: *
Disallow: /landing-pages/
# In /landing-pages/special-offer/:
<meta name="robots" content="noindex, follow">
The apparent goal is to deindex the landing page. The actual result: Google cannot access the page to read the noindex, so it may keep it in the index indefinitely if it had inbound links that had previously indexed it.
The precedence rule is the opposite of what many expect:
- To prevent crawling: use robots.txt Disallow. The noindex on the page is irrelevant if Google cannot crawl the URL.
- To prevent indexation of a crawlable page: use
<meta name="robots" content="noindex">(or the HTTP headerX-Robots-Tag: noindex). The robots.txt must allow access so Google can read this directive. - To remove from the index a page that is already indexed: remove the robots.txt block, add noindex on the page, and wait for Google to crawl and process the directive.
Gary Illyes of Google summarised this conflict clearly at a Google Search Central event: “A page blocked by robots.txt is not the same as a noindexed page. If you want to make sure something does not appear in results, do not confuse both mechanisms.”
The correct combination depends on the objective:
| Objective | Robots.txt | Meta robots |
|---|---|---|
| Do not crawl, do not index | Disallow | (irrelevant, not read) |
| Crawl but do not index | Allow (or not mentioned) | noindex |
| Crawl, index, do not follow links | Allow | nofollow |
| Remove from index (already indexed) | Remove Disallow | noindex |
How to audit your robots.txt with Google Search Console
Google Search Console includes two complementary tools for verifying the robots.txt:
-
robots.txt report (Settings → Robots.txt): shows the cached robots.txt file Googlebot is currently using, when it was last crawled, and whether there were any fetch errors.
-
URL Inspection tool: lets you test a specific URL to see whether it is blocked by robots.txt, which rule applies, and the indexing status.
For a more thorough robots.txt audit, Screaming Frog SEO Spider has a specific function that crawls the site simulating Googlebot’s behaviour and shows which pages fall outside the crawl due to current directives. The “Blocked by Robots.txt” report in the Response tab shows all affected URLs.
Steps for a basic audit:
- Open
https://yourdomain.com/robots.txtdirectly in your browser and review each directive. - In Google Search Console, use the URL Inspection tool to verify that the 10–20 most important pages on your site are not blocked by robots.txt.
- Review the Coverage report in Search Console and filter by “Excluded: blocked by robots.txt” to see if there are URLs that should not be blocked.
- If you use Screaming Frog, crawl the site and review the “Blocked by Robots.txt” report.
A correctly configured robots.txt is one of the foundations of crawl budget management and proper site indexation. If critical pages never reach Googlebot, no other SEO optimisation has the opportunity to work.
For a deeper understanding of how robots.txt, sitemaps, and indexation strategy work together, the Google Search Console practical guide covers how to coordinate both mechanisms to maximise visibility in Google.
Share this article
If you found this content useful, share it with your colleagues.
Frequently Asked Questions
¿Con qué frecuencia publican contenido nuevo?
Publicamos artículos nuevos semanalmente, enfocados en las últimas tendencias de SEO técnico, casos de estudio reales y mejores prácticas. Suscríbete a nuestro newsletter para no perderte ninguna actualización.
¿Los consejos son aplicables a cualquier tipo de sitio web?
Nuestros consejos se adaptan a diferentes tipos de sitios: ecommerce, blogs, sitios corporativos y aplicaciones web. Siempre indicamos cuándo una técnica es específica para cierto tipo de sitio o requerimientos técnicos.
¿Puedo implementar estas técnicas yo mismo?
Muchas técnicas básicas puedes implementarlas tú mismo siguiendo nuestras guías paso a paso. Para optimizaciones avanzadas o auditorías completas, recomendamos consultar con especialistas en SEO técnico como nuestro equipo.
¿Ofrecen servicios de consultoría personalizada?
Sí, ofrecemos servicios de consultoría SEO técnica personalizada, auditorías completas y optimización integral. Contáctanos para discutir las necesidades específicas de tu proyecto y cómo podemos ayudarte.