Robots.txt: errors that block Googlebot without you knowing

What are the most serious robots.txt errors for SEO?

The most severe error is Disallow: / which blocks the entire site. Other critical errors include blocking CSS and JavaScript files that Googlebot needs for rendering, using incorrect wildcards that capture unintended URLs, and case-sensitivity issues where 'Disallow: /Admin/' does not block '/admin/'.

Key takeaways

A robots.txt with 'Disallow: /' blocks the entire site from all robots — this is the most devastating error and happens more often than expected, particularly when staging configurations are copied to production.
Blocking /wp-content/uploads/, /wp-content/plugins/ or /assets/ prevents Googlebot from downloading CSS and JavaScript, which can degrade page rendering and directly affect rankings — source: Google Search Central.
The robots.txt is case-sensitive for paths: 'Disallow: /Contact/' does NOT block '/contact/' — a capitalisation error can leave exposed paths you wanted to block, or block paths you wanted crawled.
The Disallow: directive takes precedence over Allow: when both patterns have the same length — Google applies the most specific (longest) rule, not the one appearing first in the file.
The robots.txt report in Google Search Console (Settings → Robots.txt) shows the cached file Googlebot is currently using; the URL Inspection tool lets you verify whether a specific URL is blocked, without waiting for a new crawl.

The most dangerous robots.txt is not the one with obvious errors. It is the one that looks correct, that nobody has reviewed in months, and that is silently blocking key pages that Googlebot never reaches. According to Screaming Frog’s internal audit data, approximately 30% of sites they analyse have some problematic directive in their robots.txt affecting pages that should be indexed.

The robots.txt file has been there since the early days of SEO. Someone generated it, at some point, and then nobody touched it again. That is precisely the problem.

Unlike a sitemap configuration or a canonical tag, the robots.txt operates at a layer before any other SEO signal: if you block a URL here, Googlebot never reads the title, the meta tags, or the schema markup. All the on-page optimisation work becomes irrelevant before it even starts.

What robots.txt is and how Googlebot actually interprets it

The robots.txt is a plain text file that lives at the root of your domain (https://yourdomain.com/robots.txt). Its purpose is to communicate to search robots which parts of the site they can and cannot crawl. The specification is simple in theory. In practice, there are enough interpretation nuances for errors to be frequent and costly.

The first thing to understand: robots.txt is not a security mechanism. Google follows it by convention, not by obligation. A malicious robot will ignore the file without technical consequences. The robots.txt only works for bots that respect the specification, primarily search engines.

The second point: Google caches the robots.txt. It does not read it on every visit. It downloads it periodically (every few days on active sites) and uses that cached version for all its crawl decisions until the next update. An urgent change to robots.txt can take 24–48 hours to be reflected in Googlebot’s actual behaviour.

The basic structure of a directive is straightforward:

User-agent: [bot name]
Disallow: [blocked path]
Allow: [permitted path]

Google accepts * as a universal wildcard for User-agent (all robots) and as a wildcard within paths. It also accepts $ to indicate the end of a URL. What Google does not accept are some advanced regular expression patterns that other bots do interpret — a frequent point of confusion when configurations are copied from other sources.

John Mueller, Senior Search Analyst at Google, has reiterated in multiple Q&A sessions that robots.txt does not guarantee privacy and that blocking a URL here does not remove it from the index. If the URL was already indexed when the Disallow directive is added, Google may keep it in results for months because it cannot visit it to verify whether it should be deindexed.

Error #1 — Disallow: / (blocking the entire site by accident)

This is the most severe error and, surprisingly, it occurs frequently enough for Google to mention it specifically in its documentation. The result: Googlebot cannot crawl any page on the site.

# INCORRECT — blocks the entire site from all bots
User-agent: *
Disallow: /

The most common cause is not malice or ignorance: it is copying the staging environment’s robots.txt to production. Development and pre-production environments typically block all crawling to prevent them from appearing in search results. When the deployment to production happens and someone copies the full configuration file, the block travels with it.

The correct version for a site that wants to allow general crawling:

# CORRECT — allows crawling of the entire site
User-agent: *
Disallow:

# If there are specific sections to block:
User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /cart/

An empty Disallow: line is the standard way to tell Googlebot it can crawl any path. You do not need to write Allow: / (although that also works).

The clearest warning sign of this error: Google Search Console starts showing a sharp drop in pages crawled, and the Coverage report shows dozens or hundreds of pages with the status “Excluded: blocked by robots.txt”. If you see that combination, check the robots.txt immediately.

Error #2 — Blocking CSS and JavaScript resources critical for rendering

This error does not block complete pages. It blocks the resources that Googlebot needs to render those pages correctly. The effect is more subtle but equally damaging.

When Googlebot visits a URL, it does not merely read the HTML. It downloads the CSS and JavaScript files referenced in the page to render it as a real user would see it. If the robots.txt blocks those resources, Googlebot sees a degraded version of the content — and that degraded version is what it evaluates for ranking.

The blocking patterns that cause this problem most frequently:

# INCORRECT — blocks resources needed for rendering
User-agent: *
Disallow: /wp-content/uploads/
Disallow: /wp-content/plugins/
Disallow: /assets/
Disallow: /static/
Disallow: /css/
Disallow: /js/

The intention behind these blocks is often reasonable: preventing Google from indexing individual files that have no value as pages. The problem is that “not indexing” and “not crawling” are different things. Google will not index a /assets/main.css file as if it were a search results page, but it does need to download it to render any page that uses it.

# CORRECT — allows crawling of resources, blocks only what is necessary
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-login.php
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/

# CSS, JS and image resources are NOT blocked

Google Search Central explicitly documents this recommendation: allowing access to all files that the browser needs to render the page is essential for Googlebot to correctly evaluate content.

To verify whether Googlebot can access a page’s resources, use the URL Inspection tool in Google Search Console. The report shows whether any resources were blocked during the last crawl and which ones they were. If the “Page resources blocked” section appears, you have this problem.

Error #3 — Incorrect syntax: case-sensitivity and spaces that cost indexation

The robots.txt is stricter than it appears in its syntax. Two specific formatting errors cause problems that are difficult to detect without tools:

Case-sensitivity in paths: Paths in Disallow and Allow directives are case-sensitive according to Google’s specification. This means:

# INCORRECT if your real URL is /admin/ (lowercase)
User-agent: *
Disallow: /Admin/

# CORRECT — capitalisation must match the real URL exactly
User-agent: *
Disallow: /admin/

If your site has URLs with capital letters (something to avoid, but common in many CMS platforms), you need to block the exact versions. A block of /Admin/ does not affect /admin/ or /ADMIN/.

Spaces in the User-agent directive: The specification requires no space between User-agent: and the value. An incorrectly generated file may have:

# INCORRECT — space after the colon (causes problems in some parsers)
User-agent : *
Disallow: /admin/

# CORRECT
User-agent: *
Disallow: /admin/

Directives without a User-agent: Any directive that is not associated with a User-agent block is ignored. If someone adds a Disallow directive outside a block, it has no effect but also generates no visible error:

# INCORRECT — the Disallow without a preceding User-agent is ignored
Disallow: /private-area/

User-agent: *
Disallow: /admin/

Order matters for grouping: Google groups directives by User-agent block. If you have two separate blocks for the same User-agent, Google processes them independently. The result may not be what you expect:

# POTENTIALLY PROBLEMATIC — two separate blocks for the same user-agent
User-agent: *
Disallow: /admin/

User-agent: *
Disallow: /login/
# Google may apply only one of the two blocks

The correct approach is to group all directives for the same User-agent into a single block:

# CORRECT — a single block per User-agent
User-agent: *
Disallow: /admin/
Disallow: /login/
Disallow: /cart/

Error #4 — Incorrectly applied wildcard that blocks pages you want indexed

Wildcards (* and $) are powerful but require precision. A poorly written pattern can block dozens or hundreds of URLs you wanted to keep accessible to Googlebot.

The * wildcard in a path matches any sequence of characters at that position. The problem arises when the pattern is too generic:

# INCORRECT — blocks ALL URLs containing "?", including valid product pages
User-agent: *
Disallow: /*?

# This blocks:
# /product/blue-shirt?colour=blue
# /blog/article?utm_source=newsletter
# /services?tab=pricing   ← important page you wanted indexed

If the goal is to block dynamic filter pages while allowing base product URLs, the directive needs to be more specific:

# BETTER — blocks only specific filter parameters
User-agent: *
Disallow: /*?orderby=
Disallow: /*?filter_colour=
Disallow: /*?paged=

The $ wildcard indicates the end of the URL. Useful for blocking files with specific extensions without blocking paths that start the same way:

# CORRECT — blocks .pdf files but not the /documents/ section
User-agent: *
Disallow: /*.pdf$

Without the $, Disallow: /*.pdf could also block a hypothetical URL like /documents/pdf-guides/ depending on parser interpretation. With the $, only URLs ending exactly in .pdf are blocked.

A particularly costly error in ecommerce: blocking pagination pages with an overly broad pattern:

# INCORRECT — blocks category pages with pagination (/category/page/2/)
User-agent: *
Disallow: /*/page/

# If your site has URLs like /services/page/2/, /blog/page/3/,
# those are also blocked even if you want them indexed

Before adding any wildcard directive, use the URL Inspection tool in Google Search Console to test the pattern against real URLs on your site. The tool shows exactly which URLs are blocked and which remain accessible.

Error #5 — Conflict between robots.txt and meta robots: which wins?

This is the most common conceptual error: assuming that robots.txt and the meta robots tag work the same way or that they complement each other intuitively. The reality is more complex and can produce unexpected results.

Fundamental rule: if a URL is blocked in robots.txt, Google cannot crawl it. If it cannot crawl it, it cannot read the meta tags it contains. This includes the <meta name="robots" content="noindex"> tag.

The most problematic scenario:

# In robots.txt:
User-agent: *
Disallow: /landing-pages/

# In /landing-pages/special-offer/:
<meta name="robots" content="noindex, follow">

The apparent goal is to deindex the landing page. The actual result: Google cannot access the page to read the noindex, so it may keep it in the index indefinitely if it had inbound links that had previously indexed it.

The precedence rule is the opposite of what many expect:

To prevent crawling: use robots.txt Disallow. The noindex on the page is irrelevant if Google cannot crawl the URL.
To prevent indexation of a crawlable page: use <meta name="robots" content="noindex"> (or the HTTP header X-Robots-Tag: noindex). The robots.txt must allow access so Google can read this directive.
To remove from the index a page that is already indexed: remove the robots.txt block, add noindex on the page, and wait for Google to crawl and process the directive.

Gary Illyes of Google summarised this conflict clearly at a Google Search Central event: “A page blocked by robots.txt is not the same as a noindexed page. If you want to make sure something does not appear in results, do not confuse both mechanisms.”

The correct combination depends on the objective:

Objective	Robots.txt	Meta robots
Do not crawl, do not index	`Disallow`	(irrelevant, not read)
Crawl but do not index	`Allow` (or not mentioned)	`noindex`
Crawl, index, do not follow links	`Allow`	`nofollow`
Remove from index (already indexed)	Remove `Disallow`	`noindex`

How to audit your robots.txt with Google Search Console

Google Search Console includes two complementary tools for verifying the robots.txt:

robots.txt report (Settings → Robots.txt): shows the cached robots.txt file Googlebot is currently using, when it was last crawled, and whether there were any fetch errors.
URL Inspection tool: lets you test a specific URL to see whether it is blocked by robots.txt, which rule applies, and the indexing status.

For a more thorough robots.txt audit, Screaming Frog SEO Spider has a specific function that crawls the site simulating Googlebot’s behaviour and shows which pages fall outside the crawl due to current directives. The “Blocked by Robots.txt” report in the Response tab shows all affected URLs.

Steps for a basic audit:

Open https://yourdomain.com/robots.txt directly in your browser and review each directive.
In Google Search Console, use the URL Inspection tool to verify that the 10–20 most important pages on your site are not blocked by robots.txt.
Review the Coverage report in Search Console and filter by “Excluded: blocked by robots.txt” to see if there are URLs that should not be blocked.
If you use Screaming Frog, crawl the site and review the “Blocked by Robots.txt” report.

A correctly configured robots.txt is one of the foundations of crawl budget management and proper site indexation. If critical pages never reach Googlebot, no other SEO optimisation has the opportunity to work.

For a deeper understanding of how robots.txt, sitemaps, and indexation strategy work together, the Google Search Console practical guide covers how to coordinate both mechanisms to maximise visibility in Google.

Sources and references

Introduction to robots.txt — Google Search Central (developers.google.com)
Create a robots.txt file — Google Search Central (developers.google.com)
Robots.txt Specifications — Google Search Central (developers.google.com)
Block access to your site with robots.txt — Google Search Central (developers.google.com)
Common Robots.txt Mistakes and How to Fix Them — Moz (moz.com)
Robots.txt: The Complete Guide — Ahrefs Blog (ahrefs.com)
Robots.txt File: What It Is and How It Affects SEO — Search Engine Journal (searchenginejournal.com)

Share this article

If you found this content useful, share it with your colleagues.

Twitter LinkedIn

Back to

All articles

Subscribe to

Our newsletter

Frequently Asked Questions

Can robots.txt block pages that are already indexed?

Yes and no. The robots.txt prevents Googlebot from crawling blocked URLs, but it does not automatically remove them from the index. If a URL is already indexed when you add the Disallow directive, Google may keep it in the index for months because it cannot visit the page to discover it should be deindexed. To remove a page from the index, you need to combine the robots.txt block with a removal request in Google Search Console, or use a noindex meta robots tag on the page.

How long does Google take to process robots.txt changes?

Google crawls the robots.txt frequently on active sites — typically every few days, but it can take up to 24–48 hours to update its cache. To monitor updates, check the robots.txt report in Google Search Console (Settings → Robots.txt) to see when it was last crawled, and request a new crawl of the file. Critical changes such as unblocking important pages can take several days to be reflected in actual crawling behaviour.

What is the difference between robots.txt and the meta robots noindex tag?

They are distinct mechanisms with different effects. The robots.txt blocks crawling: Googlebot does not visit the URL. The meta robots noindex allows crawling but prevents indexation: Googlebot visits the page, reads the noindex, and excludes it from the index. The most dangerous combination is blocking a page in robots.txt AND adding noindex: Google cannot crawl the page to read the noindex, so it may keep the page in the index regardless.

Can robots.txt be used to block only certain URL parameters?

Yes, using wildcards. The directive 'Disallow: /*?*' blocks all URLs with any parameter. For more precision: 'Disallow: /*?colour=' blocks URLs with the 'colour' parameter. Google manages URL parameters automatically; use robots.txt wildcard patterns to block unwanted parameter variants and rel="canonical" to consolidate duplicate parameter URLs.

Robots.txt: errors that block Googlebot without you knowing | Ighenatt

What are the most serious robots.txt errors for SEO?

Key takeaways

What robots.txt is and how Googlebot actually interprets it

Error #1 — Disallow: / (blocking the entire site by accident)

Error #2 — Blocking CSS and JavaScript resources critical for rendering

Error #3 — Incorrect syntax: case-sensitivity and spaces that cost indexation

Error #4 — Incorrectly applied wildcard that blocks pages you want indexed

Error #5 — Conflict between robots.txt and meta robots: which wins?

How to audit your robots.txt with Google Search Console

Sources and references

Share this article

Frequently Asked Questions

Related Posts

Crawl budget: what it is and how to optimise it

Technical SEO Observability for CI/CD Releases

llms.txt for AI SEO: a realistic technical guide

What are the most serious robots.txt errors for SEO?

Key takeaways

What robots.txt is and how Googlebot actually interprets it

Error #1 — Disallow: / (blocking the entire site by accident)

Error #2 — Blocking CSS and JavaScript resources critical for rendering

Error #3 — Incorrect syntax: case-sensitivity and spaces that cost indexation

Error #4 — Incorrectly applied wildcard that blocks pages you want indexed

Error #5 — Conflict between robots.txt and meta robots: which wins?

How to audit your robots.txt with Google Search Console

Sources and references

Share this article

Frequently Asked Questions

Stay updated

Related Posts

Crawl budget: what it is and how to optimise it

Technical SEO Observability for CI/CD Releases

llms.txt for AI SEO: a realistic technical guide