The time budget Google assigns to your website
Think of Googlebot as an auditor with a fixed number of working hours per day. If your building has 50 well-labeled rooms with clear signage, the auditor can review them all efficiently. But if there are 500 rooms, many unlabeled, some filled with identical copies of the same documents, and others with corridors that end in dead walls, the auditor will leave without having seen what matters most.
Crawl budget works exactly like that. It is the set of resources — time, HTTP requests, bandwidth — that Googlebot has available to crawl your site in a given period of time. When that budget runs out, Google leaves. Pages it did not reach remain unindexed, at least until the next visit, which can take days or weeks.
Before any page can appear in search results, Google must complete three steps: crawl, index, and rank. Crawl budget affects the first link in that chain. If Google does not crawl a page, it cannot index it. If it cannot index it, it cannot rank it.
This guide explains how that budget works, when you should worry about it, and — most importantly — how to ensure Googlebot allocates its resources to your most valuable pages.
What determines crawl budget: two factors working together
Google formally introduced the concept of crawl budget in January 2017. According to that documentation — still current — the effective crawl budget results from the combination of two independent factors.
Crawl rate limit is the maximum speed at which Googlebot can crawl your site without degrading the real user experience. Google adjusts this limit automatically based on server response times. If your server responds in 200 ms, Googlebot can make more requests per second than if it responds in 1,500 ms. You can manually adjust this limit in Google Search Console, though lowering it is rarely advisable.
Crawl demand measures how much Google wants to crawl your site, determined by two sub-factors: URL popularity (roughly measured by PageRank) and update frequency. A page with many high-quality inbound links has high demand. A product page that updates its price and availability every hour also has high demand. Pages with no popularity and no recent changes have low demand and are visited less frequently.
Your effective crawl budget is the result of balancing both factors. A site with very fast servers but low-popularity URLs will not necessarily receive more crawls. A site with fresh, popular content but a slow server will see Googlebot throttle its visits to avoid overloading the infrastructure.
When does it actually matter?
John Mueller from Google has been direct: “IMO crawl-budget is over-rated. Most sites never need to worry about this.”
According to Google’s official documentation, crawl budget is primarily relevant in two scenarios:
- Sites with more than 1 million unique pages updated weekly or more frequently.
- Sites with more than 10,000 pages that change daily.
The 1 million page threshold has not changed since 2020, as Gary Illyes confirmed in 2025. This means that for 99% of business websites, problems that look like crawl budget issues are actually content quality, internal link structure, or server speed problems.
If your site has 5,000 pages and there is content not being indexed, before looking at crawl budget check whether that content has real user value and whether it is properly linked from authoritative pages.
The key insight: database speed matters more than page volume
For decades, crawl budget conversations have revolved around page count. The intuition was logical: more pages means more URLs to crawl, which can exhaust the budget. However, in May 2025, Gary Illyes shared an insight that significantly inverts that priority.
On the Search Off the Record podcast, Illyes explained: “If you are making expensive database calls, that’s going to cost the server a lot.” He added a concrete example: a site with 500,000 pages making costly SQL queries can have more crawling problems than a site with 2 million pages serving static cached content.
This has immediate practical implications. If you run an e-commerce site with a medium-sized catalog (50,000–200,000 products) and are experiencing indexation issues, the first question should not be “how many URLs do I have?” but “how long do my pages take to respond, and what database queries do they execute?”
Google’s documentation confirms the reference point: “Aim for server response times below 300-400 milliseconds on average.” That threshold refers to Time to First Byte (TTFB) — what Googlebot actually experiences — not the perceived load time including rendering.
Measuring real TTFB for Googlebot
Googlebot behaves as a simple HTTP client: it requests the URL, waits for the server response, and processes the HTML. It does not execute JavaScript on the initial visit. To measure the TTFB Googlebot actually experiences:
- Google Search Console → Coverage → Indexed URLs: Look for crawl time signals in the stats report.
- Server logs: Filter requests from Googlebot (
Googlebot/2.1) and analyze actual response times, not simulation estimates. - Search Console Performance → Crawl Stats: Shows the average download time Googlebot records for your pages.
- Third-party tools: Screaming Frog can simulate JavaScript-free HTTP requests to measure TTFB programmatically.
A consistently high TTFB (above 500 ms across a significant percentage of pages) is a warning signal even for mid-sized sites.
The 10 factors wasting your crawl budget
Knowing that crawl budget can be exhausted is useful. Knowing exactly what exhausts it is what enables action. These are the main culprits.
1. Technical duplicate content
According to Ahrefs data, 60% of the web is duplicate content, primarily of a technical nature. The most common variants within a single site:
http://vs.https://www.domain.comvs.domain.com- URLs with trailing slash vs. without:
/category/vs./category - Tracking parameters:
?utm_source=email,?fbclid=... - Session IDs in URLs:
?PHPSESSID=abc123 - Print or export versions:
?format=print
Each variant is a separate HTTP request for Googlebot. If you have 100,000 URLs with four variants each, you are generating 400,000 URLs competing for the same budget. The solution is to consolidate via canonical tags, configure your server to redirect variants to the canonical URL, and manage parameters in Google Search Console.
2. Faceted navigation
Faceted navigation is the largest URL multiplier in e-commerce. A “sports shoes” category with size (12 options), color (8 options), brand (20 options) and price (5 ranges) filters can mathematically generate up to 9,600 unique URL combinations. Allow multiple simultaneous active filters and the number explodes exponentially.
Most of those combinations have no independent SEO value — they are variants of the same category with similar or identical content. Yet if Googlebot can access them (because they are linked in the HTML), it will crawl them all.
Mitigation strategies include: adding <meta name="robots" content="noindex"> tags on faceted URLs, implementing JavaScript filters that modify content without changing the URL, or blocking specific URL patterns in robots.txt. Each option has its own implications for indexing relevant filtered content — the right balance depends on the specific site.
3. Redirect chains
Each HTTP redirect is an additional request for Googlebot. If URL A redirects to B which redirects to C, Googlebot makes three requests to retrieve the final content. On sites with a history of migrations, chains of three, four, or five hops are common.
Tools like Screaming Frog allow you to crawl and map all redirect chains on a site. The general rule: no redirect should have more than one hop. Chains should be collapsed so they point directly to the final destination.
4. Crawl errors (404, soft 404, 5xx)
Errors consume budget without delivering value. A 404 is a response, and Googlebot must process that response before continuing. Soft 404s (pages that return 200 but display “No results found” content) are particularly harmful because Googlebot needs more resources to detect them.
5xx errors (server errors) are the most damaging: they indicate server overload or infrastructure failures, and Googlebot may aggressively reduce its crawl rate if it encounters them frequently.
5. Poor internal link structure
Googlebot primarily discovers URLs by following links. If you have valuable pages more than three clicks deep from the homepage, Googlebot may not reach them on every crawl cycle. Worse, pages with no internal inbound links (“orphan pages”) may not be crawled at all, even if listed in the sitemap.
Internal link architecture is the most direct tool for guiding crawl budget toward priority pages. For a deeper look at this in the context of technical SEO, see our guide on technical SEO.
6. Incorrect XML sitemaps
A sitemap should list only URLs returning 200 that are canonical. However, it is common to find sitemaps containing: moved URLs (returning 301), error URLs (404), URLs with non-canonical parameters, or URLs with noindex still listed in the sitemap.
Each error in the sitemap is a negative quality signal to Google about the site. Additionally, Googlebot will crawl those non-canonical URLs, wasting budget.
7. Thin content and low-quality pages
Google combines crawl demand with quality signals. Pages with little original content, deep pagination pages (page 47 of 50 of internal search results), or tag/label pages with few entries have low crawl demand. Google visits them less frequently — and when budget is limited, it may skip them entirely.
Consolidating thin content, removing deep pagination pages without independent value, and disabling crawling of irrelevant taxonomies frees up budget for primary content.
8. Slow database queries: the hidden factor
As Gary Illyes noted in May 2025, expensive database queries can slow the server to the point where Googlebot reduces its crawl rate. This is especially relevant for:
- Product pages with real-time inventory (stock and price queries on every load)
- Listing pages with applied filters that generate complex queries
- CMS systems with complex entity relationships
Implementing full-page caching for Googlebot, or at least for the most frequent response types, can dramatically improve TTFB and, therefore, the crawl rate.
9. JavaScript-dependent content discovery
Googlebot crawls JavaScript in a second wave, separate from the initial HTML crawl. If your critical internal links are rendered by JavaScript (navigation menus, “load more” buttons, infinite scroll), Googlebot may not discover many of your pages on the first pass. For large sites, this can create a significant crawl coverage gap.
Ensure all critical navigation links are present in the initial HTML response. Use server-side rendering or static generation for navigation elements that lead to important pages.
10. AI crawlers competing for bandwidth
GPTBot (OpenAI), CCBot (Common Crawl), Google Extended, and other AI crawlers independently crawl the web and can consume up to 40% of available bandwidth during deep crawl cycles. This additional consumption has a side effect: it reduces server availability for Googlebot, which can suppress the effective crawl rate limit.
The obvious response — blocking those bots in robots.txt — comes with a documented cost. According to 2026 DEV Community data, sites that blocked GPTBot were cited 73% less in ChatGPT responses. For businesses that depend on AI generative visibility, that is not a trivial trade-off.
Measuring your current crawl budget
Before optimizing, you need data. These are the most reliable information sources.
Google Search Console — Coverage
The Coverage report shows the indexation status of all known URLs. Pay particular attention to “Discovered – currently not indexed” and “Crawled – currently not indexed” categories. The first indicates URLs Google knows about but has not yet reached — it is the most direct symptom of insufficient budget.
Google Search Console — Crawl Stats
Available at Settings → Crawl Stats. Shows the number of daily Googlebot requests, average download time, and the purpose of each request. Compare the number of daily crawled pages to your total page count to calculate the ratio:
- Pages/daily crawls ratio > 10:1 → urgent problem
- Ratio between 3:1 and 10:1 → monitor actively
- Ratio ≤ 3:1 → not a current priority
Server log analysis
Logs are the source of truth. Tools like Screaming Frog Log Analyzer, SEOlyzer, or a custom ELK Stack configuration allow you to filter Googlebot requests, calculate crawl rates per section, identify error URLs, and detect bot-traffic spikes competing with Googlebot.
Action plan: prioritization by impact
Not all optimizations deliver the same return. Here is the recommended intervention order based on potential impact:
Priority 1 — Immediate impact
- Fix 5xx server errors (directly improves crawl rate limit)
- Eliminate non-canonical URL variants (redirect HTTP→HTTPS, www→non-www, inconsistent trailing slash)
- Collapse redirect chains to single hops
- Fix XML sitemap: only canonical 200 URLs
Priority 2 — Medium impact
- Implement full-page caching to reduce TTFB
- Manage faceted URLs (noindex or robots.txt blocking)
- Remove or noindex thin pages without independent value
- Profile and optimize slow database queries (EXPLAIN ANALYZE in PostgreSQL/MySQL)
Priority 3 — Structural optimization
- Review and optimize internal linking so priority pages are within 3 clicks
- Configure crawl rate limit in GSC only if the server has real capacity constraints
- Define a policy for AI crawlers (GPTBot, Google Extended, etc.)
- Implement continuous log monitoring to detect regressions
The data point that corrects perspective
An e-commerce site implemented a complete crawl budget optimization program — duplicate consolidation, faceted navigation management, server speed improvements — and increased its crawl rate 10x over two years, from 60,000 to 600,000 URLs crawled daily, according to a case documented by Conductor Academy. The effect on indexation was proportional: pages that had been stuck in “Discovered – currently not indexed” for months started appearing in results within weeks.
However, a similar result — or even better — can be achieved simply by improving server speed. There are documented cases of sites multiplying their crawl rate by 4x (from 150,000 to 600,000 URLs/day) solely by improving TTFB from 800 ms to 180 ms, without touching content architecture.
The correct priority hierarchy for crawl budget optimization, informed by Gary Illyes’s 2025 insight, is: server speed first, crawlable content quality second, and URL volume third.
Conclusion: when to act and when not to
John Mueller is right that crawl budget is overrated for most sites. If your website has fewer than 100,000 pages, a server responding in under 400 ms, and content without massive duplicates, crawl budget is not your problem.
But if you run a large e-commerce store, a portal with faceted navigation, a site with a history of migrations, or an architecture with many parametric URLs, crawl budget may be the bottleneck explaining why your pages do not appear in Google despite having good content.
The most underrated tool for diagnosing this is GSC → Coverage → “Discovered – currently not indexed.” If that number is large and growing, you have a crawling problem worth addressing. If it is small and stable, your priorities are elsewhere.
For sites combining technical SEO with presence in AI engines, the decision on AI crawlers adds a strategic layer that did not exist two years ago. Crawl budget optimization in 2026 is no longer just a conversation about Googlebot. For a deeper exploration of indexation issues and their technical causes, see our guide on Google indexation problems.