Crawl budget only matters for sites with 100,000+ pages. If your site has fewer than 10,000 URLs, skip this guide. For large sites — here's exactly what wastes crawl budget and how to fix it.

Let's establish the threshold before diving in: crawl budget optimisation is a concern for large sites. Google's own documentation is clear that "for most sites, crawl budget is not something to worry about." The practical threshold is approximately 100,000 pages — below that, Googlebot will crawl all your quality content without constraint. Below 10,000 pages, crawl budget issues are almost certainly not what's limiting your organic performance.
If you run a SaaS product with 50 landing pages and a 200-post blog, close this tab and spend time on content quality, backlinks, or schema. If you run an e-commerce site with 500,000 product and category URLs, or a news site publishing hundreds of articles daily, read on.
Google defines crawl budget as the number of URLs Googlebot will crawl on your site within a given time frame. The concept has two components that are often conflated:
Crawl rate limit: how fast Googlebot can crawl your site without overloading your server. Googlebot monitors server response times and adjusts its crawl speed to avoid degrading your site performance. If your server responds slowly, Googlebot slows down. If your server responds fast, Googlebot speeds up. You can also set a crawl rate limit in Google Search Console — but you almost never should, because manually lowering the crawl rate delays indexing of new content.
Crawl demand: how much of your site Googlebot wants to crawl based on the perceived value and freshness of your pages. Pages with more backlinks, more internal links, and more frequently updated content generate higher crawl demand. Pages that haven't changed in years and have no backlinks generate low crawl demand. Googlebot prioritises its crawling toward high-demand pages.
Crawl budget problems occur when the crawl demand is being wasted on low-value URLs — leaving Googlebot's allocated crawl budget consumed before it reaches your important pages.
Faceted navigation — the filter systems on e-commerce category pages (brand, colour, size, price range, material) — is the most common and most severe crawl budget problem. Each filter combination produces a unique URL. A category page with 5 filter dimensions, each with 10 options, can theoretically generate 100,000 URL variants for a single category page.
Even if your site has 5,000 products, the faceted navigation URLs can create millions of indexable URLs — each a near-duplicate of the others, each consuming crawl budget, none of them adding unique value.
Example of the problem:
/category/shoes/
/category/shoes/?colour=red
/category/shoes/?colour=red&size=42
/category/shoes/?brand=nike&colour=red
/category/shoes/?brand=nike&colour=red&size=42&sort=price-asc
Each of these is a separate URL in Google's crawl queue. The canonical URL (/category/shoes/) is the same page with different filter states. The filtered variants are not meaningfully different content.
Sites publishing hundreds of articles daily with tag pages, category archives, author archives, date archives, and paginated index pages create an enormous URL surface. A site with 50,000 articles, 200 tags per article average, and 10 archive page types can have millions of index URLs — most of them thin archive pages with overlapping content.
If your URLs include session identifiers, tracking tokens, or campaign parameters as part of the URL path (not query parameters), Googlebot treats each variant as a unique URL:
/product/123?session=abc123def456
/product/123?session=xyz789ghi012
These are the same page with different session IDs. If your site has 10,000 products and generates a session ID per visit, the crawlable URL space is effectively infinite.
Problem: query parameters for sorting, filtering, pagination, session tracking, and affiliate attribution generate thousands of near-duplicate URLs.
Fix: use the <link rel="canonical"> tag on all parameter variants to point to the canonical (parameter-free or preferred parameter) URL. Alternatively, use Google Search Console's URL Parameters tool to tell Googlebot which parameters don't change page content.
For pagination specifically: paginated pages (?page=2, ?page=3) should not be canonicalized to the first page — that removes them from the index. Use self-referencing canonicals on each paginated page and ensure clear internal linking between pages.
Problem: session IDs in URLs create a new unique URL for every user session, making your site's URL space effectively infinite.
Fix: move session management to cookies rather than URL parameters. This is a development change, not an SEO configuration — session IDs should never appear in crawlable URLs. If a legacy system generates session IDs in URLs, use canonical tags to map all session variants to the session-free URL.
Problem: Googlebot's crawl demand algorithm deprioritises pages with thin content, no backlinks, and low engagement signals. But it still crawls them — just less efficiently. A site with thousands of thin pages (stub articles, empty category pages, generated pages with less than 100 words) dilutes crawl efficiency by spreading Googlebot's budget across worthless URLs.
Fix: noindex thin pages that provide no user value. For categories with no products, show a redirect or a message rather than an empty indexed page. For stub articles, either complete them or delete them and redirect to related content.
Problem: infinite scroll pages that load content dynamically without URL changes or pagination links are both a crawl budget problem and a crawlability problem. Googlebot cannot scroll — it only crawls what is in the initial HTML. Content loaded below the fold via JavaScript on scroll is typically not crawled.
Fix: implement proper pagination with <a href> links to paginated pages. Each paginated page should have a self-referencing canonical and clear internal navigation to adjacent pages.
Problem: the same content accessible at multiple URLs — with and without trailing slashes, with www and without, via HTTP and HTTPS, via both /products/ and /shop/ — forces Googlebot to crawl and compare variants to determine the canonical version. This wastes crawl budget and dilutes page authority.
Fix: enforce a single canonical URL variant at the server level with 301 redirects (not just rel="canonical" tags). Choose www or non-www, HTTPS, no trailing slash (or consistent trailing slash), and enforce this universally.
The Coverage report in Google Search Console shows how many pages are indexed and excludes. A large volume of "Crawled — currently not indexed" or "Discovered — currently not indexed" URLs indicates Googlebot found pages but didn't find them worth indexing — a crawl demand signal.
The Crawl Stats report (Settings > Crawl stats) shows your daily crawl rate, average response time, and crawl requests by file type. A high ratio of Googlebot crawl requests to indexed pages suggests crawl budget waste.
Server access logs show exactly what Googlebot crawled, when, and how many times. Filter logs for the Googlebot user-agent and look for:
Choosing the wrong tool for crawl budget management is a common mistake with serious consequences.
| Goal | Correct Tool | Common Mistake |
|---|---|---|
| Prevent indexing of a page | <meta name="robots" content="noindex"> |
Disallow in robots.txt |
| Prevent crawling of private paths | Disallow in robots.txt |
noindex (requires crawl to apply) |
| Consolidate duplicate URLs | rel="canonical" pointing to preferred URL |
Disallow on variant URLs |
| Remove a page from index | noindex + allow crawl | Disallow (page stays indexed) |
The critical nuance: Disallow in robots.txt prevents Googlebot from crawling a URL but does not prevent the URL from being indexed if it is linked from other pages. A page blocked by robots.txt can still appear in the index as a "referenced but not crawled" URL. noindex is the correct signal for "don't index this".
AI crawlers — GPTBot, ClaudeBot, PerplexityBot — do not publish detailed crawl budget documentation equivalent to Google's. However, the operational logic is analogous: AI crawlers have finite crawl capacity and must prioritise which pages to crawl and process thoroughly.
The practical implications:
Thin and duplicate pages waste AI crawler capacity: an AI crawler that encounters 50,000 faceted navigation URL variants has less capacity to process your feature pages, blog posts, and product documentation thoroughly. The same crawl budget optimisation that benefits Googlebot also benefits AI crawlers.
AI crawlers are more likely to skip low-quality pages: AI systems have quality thresholds for content they will process and potentially cite. Thin pages (under 200 words), duplicate pages, and parameter-variant pages are less likely to be processed by AI crawlers even if they are technically crawlable.
Blocked AI crawlers are a separate issue from crawl budget: if AI crawlers are blocked in your robots.txt, that is not a crawl budget problem — it is an access problem. Fix AI crawler access first (see the robots.txt guide for AI crawlers), then address crawl efficiency.
To be explicit: if your site has fewer than 10,000 indexable pages and you are not running faceted navigation, session IDs in URLs, or a high-volume content publication system, crawl budget is not the reason you're not ranking.
Common actual problems masquerading as crawl budget problems:
Run a full technical SEO audit to identify the actual issues before spending engineering time on crawl budget optimisation.
Run a free technical SEO audit at seo.yatna.ai — the audit checks crawl accessibility, indexing signals, schema validity, AI crawler configuration, and page-level technical issues in a single run.
Should I use crawl-delay in robots.txt to manage crawl budget?
Almost never. crawl-delay tells crawlers to wait N seconds between requests. Slowing down Googlebot delays indexing of new content and provides no indexing quality benefit. The only legitimate use case is a server with extremely limited capacity that is being overwhelmed by crawl requests — and in that case, fixing server capacity is the right solution, not throttling Googlebot.
Does sitemap.xml affect crawl budget?
Submitting an accurate sitemap improves crawl demand for your important pages — Googlebot knows these URLs exist and can prioritise them. It does not increase your crawl rate limit. Including low-quality or duplicate URLs in your sitemap wastes the signal; only include canonical, indexable, high-quality URLs in your sitemap.
How long does it take for crawl budget improvements to show results?
Crawl budget optimisation results are measured in crawl efficiency, not direct rankings. After fixing crawl budget wasters, you should see the affected low-quality URLs drop out of the Coverage report's crawled URLs over 4–12 weeks, and important pages receiving more frequent crawl visits. Ranking improvements from previously under-crawled pages can take an additional 4–8 weeks.
Does server speed affect crawl budget?
Yes — server response time is a direct input to Googlebot's crawl rate limit. Faster servers receive more crawl requests per day. Improving your server's time-to-first-byte (TTFB) below 200ms removes a crawl rate constraint for large sites with slow infrastructure.
About the Author

Ishan Sharma
Head of SEO & AI Search Strategy
Ishan Sharma is Head of SEO & AI Search Strategy at seo.yatna.ai. With over 10 years of technical SEO experience across SaaS, e-commerce, and media brands, he specialises in schema markup, Core Web Vitals, and the emerging discipline of Generative Engine Optimisation (GEO). Ishan has audited over 2,000 websites and writes extensively about how structured data and AI readiness signals determine which sites get cited by ChatGPT, Perplexity, and Claude. He is a contributor to Search Engine Journal and speaks regularly at BrightonSEO.