Does the 2MB HTML limit affect most websites?

No. HTTPArchive data shows the median HTML file is only 33KB, with 90% of sites under 155KB. Only extreme outliers approach the 2MB limit, making this primarily a constraint on poorly optimised architectures rather than typical web content.

Should small websites worry about crawl budget optimisation?

Generally no, unless you exceed 50,000 pages with regular updates or use database-driven page generation. Google's one-million-page threshold for crawl budget concerns remains valid, but the efficiency-first approach matters for dynamic sites well below that threshold.

How do I measure if database performance is affecting my crawl budget?

Monitor server response times during Googlebot visits, track Core Web Vitals scores (especially LCP over 2.5 seconds), and analyse crawl frequency patterns in Google Search Console. Sites with expensive database queries will show slower response times and reduced crawl frequency over time.

Will using noindex tags help save crawl budget?

No. Google must fetch and read each page to detect noindex directives, so crawl budget is still consumed. Only robots.txt disallow rules or proper HTTP status codes (404/410) actually preserve crawl budget by preventing the initial fetch request.

What's the difference between crawl budget and render budget?

Crawl budget covers fetching HTML files, while render budget involves executing JavaScript and processing dynamic content. Gary Illyes' recent clarifications focus on crawl budget, but sites with heavy JavaScript also need to optimise for render budget through server-side rendering or efficient client-side architectures.

Friday, March 27, 2026

Why Google's Crawl Limit Clarification Demands a Rethink of Your Crawl Budget Strategy

9 min read

Jenoff Van Hulle

Technical SEO

Google just told us we've been optimising crawl budget wrong. After five years of treating the one-million-page rule as gospel, Googlebot's behaviour is now constrained by something far different: how fast your database responds to a crawler's request. In January 2026, this shift became official when Google simultaneously reduced its HTML crawl limit from 15MB to 2MB[1] whilst Gary Illyes, Google's Search Relations Team member, disclosed that infrastructure speed matters more than site scale.[2]

For large websites and frequently updated platforms, this distinction matters enormously. A site wasting crawl budget on low-value URLs still burns resources, but the real bottleneck isn't the number of pages you're asking Google to crawl. It's whether your infrastructure can serve those pages quickly enough to satisfy the crawl capacity limit whilst generating sufficient crawl demand through quality content. This revelation contradicts half a decade of SEO guidance built around URL inventory management and forces a fundamental reassessment of what crawl budget optimisation actually means in the context of technical SEO implementation.

The Evolution of Crawl Budget Guidance: Why the One-Million-Page Rule No Longer Tells the Full Story

The widespread counterargument holds that crawl budget optimisation remains irrelevant for 99% of websites. Most sites operate below the one-million-page threshold and publish new content infrequently enough that Google crawls them the same day content goes live. Industry sceptics point out that the 2MB HTML limit affects only extreme outliers: HTTPArchive data shows the median HTML file size across the web is 33 kilobytes, with the 90th percentile at 155 kilobytes.[3] Only at the 100th percentile do pages exceed 2 megabytes, making them genuine statistical anomalies.

This argument is technically correct but strategically incomplete. Gary Illyes confirmed that 'I would say 1 million is probably okay,' indicating the threshold has remained unchanged since 2020.[4] However, the stability of this threshold masks a more fundamental shift in how Google allocates crawl resources within that boundary, a shift that can be monitored through Google Search Console's crawl budget reporting features. The efficiency-over-scale principle applies to medium-sized sites with 50,000+ pages and dynamic content far more than previously understood.

Database Latency Over URL Quantity: What Gary Illyes Actually Revealed About Resource Consumption

The most significant revelation in Google's updated guidance centres on where computational resources are actually consumed. Gary Illyes stated: 'If you are making expensive database calls, that's going to cost the server a lot,' indicating that database latency is a primary constraint on the crawl capacity limit, separate from page volume.[5]

This fundamentally reframes the crawl budget equation. A site with 500,000 pages that trigger complex database queries will face more crawl budget pressure than a static site with two million pre-rendered pages. The distinction matters because it shifts optimisation focus from content architecture to infrastructure architecture, making database response time a crucial component of technical SEO strategy.

Illyes further clarified: 'It's not crawling that is eating up the resources, it's indexing and potentially serving, or what you are doing with the data when you are processing that data'.[6] This revelation exposes a critical misunderstanding in how SEOs conceptualise resource consumption. The act of fetching an HTML file is computationally trivial compared to processing, analysing, and storing that content in Google's index.

For practical implementation, this means crawl budget optimisation now requires collaboration between SEO teams and database administrators. Query optimisation, connection pooling, and caching strategies become SEO considerations, not merely backend engineering decisions.

Google's own documentation supports this infrastructure-first approach. The crawl capacity limit adjusts based on crawl health: 'If the site responds quickly for a while, the limit goes up. If the site slows down or responds with server errors, the limit goes down and Google crawls less'.[7] This dynamic adjustment mechanism rewards speed over scale, with site owners able to monitor these capacity adjustments directly through Google Search Console reporting.

The 2MB Crawl Limit Clarification: Technical Reality vs. Practical Impact

Google's reduction of the HTML crawl limit from 15 megabytes to 2 megabytes, with PDFs capped at 64 megabytes, appears modest given current web standards . The HTTPArchive data reinforces this perception, showing that the vast majority of web content operates well within the new boundaries.

However, this analysis misses the strategic signal embedded in the change. The 2MB limit represents Google's attempt to discourage resource-intensive page architectures that burden both Googlebot and user experience whilst increasing crawl demand through cleaner, more efficient content delivery. Pages approaching 2MB typically suffer from bloated inline styles, embedded base64 images, or server-side rendering of massive data sets within the HTML document itself.

The limit also highlights the distinction between efficient and inefficient content delivery. A properly optimised e-commerce category page displaying 100 products should load core HTML under 200KB, with product images, reviews, and dynamic content loaded asynchronously. A poorly optimised equivalent might embed all product data, user-generated content, and navigation elements directly in the HTML, pushing file sizes toward the 2MB threshold.

This architectural difference has cascading effects on crawl budget consumption. The bloated version requires more bandwidth, longer processing time, and more complex indexing operations. Meanwhile, the optimised version allows Googlebot to quickly assess page structure and content hierarchy before deciding which additional resources to fetch.

Measuring What Actually Matters: Server Performance Metrics That Replace Page Count

Traditional crawl budget optimisation focused on URL inventory management: blocking low-value pages through robots.txt, cleaning up duplicate content, and streamlining site architecture. These tactics remain valuable but miss the infrastructure layer that now determines crawl efficiency, metrics that can be tracked effectively through Google Search Console's performance and indexing reports.

Pages with Largest Contentful Paint (LCP) exceeding 2.5 seconds, Interaction to Next Paint (INP) exceeding 200ms, and Cumulative Layout Shift (CLS) exceeding 0.1 directly reduce the crawl capacity limit as Googlebot can process fewer pages per session.[8] This connection between Core Web Vitals and crawl budget reveals how user experience metrics and search engine accessibility intertwine within technical SEO implementation.

Server-side rendering (SSR) implementation reduced crawl budget consumption by allowing Googlebot to receive fully-loaded pages without executing resource-heavy JavaScript, enabling crawlers to focus crawl resources on critical content rather than JavaScript files and API calls. This architectural choice eliminates the computational overhead of client-side rendering whilst providing Googlebot with complete content access. A technical SEO best practice that improves both crawling efficiency and user experience.

Crawl Efficiency in Practice: Real Case Studies Showing Performance Beats Scale

The theoretical shift from quantity-focused to efficiency-focused crawl budget optimisation gains credibility through documented case studies showing measurable results. A mid-size e-commerce platform reduced crawl waste from 45% to 12% within 90 days through robots.txt optimisation, sitemap cleanup, and server improvements, improving new product indexing from 21 days to 4 days and increasing organic traffic from 125K/month to 198K/month (58% increase).[9] The improvements were clearly trackable through Google Search Console's crawl statistics, which showed increased crawl demand for valuable product pages.

This case study illustrates the compounding effects of infrastructure optimisation. Faster server response times allowed Googlebot to crawl more pages per session within the crawl capacity limit. Reduced crawl waste meant higher percentage of crawl budget allocated to valuable content. Improved indexing speed enabled faster ranking of new products. The 58% traffic increase demonstrates how crawl budget optimisation translates to business results when executed systematically with proper technical SEO implementation.

Technical Optimisation Strategies That Actually Move the Needle

Gary Illyes confirmed at Search Central Live Asia Pacific 2025 that soft 404 pages (returning 200 OK but indicating 'not found' or 'out of stock') consume crawl budget despite their success status code.[10] This clarification resolves a long-standing confusion about HTTP status codes and crawl budget allocation within technical SEO best practices.

The practical implication: e-commerce sites must implement proper 404 responses for discontinued products rather than displaying 'out of stock' messages on pages that return 200 OK status codes. This technical detail prevents ongoing crawl budget consumption for products that will never return to inventory whilst preserving any existing backlinks' authority through proper redirects.

Google explicitly states that using noindex meta tags does not save crawl budget because Googlebot must fetch and read the page to detect the noindex directive. Crawl budget is preserved only through robots.txt disallow rules or proper HTTP status codes (404/410).[11] This technical distinction has significant implications for large sites attempting to manage crawl budget through meta tags, a common misconception in technical SEO implementation that can be monitored through Google Search Console's indexing reports.

Pages buried at crawl depth 3 or deeper (3+ clicks from homepage) generally perform worse in organic search results because finite crawl budget means these pages are visited less frequently than shallow-depth pages.[12] The architectural implication: site structure should prioritise flat hierarchies that minimise database joins and complex query operations, not just user experience considerations, whilst ensuring high-value deep pages receive sufficient backlinks to maintain crawl demand.

Google's official guidance specifies two mechanisms to increase crawl budget: (1) Add more server resources if hitting 'Hostload exceeded' errors, or (2) Optimise content quality and popularity signals (uniqueness, freshness, authority) to increase crawl demand.[13] The first mechanism acknowledges infrastructure constraints as primary limiting factors for the crawl capacity limit, whilst the second recognises that high-quality backlinks and content freshness directly influence how frequently Google attempts to crawl your site.

Why This Matters More in an AI-Dominated Search Landscape

The efficiency-first approach to crawl budget optimisation gains additional urgency in an AI-dominated search landscape. SISTRIX analyzed over 100 million German keywords and found AI Overviews appear on about 20% of German keywords, up from 17% in August.[14] This reduction in organic visibility increases the importance of crawl efficiency for content discovery, making technical SEO implementation more crucial than ever for maintaining search visibility.

When AI systems synthesise search results from multiple sources, they require access to comprehensive, up-to-date content across the web. Sites with efficient crawl budget utilisation ensure their latest content is available for AI training and response generation. Sites with inefficient crawl patterns risk having stale content represented in AI-generated responses, particularly when their crawl demand is insufficient to attract regular Googlebot visits.

The computational demands of AI processing also influence how Google allocates crawl resources. Content that requires expensive post-processing for AI understanding may receive lower crawl priority than efficiently structured, semantically clear content. This creates an additional incentive for infrastructure optimisation beyond traditional SEO metrics, with site performance becoming increasingly important within technical SEO strategy.

The Counter-Narrative: When Database Optimisation Isn't Worth the Effort

Critics argue correctly that the database-first approach to crawl budget optimisation creates unnecessary technical complexity for sites that don't need it. A local business website with 200 pages updated monthly gains nothing from query optimisation or caching strategies. A corporate brochure site with 1,000 static pages faces no meaningful crawl capacity limit constraints regardless of server response times, and basic technical SEO implementation suffices for their needs.

The one-million-page threshold remains relevant for determining whether crawl budget optimisation deserves engineering resources. Gary Illyes confirmed this threshold's continued validity,[15] suggesting that Google's internal systems still use page count as an initial filter for crawl budget allocation decisions, easily verifiable through Google Search Console reporting.

However, the counter-narrative breaks down for dynamic sites, e-commerce platforms, news publishers, and content aggregators. These site types generate pages programmatically, update inventory frequently, and serve personalised content that triggers database queries. For these architectures, the efficiency-first approach provides measurable competitive advantage in content discovery and indexing speed, particularly when combined with strategic backlinks that boost crawl demand for key pages.

The tipping point appears around 50,000 pages with regular updates or any site architecture that generates pages through database queries. Below this threshold, traditional URL management approaches remain sufficient. Above this threshold, infrastructure optimisation becomes the primary lever for crawl budget improvement within comprehensive technical SEO audits.

Beyond 2026: What the Infrastructure-First Era Means for SEO

The next evolution in crawl budget optimisation won't be about blocking more URLs or creating flatter site structures. It will be about building infrastructure that responds in milliseconds, not seconds, whilst strategically developing backlinks that maintain strong crawl demand signals. Organisations that treat database performance as a search visibility issue (not just an engineering problem) will control their crawl destiny in 2026 and beyond, with Google Search Console providing the essential monitoring capabilities to track their progress.

This shift demands new collaborations between SEO teams and infrastructure teams. Database administrators become SEO stakeholders. DevOps engineers need to understand crawl patterns. Site reliability engineers must consider Googlebot performance in their monitoring and alerting systems. Technical SEO specialists must understand both infrastructure optimisation and the role of backlinks in maintaining crawl demand.

The implications extend to technology selection decisions. Content management systems will be evaluated for crawl efficiency, not just editorial workflow. Hosting providers will compete on crawler-optimised infrastructure. CDN configurations will balance user experience with search engine accessibility. All of these decisions now fall under the expanded umbrella of technical SEO implementation.

Most significantly, the efficiency-first approach rewards sites that prioritise architectural excellence over content volume. A well-engineered site with 100,000 pages will outperform a poorly-engineered site with 500,000 pages in search visibility and user satisfaction. This alignment between technical quality and search success represents a fundamental improvement in how search engines reward best practices.

Footnotes

1.https://developers.google.com/search/docs/crawling-indexing/googlebot#how-googlebot-accesses-your-site ↩
2.How Googlebot Crawls the Web, https://www.youtube.com/watch?v=iGguggoNZ1E&t=1s ↩
3.HTTP Archive’s annual state of the web report, https://almanac.httparchive.org/en/2025/↩
4.How Googlebot Crawls the Web, https://www.youtube.com/watch?v=iGguggoNZ1E&t=1s ↩
5.How Googlebot Crawls the Web, https://www.youtube.com/watch?v=iGguggoNZ1E&t=1s ↩
6.How Googlebot Crawls the Web, https://www.youtube.com/watch?v=iGguggoNZ1E&t=1s ↩
7.Optimize your crawl budget, https://developers.google.com/crawling/docs/crawl-budget ↩
8.Understanding Core Web Vitals and Google search results, https://developers.google.com/search/docs/appearance/core-web-vitals ↩
9.Crawl Budget Optimization: The Complete Guide to Maximizing Google’s Crawling Efficiency, Jan 12, 2026, https://www.linkgraph.com/blog/crawl-budget-optimization-2/↩
10.Google: Soft 404s Use Crawl Budget Despite 200 OK Status, Matt G. Southern, July 28, 2025, https://www.searchenginejournal.com/google-soft-404s-use-crawl-budget-despite-200-ok-status/552301/↩
11.How To Fix “Discovered ‐ Currently Not Indexed” in Google Search Console, Bartosz Góralewicz, 13 March 2026, https://www.onely.com/blog/how-to-fix-discovered-currently-not-indexed-in-google-search-console/↩
12.Crawl Depth in SEO: How to Increase Crawl Efficiency, Suraj Lalchandani, March 7, 2024, https://www.seoclarity.net/blog/what-is-crawl-depth ↩
13.Optimize your crawl budget, https://developers.google.com/crawling/docs/crawl-budget ↩
14.AI Search: How the data shows that Google is winning – yet everything is changing, Johannes Beus, March 19, 2026, https://www.sistrix.com/blog/ai-search-how-data-shows-that-google-is-winning/↩
15.How Googlebot Crawls the Web, https://www.youtube.com/watch?v=iGguggoNZ1E&t=1s ↩

Why Google's Crawl Limit Clarification Demands a Rethink of Your Crawl Budget Strategy

The Evolution of Crawl Budget Guidance: Why the One-Million-Page Rule No Longer Tells the Full Story

Database Latency Over URL Quantity: What Gary Illyes Actually Revealed About Resource Consumption

The 2MB Crawl Limit Clarification: Technical Reality vs. Practical Impact

Measuring What Actually Matters: Server Performance Metrics That Replace Page Count

Crawl Efficiency in Practice: Real Case Studies Showing Performance Beats Scale

Technical Optimisation Strategies That Actually Move the Needle

Why This Matters More in an AI-Dominated Search Landscape

The Counter-Narrative: When Database Optimisation Isn't Worth the Effort

Beyond 2026: What the Infrastructure-First Era Means for SEO

Footnotes

Frequently asked questions

Related glossary terms

Crawl Budget

Noindex Tag

robots.txt

Services

Industries

Work

Resources

Company