Should You Care About Crawl Budget? No, but…

Whether you’re an SEO pro or have a passing familiarity with SEO, or (especially) if you’ve ever gotten an SEO proposal, you will have heard about crawl budget.

The concept is pretty simple. Even from context you can probably figure out that it’s the (theoretical) resources available to crawl online content.

Search engines need to know what’s out there on the internet but they don’t want to spend a lot of resources finding out.

The question then is whether this is a problem for search engines (after all, they’re the ones who want to know what you have on your site) or is it a problem that you need to help them solve (after all, you want to appear in search results).

The answer lies somewhere in the middle between “it depends” and “not really”, but the reason why will change how you think about your site.

How the web gets read

First, we should just be clear on the concept that Googlebot (as a crawler) is a program.

LIke any other crawler, it visits URLs, downloads the HTML, and passes the content to Google’s indexing system. Then it moves on to the next URL.

That’s essentially it. A very fast, very sophisticated librarian making photocopies.

The indexed copies are what Google searches when you type a query. Not the live web (meaning what’s actually stored on servers).

Which means the quality of your presence in search is only as good as the last time Googlebot came by and took a copy.

For most pages, that’s fine. For some, it matters a lot.

The problem Google is continuously solving

Here’s what makes this interesting.

In 2018, a data scientist at Google named Bill Richoux published a post on the Unofficial Google Data Science Blog. It’s a fascinating read that gives you an insight into how much people can care about something like crawler optimization.

But, more to the point…

It describes, in unusual detail, the actual optimization problem Google had to solve in order to decide when to recrawl web pages.

The problem is harder than it sounds. Google’s index contains billions of pages. Many of those pages change regularly, some every few seconds.

Others haven’t changed in years. And even Google has finite crawl capacity, both globally and per host.

To top it off, your server can only handle so many requests at once before it starts to struggle and it’s in Googlebot’s interest to be polite about how it requests data.

So: how do you allocate crawl capacity across billions of pages to keep your index as fresh as possible while wasting as little crawl capacity as possible on pages that haven’t changed when you recrawl them?

That’s the problem.

I’ll show you the solution Google’s scientists came up with. But first, a more immediate question.

Should you even care?

Probably not.

Gary Illyes, one of Google’s most candid public voices on search infrastructure, has said publicly that more than 90% of websites don’t need to think about crawl budget.

He’s also pointed out that “crawl budget” as a concept was coined outside Google. Internally, it’s not one thing but rather a collection of metrics that the SEO community bundled into a convenient shorthand.

A term that generates entire agency audits and conference panels doesn’t have a direct internal counterpart at Google.

For a typical site, Googlebot will find what it needs to find. You don’t need to optimize for it.

But “you don’t need to optimize for it” is different from “it doesn’t matter how the system works.”

Within the same video, Gary mentions that you can have indirect control over how “budget” Googlebot thinks your site deserves.

Understanding the mechanics changes how you think about content, architecture, and freshness, even if you never open a crawl stats report.

The mechanics of crawl budgets (for Google)

Back to Richoux.

The Google team framed the problem as a mathematical optimisation: given crawl capacity constraints, how do you choose recrawl intervals for each page to maximise weighted average freshness across the entire index?

Their solution was elegant.

Every page has what they called a crawl value function.

It’s a continuously rising score that reflects the growing case for recrawling that page from the last time it was crawled, and it’s governed by two parameters.

The first is page value — roughly equivalent to PageRank in the sense the links to the page increase the value.

High-authority pages, the ones that are well-linked internally and externally, start with a higher baseline claim on Googlebot’s attention. This is why internal linking is not just a ranking tactic. It’s a crawl tactic.

Pages you don’t link to are pages that may rarely, if ever, be revisited.

The second is the estimated change rate. Google estimates how often a page meaningfully changes based on crawl history. I’ll get into meaningful change below, but first to finish the solution.

If the page tends to be different each time Googlebot visits, that rate goes up, and the page gets visited more often. If it looks the same every time, the rate drops.

The two parameters combine into a function that rises over time since the last crawl. When the score hits a threshold, the page gets recrawled. When it doesn’t, it doesn’t.

And here’s an interesting implication: for pages with sufficiently low value and sufficiently low change rate, the function never reaches the threshold. Those pages may never be recrawled so they’re just sitting on your site providing ZERO value for SEO.

That’s the mathematical basis for crawl waste.

Not “Googlebot can’t find your pages.” It can. It just doesn’t think they’re worth coming back to.

Does your site actually need to be recrawled often?

This is the question that most crawl budget advice never asks.

For an ecommerce site, fresh crawl data is a product requirement. A price change that isn’t reflected in the index is a customer experience problem. Out-of-stock products that still appear in search results drive clicks that lead nowhere. Recrawl frequency directly affects revenue.

For a news site or anything time-sensitive, the same logic applies. The indexed version of a story needs to match the live version.

But for a well-researched guide that hasn’t changed in eighteen months?

For a service page whose content is stable?

For a technical documentation page that describes how something works and will continue to work that way?

The index doesn’t need to update frequently for those. And Google’s system, if functioning well, will already reflect this, visiting those pages less often because their estimated change rate is low.

The right question isn’t “how do I get Google to crawl my site more?” It’s “which pages on my site actually benefit from frequent recrawling, and am I encouraging Google to find and revisit them?”

What counts as a meaningful change?

Richoux’s model uses an estimated change rate but Google has to figure out what counts as a change in the first place.

The post doesn’t get into this detail, to avoid going out of scope and that’s fair enough. And I’ll be honest that there’s no magic number of X or Y percent that means a page has changed enough to be considered a meaningful change.

However, we can reference another known artifact of Google architecture. Protobufs.

Protocol buffers (protobufs) are Google’s binary serialisation format, used extensively across its internal systems to encode, compare, and transmit structured data.

When Google stores a snapshot of a web page, it isn’t storing a raw HTML file. It’s encoding the content in a structured format that allows precise field-level comparison on the next visit.

Research into Google’s internal data infrastructure shows how this approach works at scale: Google collects “fingerprints that have checksums for the individual fields and locality-sensitive hash (LSH) values for the content” to detect which datasets have meaningfully changed.

The same logic underpins web page change detection.

A minor HTML tweak, a changed date in the footer, a CSS adjustment will not produce a meaningful change in the underlying content representation. The estimated change rate doesn’t move. The recrawl interval doesn’t shorten.

This matters for anyone tempted to “refresh” pages by making cosmetic edits. If the substantive content hasn’t changed, neither has the signal and the crawler will not care.

There’s another data point about this we can reference, which is Google’s search documentation about sitemaps and specifically the lastmod element in XML sitemaps.

From the documentation:

For the lastmod element to be useful, first it needs to be in a supported date format (which is documented on sitemaps.org); Search Console will tell you if it’s not once you submit your sitemap. Second, it needs to consistently match reality: if your page changed 7 years ago, but you’re telling us in the lastmod element that it changed yesterday, eventually we’re not going to believe you anymore when it comes to the last modified date of your pages. You can use a lastmod element for all the pages in your sitemap, or just the ones you’re confident about. For instance, some site software may not be able to easily tell the last modification date of the homepage or a category page because it just aggregates the other pages on the site. In these cases it’s fine to leave out lastmod for those pages. And when we say “last modification”, we actually mean “last significant modification”. If your CMS changed an insignificant piece of text in the sidebar or footer, you don’t have to update the lastmod value for that page. However if you changed the primary text, added or changed structured data, or updated some links, do update the lastmod value.

There’s another risk worth naming directly, though.

When you meaningfully update a page that currently ranks well, you open it up to re-evaluation. Rankings can improve but they can drop instead. The algorithm is reading the page again and making new judgements. Update pages when there’s a genuine reason: new data, an outdated claim, a better explanation. Not because you’re trying to manufacture a freshness signal.

A new reason to care: LLMs favor fresh content

Here’s where things get more interesting and in some sense more urgent.

A September 2025 paper from Waseda University tested seven LLMs as search rerankers. The researchers injected artificial publication dates into passages and measured how rankings shifted.

The results were unambiguous. Every model systematically promoted passages with newer timestamps. The mean publication year of the top-10 results shifted forward by up to 4.78 years. Individual passages moved by as many as 95 ranking positions.

In pairwise tests, a simple date tag reversed 25% of preferences between equally relevant passages. All results were statistically significant (p < 0.05).

Larger models attenuated the effect. None eliminated it.

OK, so how does this relate to what we’ve been discussing here?

This matters because LLM-based reranking is no longer theoretical. As AI-assisted search (AEO/GEO) becomes more central to how results are assembled and presented, the timestamp signal gets amplified by a second layer that is demonstrably biased toward newer content.

You no longer just need Google to have a fresh copy of your page. You need the AI reading that copy to recognise it as current. Which not only means having fresher content but having content that is explicit about its freshness.

All the typical caveats apply: LLMs are developing rapidly, the research is now outdated, and so forth.

Living content: a new way of thinking about what’s on your site

Now for some slightly indulgent theorizing.

If LLM rerankers systematically favour freshness and the bias is pervasive across model families and sizes, the implications for content strategy are worth thinking through carefully.

There’s a content model we can propose that deserves its own name: living content.

Pages that aren’t published and left to age, but are designed to evolve constantly with updated examples, current data, new context added on a rolling basis.

Not manufactured churn, but a genuine content flywheel that keeps the page substantively current over time.

It’s not particularly novel in some fields but certainly not something we think about when we think of informational content at scale.

The case for this approach now rests on two foundations: traditional freshness signals in search ranking, and the recency bias baked into LLM rerankers. That’s a more compelling argument than either one alone.

This deserves a full treatment in a separate post. But the concept is worth naming here, because it changes the frame.

The question stops being “how often should I publish new content?” and becomes “how do I keep my best content earning its place?”

What to actually do, then?

If you’re running a site with a few hundred pages and a reasonable server, you’re wasting time even thinking about “crawl budget”.

Instead, think about which parts of the site always need to have the freshest content in Google’s index.

If you’re running a large site with hundred of thousands or millions of pages, crawl waste is a real problem worth addressing.

Google’s own engineers note that faceted navigation alone accounts for roughly 75% of the crawling issues they encounter.

Eliminate pages that don’t earn their existence.

Fix the internal linking architecture so valuable pages accumulate authority rather than burying it.

Make sure your server responds quickly; a slow server directly limits how aggressively Google is willing to crawl. Although this is usually just a question of how much you want to spend on hosting.

For all sites: when you update a page, update it because the content genuinely got better, not because a timestamp changed. A

And start thinking about which of your key pages are living, and which ones are just waiting to become stale.