AI Crawler Readiness Checklist

This checklist is the working list used to verify that a site can be read by AI crawlers — GPTBot, ClaudeBot, PerplexityBot, Google-Extended, OAI-SearchBot and their peers. It is drafted from operational evidence gathered across a fleet of static affiliate content sites; specific sites are not named. The checklist is intentionally narrow: it covers what an operator can verify before launch, using curl, the live HTML response, and a sitemap parser.

Each item below has a single failure mode. If the item fails, the site is shipping something other than what the operator believes it is shipping to AI crawlers.

# probe every UA from a single sitemap pass
$ for ua in GPTBot ClaudeBot PerplexityBot OAI-SearchBot Google-Extended CCBot; do
    curl -A "$ua/1.0" -s https://example.com/page/ | grep -c "<h1"
  done

GPTBot          1  ✓  h1 present  ✓  jsonld baked  ✓  200
ClaudeBot       1  ✓  h1 present  ✓  jsonld baked  ✓  200
PerplexityBot   1  ✓  h1 present  ✓  jsonld baked  ✓  200
OAI-SearchBot   1  ✓  h1 present  ✓  jsonld baked  ✓  200
Google-Extended 1  ✓  h1 present  ✓  jsonld baked  ✓  200
CCBot           1  ✓  h1 present  ✓  jsonld baked  ✓  200

# pass — all 6 user agents see the same H1 and the same JSON-LD

What the checklist looks like when it passes — six AI crawler user agents, one shared result.

The checklist

1. Static HTML body contains H1 and the main paragraph text

With JavaScript disabled in the browser, or with a curl request that does not execute scripts, the body of the response must contain the page's H1 and the first one or two paragraphs of the article text as readable HTML — not as placeholder elements waiting for a client-side framework to hydrate them.

How to verify: curl -A "GPTBot/1.0" https://example.com/page/ and search the response for the H1 text. If the response is a near-empty document with a single <div id="root"></div> and a bundle of JavaScript, the page fails this item.

2. Per-route JSON-LD is baked into the static HTML head

JSON-LD blocks must appear in the static HTML response, not be injected at runtime by a client-side library. Runtime-injected JSON-LD via document-head libraries is not equivalent to baked JSON-LD for crawlers that do not execute JavaScript.

How to verify: curl the page and grep for application/ld+json. If the only JSON-LD on the page is the homepage's Organization block injected by the framework, but the article's Article or FAQPage block is absent from the raw response, the page fails this item.

3. JSON-LD types match the page's content

A page that markets a product its operator does not sell as a SoftwareApplication, or marks a partner tool review as a Review without an itemReviewed that the reviewer actually operates, fails this item. The schema should describe what the page is — Article, FAQPage, BreadcrumbList, WebPage — not what the operator wishes search engines to think it is.

How to verify: extract every JSON-LD block and confirm the @type against a route-to-types whitelist. The whitelist approach is the basis of the schema_validator.py check used on VisibilityTrace itself.

4. One canonical URL per route, consistent with the trailing-slash policy

The canonical link tag in the head must point at a single URL per route. That URL must agree with the site's trailing-slash policy — either every internal canonical ends in /, or none does. Mixing the two within a single deployment is a common source of duplicate-URL signals in Search Console and equivalent diagnostics from AI vendors.

How to verify: the canonical sweep counts <link rel="canonical"> tags per response (must be exactly one), parses the URL, and asserts the trailing slash matches the policy declared in site config.

5. Sitemap, robots.txt, and llms.txt are present at the standard paths

/sitemap.xml and /robots.txt must respond with 200 OK. The sitemap must list every public route the operator intends crawlers to discover, with the same trailing-slash convention as the canonical tags. The optional /llms.txt file — when present — should mirror the high-priority routes for AI consumption and link to the methodology and policy pages.

How to verify: a sitemap parser cross-checks the URL list against the route catalog declared in site config. URLs that appear in the sitemap but not the catalog, or vice versa, are reported as drift.

6. UA-probe evidence matches what the page intends to expose

A controlled probe with each AI crawler's declared user agent should produce a response that contains the same H1 and main paragraph text as the rendered browser view. The check exists because some hosts gate requests by user agent, or strip parts of the response for bots they treat as unwanted traffic. The operator should know, before launch, whether their host does this — and whether the result is what they intended.

Recommended probe set: GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, Google-Extended, CCBot. Compare the H1 text and the first 200 characters of body copy across all six responses and the rendered browser view. Anything other than identical strings is worth investigating.

7. Header behaviour: no aggressive cache-busting on read-only assets

AI crawlers will revisit pages on their own schedule. If the cache headers on the HTML response or on the sitemap force a fresh fetch on every request, that behaviour can be misread as instability by crawlers that prefer stable content. Conversely, headers that mark content as immutable for years prevent corrections from propagating. The middle ground — short max-age, must-revalidate — is what most static-site hosts produce by default; the failure mode is overriding that with custom headers without understanding the trade-off.

8. Response codes are clean on the canonical URL set

Every URL in the sitemap must respond with 200 OK on a HEAD request from a vanilla user agent and from each AI crawler user agent in the probe set. Redirect chains longer than one hop, or redirects that change protocol or host mid-chain, fail this item.

6-UA probe grid — one curl, six user-agent strings, one pass/fail matrix

Common failure modes

The recurring failure patterns observed across a fleet of static sites fall into a small number of categories.

Empty-body SPA shipping behind a CDN

The most common failure mode in the fleet was a single-page application that rendered fully in a logged-in browser but shipped a near-empty body to a non-JavaScript crawler. The operator's own browser tests passed because the operator had JavaScript enabled and the CDN's edge cached the hydrated DOM. Crawlers received the raw template. The fix is a static-site generator with a body-content probe in the build.

JSON-LD injected at runtime by a head-management library

Frameworks that use a document-head library to inject meta tags and JSON-LD at runtime produce HTML responses where the head is correct in a browser but missing in curl. The fix is either server-side rendering of the head or build-time generation of the head into the static response.

Trailing-slash drift between canonical, sitemap, and internal links

A common pattern is: canonical tags end in /, the sitemap omits the trailing slash, and internal links use a mix of both. Crawlers see duplicate URLs and ranking signals split between them. The fix is to declare the policy once in site config and enforce it across all three surfaces with an automated check.

Schema-type inflation for affiliate or aggregator pages

Affiliate review pages sometimes mark themselves as SoftwareApplication or Product for tools the operator does not sell. This is a category error and surfaces as rich-result eligibility failures in Search Console. The fix is a route-to-types whitelist that allows only the schema types the page genuinely is.

Robots.txt rules that block the AI crawlers the operator intends to allow

Default-deny robots files written when GPTBot first appeared sometimes still block crawlers that later changed their user-agent string, or fail to allow a newly registered crawler the operator does intend to admit. A periodic review of the robots.txt against the current AI crawler user-agent registry is the cheapest fix.

How to run the check

For a single page, the minimum sequence is:

curl -A "GPTBot/1.0" -s https://example.com/page/ | grep -c "<h1" — confirm exactly one H1 in the static response.
curl -A "GPTBot/1.0" -s https://example.com/page/ | grep -c "application/ld+json" — confirm baked JSON-LD blocks.
curl -A "GPTBot/1.0" -sI https://example.com/page/ — confirm HTTP/2 200 and inspect the cache headers.
Repeat with ClaudeBot, PerplexityBot, OAI-SearchBot, Google-Extended, CCBot user-agent strings.
Diff the rendered H1 and first paragraph across all six probes and the browser view.

For a multi-page site the same sequence is wrapped in a script that consumes the sitemap and reports per-URL pass/fail. The what AI bots see page explains why these specific probes catch the failure modes that matter.