This checklist is the working list used to verify that a site can be read by AI
crawlers — GPTBot, ClaudeBot, PerplexityBot, Google-Extended, OAI-SearchBot and
their peers. It is drafted from operational evidence gathered across a fleet of
static affiliate content sites; specific sites are not named. The checklist is
intentionally narrow: it covers what an operator can verify before launch,
using curl, the live HTML response, and a sitemap parser.
Each item below has a single failure mode. If the item fails, the site is shipping something other than what the operator believes it is shipping to AI crawlers.
The checklist
1. Static HTML body contains H1 and the main paragraph text
With JavaScript disabled in the browser, or with a curl request that
does not execute scripts, the body of the response must contain the page's H1 and
the first one or two paragraphs of the article text as readable HTML — not as
placeholder elements waiting for a client-side framework to hydrate them.
How to verify: curl -A "GPTBot/1.0" https://example.com/page/ and
search the response for the H1 text. If the response is a near-empty document with
a single <div id="root"></div> and a bundle of JavaScript,
the page fails this item.
2. Per-route JSON-LD is baked into the static HTML head
JSON-LD blocks must appear in the static HTML response, not be injected at runtime by a client-side library. Runtime-injected JSON-LD via document-head libraries is not equivalent to baked JSON-LD for crawlers that do not execute JavaScript.
How to verify: curl the page and grep for
application/ld+json. If the only JSON-LD on the page is the homepage's
Organization block injected by the framework, but the article's Article
or FAQPage block is absent from the raw response, the page fails
this item.
3. JSON-LD types match the page's content
A page that markets a product its operator does not sell as a
SoftwareApplication, or marks a partner tool review as a
Review without an itemReviewed that the reviewer
actually operates, fails this item. The schema should describe what the page is —
Article, FAQPage, BreadcrumbList, WebPage — not what the operator wishes search
engines to think it is.
How to verify: extract every JSON-LD block and confirm the @type
against a route-to-types whitelist. The whitelist approach is the basis of the
schema_validator.py check used on VisibilityTrace itself.
4. One canonical URL per route, consistent with the trailing-slash policy
The canonical link tag in the head must point at a single URL per route. That URL
must agree with the site's trailing-slash policy — either every internal canonical
ends in /, or none does. Mixing the two within a single deployment is
a common source of duplicate-URL signals in Search Console and equivalent
diagnostics from AI vendors.
How to verify: the canonical sweep counts <link rel="canonical">
tags per response (must be exactly one), parses the URL, and asserts the trailing
slash matches the policy declared in site config.
5. Sitemap, robots.txt, and llms.txt are present at the standard paths
/sitemap.xml and /robots.txt must respond with
200 OK. The sitemap must list every public route the operator
intends crawlers to discover, with the same trailing-slash convention as the
canonical tags. The optional /llms.txt file — when present —
should mirror the high-priority routes for AI consumption and link to the
methodology and policy pages.
How to verify: a sitemap parser cross-checks the URL list against the route catalog declared in site config. URLs that appear in the sitemap but not the catalog, or vice versa, are reported as drift.
6. UA-probe evidence matches what the page intends to expose
A controlled probe with each AI crawler's declared user agent should produce a response that contains the same H1 and main paragraph text as the rendered browser view. The check exists because some hosts gate requests by user agent, or strip parts of the response for bots they treat as unwanted traffic. The operator should know, before launch, whether their host does this — and whether the result is what they intended.
Recommended probe set: GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, Google-Extended, CCBot. Compare the H1 text and the first 200 characters of body copy across all six responses and the rendered browser view. Anything other than identical strings is worth investigating.
7. Header behaviour: no aggressive cache-busting on read-only assets
AI crawlers will revisit pages on their own schedule. If the cache headers on the HTML response or on the sitemap force a fresh fetch on every request, that behaviour can be misread as instability by crawlers that prefer stable content. Conversely, headers that mark content as immutable for years prevent corrections from propagating. The middle ground — short max-age, must-revalidate — is what most static-site hosts produce by default; the failure mode is overriding that with custom headers without understanding the trade-off.
8. Response codes are clean on the canonical URL set
Every URL in the sitemap must respond with 200 OK on a
HEAD request from a vanilla user agent and from each AI crawler
user agent in the probe set. Redirect chains longer than one hop, or redirects
that change protocol or host mid-chain, fail this item.
Common failure modes
The recurring failure patterns observed across a fleet of static sites fall into a small number of categories.
Empty-body SPA shipping behind a CDN
The most common failure mode in the fleet was a single-page application that rendered fully in a logged-in browser but shipped a near-empty body to a non-JavaScript crawler. The operator's own browser tests passed because the operator had JavaScript enabled and the CDN's edge cached the hydrated DOM. Crawlers received the raw template. The fix is a static-site generator with a body-content probe in the build.
JSON-LD injected at runtime by a head-management library
Frameworks that use a document-head library to inject meta tags and JSON-LD at
runtime produce HTML responses where the head is correct in a browser but
missing in curl. The fix is either server-side rendering of the
head or build-time generation of the head into the static response.
Trailing-slash drift between canonical, sitemap, and internal links
A common pattern is: canonical tags end in /, the sitemap omits the
trailing slash, and internal links use a mix of both. Crawlers see duplicate
URLs and ranking signals split between them. The fix is to declare the policy
once in site config and enforce it across all three surfaces with an automated
check.
Schema-type inflation for affiliate or aggregator pages
Affiliate review pages sometimes mark themselves as SoftwareApplication
or Product for tools the operator does not sell. This is a category
error and surfaces as rich-result eligibility failures in Search Console. The
fix is a route-to-types whitelist that allows only the schema types the page
genuinely is.
Robots.txt rules that block the AI crawlers the operator intends to allow
Default-deny robots files written when GPTBot first appeared sometimes still block crawlers that later changed their user-agent string, or fail to allow a newly registered crawler the operator does intend to admit. A periodic review of the robots.txt against the current AI crawler user-agent registry is the cheapest fix.
How to run the check
For a single page, the minimum sequence is:
curl -A "GPTBot/1.0" -s https://example.com/page/ | grep -c "<h1"— confirm exactly one H1 in the static response.curl -A "GPTBot/1.0" -s https://example.com/page/ | grep -c "application/ld+json"— confirm baked JSON-LD blocks.curl -A "GPTBot/1.0" -sI https://example.com/page/— confirmHTTP/2 200and inspect the cache headers.- Repeat with ClaudeBot, PerplexityBot, OAI-SearchBot, Google-Extended, CCBot user-agent strings.
- Diff the rendered H1 and first paragraph across all six probes and the browser view.
For a multi-page site the same sequence is wrapped in a script that consumes the sitemap and reports per-URL pass/fail. The what AI bots see page explains why these specific probes catch the failure modes that matter.