AI Search Readiness Audit: 12 Checks That Determine Whether AI Assistants Can Find and Cite Your Site

Key Takeaways

robots.txt is the first gate. Blocking GPTBot, ClaudeBot, or PerplexityBot with a single Disallow: / rule silently removes your entire site from AI search — training data and live retrieval alike.
Named authors with credentials are the #1 E-E-A-T signal for AI. An author page with a job title, bio, and LinkedIn link gives AI models a verifiable identity to attach to your content — and a reason to cite it.
Answer-first content gets cited; buried-answer content gets skipped. AI assistants extract the opening sentences of a page. If those sentences don't answer the query, the page is passed over.
Canonical tags directly control which URL an AI credits. A misconfigured canonical pointing to a staging URL or a paginated variant means the AI cites a page that does not have your content.
All 12 checks can be automated. Running a full AI readiness audit with seo.yatna.ai takes about 60 seconds and scores every signal covered in this guide.

AI assistants now handle a significant share of informational search. Perplexity serves over 100 million queries per month. ChatGPT's browsing mode retrieves live web content with every Plus query. Claude surfaces citations in research-mode responses. If your site is not accessible, credible, and structured in ways these systems can process, it does not exist for the users asking questions in those interfaces.

The encouraging reality: AI readiness is auditable. Every signal that determines whether an AI assistant can find and cite your site is measurable, and most gaps are fixable in hours, not weeks. This guide walks through all 12 checks, explains what each one means, and tells you how to score your results.

What an AI Search Readiness Audit Measures — The Short Answer

An AI search readiness audit checks whether your site passes the technical, content, and authority signals that AI assistants use to decide whether to access, trust, and cite your pages. The 12 checks below cover crawl access, content structure, schema markup, page speed, link architecture, and authority signals. Score 10 or more and you are well-positioned. Score below 4 and AI assistants may not be able to describe your site accurately at all.

Check 1: robots.txt Allows AI Crawlers

Open https://yoursite.com/robots.txt in a browser. You are looking for any rules that apply to these user-agent strings:

AI System	User-Agent
ChatGPT / OpenAI	`GPTBot`, `ChatGPT-User`
Claude / Anthropic	`ClaudeBot`, `anthropic-ai`
Perplexity	`PerplexityBot`
Common Crawl (AI training datasets)	`CCBot`
Google Gemini training	`Google-Extended`

A blocking rule looks like this:

User-agent: GPTBot
Disallow: /

That single declaration makes your entire site invisible to ChatGPT's crawler — training data and live retrieval. Many sites introduced these blocks in 2023–2024 as a precaution against AI scraping, without understanding the long-term cost to AI search visibility.

What you want: Either no rules for these agents (allowed by default) or explicit Allow: / entries for each. You can audit all known AI crawlers at once using the seo.yatna.ai robots.txt checker, which flags every blocked agent in a single pass.

Important nuance: Blocking affects training data immediately and live retrieval within days. Removing a block has a faster effect on live-retrieval systems (Perplexity, Bing AI, ChatGPT browsing) than on training-based knowledge — which can lag by months.

Check 2: llms.txt Is Present and Accurate

Navigate to https://yoursite.com/llms.txt. If you get a 404, you do not have one. llms.txt is a plain-text file placed at the root of your site that describes your site's purpose, key pages, and content focus in a format written explicitly for large language models — a README your site sends to AI systems before they start crawling.

A minimal, effective llms.txt looks like this:

# Company Name

> One-sentence description of what this site does and who it serves.

## Key Pages
- [About](https://yoursite.com/about): Who we are and what we offer
- [Services](https://yoursite.com/services): Full service descriptions
- [Blog](https://yoursite.com/blog): Guides on [your topic]

## Contact
- support@yoursite.com

Sites with a well-written llms.txt give AI models a clarity advantage, especially for niche topics where the AI might otherwise be uncertain what the site covers or which pages are most authoritative. For a complete implementation guide, read The llms.txt Complete Guide.

Pass criteria: /llms.txt returns a 200 status, contains a clear one-sentence site description, and lists at least your key landing pages and blog hub.

Check 3: Structured Data Is Present and Complete

Schema markup tells AI crawlers — in machine-readable JSON-LD — exactly what a page is, who wrote it, when it was published, and what organization stands behind it. Without schema, an AI system has to infer all of that from the prose text, which is slower, less reliable, and less citation-worthy.

Three schema types matter most for AI readiness:

Article schema on blog posts and guides. Every long-form content page should have Article or BlogPosting schema that includes a named author (with @type: Person, name, url, and jobTitle), a datePublished, and a publisher organization. This is the foundation of E-E-A-T signaling for AI.

Organization schema in the site header. A single Organization schema on your homepage or site-wide <head> tells AI systems your company name, URL, logo, and description. Without it, AI models may describe your organization inaccurately or not at all.

FAQPage schema on Q&A content. AI assistants are trained to extract question-and-answer pairs. FAQPage schema formats your answers in the exact structure these models prefer to cite. Any page with an FAQ section should have this markup.

How to check: Paste any key page URL into Google's Rich Results Test (search.google.com/test/rich-results) or the Schema Markup Validator (validator.schema.org). Both will show what structured data is present and flag validation errors.

Check 4: Named Authors With Credentials and sameAs Links

This is the strongest single E-E-A-T signal for AI citation, and it is missing from the majority of sites that fail AI readiness audits.

AI assistants are not just looking for a byline. They are looking for a verifiable identity: a named person with a consistent track record across multiple pages, a bio that states their expertise, and external links that confirm who they are. When an AI model decides whether to cite a piece of content, it is essentially asking: "Is there a real, credentialed person I can attribute this to?"

What a complete author signal looks like:

A dedicated author page at /authors/[name] or /about/[name]
Full name (not "Staff Writer" or "Admin")
Job title and credentials stated clearly in the bio
sameAs links in the author's schema pointing to their LinkedIn profile and any other authoritative presence
Article schema on each post linking to that author page via the author property
Consistent authorship across multiple posts (not one guest post with no other presence)

An author page with the above signals creates a verifiable entity that AI models can reference with confidence. A byline that says "Posted by Marketing Team" creates none.

Check 5: Answer-First Content Structure

AI assistants extract content to answer user queries. They read the first 100–200 words of a page first. If those words answer the query the page targets, the page gets cited. If those words are an introductory paragraph that builds context before getting to the point, the page gets skipped in favor of a competitor that leads with the answer.

Run this test on your most important pages: read the opening paragraph. Does it answer the primary question the page is targeting within the first two sentences? Or does it introduce the topic, acknowledge that it's an important question, and promise to explain it shortly?

Rewrite pattern for answer-first content:

Before: "Many businesses struggle with [topic]. In this guide, we will explore the key factors involved and walk you through the best practices for addressing them."

After: "[Topic] works by [direct explanation]. The three factors that matter most are [X], [Y], and [Z], and the fastest way to implement them is [specific action]."

The second version gets cited. The first does not — even if the rest of the article is excellent.

Check 6: Key Takeaway Blocks

AI assistants — and the users interacting with them — look for content they can extract and paraphrase without reading an entire article. A Key Takeaways section at the top of a post, a highlighted summary box, or a bulleted "What you need to know" block serves as a pre-extracted citation. You are doing the AI's work for it, which increases the probability your content gets used.

Best practices for Key Takeaway blocks:

Place them near the top of the page, before the main body
Write each takeaway as a standalone sentence that makes sense without surrounding context
Keep each one specific and verifiable — state the claim, not just the topic
Use H2 or a visually distinct block element (callout box) so crawlers recognize it as a summary

Pages with well-written Key Takeaway sections consistently outperform pages without them in AI citation rates, because the extracted quote requires no reformatting.

Check 7: Specific, Attributed Data Claims

Vague claims are citation-unfriendly. When an AI assistant encounters "many studies suggest that AI search is growing rapidly," it has nothing quotable — because there is no specificity, no source, and no way to verify the claim.

Replace vague language with named, attributable, verifiable data:

Vague (not citable)	Specific (citable)
"Many companies are adopting AI search"	"A 2025 Gartner survey of 1,400 CMOs found 68% had integrated AI search into their content strategy"
"Site speed affects rankings"	"Google's 2024 Core Web Vitals report found that pages loading under 2.5 seconds had a 24% higher click-through rate in AI Overview results"
"Structured data helps visibility"	"According to Search Engine Journal's 2025 analysis of 10,000 URLs, pages with Article schema were cited by Perplexity 3.2× more often than pages without it"

The specific version gives AI assistants a quotable, attributable data point. The vague version gives them nothing to work with.

This applies to your product and service pages too — not just your blog. Replace every claim about your offering that uses qualifiers like "many," "most," "often," or "typically" with a specific, verifiable assertion.

Check 8: Pages Are Indexable (No Accidental noindex)

This sounds obvious, but misconfigured content management platforms and deployment pipelines regularly noindex pages that should be public. Check your most important pages — homepage, key service pages, blog posts — for this meta tag in the <head>:

<meta name="robots" content="noindex">

Or its X-Robots-Tag HTTP header equivalent:

X-Robots-Tag: noindex

Either declaration tells all crawlers — including AI crawlers — to skip the page entirely. AI systems cannot cite content they are not allowed to index.

Common causes of accidental noindex:

Staging environment settings carried over to production during a deployment
CMS plugins that add noindex to paginated archive pages, which may include your main blog index
A/B testing tools that noindex variant pages by default
CDN or edge configurations that override server-side meta tags

Use the browser's developer tools (View Page Source, not Inspect — since Inspect shows the rendered DOM) to check the raw <meta> tags on your key pages. Or run a full crawl with an SEO crawler to flag every noindexed URL at once.

Check 9: Site Loads Under 3 Seconds

AI assistants that use real-time Browse — ChatGPT-User, ClaudeBot in retrieval mode, Bing-AI — time out on pages that take too long to return content. A page that consistently loads in 4–6 seconds may never be successfully fetched during a live retrieval session, meaning the AI gives an answer based on older cached data or skips your site entirely.

Target thresholds for AI readiness:

Time to First Byte (TTFB): Under 600ms
Largest Contentful Paint (LCP): Under 2.5 seconds
Full page load: Under 3 seconds on a standard desktop connection

Test your page speed at PageSpeed Insights (pagespeed.web.dev). The tool measures real-world Core Web Vitals from Chrome User Experience Report data — the same data that feeds Google's AI Overview rankings.

The highest-impact fixes for slow load times are: serve images in WebP format with explicit width and height attributes, enable HTTP/2 or HTTP/3, remove render-blocking third-party scripts from the critical path, and use a CDN for static assets.

Check 10: Canonical Tags Are Correct

AI crawlers follow canonical tags. If your page at https://yoursite.com/guide/ai-seo has a canonical tag pointing to https://www.yoursite.com/guide/ai-seo (note the www vs. non-www discrepancy), the AI will attribute the content to the canonical URL. If both URLs serve similar content but the canonical is the one with less inbound link equity, you are splitting your authority signal.

More critically: if a CMS or e-commerce platform has set canonicals to paginated variants, parameter URLs, or staging domains, the AI cites the wrong page — which may not exist for real users, or may return a 404.

Check canonical tags by viewing page source and searching for:

<link rel="canonical" href="...">

Confirm that:

The canonical URL matches the URL you want AI to attribute content to
The canonical URL returns a 200 status code
There is exactly one canonical tag per page (multiple canonical tags cause crawlers to ignore both)

Common canonical errors to fix: non-www → www inconsistency across pages, HTTP canonicals on HTTPS pages, canonicals pointing to trailing-slash vs. no-trailing-slash variants without a site-wide policy, and pagination pages canonicalizing to the first page of a series when each page has unique content.

Check 11: Content Has Minimum Depth

AI assistants have a revealed preference for citing substantive content. Pages with fewer than 400 words rarely appear in AI citations — not because there is a hard word-count rule, but because thin pages tend not to answer questions with enough specificity to be useful as a cited source.

The practical threshold:

Under 400 words: Rarely cited. Treat as too thin unless it is a definition page or a specific tool landing page with clear schema.
400–800 words: Cited occasionally, especially if the content is extremely direct and well-structured.
800+ words: Strong candidate for citation, assuming quality, author signals, and schema are also present.
1,500+ words with substantive coverage, named author, and data claims: The sweet spot for consistent AI citation.

What to do with thin pages: Either expand them with genuine informational depth (not filler paragraphs) or consolidate multiple thin pages on related topics into one comprehensive guide. Thin pages also dilute your site's topical authority signal — AI systems assess sites as a whole, not just individual pages.

Check 12: Social Proof and External Citations

AI assistants do not operate on your content in isolation. They use web graph signals — the same signals that underpin traditional PageRank — to assess whether a site is authoritative enough to cite. A site with no inbound links from credible external sources, no mentions in industry directories, and no references from DR 50+ domains is a weak authority signal regardless of how well the on-site signals are optimized.

What counts as positive social proof for AI:

References or citations on Wikipedia pages related to your industry
Listings in authoritative industry directories (not low-quality link farms — genuine directories like G2, Capterra, or government/academic resource lists)
Mentions on DR 50+ domains — news sites, industry publications, university research pages
Author bylines on external publications that link back to the author's profile on your site
Case study or testimonial pages that reference specific, named clients (with permission)

This is the hardest check to pass quickly — external authority takes time to build. But it is worth auditing because it tells you whether your AI visibility problem is on-site (technical and content signals) or off-site (authority and web graph signals). The fix strategy is different for each.

AI Search Readiness Scoring Guide

After running all 12 checks, count how many your site passes:

Score	Readiness Level	What to Expect
10–12	High AI Readiness	Your site is routinely accessed and cited by AI assistants. Focus on maintaining quality and expanding topical authority.
7–9	Medium Readiness	You get occasional citations, especially for your strongest pages. Prioritize the failing checks starting with robots.txt and schema.
4–6	Low Readiness	AI assistants rarely cite you and may describe your site inaccurately when they do. Systematic fixes needed across multiple areas.
0–3	Critical	AI assistants likely cannot access, understand, or accurately describe your site. Start with crawl access (Check 1), then indexability (Check 8), then schema (Check 3).

One important clarification: a site that scores 12/12 is not guaranteed constant citation. Scores measure whether your site can be found and cited — whether it is depends on topical relevance and content quality. But a site scoring below 7 is effectively competing with one hand behind its back.

FAQ

How often should I run an AI search readiness audit?

Run a full audit whenever you make significant changes to your site architecture, CMS, or content strategy — and at minimum once per quarter. AI crawler behavior and supported schema types evolve quickly. A robots.txt change that seems routine (a new CDN deployment, a CMS update) can inadvertently reintroduce blocks. Quarterly audits catch regressions before they compound.

Do these 12 checks apply equally to ChatGPT, Perplexity, and Claude?

The core checks — crawl access, indexability, page speed, canonicals, and schema — apply to all three. The signals that vary by platform are weights: Perplexity weights recency and real-time retrievability more heavily because it does live crawls for every query. ChatGPT in training-data mode weights domain authority and external citation signals more. Claude's research mode weights named authorship and content specificity especially highly. Optimizing for all 12 checks means you are covered across all three.

My robots.txt does not block AI crawlers but my site still does not appear in AI results. Why?

Crawl access is the necessary condition, not the sufficient one. A site can be crawlable but still invisible in AI results because of thin content, missing schema, no external citations, or slow page load times that cause real-time retrieval to time out. Work through all 12 checks — crawl access is only the first gate.

Does adding llms.txt guarantee AI assistants will read it?

No. llms.txt is a convention, not a formal standard with universal enforcement. As of early 2026, Perplexity supports it in its crawler; OpenAI and Anthropic have acknowledged awareness but have not published official support statements. Creating llms.txt costs nothing and signals that your site is deliberately designed for AI access — which is a positive signal even if not every crawler processes it yet.

What is the fastest single fix for a site scoring below 4?

Check robots.txt first. A single Disallow: / under User-agent: * blocks every crawler on Earth including all AI systems. If that is not the issue, check for accidental noindex tags on key pages. Both of these are zero-cost, sub-30-minute fixes that can move a site from invisible to crawlable in a single deployment.

Run Your Free Automated AI Readiness Audit

The manual checks above tell you what to look for. An automated audit checks all 12 signals simultaneously, scores each one, and produces a prioritized fix list ranked by impact — in about 60 seconds.

Run a free automated AI readiness audit with seo.yatna.ai. The free tier audits up to 5 pages and covers every check in this guide: robots.txt AI crawler access, llms.txt presence, schema completeness, author signals, page speed, canonical accuracy, noindex detection, content depth, and external authority signals.

If you want to understand the broader strategic context behind AI search readiness, read What Is GEO (Generative Engine Optimization)? — it covers the framework that makes these 12 checks meaningful together, not just as a checklist.

About the Author

Ishan Sharma

Head of SEO & AI Search Strategy

Ishan Sharma is Head of SEO & AI Search Strategy at seo.yatna.ai. With over 10 years of technical SEO experience across SaaS, e-commerce, and media brands, he specialises in schema markup, Core Web Vitals, and the emerging discipline of Generative Engine Optimisation (GEO). Ishan has audited over 2,000 websites and writes extensively about how structured data and AI readiness signals determine which sites get cited by ChatGPT, Perplexity, and Claude. He is a contributor to Search Engine Journal and speaks regularly at BrightonSEO.

LinkedIn →

Related Guides

AI Search Readiness

How to Check If Your Site Is Visible to ChatGPT, Perplexity, and Claude (2026 Guide)

Learn how to check if your site is visible to ChatGPT, Perplexity, and Claude with 6 actionable steps — from robots.txt to schema markup and direct AI testing.

Mar 25, 2026

AI Search Readiness

robots.txt for AI Crawlers in 2026: GPTBot, ClaudeBot, PerplexityBot — The Complete Configuration Guide

The definitive 2026 reference for robots.txt and AI crawlers — configure GPTBot, ClaudeBot, and PerplexityBot to maximize AI search visibility.

Mar 25, 2026

AI Search Readiness

llms.txt: The New File Every Site Needs for AI Search Visibility (Complete Guide)

llms.txt is a plain-text file that tells AI assistants what your site is about and what to cite — and most sites don't have one yet.

Mar 25, 2026