Run these 12 checks to find out whether AI assistants like ChatGPT, Perplexity, and Claude can access, understand, and cite your site.

Disallow: / rule silently removes your entire site from AI search — training data and live retrieval alike.AI assistants now handle a significant share of informational search. Perplexity serves over 100 million queries per month. ChatGPT's browsing mode retrieves live web content with every Plus query. Claude surfaces citations in research-mode responses. If your site is not accessible, credible, and structured in ways these systems can process, it does not exist for the users asking questions in those interfaces.
The encouraging reality: AI readiness is auditable. Every signal that determines whether an AI assistant can find and cite your site is measurable, and most gaps are fixable in hours, not weeks. This guide walks through all 12 checks, explains what each one means, and tells you how to score your results.
An AI search readiness audit checks whether your site passes the technical, content, and authority signals that AI assistants use to decide whether to access, trust, and cite your pages. The 12 checks below cover crawl access, content structure, schema markup, page speed, link architecture, and authority signals. Score 10 or more and you are well-positioned. Score below 4 and AI assistants may not be able to describe your site accurately at all.
Open https://yoursite.com/robots.txt in a browser. You are looking for any rules that apply to these user-agent strings:
| AI System | User-Agent |
|---|---|
| ChatGPT / OpenAI | GPTBot, ChatGPT-User |
| Claude / Anthropic | ClaudeBot, anthropic-ai |
| Perplexity | PerplexityBot |
| Common Crawl (AI training datasets) | CCBot |
| Google Gemini training | Google-Extended |
A blocking rule looks like this:
User-agent: GPTBot
Disallow: /
That single declaration makes your entire site invisible to ChatGPT's crawler — training data and live retrieval. Many sites introduced these blocks in 2023–2024 as a precaution against AI scraping, without understanding the long-term cost to AI search visibility.
What you want: Either no rules for these agents (allowed by default) or explicit Allow: / entries for each. You can audit all known AI crawlers at once using the seo.yatna.ai robots.txt checker, which flags every blocked agent in a single pass.
Important nuance: Blocking affects training data immediately and live retrieval within days. Removing a block has a faster effect on live-retrieval systems (Perplexity, Bing AI, ChatGPT browsing) than on training-based knowledge — which can lag by months.
Navigate to https://yoursite.com/llms.txt. If you get a 404, you do not have one. llms.txt is a plain-text file placed at the root of your site that describes your site's purpose, key pages, and content focus in a format written explicitly for large language models — a README your site sends to AI systems before they start crawling.
A minimal, effective llms.txt looks like this:
# Company Name
> One-sentence description of what this site does and who it serves.
## Key Pages
- [About](https://yoursite.com/about): Who we are and what we offer
- [Services](https://yoursite.com/services): Full service descriptions
- [Blog](https://yoursite.com/blog): Guides on [your topic]
## Contact
- support@yoursite.com
Sites with a well-written llms.txt give AI models a clarity advantage, especially for niche topics where the AI might otherwise be uncertain what the site covers or which pages are most authoritative. For a complete implementation guide, read The llms.txt Complete Guide.
Pass criteria: /llms.txt returns a 200 status, contains a clear one-sentence site description, and lists at least your key landing pages and blog hub.
Schema markup tells AI crawlers — in machine-readable JSON-LD — exactly what a page is, who wrote it, when it was published, and what organization stands behind it. Without schema, an AI system has to infer all of that from the prose text, which is slower, less reliable, and less citation-worthy.
Three schema types matter most for AI readiness:
Article schema on blog posts and guides. Every long-form content page should have Article or BlogPosting schema that includes a named author (with @type: Person, name, url, and jobTitle), a datePublished, and a publisher organization. This is the foundation of E-E-A-T signaling for AI.
Organization schema in the site header. A single Organization schema on your homepage or site-wide <head> tells AI systems your company name, URL, logo, and description. Without it, AI models may describe your organization inaccurately or not at all.
FAQPage schema on Q&A content. AI assistants are trained to extract question-and-answer pairs. FAQPage schema formats your answers in the exact structure these models prefer to cite. Any page with an FAQ section should have this markup.
How to check: Paste any key page URL into Google's Rich Results Test (search.google.com/test/rich-results) or the Schema Markup Validator (validator.schema.org). Both will show what structured data is present and flag validation errors.
This is the strongest single E-E-A-T signal for AI citation, and it is missing from the majority of sites that fail AI readiness audits.
AI assistants are not just looking for a byline. They are looking for a verifiable identity: a named person with a consistent track record across multiple pages, a bio that states their expertise, and external links that confirm who they are. When an AI model decides whether to cite a piece of content, it is essentially asking: "Is there a real, credentialed person I can attribute this to?"
What a complete author signal looks like:
/authors/[name] or /about/[name]sameAs links in the author's schema pointing to their LinkedIn profile and any other authoritative presenceauthor propertyAn author page with the above signals creates a verifiable entity that AI models can reference with confidence. A byline that says "Posted by Marketing Team" creates none.
AI assistants extract content to answer user queries. They read the first 100–200 words of a page first. If those words answer the query the page targets, the page gets cited. If those words are an introductory paragraph that builds context before getting to the point, the page gets skipped in favor of a competitor that leads with the answer.
Run this test on your most important pages: read the opening paragraph. Does it answer the primary question the page is targeting within the first two sentences? Or does it introduce the topic, acknowledge that it's an important question, and promise to explain it shortly?
Rewrite pattern for answer-first content:
Before: "Many businesses struggle with [topic]. In this guide, we will explore the key factors involved and walk you through the best practices for addressing them."
After: "[Topic] works by [direct explanation]. The three factors that matter most are [X], [Y], and [Z], and the fastest way to implement them is [specific action]."
The second version gets cited. The first does not — even if the rest of the article is excellent.
AI assistants — and the users interacting with them — look for content they can extract and paraphrase without reading an entire article. A Key Takeaways section at the top of a post, a highlighted summary box, or a bulleted "What you need to know" block serves as a pre-extracted citation. You are doing the AI's work for it, which increases the probability your content gets used.
Best practices for Key Takeaway blocks:
Pages with well-written Key Takeaway sections consistently outperform pages without them in AI citation rates, because the extracted quote requires no reformatting.
Vague claims are citation-unfriendly. When an AI assistant encounters "many studies suggest that AI search is growing rapidly," it has nothing quotable — because there is no specificity, no source, and no way to verify the claim.
Replace vague language with named, attributable, verifiable data:
| Vague (not citable) | Specific (citable) |
|---|---|
| "Many companies are adopting AI search" | "A 2025 Gartner survey of 1,400 CMOs found 68% had integrated AI search into their content strategy" |
| "Site speed affects rankings" | "Google's 2024 Core Web Vitals report found that pages loading under 2.5 seconds had a 24% higher click-through rate in AI Overview results" |
| "Structured data helps visibility" | "According to Search Engine Journal's 2025 analysis of 10,000 URLs, pages with Article schema were cited by Perplexity 3.2× more often than pages without it" |
The specific version gives AI assistants a quotable, attributable data point. The vague version gives them nothing to work with.
This applies to your product and service pages too — not just your blog. Replace every claim about your offering that uses qualifiers like "many," "most," "often," or "typically" with a specific, verifiable assertion.
This sounds obvious, but misconfigured content management platforms and deployment pipelines regularly noindex pages that should be public. Check your most important pages — homepage, key service pages, blog posts — for this meta tag in the <head>:
<meta name="robots" content="noindex">
Or its X-Robots-Tag HTTP header equivalent:
X-Robots-Tag: noindex
Either declaration tells all crawlers — including AI crawlers — to skip the page entirely. AI systems cannot cite content they are not allowed to index.
Common causes of accidental noindex:
Use the browser's developer tools (View Page Source, not Inspect — since Inspect shows the rendered DOM) to check the raw <meta> tags on your key pages. Or run a full crawl with an SEO crawler to flag every noindexed URL at once.
AI assistants that use real-time Browse — ChatGPT-User, ClaudeBot in retrieval mode, Bing-AI — time out on pages that take too long to return content. A page that consistently loads in 4–6 seconds may never be successfully fetched during a live retrieval session, meaning the AI gives an answer based on older cached data or skips your site entirely.
Target thresholds for AI readiness:
Test your page speed at PageSpeed Insights (pagespeed.web.dev). The tool measures real-world Core Web Vitals from Chrome User Experience Report data — the same data that feeds Google's AI Overview rankings.
The highest-impact fixes for slow load times are: serve images in WebP format with explicit width and height attributes, enable HTTP/2 or HTTP/3, remove render-blocking third-party scripts from the critical path, and use a CDN for static assets.
AI crawlers follow canonical tags. If your page at https://yoursite.com/guide/ai-seo has a canonical tag pointing to https://www.yoursite.com/guide/ai-seo (note the www vs. non-www discrepancy), the AI will attribute the content to the canonical URL. If both URLs serve similar content but the canonical is the one with less inbound link equity, you are splitting your authority signal.
More critically: if a CMS or e-commerce platform has set canonicals to paginated variants, parameter URLs, or staging domains, the AI cites the wrong page — which may not exist for real users, or may return a 404.
Check canonical tags by viewing page source and searching for:
<link rel="canonical" href="...">
Confirm that:
Common canonical errors to fix: non-www → www inconsistency across pages, HTTP canonicals on HTTPS pages, canonicals pointing to trailing-slash vs. no-trailing-slash variants without a site-wide policy, and pagination pages canonicalizing to the first page of a series when each page has unique content.
AI assistants have a revealed preference for citing substantive content. Pages with fewer than 400 words rarely appear in AI citations — not because there is a hard word-count rule, but because thin pages tend not to answer questions with enough specificity to be useful as a cited source.
The practical threshold:
What to do with thin pages: Either expand them with genuine informational depth (not filler paragraphs) or consolidate multiple thin pages on related topics into one comprehensive guide. Thin pages also dilute your site's topical authority signal — AI systems assess sites as a whole, not just individual pages.
AI assistants do not operate on your content in isolation. They use web graph signals — the same signals that underpin traditional PageRank — to assess whether a site is authoritative enough to cite. A site with no inbound links from credible external sources, no mentions in industry directories, and no references from DR 50+ domains is a weak authority signal regardless of how well the on-site signals are optimized.
What counts as positive social proof for AI:
This is the hardest check to pass quickly — external authority takes time to build. But it is worth auditing because it tells you whether your AI visibility problem is on-site (technical and content signals) or off-site (authority and web graph signals). The fix strategy is different for each.
After running all 12 checks, count how many your site passes:
| Score | Readiness Level | What to Expect |
|---|---|---|
| 10–12 | High AI Readiness | Your site is routinely accessed and cited by AI assistants. Focus on maintaining quality and expanding topical authority. |
| 7–9 | Medium Readiness | You get occasional citations, especially for your strongest pages. Prioritize the failing checks starting with robots.txt and schema. |
| 4–6 | Low Readiness | AI assistants rarely cite you and may describe your site inaccurately when they do. Systematic fixes needed across multiple areas. |
| 0–3 | Critical | AI assistants likely cannot access, understand, or accurately describe your site. Start with crawl access (Check 1), then indexability (Check 8), then schema (Check 3). |
One important clarification: a site that scores 12/12 is not guaranteed constant citation. Scores measure whether your site can be found and cited — whether it is depends on topical relevance and content quality. But a site scoring below 7 is effectively competing with one hand behind its back.
How often should I run an AI search readiness audit?
Run a full audit whenever you make significant changes to your site architecture, CMS, or content strategy — and at minimum once per quarter. AI crawler behavior and supported schema types evolve quickly. A robots.txt change that seems routine (a new CDN deployment, a CMS update) can inadvertently reintroduce blocks. Quarterly audits catch regressions before they compound.
Do these 12 checks apply equally to ChatGPT, Perplexity, and Claude?
The core checks — crawl access, indexability, page speed, canonicals, and schema — apply to all three. The signals that vary by platform are weights: Perplexity weights recency and real-time retrievability more heavily because it does live crawls for every query. ChatGPT in training-data mode weights domain authority and external citation signals more. Claude's research mode weights named authorship and content specificity especially highly. Optimizing for all 12 checks means you are covered across all three.
My robots.txt does not block AI crawlers but my site still does not appear in AI results. Why?
Crawl access is the necessary condition, not the sufficient one. A site can be crawlable but still invisible in AI results because of thin content, missing schema, no external citations, or slow page load times that cause real-time retrieval to time out. Work through all 12 checks — crawl access is only the first gate.
Does adding llms.txt guarantee AI assistants will read it?
No. llms.txt is a convention, not a formal standard with universal enforcement. As of early 2026, Perplexity supports it in its crawler; OpenAI and Anthropic have acknowledged awareness but have not published official support statements. Creating llms.txt costs nothing and signals that your site is deliberately designed for AI access — which is a positive signal even if not every crawler processes it yet.
What is the fastest single fix for a site scoring below 4?
Check robots.txt first. A single Disallow: / under User-agent: * blocks every crawler on Earth including all AI systems. If that is not the issue, check for accidental noindex tags on key pages. Both of these are zero-cost, sub-30-minute fixes that can move a site from invisible to crawlable in a single deployment.
The manual checks above tell you what to look for. An automated audit checks all 12 signals simultaneously, scores each one, and produces a prioritized fix list ranked by impact — in about 60 seconds.
Run a free automated AI readiness audit with seo.yatna.ai. The free tier audits up to 5 pages and covers every check in this guide: robots.txt AI crawler access, llms.txt presence, schema completeness, author signals, page speed, canonical accuracy, noindex detection, content depth, and external authority signals.
If you want to understand the broader strategic context behind AI search readiness, read What Is GEO (Generative Engine Optimization)? — it covers the framework that makes these 12 checks meaningful together, not just as a checklist.
About the Author

Ishan Sharma
Head of SEO & AI Search Strategy
Ishan Sharma is Head of SEO & AI Search Strategy at seo.yatna.ai. With over 10 years of technical SEO experience across SaaS, e-commerce, and media brands, he specialises in schema markup, Core Web Vitals, and the emerging discipline of Generative Engine Optimisation (GEO). Ishan has audited over 2,000 websites and writes extensively about how structured data and AI readiness signals determine which sites get cited by ChatGPT, Perplexity, and Claude. He is a contributor to Search Engine Journal and speaks regularly at BrightonSEO.