Skip to main content
Technical SEO 10 min

llms.txt for AI SEO: a realistic technical guide | Ighenatt

What llms.txt can and cannot do for AI SEO: robots.txt and sitemap differences, a practical example, llms-full.txt governance and validation. Read the full a...

EG

Elu Gonzalez

Author

The easiest mistake with llms.txt is treating it as the new magic tag for AI SEO. Upload a file to the root, add your best URLs, and wait for ChatGPT, Claude, Perplexity, or Google to read you more kindly. Comfortable idea. Too comfortable.

The reality is more useful and less theatrical: llms.txt is a proposal for publishing a Markdown map of your site for language models and retrieval tools. It helps explain which content matters, which URLs are canonical, and what context should travel with them. It is not an indexing promise. It is not a blocking directive. It is not a replacement for a clear site architecture.

Jeremy Howard, author of the proposal and co-founder of fast.ai, frames it on llmstxt.org as a way to provide LLMs with helpful information at inference time. The important word is “proposal”. Not universal standard. Not an RFC. Not official documentation from Google, OpenAI, Anthropic, or Perplexity saying “we depend on this”.

Used well, llms.txt is a table of contents for machines: it does not cook the meal, but it stops the waiter from bringing the cutlery first, then last year’s menu, and only later the actual dish. It orders the experience.

What llms.txt is and why it is not a magic wand

llms.txt is a Markdown text file normally published at https://yourdomain.com/llms.txt. The original proposal describes a simple structure: site title, short description, thematic sections, and links to important resources, ideally in language-model-friendly versions. It comes from a real problem: modern webpages mix navigation, banners, JavaScript, repeated components, menus, and main content into HTML that can be noisy to process.

The contrarian point: the best llms.txt is not the longest one. It is the most selective one.

If you add 900 URLs because “more coverage is better”, you end up with a sitemap wearing a Markdown costume. A model, documentation tool, or agent that checks the file needs to know where to start. Editorial priority is the signal. That is why the file should include pillar guides, evergreen resources, service pages, product documentation, and content that answers recurring questions, not every post, tag, and paginated archive.

Expectation and utility also need to be separated. Google’s documentation for AI features says you do not need to create AI text files or special markup to appear in AI Overviews or AI Mode. OpenAI documents its crawlers and distinguishes GPTBot, OAI-SearchBot, and ChatGPT-User, but it does not state that llms.txt is a ranking, training, or retrieval signal. Perplexity publishes an llms.txt for its own documentation, but that proves adoption as a documentation format, not universal dependence across the ecosystem.

In a GEO and generative engine optimization strategy, the file makes sense as a clarity layer. It reduces ambiguity, helps audits, and forces a decision about which pages best represent your expertise. But if the content lacks authority, sources, structure, and real usefulness, llms.txt will not fix it.

The real differences from robots.txt and sitemap.xml

The confusion comes from location: robots.txt, sitemap.xml, and llms.txt all live near the domain root. That is where the similarity ends. They perform different jobs, are validated differently, and fail differently.

robots.txt is a crawl-control mechanism. According to Google Search Central, it is mainly used to manage which URLs crawlers can access and to avoid overload, not to keep a page out of the index. If you need to block indexing, you use noindex, snippet controls, authentication, or appropriate headers. For AI bot decisions, start with robots.txt configuration and SEO errors, because that is where you can allow or block specific agents such as GPTBot, ClaudeBot, or PerplexityBot.

sitemap.xml is a discovery inventory. Google recommends including absolute, canonical, 200-status URLs that are relevant for search. The sitemap can declare lastmod, be segmented by content type, and be submitted through Search Console or declared with Sitemap: in robots.txt. If you mix blocked, noindex, redirected, or duplicate URLs, you send contradictory signals. The sitemap and robots audit guide covers exactly that coherence.

llms.txt is an interpretive map. It does not say “you may crawl this” or “this URL should be indexed”. It says: “if you want to understand this site, start here and read these pieces in this order”. Its role is closer to an editorial index or a product README than to an exclusion protocol.

The mental table is simple: robots.txt manages access, sitemap.xml manages discovery, llms.txt manages context. The expensive mistake is asking one to do another’s job. Blocking GPTBot in llms.txt does not block GPTBot. Listing a URL in llms.txt does not make it canonical. Adding a private page to the file can expose a path you did not want to highlight.

What it can do and what it cannot do

What llms.txt can do: improve your AI content governance. It forces you to answer questions many sites postpone: which pages are sources of truth, which resources are current, which language version should be prioritized, which content should not appear as a primary reference, and which URLs best explain your value proposition.

It can also help compatible tools. Some documentation platforms, development assistants, and internal workflows can read llms.txt to discover relevant pages before going deeper. Perplexity, for example, links its own llms.txt index from its documentation so available pages can be discovered. That is a practical sign of usefulness: as a controlled index, not as an algorithmic guarantee.

It also works as a gap audit. If your llms.txt includes a 2024 pillar guide that has not been updated, while the blog has a better 2026 version that is missing, you have found an editorial issue. If the file recommends a URL that returns a 301, you have found a technical issue. If the same topic appears in three URLs without hierarchy, you have found a cannibalization signal.

What it cannot do matters more. It cannot force Google to use your content in AI Overviews. It cannot instruct OpenAI to train or not train on your pages. It cannot replace documented controls for each crawler. It cannot improve weak content. It cannot solve rendering, architecture, duplication, or authority problems.

Think of llms.txt as a dossier cover page. It helps someone understand what is inside, but if the pages inside are empty, disordered, or unsourced, the cover does not save the work. To measure which bots actually reach the site, combine this layer with GPTBot, ClaudeBot, and AI bot log analysis. That gives you visits, user agents, frequency, and requested URLs. llms.txt does not provide that evidence.

A practical llms.txt example for an SEO site

A good file starts small. For an SEO agency, the first version might include 20 to 60 links: service pages, pillar guides, GEO resources, case studies, and contact pages. Every link should have a short description that explains why it matters, not repeat the title tag.

Simplified example:

# Ighenatt

> SEO agency specializing in technical SEO, content strategy, and generative engine visibility for companies in Spain.

## High priority

- [Technical SEO audit](https://ighenatt.es/recursos/auditoria-seo/auditoria-seo-tecnica/): methodology for detecting crawling, indexing, architecture, and performance issues.
- [GEO for generative engines](https://ighenatt.es/recursos/geo/geo-optimizacion-motores-generativos/): pillar guide on visibility in ChatGPT, Perplexity, and AI Overviews.
- [AI bot log analysis](https://ighenatt.es/blog/analisis-logs-bots-ia-gptbot-claudebot/): process for identifying GPTBot, ClaudeBot, PerplexityBot, and other crawlers.

## Crawl control

- [Robots.txt and SEO errors](https://ighenatt.es/blog/robots-txt-configuracion-errores-seo/): differences between crawl blocking, indexing, and bot management.
- [XML sitemap and robots.txt](https://ighenatt.es/recursos/auditoria-seo/sitemap-robots-configuracion/): coordinated configuration for discovery and access.

## Commercial contact

- [Technical SEO consulting](https://ighenatt.es/servicios/consultoria-seo-tecnica/): service for technical audits, migrations, and SEO architecture.

Several decisions in the example are intentional. The URLs are absolute, not relative. Sections separate intent, not content format. Descriptions say what a reader or agent will find, not promotional copy. And high priority is limited to pieces that explain the site better than the homepage alone.

For multilingual sites, do not mix languages casually. You can create a main llms.txt with language sections or publish auxiliary files linked from the main one. The important part is declaring the language of each resource and avoiding a Catalan guide pointing to a Spanish version when an equivalent alternative exists.

A useful rule: every included URL must pass three tests before entry. It returns 200, it is canonical, and it has a review date. If it fails one of the three, it does not go in the file. Strict, yes. But it keeps llms.txt from becoming a storefront for technical debt.

How to maintain llms-full.txt without creating technical debt

The original llmstxt.org proposal mentions expanded files derived from the main index, including versions that contain the full text of linked URLs. In practice, many teams call this idea llms-full.txt: a larger Markdown file that bundles priority content so a system does not have to visit each URL individually. Useful, yes. Risky if automated without control.

The problem with llms-full.txt is not generation. It is cleanliness. A CMS can export HTML converted to Markdown, but it often drags along menus, CTAs, repeated blocks, breadcrumbs, legal text, forms, and related-article modules. That consumes tokens and muddies context. The value is in extracting main content, preserving H2-H3 hierarchy, keeping tables, including sources, and removing anything that does not improve understanding.

The update cadence depends on the site. For an active technical blog, regenerate llms-full.txt weekly or whenever you publish a pillar guide. For a stable corporate site, monthly is usually enough. For product documentation, tie regeneration to documentation deployments. In every case, keep a control fingerprint: generation date, URL count, file size, language, commit, or CMS version.

You also need limits. If the file grows beyond a few megabytes, split it by topic or language. If a URL contributes only 150 shallow words, it probably does not deserve inclusion. If a page changes daily, maybe it should be linked from llms.txt but left out of the full file until it stabilizes.

Operational rule: llms.txt decides what enters; llms-full.txt packages what was already approved. Never reverse that order. If the full file is generated by crawling the whole site without governance, you have only created a heavy copy of the original mess.

Priority URL governance: who decides what enters

The file looks technical, but the decision is editorial and commercial. On a serious site, the developer should not be the only person deciding which URLs represent the company to AI assistants. SEO, content, legal, product, and sales may all have different priorities. The solution is not to add everything. The solution is an entry policy.

A simple model works well. Every candidate URL gets an owner, objective, language, status, priority, and next review date. Priority 1 covers pages that explain the entity: homepage, strategic services, pillar guides, sourced resources, and strong case studies. Priority 2 covers supporting articles, comparisons, and FAQs. Priority 3 covers tactical pieces that can rotate in or leave the file.

The key question is not “do we want an AI to see this page?”. The better question is: “if an AI could read only 30 of our URLs, should this be one of them?”. That limit forces clarity.

Exclusions need documentation too. Sensitive legal content, outdated pricing pages, temporary offers, campaign landing pages, internal search results, tags, and paginations almost never belong. If you need to block crawlers, that belongs in robots.txt or access controls, not llms.txt. But if you simply do not want to promote a URL as a primary source, leaving it out is enough.

Governance becomes even more important in GEO, where citability depends on clear sources and consistent entities. A pillar guide on generative AI should link to related resources, show visible authorship, cite real sources, and provide a direct answer. llms.txt can point to it, but the page has to deserve the pointer. The file does not create authority; it orders it.

Validation checklist, update cadence, and signals to measure

Before publishing, validate the file like a technical deployment. Opening it in a browser and seeing text is not enough.

Minimum checklist:

  • The file is available at https://domain.com/llms.txt with status 200.
  • It uses Content-Type: text/plain or a compatible type that does not force odd downloads.
  • It is encoded in UTF-8 and special characters render correctly.
  • Every included URL is absolute, canonical, crawlable, and returns 200.
  • Internal routes end with a trailing slash if the site uses trailing slash.
  • It does not include noindex, robots-blocked, redirected, or private pages.
  • Every link has a specific description, not a title repetition.
  • Sections reflect editorial priority, not only CMS categories.
  • llms-full.txt, if present, is generated from the same approved list.
  • The file has an owner, review date, and internal changelog.

Then measure with humility. Do not look for a clean “ranking lift from llms.txt”, because you will not isolate that variable. Measure verifiable signals: AI bots in logs, requests to /llms.txt, URLs most crawled by user agent, citations in AI responses, referred traffic from Perplexity or ChatGPT, and alignment between priority pages and content that actually receives crawling.

A reasonable cadence is monthly for sites that publish often, quarterly for stable sites, and mandatory after migrations, robots.txt changes, new pillar guides, or service restructures. The owner should review three columns: additions, removals, and priority changes.

llms.txt deserves neither automatic cynicism nor blind faith. It is cheap, readable, easy to audit, and useful for ordering an AI-oriented content strategy. But its strength is the discipline it requires: choose, explain, maintain, and verify. Put differently: it does not optimize for you. It forces you to show what you would have optimized if you had to explain it to a machine with limited time.

Share this article

If you found this content useful, share it with your colleagues.

Frequently Asked Questions

¿Con qué frecuencia publican contenido nuevo?

Publicamos artículos nuevos semanalmente, enfocados en las últimas tendencias de SEO técnico, casos de estudio reales y mejores prácticas. Suscríbete a nuestro newsletter para no perderte ninguna actualización.

¿Los consejos son aplicables a cualquier tipo de sitio web?

Nuestros consejos se adaptan a diferentes tipos de sitios: ecommerce, blogs, sitios corporativos y aplicaciones web. Siempre indicamos cuándo una técnica es específica para cierto tipo de sitio o requerimientos técnicos.

¿Puedo implementar estas técnicas yo mismo?

Muchas técnicas básicas puedes implementarlas tú mismo siguiendo nuestras guías paso a paso. Para optimizaciones avanzadas o auditorías completas, recomendamos consultar con especialistas en SEO técnico como nuestro equipo.

¿Ofrecen servicios de consultoría personalizada?

Sí, ofrecemos servicios de consultoría SEO técnica personalizada, auditorías completas y optimización integral. Contáctanos para discutir las necesidades específicas de tu proyecto y cómo podemos ayudarte.

Stay updated

Receive the latest articles, tips and strategies about SEO, web performance and digital marketing in your email.

We send a newsletter every week, and you can unsubscribe at any time.

Tags: #llms.txt #AI SEO technical guide #GEO #robots.txt #XML sitemap #llms-full.txt #AI crawlers #technical SEO
EG

Elu Gonzalez

SEO Expert & Web Optimization