Skip to main content
Technical SEO 9 min

AI Bot Log Analysis: GPTBot and ClaudeBot 2026 | Ighenatt

Every day, GPTBot and ClaudeBot crawl your site without Google Analytics recording a single visit. Learn to detect AI bots in your server logs and decide str...

EG

Elu Gonzalez

Author

Google Analytics shows zero visits. The server logs show thousands of requests. The gap between those two figures — which once pointed to spam traffic or Googlebot — now includes a third actor that no analytics dashboard tracks by default: AI model bots.

Between February and March 2026, AI visibility firm WISLR analysed 48 days of server logs and documented 12,099 AI bot requests during that period. The most active bot was not GPTBot: it was Meta-WebIndexer, with 1,833 requests, followed by ChatGPT-User (923), Claude-SearchBot (549), and PerplexityBot (456). GPTBot contributed only 187 direct requests, yet its weight in terms of model training impact is disproportionate to its request volume.

The same analysis detected a behaviour that reveals how these systems coordinate content discovery: on 18 and 19 March 2026, ClaudeBot and GPTBot both requested the sitemap.xml file on the same day, from different companies and with no apparent technical connection. A signal that content discovery standards for LLMs are converging quietly.

AI bots crawling your site in 2026: the complete user-agent list

The first step in managing AI bot traffic is knowing exactly which bots visit your site and for what purpose. Each company operates multiple bots with distinct roles: model training, indexing for real-time search, and user-initiated requests.

OpenAI operates three documented bots: GPTBot (model training, user-agent: GPTBot/1.2), OAI-SearchBot (indexing for ChatGPT Search), and ChatGPT-User (real-time requests initiated by ChatGPT users). The distinction is critical: blocking GPTBot affects future training but not ChatGPT Search citations, which uses OAI-SearchBot.

Anthropic has the same three-part structure: ClaudeBot (training, ClaudeBot/0.1), Claude-SearchBot (indexing for search within Claude.ai), and Claude-User (user requests). All three are documented at support.anthropic.com.

Perplexity distinguishes between PerplexityBot (periodic indexing) and Perplexity-User (real-time retrieval per user query), both documented at docs.perplexity.ai.

Google adds Google-Extended to the usual Googlebot catalogue: this bot specifically controls the use of content for training Gemini and Vertex AI, independently of search indexing. Blocking it with Disallow: / for User-agent: Google-Extended does not affect organic rankings.

The rest of the ecosystem includes CCBot (Common Crawl, the training base for many LLMs), Applebot-Extended (Apple Intelligence), Amazonbot (Alexa AI), Bytespider (ByteDance), and over a dozen additional agents with no official public documentation.

How to detect AI bots in server logs

The Apache or Nginx access log records every HTTP request with a timestamp, source IP, requested URL, response code, and client user-agent. It is the only source that sees AI bot traffic in its entirety, because — unlike Google Analytics — it does not depend on JavaScript.

To filter exclusively for GPTBot traffic in an Nginx log:

grep "GPTBot" /var/log/nginx/access.log | awk '{print $7}' | sort | uniq -c | sort -rn | head -20

That command returns the 20 URLs most crawled by GPTBot, sorted by frequency. Replacing "GPTBot" with any user-agent from the list above gives you the same analysis for ClaudeBot, PerplexityBot, or Google-Extended.

For Cloudflare users, the Analytics panel under “Security > Bots” shows bot traffic with automatic classification, though it groups categories together. Cloudflare Workers Logs and access to CDN logs via the API offer more granularity if you need to distinguish between training bots and retrieval bots.

Specialised tools — such as GoAccess for real-time log visualisation or Screaming Frog Log Analyser — let you load logs and segment by user-agent with a graphical interface, which is useful for high-volume sites where grep-based analysis becomes slow.

Unlike log analysis for Googlebot, where the focus is on crawl budget and indexing behaviour, AI bot log analysis aims to answer three distinct questions: how much are they crawling, which pages do they prioritise, and does the type of bot dominating the traffic (training vs. retrieval) inform blocking decisions.

Crawl frequency and behaviour: GPTBot vs. ClaudeBot vs. Googlebot

Cloudflare Radar data shows that GPTBot grew 305% in request volume between May 2024 and May 2025, increasing its share of crawler traffic from 4.7% to 11.7% over that period. In the same interval, Googlebot traffic also grew by 96%, signalling that the bot ecosystem is expanding globally rather than displacing existing crawlers.

The most striking behavioural difference is the crawl-to-referral ratio: how many pages a bot crawls per real visit it sends to the site. For Googlebot, that ratio ranges between 3:1 and 30:1 depending on the site type. For Anthropic, Cloudflare documented a ratio of 38,000:1 in July 2025 — 38,000 pages crawled for every referred visit sent to external sites. This figure explains why many webmasters see spikes in ClaudeBot traffic in their logs with no corresponding referred visits in Analytics.

GPTBot’s behaviour more closely resembles a traditional indexing crawl: incremental crawling, respect for crawl-delay in robots.txt when specified, and a preference for high-authority pages according to third-party analyses. ClaudeBot has a more aggressive deep-exploration pattern, particularly on sites with dense content architecture.

In terms of content-type distribution, AI bots show a preference for articles with a clear H2–H3 structure, statistical data, comparison tables, and FAQ sections — exactly the formats that LLM citation algorithms also favour. Analysing your robots.txt in light of AI’s impact on search rankings and crawling is a combination that determines how much of your content is available to be cited.

Block or allow? The strategic decision

The most relevant study on this question is BuzzStream’s, published in March 2026 and based on 4 million citations analysed across ChatGPT, Gemini, AI Overviews, and AI Mode. The principal finding contradicts the intuition: 95% of all cited sites blocked at least one training bot via robots.txt, and 70% of ChatGPT citations came from sites that specifically blocked ChatGPT’s retrieval bot.

The technical explanation is that many AI retrieval systems never reach the origin server: they extract data from SERP snippets (title, URL, indexed excerpt from Google) or from cached versions of content. Blocking via robots.txt is partially ineffective because the data was already in training datasets or in Google’s cache before the block was put in place.

The operational recommendation distinguishes between two types of decision. For training bots (GPTBot, ClaudeBot, Google-Extended, CCBot): blocking protects content from being used in future training cycles, but does not affect citations in current model versions. If the content is sensitive or proprietary, blocking makes sense; if it is public marketing content, the opportunity cost may outweigh the benefit.

For retrieval bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot): these bots feed the real-time search systems of each LLM. Blocking them does reduce the likelihood of being cited in responses to recent queries. For sites that want to maximise AI citability, these bots should be allowed.

llms.txt: the protocol competing with robots.txt for AI

When Jeremy Howard published the llms.txt specification in September 2024, the premise was straightforward: just as robots.txt tells crawlers what not to crawl, llms.txt tells LLMs what to read first. The format is plain Markdown at the root of the domain, with a site description and a structured list of relevant resources with their URLs and descriptions.

# Ighenatt — Agencia SEO Barcelona

> Agencia SEO especializada en técnico, contenido y IA generativa.

## Blog SEO
- [Entity SEO y Knowledge Graph](/blog/entity-seo-optimizacion-entidades-ia/): Construir marca como entidad
- [Guía de auditoría SEO técnica](/blog/guia-auditorias-seo-tecnicas/): Proceso paso a paso

## Recursos
- [Recursos SEO](/recursos/): Guías técnicas descargables

By October 2025, more than 844,000 sites had implemented llms.txt, including Anthropic and Cloudflare themselves. The issue is compliance: no major LLM provider has confirmed that their crawlers read llms.txt consistently. Anthropic acknowledged in internal documentation that “their systems consider the file when it exists”, without offering further technical detail. OpenAI and Google have made no public statements about their use of the standard.

The fundamental difference from robots.txt is that the latter has immediate, verifiable consequences (within 24–48 hours you can confirm in Search Console that Googlebot respects a block); llms.txt lacks that verification mechanism. Nevertheless, the implementation cost is minimal and the risk is zero: if LLMs begin following it consistently in the future, sites that have already implemented it will have a structural advantage with no additional effort required.

AI citation impact: technical checklist for 2026

The final decision on what to block depends on business objectives and content type. The following decision framework applies to the majority of company and agency websites:

For public marketing and blog content: allow all retrieval bots, evaluate training bot blocking according to content usage policy. Implement llms.txt with the most relevant resources.

For tools or content with a technical competitive advantage: block training bots (GPTBot, ClaudeBot, Google-Extended), allow retrieval bots. Add X-Robots-Tag: noai in the HTTP headers of pages with sensitive content.

For news sites or sites with time-sensitive content: allow all retrieval bots to maximise citations in responses to recent queries. The AI citability business model offsets the cost of crawling.

The natural next step is programmatic SEO to systematically generate the type of content that retrieval bots prioritise: pages with clear structure, verifiable data, and direct answers to high-frequency search queries.

Share this article

If you found this content useful, share it with your colleagues.

Frequently Asked Questions

¿Con qué frecuencia publican contenido nuevo?

Publicamos artículos nuevos semanalmente, enfocados en las últimas tendencias de SEO técnico, casos de estudio reales y mejores prácticas. Suscríbete a nuestro newsletter para no perderte ninguna actualización.

¿Los consejos son aplicables a cualquier tipo de sitio web?

Nuestros consejos se adaptan a diferentes tipos de sitios: ecommerce, blogs, sitios corporativos y aplicaciones web. Siempre indicamos cuándo una técnica es específica para cierto tipo de sitio o requerimientos técnicos.

¿Puedo implementar estas técnicas yo mismo?

Muchas técnicas básicas puedes implementarlas tú mismo siguiendo nuestras guías paso a paso. Para optimizaciones avanzadas o auditorías completas, recomendamos consultar con especialistas en SEO técnico como nuestro equipo.

¿Ofrecen servicios de consultoría personalizada?

Sí, ofrecemos servicios de consultoría SEO técnica personalizada, auditorías completas y optimización integral. Contáctanos para discutir las necesidades específicas de tu proyecto y cómo podemos ayudarte.

Stay updated

Receive the latest articles, tips and strategies about SEO, web performance and digital marketing in your email.

We send a newsletter every week, and you can unsubscribe at any time.

Tags: #GPTBot logs #ClaudeBot analysis #AI bots crawling #llms.txt protocol #server log analysis #block AI crawlers #PerplexityBot #technical SEO AI
EG

Elu Gonzalez

SEO Expert & Web Optimization