Most SEOs set up Google Search Console, run a crawler and consider the technical diagnosis complete. What they do not see is the full picture: everything that happens on the server before GSC receives the filtered data. Server logs are that full picture.
An Apache access log records every HTTP request that reaches the server: exact URL, millisecond-precise timestamp, source IP, user-agent and response code. No sampling, no filters, no 48-hour delays. If Googlebot crawled your privacy policy page 47 times in a month while your main category received no visits at all, the log shows it. GSC probably does not.
Server logs record every HTTP request unfiltered, capturing bot crawls that analytics tools like Google Analytics do not collect by design. For any site with more than 10,000 pages, they are the most accurate source for diagnosing crawlability and indexation problems.
This guide covers log analysis at the technical level: how to read Apache and Nginx format, how to identify Googlebot (and verify it is actually Googlebot), how to detect problematic crawl patterns, which tools to use and how to interpret what the data reveals about the site’s SEO health.
What a server log contains and how to read it
An access log entry in Apache’s Combined Log Format looks like this:
66.249.73.135 - - [28/Mar/2026:08:42:17 +0100] "GET /category/shoes/ HTTP/1.1" 200 4521 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +https://www.google.com/bot.html)"
Each field has a precise meaning:
- 66.249.73.135 — Client IP (in this case, a Google range)
- [28/Mar/2026:08:42:17 +0100] — Timestamp with timezone
- “GET /category/shoes/ HTTP/1.1” — HTTP method, requested URL and protocol
- 200 — Server response code
- 4521 — Response size in bytes
- ”-” — Referrer (empty here)
- “Mozilla/5.0 (compatible; Googlebot/2.1; +…)” — User-agent
Nginx uses a similar format by default. The main difference is that Nginx separates access and error logs into distinct files (access.log and error.log), while Apache combines them or separates them depending on the VirtualHost configuration.
IIS (Internet Information Services) uses the W3C Extended Log File Format, which has column headers at the start of the file and orders fields slightly differently. Screaming Frog Log Analyzer accepts all three formats.
The fields that matter most for SEO
For an SEO-focused analysis, four fields are critical:
The user-agent identifies who is making the request. Googlebot identifies itself as Googlebot/2.1. Mobile Googlebot also uses Googlebot/2.1 but with the additional device string. Bingbot as bingbot/2.0. OpenAI’s GPTBot as GPTBot/1.2. An empty user-agent or suspicious strings may indicate scraping or malicious bots.
The response code is the instant diagnosis: 200 (OK), 301/302 (redirect), 404 (not found), 500/503 (server error). The distribution of these codes in Googlebot traffic reveals the technical health of the site.
The URL allows grouping crawls by section: categories, products, parameters, pagination URLs. Crawl frequency by section is the most direct signal of where Googlebot is spending its budget.
The timestamp allows building time series: how often does Googlebot visit each URL? Are there critical URLs that have not been crawled in 30 days?
How to identify Googlebot (and verify it is real)
Here is the problem few people mention: any bot can spoof Googlebot’s user-agent. A scraper can send requests with Googlebot/2.1 in the user-agent and appear legitimate in the logs. To confirm that an access is genuinely from Google, you need to perform a reverse DNS lookup.
The process has two steps. First, look up the hostname from the log’s IP:
host 66.249.73.135
# Result: 66.249.73.135.in-addr.arpa domain name pointer crawl-66-249-73-135.googlebot.com.
Second, verify that the hostname resolves back to the same IP:
host crawl-66-249-73-135.googlebot.com
# Result: crawl-66-249-73-135.googlebot.com has address 66.249.73.135
If both steps match and the domain ends in .googlebot.com or .google.com, the request is legitimately from Google. Google also publishes its IP ranges at https://developers.google.com/static/search/apis/ipranges/googlebot.json. Log analysis tools like Screaming Frog Log Analyzer automate this verification.
Screaming Frog Log Analyzer and tools like OnCrawl automatically verify user-agents against IP ranges published by search engines, distinguishing verified bots from bots spoofing identity. This is particularly relevant in 2025, when total crawler traffic grew 18% year-on-year and AI bots now represent a growing share of non-human server traffic.
AI bots in 2025: a new actor in the logs
Between May 2024 and May 2025, OpenAI’s GPTBot traffic grew 305% on analysed servers. Googlebot, for its part, increased by 96%. This data, published by Single Grain based on client server log analysis, has practical implications: logs no longer just reveal search engine behaviour — they also show who is training AI models on your content.
ClaudeBot (Anthropic), GPTBot (OpenAI), CCBot (Common Crawl) and Bytespider (ByteDance/TikTok) are the most common AI bots appearing in current logs. None of them appear in Google Search Console. They are only visible in server logs.
Problematic patterns that logs reveal before any other tool
Dana Tan, Director of SEO at Under Armour, captures it with surgical precision: “Getting server logs takes the conjecture out of SEO and it’s 100% scientific. It’s data.” There is no possible misinterpretation when the log shows Googlebot crawling a URL 200 times in a month with a 404 response code.
These are the most common patterns that logs detect before any other source:
Excessive crawling of low-value URLs
E-commerce faceted navigation is the most common case. A site with 50,000 products can generate millions of filter URL combinations: /shoes/?colour=red&size=42&price=50-100. Many of these URLs have no real SEO value. If Googlebot is crawling these combinations frequently, crawl budget is consumed on URLs that will never rank.
OnCrawl documented a case where 4.5 million URLs were being crawled unnecessarily on an e-commerce site. The problem was not visible in Google Search Console because the URLs returned 200 responses and had canonicals implemented — but the logs showed Googlebot continued crawling them regularly despite the canonical. The solution combined robots.txt for the most aggressive parameter URLs and a sitemap review to prioritise valuable URLs.
High-value pages with insufficient or zero crawling
The opposite problem. New category pages or recently published products that receive no Googlebot visits for weeks. Logs allow you to cross-reference the list of important URLs (obtained from the sitemap or a crawl) against the record of actual crawls. The discrepancy between what should be crawled and what is actually being crawled points to internal linking or architecture issues.
PJ Howland, VP of Industry Insights at 97th Floor, places it in the correct context: “Crawlability is the foundation of any technical SEO rollout. Without crawlability sites won’t get indexed. Without getting indexed, they won’t rank.” Logs are the only place where you can directly verify whether that foundation is working.
5xx errors that do not appear in GSC
Server errors (500, 502, 503) that occur during Googlebot’s crawl only appear in GSC’s Coverage report if they are persistent. A transient 503 error lasting 2 minutes can coincide exactly with a Googlebot visit, and that crawl is recorded as failed in the logs without leaving a trace in GSC. If logs show a pattern of 5xx errors at specific times (load spikes, deployments), there is an availability problem affecting the crawl.
Redirect chains consuming budget
A URL that returns a 301 pointing to another URL that returns another 301 before reaching the final destination. Googlebot follows redirects, but each hop consumes crawl time. Logs allow you to identify which URLs return redirects and how many hops there are to the final URL. Google’s recommendation is that redirects be direct (a single 301 to the final destination).
The OnCrawl case: +37% sessions through log analysis
The most documented case study of SEO improvement through log analysis comes from OnCrawl’s work with an e-commerce client selling high-value products with fast turnover (pages were removed after sales).
Log analysis revealed three simultaneous problems:
- Duplicate subfolders with outdated content receiving frequent Googlebot crawling
- 404 error pages from sold products being crawled repeatedly
- Low-value subfolders receiving more crawl budget than main categories
The implemented solutions were: deletion and 301 redirection of redundant subfolders, strategic internal links from low-priority to main category pages, sitemap reorganisation to prioritise critical URLs, and canonical and meta robots review on high-priority pages.
The documented result was a 37% increase in sessions and 22% in transactions after implementation. The root cause — crawl budget waste — was not visible from GSC or from a conventional crawler. Only the logs showed where Googlebot’s time was actually going.
Tools for SEO log analysis
Screaming Frog Log Analyzer
The most straightforward option for teams already using Screaming Frog SEO Spider. The Log Analyzer accepts files in Apache, Nginx and W3C Extended Log Format. The workflow is: export the logs from the server (or request them from your host), load the file into the tool, filter by verified Googlebot.
Screaming Frog Log Analyzer allows segmentation by: most and least frequently crawled URLs, response code distribution for Googlebot, and comparison between Log Analyzer and SEO Spider crawl data in the same interface. The advantage is direct integration with the Spider for cross-referencing crawlability data and log data.
The limit is volume: for very large sites, processing log files of several GB can be slow. For those cases, OnCrawl or Botify are more appropriate.
OnCrawl
OnCrawl is the reference platform for technical SEO analysis at enterprise scale. Unlike Screaming Frog, OnCrawl allows direct server integration to receive logs continuously (not just point-in-time files). It combines log data with its own crawl data and Google Search Console data in a single dashboard.
The most valuable functionality for crawl diagnostics is the correlation between crawl frequency and page performance: OnCrawl automatically cross-references which pages have the most organic traffic with which pages receive the most Googlebot visits. Discrepancies (high-traffic pages with little crawling, or no-traffic pages with frequent crawling) are the highest-priority alerts.
Seolyzer
Seolyzer is oriented towards medium-sized sites and its differentiator is real-time log error detection. No file downloads required: a snippet is installed on the server that sends logs directly to the platform. It automatically identifies error patterns, bot crawls, and generates alerts when it detects anomalies (spike in 404s, drop in Googlebot crawl frequency).
Python analysis for large volumes
For log files exceeding several GB, GUI tools have performance limitations. Analysing with Python using pandas and the apache-log-parser library allows processing millions of lines in minutes.
The basic Python workflow:
import pandas as pd
import re
# Regex for Combined Log Format
log_pattern = r'(\S+) \S+ \S+ \[([^\]]+)\] "(\S+) (\S+) \S+" (\d+) (\S+) "([^"]*)" "([^"]*)"'
rows = []
with open('access.log', 'r', encoding='utf-8', errors='ignore') as f:
for line in f:
match = re.match(log_pattern, line)
if match:
rows.append({
'ip': match.group(1),
'datetime': match.group(2),
'method': match.group(3),
'url': match.group(4),
'status': int(match.group(5)),
'user_agent': match.group(8)
})
df = pd.DataFrame(rows)
# Filter Googlebot only
googlebot = df[df['user_agent'].str.contains('Googlebot', case=False, na=False)]
# Most crawled URLs by Googlebot
top_crawled = googlebot['url'].value_counts().head(50)
print(top_crawled)
# Response code distribution for Googlebot
print(googlebot['status'].value_counts())
This script processes a 2GB log file in under 2 minutes on a standard computer. From there, more complex segmentations can be built: grouping URLs by section (using regex on the url column), calculating average crawl frequency per page, or identifying 404 URLs that Googlebot keeps visiting.
What to do with what you find in the logs
Log analysis does not end at diagnosis. Each problematic pattern has a concrete action:
Low-value URLs crawled at high frequency: Add Disallow directives in robots.txt for sections with no SEO value (filter parameter URLs, internal search URLs, session URLs). For URLs that must exist but should not be crawled, use noindex in the meta robots combined with a canonical.
Important pages with insufficient crawling: Review internal linking to those URLs. A page without sufficient internal links receives little PageRank and therefore less Googlebot interest. Add those URLs to sitemap.xml with high priority. Verify they are not accidentally blocked in robots.txt.
404 errors crawled repeatedly: For deleted pages that had traffic or backlinks, implement a 301 redirect to the most relevant available content. For URLs that should never have existed (junk parameters, script-generated URLs), block in robots.txt.
5xx errors during crawling: Investigate load spikes coinciding with 5xx errors. If the server cannot handle Googlebot’s crawl rate, you can reduce crawl speed from Google Search Console under Settings → Crawl Rate, although Google recommends this only as a last resort.
Redirect chains: Update source URLs to point directly to the final destination. Each intermediate 301 can be eliminated if the CMSs or internal systems generating the links are updated.
Googlebot’s crawl frequency is an indirect signal of the quality perception Google has of a site. A fast site with frequently updated content and strong internal architecture receives more frequent visits. A site with many errors or outdated content sees Googlebot reduce its cadence. Logs record that pulse objectively, without interpretations or sampling.
If you want to know exactly how Googlebot is crawling your site and where crawl budget is being wasted, log analysis is part of every technical SEO audit we carry out. Tell us about your case.
Share this article
If you found this content useful, share it with your colleagues.
Frequently Asked Questions
¿Con qué frecuencia publican contenido nuevo?
Publicamos artículos nuevos semanalmente, enfocados en las últimas tendencias de SEO técnico, casos de estudio reales y mejores prácticas. Suscríbete a nuestro newsletter para no perderte ninguna actualización.
¿Los consejos son aplicables a cualquier tipo de sitio web?
Nuestros consejos se adaptan a diferentes tipos de sitios: ecommerce, blogs, sitios corporativos y aplicaciones web. Siempre indicamos cuándo una técnica es específica para cierto tipo de sitio o requerimientos técnicos.
¿Puedo implementar estas técnicas yo mismo?
Muchas técnicas básicas puedes implementarlas tú mismo siguiendo nuestras guías paso a paso. Para optimizaciones avanzadas o auditorías completas, recomendamos consultar con especialistas en SEO técnico como nuestro equipo.
¿Ofrecen servicios de consultoría personalizada?
Sí, ofrecemos servicios de consultoría SEO técnica personalizada, auditorías completas y optimización integral. Contáctanos para discutir las necesidades específicas de tu proyecto y cómo podemos ayudarte.