Server log analysis: the definitive SEO guide for crawling

What is SEO log file analysis and what is it used for?

SEO log file analysis means examining web server access logs (Apache, Nginx or IIS) to identify exactly which URLs Googlebot crawls, how often it visits them and what response codes the server returns. Unlike Google Search Console, server logs capture every HTTP request unfiltered — including bot crawls, 404/500 errors and faceted navigation patterns — making them the most precise source for diagnosing crawlability and indexation issues.

Key takeaways

Server logs record every HTTP request unfiltered, including bot crawls that analytics tools like Google Analytics do not capture. They are the most accurate source for diagnosing crawlability issues.
Between May 2024 and May 2025, overall crawler traffic rose 18%; Googlebot traffic grew 96% and GPTBot grew 305% — Source: Single Grain / Server data 2025.
An e-commerce site analysed by OnCrawl had 4.5 million URLs being crawled unnecessarily. After redirecting redundant subfolders and optimising the crawl, the site achieved a 37% increase in sessions and 22% in transactions.
Dana Tan, Director of SEO at Under Armour: 'Getting server logs takes the conjecture out of SEO and it's 100% scientific. It's data.'
The main tools for SEO log analysis are: Screaming Frog Log Analyzer (manual upload), OnCrawl (direct integration), Seolyzer and Botify for enterprise-scale sites.

Most SEOs set up Google Search Console, run a crawler and consider the technical diagnosis complete. What they do not see is the full picture: everything that happens on the server before GSC receives the filtered data. Server logs are that full picture.

An Apache access log records every HTTP request that reaches the server: exact URL, millisecond-precise timestamp, source IP, user-agent and response code. No sampling, no filters, no 48-hour delays. If Googlebot crawled your privacy policy page 47 times in a month while your main category received no visits at all, the log shows it. GSC probably does not.

Server logs record every HTTP request unfiltered, capturing bot crawls that analytics tools like Google Analytics do not collect by design. For any site with more than 10,000 pages, they are the most accurate source for diagnosing crawlability and indexation problems.

This guide covers log analysis at the technical level: how to read Apache and Nginx format, how to identify Googlebot (and verify it is actually Googlebot), how to detect problematic crawl patterns, which tools to use and how to interpret what the data reveals about the site’s SEO health.

What a server log contains and how to read it

An access log entry in Apache’s Combined Log Format looks like this:

66.249.73.135 - - [28/Mar/2026:08:42:17 +0100] "GET /category/shoes/ HTTP/1.1" 200 4521 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +https://www.google.com/bot.html)"

Each field has a precise meaning:

66.249.73.135 — Client IP (in this case, a Google range)
[28/Mar/2026:08:42:17 +0100] — Timestamp with timezone
“GET /category/shoes/ HTTP/1.1” — HTTP method, requested URL and protocol
200 — Server response code
4521 — Response size in bytes
“-” — Referrer (empty here)
“Mozilla/5.0 (compatible; Googlebot/2.1; +…)” — User-agent

Nginx uses a similar format by default. The main difference is that Nginx separates access and error logs into distinct files (access.log and error.log), while Apache combines them or separates them depending on the VirtualHost configuration.

IIS (Internet Information Services) uses the W3C Extended Log File Format, which has column headers at the start of the file and orders fields slightly differently. Screaming Frog Log Analyzer accepts all three formats.

The fields that matter most for SEO

For an SEO-focused analysis, four fields are critical:

The user-agent identifies who is making the request. Googlebot identifies itself as Googlebot/2.1. Mobile Googlebot also uses Googlebot/2.1 but with the additional device string. Bingbot as bingbot/2.0. OpenAI’s GPTBot as GPTBot/1.2. An empty user-agent or suspicious strings may indicate scraping or malicious bots.

The response code is the instant diagnosis: 200 (OK), 301/302 (redirect), 404 (not found), 500/503 (server error). The distribution of these codes in Googlebot traffic reveals the technical health of the site.

The URL allows grouping crawls by section: categories, products, parameters, pagination URLs. Crawl frequency by section is the most direct signal of where Googlebot is spending its budget.

The timestamp allows building time series: how often does Googlebot visit each URL? Are there critical URLs that have not been crawled in 30 days?

How to identify Googlebot (and verify it is real)

Here is the problem few people mention: any bot can spoof Googlebot’s user-agent. A scraper can send requests with Googlebot/2.1 in the user-agent and appear legitimate in the logs. To confirm that an access is genuinely from Google, you need to perform a reverse DNS lookup.

The process has two steps. First, look up the hostname from the log’s IP:

host 66.249.73.135
# Result: 66.249.73.135.in-addr.arpa domain name pointer crawl-66-249-73-135.googlebot.com.

Second, verify that the hostname resolves back to the same IP:

host crawl-66-249-73-135.googlebot.com
# Result: crawl-66-249-73-135.googlebot.com has address 66.249.73.135

If both steps match and the domain ends in .googlebot.com or .google.com, the request is legitimately from Google. Google also publishes its IP ranges at https://developers.google.com/static/search/apis/ipranges/googlebot.json. Log analysis tools like Screaming Frog Log Analyzer automate this verification.

Screaming Frog Log Analyzer and tools like OnCrawl automatically verify user-agents against IP ranges published by search engines, distinguishing verified bots from bots spoofing identity. This is particularly relevant in 2025, when total crawler traffic grew 18% year-on-year and AI bots now represent a growing share of non-human server traffic.

AI bots in 2025: a new actor in the logs

Between May 2024 and May 2025, OpenAI’s GPTBot traffic grew 305% on analysed servers. Googlebot, for its part, increased by 96%. This data, published by Single Grain based on client server log analysis, has practical implications: logs no longer just reveal search engine behaviour — they also show who is training AI models on your content.

ClaudeBot (Anthropic), GPTBot (OpenAI), CCBot (Common Crawl) and Bytespider (ByteDance/TikTok) are the most common AI bots appearing in current logs. None of them appear in Google Search Console. They are only visible in server logs.

Problematic patterns that logs reveal before any other tool

Dana Tan, Director of SEO at Under Armour, captures it with surgical precision: “Getting server logs takes the conjecture out of SEO and it’s 100% scientific. It’s data.” There is no possible misinterpretation when the log shows Googlebot crawling a URL 200 times in a month with a 404 response code.

These are the most common patterns that logs detect before any other source:

Excessive crawling of low-value URLs

E-commerce faceted navigation is the most common case. A site with 50,000 products can generate millions of filter URL combinations: /shoes/?colour=red&size=42&price=50-100. Many of these URLs have no real SEO value. If Googlebot is crawling these combinations frequently, crawl budget is consumed on URLs that will never rank.

OnCrawl documented a case where 4.5 million URLs were being crawled unnecessarily on an e-commerce site. The problem was not visible in Google Search Console because the URLs returned 200 responses and had canonicals implemented — but the logs showed Googlebot continued crawling them regularly despite the canonical. The solution combined robots.txt for the most aggressive parameter URLs and a sitemap review to prioritise valuable URLs.

High-value pages with insufficient or zero crawling

The opposite problem. New category pages or recently published products that receive no Googlebot visits for weeks. Logs allow you to cross-reference the list of important URLs (obtained from the sitemap or a crawl) against the record of actual crawls. The discrepancy between what should be crawled and what is actually being crawled points to internal linking or architecture issues.

PJ Howland, VP of Industry Insights at 97th Floor, places it in the correct context: “Crawlability is the foundation of any technical SEO rollout. Without crawlability sites won’t get indexed. Without getting indexed, they won’t rank.” Logs are the only place where you can directly verify whether that foundation is working.

5xx errors that do not appear in GSC

Server errors (500, 502, 503) that occur during Googlebot’s crawl only appear in GSC’s Coverage report if they are persistent. A transient 503 error lasting 2 minutes can coincide exactly with a Googlebot visit, and that crawl is recorded as failed in the logs without leaving a trace in GSC. If logs show a pattern of 5xx errors at specific times (load spikes, deployments), there is an availability problem affecting the crawl.

Redirect chains consuming budget

A URL that returns a 301 pointing to another URL that returns another 301 before reaching the final destination. Googlebot follows redirects, but each hop consumes crawl time. Logs allow you to identify which URLs return redirects and how many hops there are to the final URL. Google’s recommendation is that redirects be direct (a single 301 to the final destination).

The OnCrawl case: +37% sessions through log analysis

The most documented case study of SEO improvement through log analysis comes from OnCrawl’s work with an e-commerce client selling high-value products with fast turnover (pages were removed after sales).

Log analysis revealed three simultaneous problems:

Duplicate subfolders with outdated content receiving frequent Googlebot crawling
404 error pages from sold products being crawled repeatedly
Low-value subfolders receiving more crawl budget than main categories

The implemented solutions were: deletion and 301 redirection of redundant subfolders, strategic internal links from low-priority to main category pages, sitemap reorganisation to prioritise critical URLs, and canonical and meta robots review on high-priority pages.

The documented result was a 37% increase in sessions and 22% in transactions after implementation. The root cause — crawl budget waste — was not visible from GSC or from a conventional crawler. Only the logs showed where Googlebot’s time was actually going.

Tools for SEO log analysis

Screaming Frog Log Analyzer

The most straightforward option for teams already using Screaming Frog SEO Spider. The Log Analyzer accepts files in Apache, Nginx and W3C Extended Log Format. The workflow is: export the logs from the server (or request them from your host), load the file into the tool, filter by verified Googlebot.

Screaming Frog Log Analyzer allows segmentation by: most and least frequently crawled URLs, response code distribution for Googlebot, and comparison between Log Analyzer and SEO Spider crawl data in the same interface. The advantage is direct integration with the Spider for cross-referencing crawlability data and log data.

The limit is volume: for very large sites, processing log files of several GB can be slow. For those cases, OnCrawl or Botify are more appropriate.

OnCrawl

OnCrawl is the reference platform for technical SEO analysis at enterprise scale. Unlike Screaming Frog, OnCrawl allows direct server integration to receive logs continuously (not just point-in-time files). It combines log data with its own crawl data and Google Search Console data in a single dashboard.

The most valuable functionality for crawl diagnostics is the correlation between crawl frequency and page performance: OnCrawl automatically cross-references which pages have the most organic traffic with which pages receive the most Googlebot visits. Discrepancies (high-traffic pages with little crawling, or no-traffic pages with frequent crawling) are the highest-priority alerts.

Seolyzer

Seolyzer is oriented towards medium-sized sites and its differentiator is real-time log error detection. No file downloads required: a snippet is installed on the server that sends logs directly to the platform. It automatically identifies error patterns, bot crawls, and generates alerts when it detects anomalies (spike in 404s, drop in Googlebot crawl frequency).

Python analysis for large volumes

For log files exceeding several GB, GUI tools have performance limitations. Analysing with Python using pandas and the apache-log-parser library allows processing millions of lines in minutes.

The basic Python workflow:

import pandas as pd
import re

# Regex for Combined Log Format
log_pattern = r'(\S+) \S+ \S+ \[([^\]]+)\] "(\S+) (\S+) \S+" (\d+) (\S+) "([^"]*)" "([^"]*)"'

rows = []
with open('access.log', 'r', encoding='utf-8', errors='ignore') as f:
    for line in f:
        match = re.match(log_pattern, line)
        if match:
            rows.append({
                'ip': match.group(1),
                'datetime': match.group(2),
                'method': match.group(3),
                'url': match.group(4),
                'status': int(match.group(5)),
                'user_agent': match.group(8)
            })

df = pd.DataFrame(rows)

# Filter Googlebot only
googlebot = df[df['user_agent'].str.contains('Googlebot', case=False, na=False)]

# Most crawled URLs by Googlebot
top_crawled = googlebot['url'].value_counts().head(50)
print(top_crawled)

# Response code distribution for Googlebot
print(googlebot['status'].value_counts())

This script processes a 2GB log file in under 2 minutes on a standard computer. From there, more complex segmentations can be built: grouping URLs by section (using regex on the url column), calculating average crawl frequency per page, or identifying 404 URLs that Googlebot keeps visiting.

What to do with what you find in the logs

Log analysis does not end at diagnosis. Each problematic pattern has a concrete action:

Low-value URLs crawled at high frequency: Add Disallow directives in robots.txt for sections with no SEO value (filter parameter URLs, internal search URLs, session URLs). For URLs that must exist but should not be crawled, use noindex in the meta robots combined with a canonical.

Important pages with insufficient crawling: Review internal linking to those URLs. A page without sufficient internal links receives little PageRank and therefore less Googlebot interest. Add those URLs to sitemap.xml with high priority. Verify they are not accidentally blocked in robots.txt.

404 errors crawled repeatedly: For deleted pages that had traffic or backlinks, implement a 301 redirect to the most relevant available content. For URLs that should never have existed (junk parameters, script-generated URLs), block in robots.txt.

5xx errors during crawling: Investigate load spikes coinciding with 5xx errors. If the server cannot handle Googlebot’s crawl rate, you can reduce crawl speed from Google Search Console under Settings → Crawl Rate, although Google recommends this only as a last resort.

Redirect chains: Update source URLs to point directly to the final destination. Each intermediate 301 can be eliminated if the CMSs or internal systems generating the links are updated.

Googlebot’s crawl frequency is an indirect signal of the quality perception Google has of a site. A fast site with frequently updated content and strong internal architecture receives more frequent visits. A site with many errors or outdated content sees Googlebot reduce its cadence. Logs record that pulse objectively, without interpretations or sampling.

If you want to know exactly how Googlebot is crawling your site and where crawl budget is being wasted, log analysis is part of every technical SEO audit we carry out. Tell us about your case.

Sources and references

Server access logs and SEO — Search Engine Land (searchengineland.com)
Log File Analysis for SEO — iPullRank (ipullrank.com)
Log Files Analysis Case Study — OnCrawl (oncrawl.com)
SEO Log File Analyser — Screaming Frog (screamingfrog.co.uk)
Decoding crawl frequency — OnCrawl Technical SEO (oncrawl.com)
Log File Analysis for Understanding AI Crawling Behavior — Single Grain (singlegrain.com)
Crawl Stats Report — Google Search Console Help (support.google.com)
Log File Analysis for SEO — LinkGraph (linkgraph.com)

Share this article

If you found this content useful, share it with your colleagues.

Twitter LinkedIn

Back to

All articles

Subscribe to

Our newsletter

Frequently Asked Questions

How do I access my website's server logs?

It depends on your hosting setup. On Linux servers with cPanel, access logs are typically found in /home/user/logs/domain.com-ssl_log or under the 'Logs' section of your control panel. On VPS with root access, Apache logs are at /var/log/apache2/access.log and Nginx logs at /var/log/nginx/access.log. On shared hosting, contact support to request the log file or enable logging from your panel. For Cloudflare, you can enable Cloudflare Logs on the Enterprise plan or use the Logpush endpoint.

How much log data do I need for a useful SEO analysis?

For medium-sized sites (up to 100,000 pages), 30 days of logs are sufficient to identify crawl patterns, pages that Googlebot never visits, and recurring errors. For large sites with seasonal pages or frequent updates, analyse at least 90 days to distinguish real patterns from one-off anomalies. If the site has recently undergone a migration or a major structural change, compare the period before and after the change.

What is the difference between log analysis and Google Search Console?

Google Search Console shows a sample of Googlebot crawls with up to 16 months of index data, but does not record every HTTP request or distinguish between bots (Googlebot, Bingbot, GPTBot, malicious bots). Server logs record everything: each request with exact timestamp, IP, user-agent and response code. Logs reveal Googlebot crawls on URLs that GSC does not show, 5xx errors that the server returns but do not always appear in the Coverage report, and AI bot behaviour (GPTBot, ClaudeBot) that does not appear in GSC.

Do AI bots like GPTBot affect server performance?

Yes, and in 2025 the impact is already measurable. Between May 2024 and May 2025, GPTBot traffic grew 305% according to server data analysed by Single Grain. On large sites, these bots can consume significant bandwidth and stress the server, especially if they are not rate-limited in robots.txt. Log analysis allows you to identify which IPs and user-agents consume the most resources, block unauthorised ones in robots.txt or at the firewall level, and decide whether to allow AI training bots access to your content.

Server log analysis: the definitive SEO guide for crawling | Ighenatt

What is SEO log file analysis and what is it used for?

Key takeaways

What a server log contains and how to read it

The fields that matter most for SEO

How to identify Googlebot (and verify it is real)

AI bots in 2025: a new actor in the logs

Problematic patterns that logs reveal before any other tool

Excessive crawling of low-value URLs

High-value pages with insufficient or zero crawling

5xx errors that do not appear in GSC

Redirect chains consuming budget

The OnCrawl case: +37% sessions through log analysis

Tools for SEO log analysis

Screaming Frog Log Analyzer

OnCrawl

Seolyzer

Python analysis for large volumes

What to do with what you find in the logs

Sources and references

Share this article

Frequently Asked Questions

Related Posts

301 and 302 Redirects: Complete SEO Guide

XML Sitemaps: technical guide for Google indexing

Robots.txt: errors that block Googlebot without you knowing

What is SEO log file analysis and what is it used for?

Key takeaways

What a server log contains and how to read it

The fields that matter most for SEO

How to identify Googlebot (and verify it is real)

AI bots in 2025: a new actor in the logs

Problematic patterns that logs reveal before any other tool

Excessive crawling of low-value URLs

High-value pages with insufficient or zero crawling

5xx errors that do not appear in GSC

Redirect chains consuming budget

The OnCrawl case: +37% sessions through log analysis

Tools for SEO log analysis

Screaming Frog Log Analyzer

OnCrawl

Seolyzer

Python analysis for large volumes

What to do with what you find in the logs

Sources and references

Share this article

Frequently Asked Questions

Stay updated

Related Posts

301 and 302 Redirects: Complete SEO Guide

XML Sitemaps: technical guide for Google indexing

Robots.txt: errors that block Googlebot without you knowing