Duplicate Content: How to Detect and Fix It

What is duplicate content in SEO and how does it affect rankings?

Duplicate content occurs when the same text — or very similar text — appears on two or more different URLs of your site (or external sites). Google cannot determine which version to show in search results, so it splits authority across all affected URLs and may choose not to index any of them. The result is a silent PageRank dilution that reduces the ranking potential of the duplicate pages without triggering any error message in Google Search Console.

Key takeaways

Technical duplicate content (URL parameters, HTTP/HTTPS versions, www/non-www, pagination) can automatically generate hundreds or thousands of duplicate URLs with no editorial intervention — source: Google Search Central.
The canonical tag is the correct solution for keeping URLs accessible to users while consolidating SEO signals into a single version, but Google treats it as a strong hint, not an absolute directive — source: Google Search Central.
An ecommerce site with 5 combinable filters on a catalogue of 1,000 products can generate more than 32,000 URL variants if parameter indexation is not properly managed — based on standard combinatorics.
Screaming Frog SEO Spider detects duplicate content at title, H1, meta description and full content hash level, without needing to export data manually — source: Screaming Frog blog.
Duplicate content and keyword cannibalisation are distinct problems: the former is identical or near-identical text on multiple URLs; the latter is different pages competing for the same search intent — source: Moz.

There is an SEO problem that does not appear in any error report and that Google does not flag in Search Console: the technical duplicate content your own CMS or infrastructure generates automatically. An ecommerce site without specific configuration — a shop with colour and size filters — can turn ten product pages into several hundred indexable URLs within weeks. All with the same content. All silently competing for the same authority.

Technical duplicate content — identical or near-identical text distributed across multiple URLs — is one of the most frequent and least visible technical SEO problems. Unlike a 404 error or an overly long title, it triggers no alert. It simply dilutes your PageRank gradually until none of the affected URLs reaches its full ranking potential.

This guide differentiates duplicate content from keyword cannibalisation (they are distinct problems), explains the most common technical origins, and establishes a concrete action plan for resolving it without losing rankings in the process.

Duplicate Content vs Cannibalisation: The Critical Difference Everyone Confuses

Before getting into solutions, it is necessary to separate two concepts that are frequently conflated even in industry reference sources.

Duplicate content is identical or substantially similar text appearing on two or more URLs. The problem is purely one of technical signal: Google receives the same content from multiple addresses, cannot determine which to show as the canonical version, and splits authority across all of them. There is no difference of editorial intent between the URLs — the same content simply exists in multiple places.

Keyword cannibalisation is a search intent problem. Two pages with different content compete to rank for the same query because Google interprets them as equivalent answers to the same intent. The content may be entirely original on each page — the problem is not the text, but the overlap of relevance signals for the same query.

A site can have one without the other. A blog with a print version of its articles — /article/ and /article/print/ — has duplicate content but not cannibalisation if the print version has a correct canonical. A blog with two distinct guides on “SEO tools” has cannibalisation but not duplicate content if each guide contains original material.

Confusing the two problems leads to applying the wrong solutions. Canonicals resolve technical duplicate content; they do not resolve cannibalisation between pages with different content. 301 redirects can resolve cannibalisation; applying them to necessary technical versions (print, UTM) breaks functionality unnecessarily. Correctly identifying the type of problem is the first diagnostic step.

Types of Duplicate Content: Internal, External and Technical

There are three main categories of duplicate content, each with different causes and solutions.

Internal Duplicate

The same content exists on multiple URLs within the same domain. This is the most common type and the one with the greatest SEO impact because it directly divides the authority of the domain itself. The most frequent causes:

Protocol versions: http://domain.com/page/ and https://domain.com/page/ are different URLs for Google if there is no 301 redirect from HTTP to HTTPS correctly configured.
WWW and non-WWW versions: www.domain.com/page/ and domain.com/page/ are two distinct URLs. Without a canonical or redirect consolidating both versions, Google autonomously chooses which to index — and may choose the wrong one.
Trailing slash: /page and /page/ are technically different. In most modern frameworks and CMSs they are configured as equivalent, but in older or custom configurations they can be independent.
Session or tracking parameters: URLs generated by the system with parameters such as ?sessionid=, ?ref=, ?source=. Each variant is a new URL for Google’s crawler.

External Duplicate

The same content appears on different domains. Causes include:

Syndication without canonical: An article published on the original site and replicated on syndication platforms (Medium, LinkedIn Pulse, affiliate sites) without the copy indicating a canonical pointing to the original.
Scraping: Other sites copy your content without permission. Google usually identifies the original correctly by first-indexation dates, but on domains with higher authority than yours it may index the copy instead of the original.
Manufacturer content in ecommerce: Product sheets using the manufacturer’s literal descriptions, also used by other distributors. Dozens of sites publish exactly the same text for the same product.

Technical Duplicate

Generated by site architecture, not editorial decisions. The most underestimated in terms of volume because it scales automatically:

Pagination: /blog/, /blog/page/2/, /blog/page/3/. Paginated pages from the second onwards share the same H1 and structure as the first, with different content but identical metadata.
Filter and sort parameters: The most severe case in ecommerce. A category with colour, size, price and brand filters can generate thousands of URL combinations with the same base content.
Print versions: /article/?print=1 or /print/article/ with the same content as the original version.
WordPress date archives: Date archive pages (/2024/03/, /2024/03/15/) and category archives can duplicate content from individual posts.

How to Detect Duplicate Content With the Right Tools

Effective diagnosis combines crawl data (what the bot sees) with indexation data (what Google has in its index).

Google Search Console: The Mandatory Starting Point

In GSC, go to Indexing > Pages. The panel shows the indexation status of all detected URLs. The most relevant statuses for detecting duplicates:

“Duplicate page, Google chose different canonical than user”: You have a canonical declared but Google is not respecting it. This typically occurs because it considers the pages have sufficient differences not to be duplicates, or because there are contradictory signals (for example, the non-canonical URL has more backlinks).
“Duplicate without user-selected canonical”: Google has detected pages with near-identical content and has autonomously chosen which to index. If the URL it chose is not the one you want as canonical, you need to implement an explicit canonical.
“Crawled, currently not indexed”: Not necessarily duplicates, but many times these pages are excluded precisely because Google has preferred to index another version.

Screaming Frog: Full Crawl With Duplicate Detection

Screaming Frog SEO Spider is the industry standard for detecting duplicate content at a technical level. The process:

Crawl the complete site (Configuration > Spider > All).
Go to the Content tab and filter by “Duplicate”.
The report shows groups of URLs with the same content hash (exact duplicate) or a high similarity percentage (partial duplicate).
Export the report and group by hash to see how many URLs share exactly the same content.

The duplicate Page Titles and duplicate H1 reports are especially revealing: two pages with the exact same H1 are almost always a case of duplicate content or cannibalisation.

Screaming Frog’s free version limits crawling to 500 URLs. The paid version (£259/year) allows unlimited crawls with full-hash-based content similarity detection — essential for sites with thousands of pages.

Semrush Site Audit and Ahrefs Site Audit

Semrush Site Audit includes a dedicated duplicate content module in the Issues > Warnings section. It automatically detects:

Pages with the same title.
Pages with the same meta description.
Pages with internal duplicate content above a configurable threshold.

Ahrefs Site Audit performs the same analysis in Content Quality > Duplicate Content. One advantage of Ahrefs is that it cross-references duplicate content data with backlink data: you can immediately see which duplicate URLs have external links, which is crucial for deciding which version to consolidate.

Siteliner: Internal Duplicate Detection in One Click

Siteliner (siteliner.com) is a free tool specialised in detecting internal duplicate content. Enter your domain and the system crawls up to 250 pages in the free version, showing the percentage of shared content between pages. It is useful as a quick diagnostic before a deeper audit with Screaming Frog.

URL Parameters: The Most Underestimated Source of Duplication

URL parameters are the most frequent cause of unintentional large-scale duplicate content, especially in ecommerce and sites with internal search.

When a product catalogue has combinable filters — for example, colour, size, price, brand and rating — each possible combination generates a different URL. A catalogue with 5 binary filters can generate up to 2⁵ = 32 URL variants for the same category. With multi-value filters, the number grows exponentially.

The problem is not only content duplication but crawl budget consumption. As Google Search Central’s official documentation on URL parameters states: “Parameters can create URLs that show duplicate content or that vary only slightly from the content of other URLs.” Googlebot will crawl all these variants, consuming crawl budget that could be spent on pages with unique content.

The most effective solutions for URL parameters:

Canonical on each parametrised URL: Each URL with parameters includes a <link rel="canonical"> pointing to the clean category URL. This is the standard solution for sites where filters are functional for users but should not generate independent URLs in the index.

<!-- On /products/trainers/?colour=red&size=42 -->
<link rel="canonical" href="https://myshop.com/products/trainers/" />

Rely on Google’s automatic parameter handling: Google retired the URL Parameters tool in Google Search Console in April 2022. The current recommended approach is to use canonical tags to consolidate parameter variants, block unwanted parameter patterns in robots.txt, use hreflang for locale variants, and rely on Google’s automatic parameter handling for parameters that do not affect page content. The canonical tag remains the most reliable mechanism for signalling the preferred URL to Google.

Robots.txt for session parameters: Session or tracking parameters that have no value for users or crawling can be blocked in robots.txt. Note: blocking in robots.txt does not deindex already-indexed URLs — it only prevents future crawling.

Internal site search results pages: URLs generated by the site’s internal search (/search/?q=red+trainers) should be systematically blocked in robots.txt or with a noindex meta robots tag. Google has explicitly stated that indexing internal search result pages is not a recommended practice.

The Canonical Tag: When to Use It and When It Is Not Enough

The canonical tag (<link rel="canonical">) is the primary mechanism for consolidating SEO signals when multiple versions of the same URL exist. But it has important limitations that are frequently overlooked.

How it works: The non-canonical page includes in its <head> a canonical pointing to the version Google should index. Google consolidates the signals of both pages (backlinks, interaction data) into the canonical URL. The non-canonical URL remains accessible to users.

<!-- Print version of an article -->
<head>
  <link rel="canonical" href="https://ighenatt.es/en/blog/my-article/" />
</head>

<!-- UTM-parametrised version -->
<head>
  <link rel="canonical" href="https://ighenatt.es/en/landing/my-landing/" />
</head>

When the canonical is the correct solution:

Print versions of articles that must remain accessible.
URLs with UTM parameters for campaign tracking.
Product pages with attribute parameters (colour, size) in ecommerce.
HTTP versions of pages when a 301 redirect is not temporarily available.
Sites serving the same content on multiple domains (e.g. regional versions with the same content).

When the canonical is NOT enough:

As John Mueller, Search Advocate at Google, explained in a Google Search Central Live session: “The canonical is a strong hint, not a directive. Google may decide not to respect it if it has reasons to choose another URL.” This occurs especially when the non-canonical URL has significantly more backlinks than the declared canonical, or when the content of both pages has substantial differences that Google interprets as different pages.

For truly duplicated content where no URL needs to be independently accessible, a 301 redirect is always cleaner and more effective: it is a directive, not a suggestion, and permanently removes the duplicate URL from the index.

Self-referencing canonical: Every page should have a canonical pointing to itself. This self-referential canonical confirms to Google that this URL is the canonical version and avoids ambiguity when URL parameters could be interpreted as variants.

<!-- On /blog/my-article/ -->
<head>
  <link rel="canonical" href="https://ighenatt.es/en/blog/my-article/" />
</head>

Duplicate Content in Ecommerce: The Most Complex Case

Ecommerce is the scenario where duplicate content reaches its greatest scale and where the business impact is most direct. A large catalogue without proper duplicate management can have more duplicate URLs than unique pages.

Manufacturer Product Descriptions

The most widespread problem among distributors and retailers: using the manufacturer’s literal descriptions. When twenty distributors publish exactly the same description for the same product, Google indexes one of them — typically the manufacturer’s or the retailer with the most authority — and ignores the rest.

Writing original descriptions for every product is the optimal solution but not always scalable. For catalogues with thousands of SKUs, the most practical strategy is to prioritise: write original content for the highest-traffic or highest-margin products, and implement templating with text variations for the rest. Even small variations in the first paragraph and in feature bullet points significantly reduce the percentage of identical text.

Product Pages With Colour and Size Variants

The most common pattern in fashion and technology: /trainers-model-x/, /trainers-model-x/?colour=red, /trainers-model-x-red/. Three URLs, same product, same description with minimal variations.

The correct architecture depends on the search volume per variant. If “trainers model x red” has relevant search volume, a dedicated URL with specific content may be justified. If volume is marginal, the variant should be an option on the main product sheet with a canonical pointing to the base model URL.

Category Pagination

/shoes/, /shoes/?page=2, /shoes/?page=3. Paginated pages should not be independently indexed: the H1 is the same, the meta description is identical, and the content of each page is simply a subset of the total listing.

The standard solution is a canonical from all paginated pages to the first page of the category. Google can crawl paginated pages to discover products, but the indexation signal consolidates on the first page URL.

According to data published by Semrush in their analysis of the most common SEO errors in ecommerce (2024), pagination without canonical management is the third most frequent technical problem in online shops, present in 47% of audited sites with more than 1,000 pages.

Action Plan for Resolving Duplicate Content Without Losing Rankings

The order of intervention is as important as the technical solutions. Acting in the wrong order can consolidate authority on the wrong URL.

Step 1: Full Inventory Before Any Change

Before implementing a single canonical tag or redirect, you need a complete map of the problem. Crawl the site with Screaming Frog and export:

All URLs with the same content hash (exact duplicates).
All URLs with duplicate titles.
All URLs with duplicate meta descriptions.
All URLs with more than 80% similar content.

For each group of duplicates, identify: how many URLs are in the group, which has the most external backlinks (using Ahrefs or Semrush), which has the most historical traffic (using GSC), and which is currently declared as canonical in the code.

Step 2: Choose the Winning URL Per Group

For each group of duplicates, choose the winning URL following this priority:

The URL with the most unique referring domains.
If tied, the URL with the most traffic in the last 12 months according to GSC.
If still tied, the shorter and semantically cleaner URL.

Do not choose the winning URL on aesthetic or ideal structure grounds. The URL with the most accumulated authority must always be the destination — you can adjust the structure after consolidation with an additional redirect if necessary.

Step 3: Implement in This Order

Self-referencing canonical on the winning URL: First, ensure the URL you want to keep has a canonical pointing to itself.
Canonical on non-canonical URLs (if they must remain accessible).
301 redirect from duplicate URLs that do not need to be accessible.
Update internal links: Change all internal links pointing to duplicate URLs so they point directly to the winning URL. This step is omitted by almost everyone and is the one that creates unnecessary redirect chains.
Update the sitemap: Include only canonical URLs. Never include in the sitemap URLs with a canonical declared to another address.

Step 4: Monitor in GSC for 4–8 Weeks

After implementing the changes, monitor the Indexing > Pages report in GSC. The “Duplicate without user-selected canonical” status should progressively decrease. The re-crawl and re-indexation process can take between 2 and 8 weeks depending on the site’s crawl budget and how frequently Google visits it.

If after 8 weeks Google is still ignoring your canonicals for a group of URLs, check:

Do the non-canonical URLs have more backlinks than the canonical? You may need a 301 redirect instead of a canonical.
Is there enough content difference between the URLs for Google not to consider them duplicates? Canonicals only work when Google determines the content is substantially similar.

The guide on 301 and 302 redirects details the technical workings of redirects and the most costly implementation mistakes — especially relevant for the URL consolidation step.

If the duplicate content diagnosis reveals broader architectural problems — URL parameters proliferating, unmanaged pagination, unresolved HTTP/HTTPS versions — you are likely facing a systemic issue that requires a full technical audit. At Ighenatt, we audit duplicate content management as part of the technical SEO diagnosis process: we identify the real volume of the problem, prioritise interventions by potential impact, and define the correct implementation plan to avoid breaking anything in the process.

Sources and references

Consolidate duplicate URLs — Google Search Central (developers.google.com)
Duplicate Content — Google Search Central (developers.google.com)
Duplicate Content Guide — Moz (moz.com)
How to Find and Fix Duplicate Content — Screaming Frog (screamingfrog.co.uk)
Duplicate Content: Causes, Effects and Solutions — Semrush (semrush.com)
How to Handle URL Parameters — Google Search Central (developers.google.com)
Duplicate Content in a Post-Panda World — Ahrefs (ahrefs.com)

Share this article

If you found this content useful, share it with your colleagues.

Twitter LinkedIn

Back to

All articles

Subscribe to

Our newsletter

Frequently Asked Questions

Does Google penalise duplicate content?

Google does not automatically apply a manual penalty for duplicate content. What it does is attempt to select the most relevant version to index and display, ignoring the others. The damage is not a penalty but a dilution: authority splits across multiple URLs, none accumulates enough signal, and none reaches its full ranking potential. In extreme cases — content scraped from other sites with spammy intent — a manual action can be triggered.

What is the difference between duplicate content and keyword cannibalisation?

They are distinct problems. Duplicate content is identical or near-identical text on multiple URLs — it can be technical (HTTP/HTTPS versions, URL parameters) or editorial (the same article published under two different paths). Keyword cannibalisation is when two pages with different content compete for the same search intent. A site can have cannibalisation without duplicate content (two distinct guides on the same topic) or duplicate content without cannibalisation (print versions with a correct canonical).

How do I know which URL Google has chosen as the canonical version?

In Google Search Console, go to Indexing > Pages and filter by 'Duplicate page, Google chose different canonical than user'. Google shows which URL it chose as canonical and which it ignored. You can also use the URL Inspection tool on a specific URL: the 'Google-selected canonical' field shows which version Google has in its index — which may differ from your declared canonical if Google does not accept it.

Does the ?page=1 parameter create duplicate content with the root URL?

Yes, in many cases. If /category/ and /category/?page=1 serve exactly the same content, Google treats them as duplicates and must choose which to index. The cleanest solution is for pagination to start at ?page=2 (the first page without a parameter) and for paginated pages to carry a canonical pointing to /category/.

How does duplicate content affect crawl budget?

Duplicate content consumes crawl budget without generating value. Googlebot crawls duplicate URLs — with parameters, www and non-www versions, HTTP/HTTPS protocols — spending the same crawl allowance as it would on unique content pages. For large sites (over 10,000 pages), this unnecessary consumption can cause important pages to take weeks to be crawled and updated in the index.

Duplicate Content: How to Detect and Fix It - Ighenatt Blog

What is duplicate content in SEO and how does it affect rankings?

Key takeaways

Duplicate Content vs Cannibalisation: The Critical Difference Everyone Confuses

Types of Duplicate Content: Internal, External and Technical

Internal Duplicate

External Duplicate

Technical Duplicate

How to Detect Duplicate Content With the Right Tools

Google Search Console: The Mandatory Starting Point

Screaming Frog: Full Crawl With Duplicate Detection

Semrush Site Audit and Ahrefs Site Audit

Siteliner: Internal Duplicate Detection in One Click

URL Parameters: The Most Underestimated Source of Duplication

The Canonical Tag: When to Use It and When It Is Not Enough

Duplicate Content in Ecommerce: The Most Complex Case

Manufacturer Product Descriptions

Product Pages With Colour and Size Variants

Action Plan for Resolving Duplicate Content Without Losing Rankings

Step 1: Full Inventory Before Any Change

Step 2: Choose the Winning URL Per Group

Step 3: Implement in This Order

Step 4: Monitor in GSC for 4–8 Weeks

Sources and references

Share this article

Frequently Asked Questions

Related Posts

Technical SEO Observability for CI/CD Releases

XML Sitemaps: technical guide for Google indexing

Robots.txt: errors that block Googlebot without you knowing

What is duplicate content in SEO and how does it affect rankings?

Key takeaways

Duplicate Content vs Cannibalisation: The Critical Difference Everyone Confuses

Types of Duplicate Content: Internal, External and Technical

Internal Duplicate

External Duplicate

Technical Duplicate

How to Detect Duplicate Content With the Right Tools

Google Search Console: The Mandatory Starting Point

Screaming Frog: Full Crawl With Duplicate Detection

Semrush Site Audit and Ahrefs Site Audit

Siteliner: Internal Duplicate Detection in One Click

URL Parameters: The Most Underestimated Source of Duplication

The Canonical Tag: When to Use It and When It Is Not Enough

Duplicate Content in Ecommerce: The Most Complex Case

Manufacturer Product Descriptions

Product Pages With Colour and Size Variants

Category Pagination

Action Plan for Resolving Duplicate Content Without Losing Rankings

Step 1: Full Inventory Before Any Change

Step 2: Choose the Winning URL Per Group

Step 3: Implement in This Order

Step 4: Monitor in GSC for 4–8 Weeks

Sources and references

Share this article

Frequently Asked Questions

Stay updated

Related Posts

Technical SEO Observability for CI/CD Releases

XML Sitemaps: technical guide for Google indexing

Robots.txt: errors that block Googlebot without you knowing