SUPERCHARGE YOUR ONLINE VISIBILITY! CONTACT US AND LET’S ACHIEVE EXCELLENCE TOGETHER!
This project applies contextualized language representations using a fine-tuned transformer to analyze page-level semantic similarity for SEO insights. By embedding words in context-sensitive formats, the approach captures nuanced meanings beyond simple term matching, offering a deeper understanding of how different web pages relate to one another.
The system computes similarity scores between various types of web pages — such as homepages, service pages, contact pages, and blog articles — using context-aware embeddings. This allows for detection of:
- Semantic overlap or redundancy across internal site pages
- Misaligned or off-topic content clusters
- Repetition between newly published and existing blog content
- Content duplication or similarity across competitor domains
By leveraging transformer-based language models that understand word usage in real-world contexts, this project showcases how contextualized embeddings can drive more accurate and actionable SEO diagnostics.
Purpose
The purpose of this project is to apply contextualized language representations using a fine-tuned transformer model to analyze page-level semantic similarity for SEO insights. By leveraging context-sensitive word embeddings, the model identifies nuanced relationships between different pages, such as homepages, service pages, and blog articles.
The project aims to uncover:
- Semantic overlap and redundancy within site pages
- Misaligned content clusters that deviate from the website’s focus
- Competitor content comparisons for identifying optimization opportunities
By capturing how words function in context, this approach provides deeper SEO insights that go beyond traditional keyword-based methods.
What is Contextualized Language Representations?
This refers to the way modern language models understand words not just by their dictionary definition, but by considering the context in which they appear. For example, the word “charge” means something entirely different in “credit card charge” vs. “charge the battery.” Traditional models treat these as the same word, but contextualized models — like the one used in this project — can tell the difference by analyzing surrounding words and sentence structure.
Contextualized representations help the model understand how meaning shifts depending on usage. This makes it possible to measure not just whether two pages use similar keywords, but whether they communicate similar ideas — even if the exact words differ.
What is Embedding?
“Embedding” is a technical term for converting words, sentences, or entire web pages into numerical formats (called vectors) that the model can compare. These embeddings reflect both the meaning of the words and the context in which they’re used.
In this project, embedding context-sensitive representations means that each page is analyzed in terms of its true meaning — not just word matching. For instance, if one page talks about “search visibility strategies” and another mentions “SEO ranking improvements,” their embeddings may still be similar because the topics are conceptually close.
These embeddings are generated using a pre-trained transformer model, fine-tuned for high accuracy in detecting semantic similarity — a core part of advanced SEO diagnostics.
What is Semantic Similarity?
Semantic similarity refers to how closely two pieces of text (such as web pages) relate to one another in terms of meaning. It goes beyond just matching words; it considers the underlying concepts and how they are expressed in different contexts. This is crucial for SEO, as pages with similar content could lead to duplication issues or missed optimization opportunities.
In what ways does this project contribute to SEO performance?
This project enables deeper insights into how different web pages relate to each other based on their actual meaning. Instead of only analyzing keywords, it considers how words are used in context, helping identify overlapping topics, redundant sections, and content that may not be clearly aligned with the rest of the website. This leads to more focused content, better keyword distribution, and reduced internal competition—key factors that influence how search engines rank pages.
Why is understanding page-level content similarity important for a website?
When multiple pages cover similar ground without intentional differentiation, search engines may struggle to decide which one to rank—this is known as keyword cannibalization. Content similarity analysis helps identify which pages are too alike or too different in the wrong contexts. This insight can guide restructuring efforts, such as merging similar content or separating topics more clearly, improving both discoverability and user navigation.
The result is a more coherent site that’s easier for both users and search engines to understand.
How can this approach highlight missed opportunities in content?
By comparing one page to another, or one page to a competitor’s, the system identifies pairs with low semantic similarity. When two pages are expected to be related but aren’t, it may point to a gap in topical coverage or off-target messaging. For example, if a homepage shares little meaning with the core service pages, it could signal weak internal linking or unclear positioning. These gaps are useful for guiding new content development.
Can it identify pages that are off-topic or don’t fit well within the site?
Yes. If a page shows unusually low similarity when compared to others in the same section (e.g., among all service pages), it may be off-topic or poorly integrated into the site’s content strategy. Such detection is especially useful during audits or content restructuring, ensuring all pages support the site’s core themes and user journey.
Is this limited to analyzing only internal pages?
No. This system supports both internal comparisons (within a single site) and external comparisons (with competitor content). This allows a business to:
- Benchmark its content coverage against others in the industry
- Identify unique strengths or blind spots
- Strategize new content based on what competitors are doing differently
This competitive angle enhances decision-making in content planning and SEO campaigns.
How are these comparisons made technically?
Each web page is converted into a contextual embedding—a numerical representation of its overall meaning—using a fine-tuned transformer model (all-MiniLM-L6-v2). These embeddings are then compared using cosine similarity, resulting in a score between 0 (completely different) and 1 (identical in meaning). This allows for objective, scalable, and fast content comparison.
What outcomes can a website owner expect from this analysis?
Website owners can use this system to:
- Discover which service pages or blogs are repeating information unnecessarily
- Identify underutilized areas of the site where fresh content can be added
- Streamline site structure by eliminating or merging overlapping pages
- Compare new content ideas against existing material to avoid duplication
- Align page themes better with searcher intent for stronger ranking signals
Libraries Used
sentence_transformers
sentence_transformers is a Python library built on top of Hugging Face Transformers. It specializes in producing semantically meaningful sentence and paragraph embeddings using transformer-based models like BERT or MiniLM.
It is used to load the all-MiniLM-L6-v2 model, which generates context-aware embeddings of webpage content. These embeddings are then compared to compute semantic similarity scores between different pages.
requests
requests is a simple yet powerful library for making HTTP requests in Python. It allows downloading web content such as HTML pages from specified URLs.
It is used to fetch the raw HTML content of the target webpages (both internal and external) so that the content can be analyzed for semantic similarity.
BeautifulSoup (from bs4)
BeautifulSoup is a web scraping tool used to parse HTML and XML documents. It makes it easy to extract data from web pages in a structured format.
It helps extract meaningful content (like headings, paragraphs, and lists) from HTML pages, removing layout elements and boilerplate so that only relevant text is processed.
numpy
numpy is a fundamental library for numerical computing in Python, widely used for operations involving arrays, matrices, and linear algebra.
It is used to compute cosine similarity between embedding vectors, which is essential for measuring how closely two web pages relate in terms of content.
re (Regular Expressions)
The re module is Python’s built-in support for regular expressions, which are used for advanced pattern matching and text manipulation.
It helps clean and preprocess raw text data by removing unwanted characters, HTML artifacts, and irrelevant symbols before the content is embedded and analyzed.
csv
The csv module is a standard Python library for reading and writing data in CSV (Comma Separated Values) format, often used for tabular data.
It is used to store the output results—such as similarity scores—in a structured file format that can be opened in spreadsheet software for client reporting or further analysis.
Function extract_content
The extract_content function is designed to scrape and clean text from web pages. Web scraping is a common practice in SEO for collecting data from different parts of a website, such as blogs, product pages, and service pages. This function does the following:
- Input: It accepts a URL as input, which points to the specific web page that needs to be analyzed.
- Web Scraping: It uses the requests library to fetch the webpage’s content and the BeautifulSoup library to parse the HTML of the page.
- Tag Removal: It removes unnecessary HTML elements that do not contribute meaningful content, such as <script>, <style>, <header>, <footer>, and other non-content tags like forms and iframes.
- Content Extraction: After cleaning up the page structure, the function extracts the main body content, specifically looking for textual data from <p>, <h1>, <h2>, <h3>, and <li> tags (which are commonly used to store paragraphs, headers, and list items).
- Output: It returns the cleaned and formatted text, which is then passed on for further processing (like embedding generation and similarity scoring).
This function ensures that only the most relevant and readable content is extracted from a webpage. In SEO, having access to clean and structured content is essential for analyzing the textual components of a page, which can then be used for tasks like content similarity, detecting gaps, and improving SEO rankings.
Function preprocess_text
This function is responsible for cleaning and normalizing the raw textual content extracted from web pages before it is processed by a language model.
This is a text-cleaning function that prepares unstructured, noisy webpage content for semantic processing. Web data often contains formatting issues, extra white spaces, special characters, or other non-standard text elements that may interfere with text embedding models. This function ensures that the text is as clean and consistent as possible before it is embedded using a transformer model.
The function performs several sequential cleaning steps:
· Multiple Space Removal
text = re.sub(r’\s+’, ‘ ‘, text)
Replaces multiple consecutive whitespace characters (tabs, newlines, extra spaces) with a single space to normalize spacing across the content.
· Unusual Character Removal
text = re.sub(r'[^A-Za-z0-9.,;!?()\[\]\’\”\s]’, ”, text)
Filters out non-alphanumeric characters except basic punctuation. This removes symbols, emojis, and other non-essential elements.
· Short Word Removal
text = re.sub(r’\b\w{1,2}\b’, ”, text)
Removes very short words (1–2 characters) that often don’t contribute meaningful information and could be artifacts from HTML elements or abbreviations.
· Final Cleanup
return text.strip()
Trims any leading or trailing whitespace left after the cleaning process.
By preprocessing the text, the function contributes significantly to the overall accuracy and effectiveness of content similarity analysis — a core part of this SEO-focused project.
Function load_model
Loads the all-MiniLM-L6-v2 model from the SentenceTransformers library.
Used to initialize the pretrained transformer model that generates contextualized embeddings for SEO content similarity analysis.
Model Used: all-MiniLM-L6-v2
Overview
The all-MiniLM-L6-v2 model is a compact, high-performance sentence embedding model from the SentenceTransformers library. It is designed to convert sentences and paragraphs into dense vector representations that capture contextual meaning. Despite being lightweight, it delivers strong performance on tasks such as semantic similarity, clustering, and information retrieval.
How It Works
Architecture:
Based on the MiniLM architecture, which is a distilled version of larger transformer models like BERT. It includes 6 transformer layers and leverages self-attention mechanisms to understand relationships between words in a sentence.
Training Objective:
The model has been fine-tuned using contrastive learning techniques on large-scale Natural Language Inference (NLI) datasets. This enables it to produce embeddings that preserve semantic relationships, not just surface-level token similarity.
Output:
Given a sentence or paragraph, the model returns a fixed-size embedding (384-dimensional vector). Similar sentences will produce embeddings that are close in the vector space, enabling semantic comparison.
Why It’s Used in This Project
Contextual Understanding:
Unlike traditional keyword matching, this model understands how words are used in context. This is critical for SEO, where the same term might mean different things on different pages.
Performance-Optimized:
Balances accuracy with computational efficiency, making it ideal for comparing multiple web pages at scale without significant processing overhead.
Pretrained and Ready:
Requires no additional training, which allows quick integration into real-world applications while still delivering reliable results.
How It Helps in SEO Analysis
· Detects Semantic Similarity Across Pages:
Allows the identification of pages that cover overlapping topics or content, even if they use different wording. This helps in:
- Identifying internal content redundancy.
- Ensuring content variety across services, blogs, and landing pages.
· Content Strategy Optimization:
By comparing similarity scores between service pages, homepages, and blog content, it’s possible to:
- Pinpoint misaligned or irrelevant content clusters.
- Refine site architecture and internal linking for better relevance.
· Competitive Benchmarking:
When applied across domains, it detects content overlaps or gaps between a site and its competitors, helping with:
- Unique value proposition assessment.
- Identifying opportunities for new content development.
· Improved User Experience and Search Intent Match:
Semantic similarity ensures content clusters align better with searcher intent, which can reduce bounce rates and improve on-site engagement.
Function get_embedding
This function plays a pivotal role in transforming raw text into vector embeddings using a transformer-based model. Here’s how it works:
- Input: It accepts a cleaned text string (the output of extract_content) and a transformer model (like MiniLM).
- Processing: The text is passed through a pre-trained transformer model, such as all-MiniLM-L6-v2, to generate a semantic vector representation. This is where context-sensitive word embeddings come into play. The model understands the meaning of words in context rather than just matching keywords. This line model.encode([text], convert_to_tensor=True) encodes the text data into vector embedding required to compare similarity.
- Output: The result is a numerical vector (embedding) that encapsulates the semantic meaning of the input text. These embeddings are crucial because they allow us to compare different web pages based on their contextual similarity.
This function generates the core components needed to compare the content of different pages semantically. Embeddings make it possible to assess not just exact keyword matches but also the deeper meaning and context of the text on a page, which is essential for SEO tasks like comparing home, service, and contact pages, identifying content gaps, and detecting duplicate content.
Explanation:
The above output represents the dense vector embedding of a webpage, generated by the all-MiniLM-L6-v2 model. Each number in the tensor corresponds to a specific dimension in a 384-dimensional space, capturing the semantic features of the content. These embeddings allow the system to compute how similar two pieces of content are by comparing their vector distances, rather than relying on exact word matches. This enables a more accurate, context-aware comparison of web pages for SEO analysis.
Function calculate_similarity
The calculate_similarity function calculates the semantic similarity between two pages by comparing their embeddings:
- Input: It takes in two text embeddings — one from each page — and calculates the cosine similarity between them.
- Processing: The function uses a mathematical formula to determine how similar the two embedding vectors are. Cosine similarity measures the cosine of the angle between two vectors, with a score of 1 indicating that the vectors are identical, and a score closer to 0 indicating that they are completely dissimilar.
- Output: It returns a similarity score that quantifies how similar the two pages are to each other in terms of their content.
This function enables the comparison of different pages or web content based on their semantic meaning, rather than just surface-level keyword overlap. In SEO, comparing page similarity is crucial for identifying duplicate content, assessing the effectiveness of content clustering, and ensuring that content is relevant and aligned with SEO goals.
Comparison of Two Individual Pages
The similarity score between the following two pages:
is 0.3307.
This relatively low score suggests that the two pages have limited semantic overlap and likely target distinct topics. In a well-optimized website, such a separation is desirable, as it helps avoid content redundancy and ensures that each page serves a unique purpose for both users and search engines. This test confirms that the implemented similarity model effectively identifies topical differences between web pages.
Multi-Page Semantic Similarity Assessment
This section evaluates the contextual similarity between multiple webpages. The goal is to understand how semantically close different pieces of content are to each other, which is crucial for identifying overlaps, redundancies, or content gaps across a site.
How Similarity Is Measured
The project uses cosine similarity, a numerical score that quantifies how similar two pieces of text are in terms of their meaning, not just keyword matching. The model used—all-MiniLM-L6-v2—converts each page’s content into dense vectors (embeddings) that reflect the meaning of the page as a whole. The cosine similarity between these embeddings provides a semantic similarity score ranging from 0 to 1:
Similarity Score Range and their Interpretation
Cosine similarity measures how close two pages are in terms of meaning and topic. The value ranges between 0 (completely different) and 1 (identical).
- 0.70 – 1.00: Highly similar – Likely about the same topic
- 0.50 – 0.69: Moderately similar – Overlapping or related topics
- 0.30 – 0.49: Slightly similar – Some shared context or terms
- Below 0.30: Low similarity – Different subjects entirely
Result Analysis
Handling Document URLs vs. GBP Listing Analysis
Similarity Score: 0.3307
Interpretation: These pages have minimal overlap. They might share a few technical SEO terms, but the overall focus is different. This is considered a healthy level of separation for internal SEO content.
Handling Document URLs vs. Crawl Efficiency via Log Files
Similarity Score: 0.486
Interpretation: This pair is moderately close. Both touch on backend optimization practices, which could create some thematic similarity. However, they still maintain distinct purposes.
GBP Listing Analysis vs. Crawl Efficiency via Log Files
Similarity Score: 0.4182
Interpretation: These pages are slightly related. While both address optimization in different areas (local vs. technical SEO), the overlap is minor and does not raise concerns for redundancy.
What This Means for the Website
- Content Redundancy Check: No pairs indicate strong similarity (scores > 0.70), so there is no duplication issue across these url contents.
- Topical Distinction: Each page maintains its own niche or focus, supporting a broader content strategy.
- SEO Planning Insight: This analysis provides confidence that new articles are not cannibalizing existing content and are aligned with distinct SEO goals.
Interpreting Internal Page Similarity
In this section, each group of similarity scores represents comparisons between internal pages of the same website. The objective here is to assess how distinct or overlapping the content is within a site’s key pages — such as the homepage, service-related pages, and contact or about pages.
Across the analyzed websites, a pattern emerges that reveals important insights into content structure and potential SEO opportunities:
Pages with Higher Similarity (Scores > 0.5)
Some page pairs show relatively high semantic similarity — for example, homepage vs. service page, or homepage vs. another major content page. While it’s expected for pages to share brand language and site-wide messaging, higher similarity scores suggest that the core informational content across these pages may not be distinct enough. This can blur the lines between what each page is supposed to communicate.
What this means:
- There could be reused sections like intros, headers, or CTAs that are too similar.
- Search engines might struggle to differentiate which page is more relevant for specific queries.
- These pages may benefit from clearer content segmentation and more topic-specific wording.
Pages with Moderate Similarity (Around 0.4–0.5)
This range generally reflects a balanced relationship: the pages share some thematic overlap but still maintain their own identity. It’s common between pages that touch on related topics (e.g., homepage vs. service overview) or that share certain phrasing but focus on different user needs.
What this means:
- The pages are likely aligned under a broader content theme, but they’re not redundant.
- This is often an acceptable and expected similarity for SEO, especially in navigational or branded content.
- Minor refinements can further sharpen the focus without needing a complete rewrite.
Pages with Low Similarity (Below 0.3)
These pairs show a strong level of distinction in terms of meaning and topic. This is generally a positive signal from an SEO perspective, especially between pages that should be targeting entirely different user intents — like service details versus contact or legal pages.
What this means:
- Content is clearly separated by purpose and likely serves different stages of the user journey.
- These pages help define a site’s topical hierarchy, improving crawl efficiency and page relevance.
- No immediate changes needed — this is usually the ideal scenario.
Overall Insight
The scores across different websites demonstrate how internal content structure varies. In some cases, there’s too much overlap between high-priority pages, which can lead to diluted messaging or SEO cannibalization. In other cases, the content is well-separated, helping each page stand on its own in the eyes of search engines.
The key takeaway here is that similarity scores are a powerful diagnostic — they don’t just measure content repetition but highlight whether a site’s messaging strategy is aligned with its page structure. This allows teams to pinpoint areas where content needs more differentiation or consolidation based on how meaningfully unique each page actually is.
What should I do after receiving the similarity scores?
Start by reviewing the highest similarity pairs. Focus first on core landing pages (e.g., homepage, services, product pages) that should have clear distinctions in content and purpose. Action steps:
- Prioritize content that has a score above 0.6 with other key pages.
- Check if these pages target overlapping keywords or user intents.
- Create a short audit summary of which pages need rewriting, consolidation, or structural adjustments.
What do the similarity scores actually represent?
The similarity scores reflect how semantically close two web pages are in terms of their content. A higher score means the pages convey similar ideas or topics; a lower score means the content is more distinct. These scores are based on contextual language representations, which go beyond keywords to understand actual meaning.
What is considered a “good” or “bad” score?
There’s no absolute good or bad score — it depends on which pages are being compared. However, general interpretations are:
- Above 0.6 -> High similarity (possibly overlapping content)
- 0.4 to 0.6 -> Moderate similarity (some overlap, but likely intentional)
- Below 0.4 -> Low similarity (distinct topics, often ideal for internal SEO)
For example, high similarity between a service page and a contact page is not ideal, while moderate similarity between related service pages may be acceptable.
Why does content similarity matter for SEO?
Search engines aim to rank the most relevant and unique page for a given query. If multiple pages on a site are too similar, they may compete with each other (called keyword cannibalization) or confuse the search engine about which one to prioritize. Distinct, well-structured content improves ranking potential and user clarity.
How can I use this data to improve my site’s content structure?
Use the scores as a roadmap:
- High-similarity pairs -> Reassess and differentiate
- Moderate-similarity pairs -> Fine-tune focus and value
- Low-similarity pairs -> Likely aligned well; ensure they’re linked where relevant
Also consider whether content silos or topic clusters are clearly implemented — each page should serve a unique purpose and target a distinct query intent.
Are there situations where high similarity is acceptable?
Yes, for some templated pages (e.g., legal disclaimers, location pages with small variations), higher similarity is expected. In such cases, technical solutions like canonicalization or structured data can help search engines handle the duplication properly.
What should I do if many pages across my site are too similar?
This suggests an architectural or strategic content issue. Consider:
- Auditing your content categories or silos
- Implementing clearer page-level intent
- Creating content briefs or templates to ensure uniqueness going forward
- Consolidating similar articles into a single authoritative guide with section anchors
Final Thoughts
This analysis has demonstrated the value of leveraging contextual language models to evaluate content similarity at scale. By quantifying the semantic overlap between web pages, it becomes possible to uncover structural and strategic issues that may not be immediately visible through manual audits or keyword-based tools alone.
The similarity scores offer actionable insights into how well each page is differentiated in terms of topic coverage, user intent, and SEO purpose. High similarity between key pages often indicates redundancy, keyword cannibalization, or diluted topical focus — all of which can negatively impact organic performance. On the other hand, low similarity generally signals well-differentiated content and a healthy content architecture.
Ultimately, this method empowers decision-makers to:
- Audit content quality and uniqueness more objectively
- Align pages with distinct search intents and funnel stages
- Streamline future content strategies and avoid overlap from the outset
As content continues to drive digital visibility, ensuring clear topical separation between core pages is critical. This approach helps not only in improving search engine rankings but also in enhancing the user experience by providing focused, relevant, and non-redundant information throughout the site.
For optimal results, it is recommended to integrate similarity analysis into regular SEO and content workflows — especially when expanding content, redesigning site structures, or merging pages. When applied consistently, it becomes a powerful tool in building sustainable and search-optimized content ecosystems.
The businesses that dominate local search don’t just have great services — they have a structure that tells Google exactly how to rank them.If you implement this blueprint, you won’t just compete in your local market — you’ll own it.
Thatware | Founder & CEO
Tuhin is recognized across the globe for his vision to revolutionize digital transformation industry with the help of cutting-edge technology. He won bronze for India at the Stevie Awards USA as well as winning the India Business Awards, India Technology Award, Top 100 influential tech leaders from Analytics Insights, Clutch Global Front runner in digital marketing, founder of the fastest growing company in Asia by The CEO Magazine and is a TEDx speaker and BrightonSEO speaker.