SEO Optimization with Gemma - Leveraging Embeddings for Query

Get a Customized Website SEO Audit and Online Marketing Strategy and Action Plan

This project demonstrates the application of advanced embedding models to search engine optimization (SEO) for analyzing and organizing website content. Webpages are processed alongside defined queries to evaluate how effectively each section aligns with search intent. This process provides a detailed view of content coverage, relevance, and areas where optimization opportunities exist.

At the core of the project is Gemma, a state-of-the-art language model. Its capability to generate highly contextual embeddings captures semantic meaning beyond simple keyword matching. This makes it possible to assess relationships between queries and content at a deeper level, revealing subtle patterns of relevance and alignment that conventional approaches may overlook.

The framework integrates embeddings, clustering, and semantic alignment to deliver actionable insights on content performance across multiple queries and URLs. In addition, the analysis highlights opportunities such as improved internal linking and identification of semantic gaps that can enhance coverage.

By positioning Gemma as a central enabler of next-generation SEO analysis, the project establishes a methodology that not only explains current content performance but also informs strategies for content expansion, refinement, and optimization.

Project Purpose

The purpose of this project is to apply advanced embedding techniques, with a focus on Gemma, to strengthen the analytical foundation of SEO-focused content evaluation. Modern search performance depends on how well webpages align with specific queries, how consistently they cover important topics, and how effectively they interconnect with related content. Traditional keyword-based methods provide only surface-level insights, while embedding-driven approaches capture deeper semantic relationships.

This project introduces a structured pipeline that uses Gemma embeddings to analyze query–content alignment, highlight clusters of related sections, identify opportunities for internal linking, and uncover semantic gaps across multiple URLs. Each deliverable directly supports critical aspects of content optimization. Similarity scoring ensures queries are matched with the most relevant page sections, clustering reveals thematic structure, internal link recommendations encourage stronger site connectivity, and semantic gap detection ensures key queries are adequately addressed.

The purpose is not limited to demonstrating technical capability, but to show how embedding-based analysis can be used to refine SEO strategy. By systematically evaluating relevance, structure, and connectivity, the project builds a framework where optimization decisions are guided by measurable evidence, leading to stronger query alignment, reduced content redundancy, and more comprehensive topic coverage.

Introduction to Gemma

Gemma is a suite of advanced AI models developed by Google, designed for understanding and representing language with high semantic fidelity. It provides versatile tools for natural language processing, including text classification, retrieval, and semantic analysis. In this project, the focus is on the embedding component of Gemma.

EmbeddingGemma Overview

EmbeddingGemma generates vector representations of text that capture meaning and context. These embeddings enable precise comparisons between queries and content, allowing for advanced semantic analysis in SEO applications.

Benefits of EmbeddingGemma in This Project

Query–Content Semantic Alignment: Matches user queries with the most relevant content sections using semantic similarity rather than keyword overlap.
Content Clustering: Groups related sections across pages to identify topic overlaps or content cannibalization risks.
Internal Linking Recommendations: Suggests links between semantically close sections to improve site architecture and authority distribution.
Semantic Gap Detection: Identifies queries not well-covered by existing content, guiding content creation priorities.

Technical Highlights Relevant to Project

Separate encoders for queries and documents (encode_query, encode_document) optimize retrieval accuracy.
Section-level embeddings allow granular insights into relevance, clustering, and coverage gaps.
Compact and efficient model design supports multi-page analysis without excessive computational cost.

Practical Application in This Project

Pages are segmented into sections, each embedded using EmbeddingGemma. Queries are embedded separately, then aligned with sections to measure relevance. Clustering identifies thematic relationships, while internal linking and gap analysis provide actionable SEO insights.

Understanding Gemma Embeddings

Purpose: Gemma is a large-scale embedding model designed to represent textual data in a dense, high-dimensional vector space. Embeddings capture semantic meaning beyond literal word matches, enabling deeper comparisons between text snippets.
Functionality: Queries and content sections are transformed into dense vectors, where semantic similarity can be computed using measures like cosine similarity. The embeddings encode not just individual words but contextual and relational meaning across the text.
Practical Implications: Sections with high similarity in embedding space are semantically aligned with the queries, even if exact keywords are not present. This allows detection of relevance and meaning at a conceptual level, supporting better content alignment, clustering, and gap analysis.

Understanding Query–Content Semantic Alignment

Purpose: Ensures that each search query is matched to the most relevant content section based on semantic meaning, rather than exact keyword matches.
Mechanism:
- Both queries and content sections are represented as Gemma embeddings.
- Cosine similarity is computed between the query vector and section vectors.
- Sections are ranked by similarity, highlighting areas that align strongly with the query intent.
Insights Derived: This method identifies content sections that best satisfy the underlying meaning of queries, revealing opportunities to enhance coverage or optimize sections for semantic relevance.

Understanding Content Coverage Clustering

Purpose: Groups semantically similar sections across pages into thematic clusters to understand coverage overlap and redundancy.
Mechanism:
- Sections with embeddings that are close in vector space are assigned the same cluster label.
- Clustering highlights overlapping or related content, helping to organize information coherently.
Insights Derived: Identifies redundancies, thematic gaps, and sections that require content enhancement or reorganization for better topical coverage.

Understanding Internal Linking Recommendations

Purpose: Suggests links between semantically related sections to improve content navigation and topical authority.
Mechanism:
- Sections with high semantic similarity but lacking existing links are identified.
- Source and target sections are paired, along with relevant text snippets for context.
Insights Derived: Enables structuring of internal links that naturally connect related content, improving discoverability and reinforcing semantic relationships across content clusters.

Understanding Semantic Gap Analysis

Purpose: Detects queries that are insufficiently addressed in existing content, highlighting gaps in coverage.
Mechanism:
- The highest similarity score between each query and available sections is evaluated.
- Queries with low matching scores are flagged as coverage gaps.
Insights Derived: Reveals underrepresented topics, guiding content creation to ensure comprehensive coverage of relevant search intents.

Questions and Answers on Project Value and Importance

How does embedding-based analysis strengthen the alignment between search intent and website content?

Embedding-based analysis ensures that webpage sections are evaluated not only for keyword presence but for true semantic meaning. This strengthens alignment with search intent by capturing context, synonyms, and related expressions that traditional keyword matching overlooks. When search intent and content meaning are closely aligned, webpages stand a greater chance of ranking higher, achieving stronger visibility, and sustaining relevance across evolving search queries.

What SEO advantages are created by applying a semantic-first approach with Gemma embeddings?

A semantic-first approach powered by Gemma embeddings provides advantages that extend beyond surface-level keyword optimization. Search engines increasingly evaluate relevance based on context and meaning, rather than exact phrases. Embedding-driven analysis captures these relationships, enabling optimization that mirrors how modern ranking systems interpret content. This leads to improved topical authority, reduced keyword cannibalization, and more natural content flows that benefit both visibility and user experience.

Why is analyzing intent consistency across webpages critical for SEO success?

Consistency of intent across webpages ensures that content fulfills distinct but complementary roles within a broader site structure. Without consistent intent mapping, overlapping or mismatched content may dilute relevance and confuse both search engines and audiences. By analyzing and maintaining intent consistency, webpages reinforce each other, establish stronger authority within specific topics, and create a cohesive ecosystem of content that aligns with long-term ranking performance.

In what ways does embedding-driven analysis support more effective content strategy development?

Embedding-driven analysis reveals patterns in how different topics, subtopics, and queries relate to one another. These insights form the foundation of effective content strategies by highlighting strengths in topical coverage, detecting weak areas, and organizing content into structured hierarchies. The approach supports long-term growth strategies by guiding decisions on which areas to expand, consolidate, or interlink, ensuring that every addition strengthens the overall site’s authority and relevance.

How does this project align with the direction of modern search engine ranking systems?

Modern ranking systems increasingly depend on machine learning models that evaluate meaning and context, rather than relying on keyword frequency. By adopting embeddings, the methodology in this project mirrors the core mechanics of these ranking systems. This alignment ensures that optimization strategies remain future-proof, adapting naturally as search engines evolve, and maintaining competitiveness even as algorithms become more context-sensitive and intent-driven.

What broader business outcomes can be achieved through this type of SEO-focused embedding analysis?

Beyond improved rankings, embedding-based SEO analysis translates into tangible business outcomes. Higher visibility for relevant queries drives qualified traffic, while better content alignment increases engagement and retention. Stronger topical authority enhances trust and reputation, improving conversion potential. At a strategic level, embedding-driven insights support sustainable growth by ensuring that investment in content directly contributes to visibility, authority, and measurable business value.

Libraries Used

Requests

The requests library is a widely used Python module designed for sending HTTP requests in a simple and human-friendly manner. It abstracts the complexities of network communication and provides methods to retrieve data from webpages, APIs, or any online resource. Its flexibility and simplicity make it the default choice for many data science and web-related workflows.

In this project, requests is used to fetch raw webpage content directly from URLs. Since the analysis requires real-world webpage text, this library provides the foundation for accessing the live content that forms the input for subsequent cleaning, processing, and embedding steps.

Logging

The logging library is part of Python’s standard utilities that enable structured tracking of application events. It provides levels of severity (debug, info, warning, error, critical) to monitor the flow of execution and capture important events without interrupting the program.

Here, logging is configured to track warnings and essential messages during the execution pipeline. It ensures that errors in data fetching, processing, or embedding are captured and reported, which is crucial in maintaining stability and traceability in a real-world project.

Re

The re module is Python’s built-in library for working with regular expressions. It is primarily used for identifying, matching, and transforming text patterns with high flexibility. Regular expressions are widely applied in text processing tasks that require precise matching of characters, words, or structures.

Within this project, re plays a role in text preprocessing and cleaning by identifying unwanted characters, HTML fragments, or irregular patterns in webpage content. This ensures that only meaningful and structured text is passed forward for embedding and analysis.

Html

The html library in Python provides tools to work with HTML entities and text formatting. It is particularly useful when decoding or escaping characters that appear in webpages, such as &,  , or other HTML-coded symbols.

Here, html is used to clean webpage text after extraction, ensuring that the textual data is free from encoded characters that could interfere with embeddings or similarity analysis. It contributes to improving text readability and accuracy in the analysis pipeline.

Unicodedata

The unicodedata module provides access to the Unicode Character Database and offers tools for character-level transformations. It is especially useful for normalizing text by handling accented characters, diacritics, and other multilingual inputs.

In this project, unicodedata is used to normalize webpage text, making sure that characters are standardized before embeddings are generated. This step avoids mismatches and inconsistencies, particularly when dealing with multilingual or diverse content sources.

Time

The time library offers functionality related to handling time and delays in execution. It is a core Python module used for measuring execution speed, adding pauses, or handling scheduling.

This project uses time in areas where controlled delays are needed, such as handling multiple requests to webpages without overloading servers. It helps maintain stability and ethical execution while fetching and processing content.

BeautifulSoup (bs4)

BeautifulSoup is a Python library specifically designed for parsing and extracting data from HTML and XML documents. It provides an intuitive way to navigate webpage structures, search for tags, and clean raw markup into structured text.

In this project, BeautifulSoup is a core component for extracting structured blocks from webpages. It identifies headings, paragraphs, and text elements, allowing the content to be broken into meaningful sections for embedding and similarity evaluation.

Typing

The typing module is a Python feature that supports type hinting, enabling developers to specify expected data types for variables, functions, and structures. It enhances readability and makes large-scale projects more maintainable.

This project uses typing to clearly define input and output structures across functions such as dictionaries, lists, sets, and tuples. This ensures that the project codebase remains reliable, easy to understand, and less prone to type-related bugs.

NumPy (numpy)

NumPy is a core numerical computing library in Python that provides efficient operations on arrays and matrices, along with a wide range of mathematical functions. It underpins much of the Python scientific computing ecosystem.

Here, NumPy is used to handle embeddings and similarity computations. Embeddings are represented as vectors, and NumPy enables fast operations on these vectors, including similarity calculations and clustering preparation.

SentenceTransformers

SentenceTransformers is a library built on top of PyTorch and Hugging Face Transformers, designed to generate semantically meaningful sentence and paragraph embeddings. It provides pre-trained models and tools for embedding generation and comparison.

In this project, SentenceTransformers is used to embed webpage sections and queries into vector space. These embeddings form the foundation of semantic similarity analysis, intent alignment, and clustering operations.

Scikit-learn (AgglomerativeClustering, cosine_similarity)

Scikit-learn is a machine learning library offering algorithms for clustering, classification, regression, and more. The specific tools used here are Agglomerative Clustering and cosine similarity metrics.

Agglomerative Clustering groups similar sections into clusters, highlighting content themes across webpages. Cosine similarity quantifies the closeness between query embeddings and content embeddings, directly supporting alignment and gap analysis.

Deepcopy

The deepcopy function from the copy module allows creating entirely new copies of complex objects in Python. Unlike shallow copies, it ensures that all nested elements are duplicated without reference overlap.

In this project, deepcopy is critical to avoid query overwriting issues when multiple queries are evaluated per URL. It ensures that each query-result pair is maintained independently, preventing data leakage or unintended modifications.

Torch (PyTorch)

PyTorch is a deep learning framework widely used for building, training, and deploying machine learning models. It offers dynamic computation graphs and extensive GPU support for efficient training and inference.

In this project, PyTorch underpins the embedding generation process used by SentenceTransformers and Gemma-based models. It enables efficient model execution, ensuring fast and accurate embedding computations.

Transformers (utils)

The Hugging Face Transformers library provides state-of-the-art models for natural language processing tasks, including embeddings, classification, and generation. The utils submodule includes tools for controlling model behavior, logging, and resource handling.

Here, Transformers’ utilities are used to reduce unnecessary logging and disable progress bars during execution. This improves readability and ensures that only meaningful information is surfaced during pipeline runs.

Matplotlib (pyplot)

Matplotlib is a widely used library for data visualization in Python. The pyplot interface provides simple functions to create a variety of charts and graphs.

In this project, Matplotlib is used to generate visual insights such as heatmaps, distributions, and semantic coverage summaries. These visualizations make results easier to interpret and communicate effectively.

Seaborn

Seaborn is a visualization library built on top of Matplotlib, designed to provide high-level interfaces for creating statistically rich plots. It enhances the aesthetic quality of visualizations and simplifies complex plotting tasks.

Within this project, Seaborn is used to create clear and informative heatmaps, pie charts, and distribution plots. These plots visually demonstrate semantic relationships, gaps, and coverage patterns in ways that raw numbers cannot.

Function: extract_structured_blocks

Overview

The extract_structured_blocks function is responsible for extracting structured textual content from a webpage URL. It ensures that the extracted text is usable for downstream NLP tasks by cleaning and organizing it into consistent content blocks. The function works in a hierarchical manner:

· HTML Fetching – Downloads page HTML using a polite request strategy.

· Parsing & Cleaning – Strips out irrelevant tags like <script>, <style>, and navigational elements.

· Content Structuring – Attempts three levels of extraction:

Hierarchical: Organizes content under <h2> and <h3> headings.
Section-based Fallback: Groups paragraphs under their nearest <h2>.
Block-based Fallback: Uses standalone paragraphs if structured headings are missing.

· Block Length Handling – Ensures minimum block length and splits overly long sections into manageable chunks.

· Return Schema – Outputs a consistent structure containing the URL and a list of content blocks, where each block may include heading, subheading, text, and extraction method.

This modular design ensures that even poorly structured pages can be processed into clean, analyzable sections.

Key Line Explanation

· Fetching HTML Safely

response = requests.get(url, timeout=request_timeout, headers={“User-Agent”: “Mozilla/5.0”})

This line requests the webpage content while using a timeout and a browser-like user agent to reduce the risk of being blocked.

· Removing Irrelevant Tags

Cleans the HTML by removing non-content elements, leaving only textual sections relevant for analysis.

Splitting Long Blocks into Chunks

Prevents excessively long paragraphs by breaking them into smaller, more manageable chunks suitable for NLP models.

Hierarchical Extraction Logic

for tag in soup.find_all([“h2”, “h3”, “p”, “li”, “blockquote”]):

This iterates through important structural tags in order to capture both headings and the associated text blocks.

Fallback Extraction

Ensures robustness by trying multiple extraction methods if the hierarchical structure is missing or too sparse.

Function: preprocess_text

Overview

The preprocess_text function standardizes and cleans raw webpage text so it is ready for downstream NLP tasks. Since raw content often includes noise such as boilerplate text, inline URLs, or inconsistent punctuation, this function ensures that only meaningful, clean text remains.

Key operations include:

Normalization: Handles HTML entities and Unicode inconsistencies.
Noise Removal: Strips out common boilerplate phrases (e.g., “privacy policy”, “read more”) and inline URLs.
Punctuation Standardization: Converts curly quotes, dashes, and other variations into consistent forms.
Whitespace Handling: Collapses unnecessary whitespace into clean spacing.

This results in a text block that is lean, consistent, and optimized for embeddings, classification, or similarity tasks.

Key Line Explanation

· Boilerplate Filtering

boilerplate_regex = re.compile(r”\b(” + “|”.join(base_patterns) + r”)\b”, re.IGNORECASE)

Creates a case-insensitive regex to detect and remove recurring boilerplate phrases that typically add no semantic value.

· URL Removal

Ensures inline URLs are stripped out so they don’t interfere with NLP models or similarity scoring.

HTML and Unicode Normalization

Converts HTML entities (like &) and different Unicode forms into a consistent representation.

Substitutions for Clean Punctuation

· substitutions = {““”: ‘”‘, “””: ‘”‘, “‘”: “‘”, “’”: “‘”, “–”: “-“, “—”: “-“}

Replaces typographic variations with standardized ASCII equivalents to maintain consistency across text.

· Whitespace Collapsing

text = re.sub(r”\s+”, ” “, text).strip()

Removes unnecessary line breaks, tabs, or multiple spaces to ensure a clean flow of text.

Function: preprocess_page

Overview

The preprocess_page function applies preprocess_text to every section within a page object. Since extracted blocks may contain boilerplate or messy text, this function ensures that all sections are uniformly cleaned before downstream tasks like embeddings or intent classification.

By iterating over each section in the page[“sections”] list, the function guarantees that the entire page is processed consistently. It also supports adding custom boilerplate removal patterns when needed for domain-specific cases.

Key Line Explanation

· Iterating Over Sections

for section in page.get(“sections”, []):

Safely loops through all content sections. The get method ensures the function won’t fail if “sections” is missing.

· Applying Text Preprocessing

section[“text”] = preprocess_text(section[“text”], boilerplate_extra)

Cleans each section’s text individually using the preprocess_text function, ensuring consistent formatting across the page.

Function: load_embedding_model

Overview

The load_embedding_model function initializes and loads the Gemma embedding model into the environment using the SentenceTransformer library. Since embeddings are the foundation for similarity calculations, clustering, and downstream SEO-focused NLP tasks, this function ensures the model is properly set up before any analysis begins.

It dynamically checks whether GPU (cuda) is available and assigns the model to the appropriate device, allowing efficient performance across different system configurations. By leveraging Hugging Face integration, the function can securely authenticate with an access token and fetch the specified Gemma model (google/embeddinggemma-300m by default).

The output is a ready-to-use SentenceTransformer object that can directly transform text into dense vector embeddings for advanced processing.

Key Line Explanation

· Device Detection

device = “cuda” if torch.cuda.is_available() else “cpu”

Automatically selects GPU when available, ensuring maximum efficiency for embedding generation. If GPU is not available, the function defaults to CPU.

· Model Initialization

model = SentenceTransformer(model_name, token=hf_token, device=device)

Loads the Gemma embedding model from Hugging Face, authenticated with the given token. This line establishes the core embedding functionality that will be used throughout the project.

Function: embed_sections

Overview

The embed_sections function generates vector embeddings for each section of a webpage using the pre-loaded Gemma embedding model. After a webpage has been processed and structured into clean text blocks, this function transforms the text into high-dimensional numerical vectors.

These embeddings form the backbone of the entire analysis, enabling semantic similarity calculations, clustering, coverage detection, and intent analysis. Each embedding represents the meaning of a text block, making it possible to compare sections and queries in a context-aware manner.

By appending the embeddings directly into each section dictionary, the function ensures that the enriched data is ready for subsequent steps in the pipeline.

Key Line Explanation

· Extracting Section Texts

texts = [s.get(“text”, “”) for s in page.get(“sections”, [])]

Collects all cleaned text blocks from the structured sections of the page to prepare them for embedding.

· Handling Empty Sections

Provides a warning in case no text sections exist, ensuring potential extraction issues are flagged early.
Generating Embeddings

doc_embeddings = model.encode_document(texts).tolist()

Uses the embedding model to encode the list of text blocks into numerical vectors. The .tolist() conversion makes the embeddings compatible with JSON-like structures.

Storing Embeddings in Sections

Iterates over each text block and assigns the corresponding embedding, enriching the section data with semantic representation.

Function: embed_query

Overview

The embed_query function is responsible for generating the vector embedding of a search query. Unlike webpage sections that are encoded with encode_document, queries are encoded using encode_query. This distinction is important because embedding models are often optimized differently for document text and query text, ensuring that relevance alignment reflects real-world search behavior.

The output is a numerical vector representation of the query, which allows direct comparison with section embeddings through similarity measures such as cosine similarity. This forms the basis for aligning queries with content, detecting semantic gaps, and generating internal linking insights.

By embedding queries separately, the system ensures a dual-encoder approach that mirrors how modern retrieval systems work: one encoder tuned for documents and another for queries.

Key Line Explanation

· Encoding the Query

embedding = model.encode_query(query).tolist()

Transforms the query text into a semantic vector using the model’s encode_query method. The .tolist() ensures the vector is in a Python-native format, making it easy to store and manipulate downstream.

Function: align_query_to_sections

Overview

The align_query_to_sections function is where query embeddings meet content embeddings. Its main role is to compute similarity scores between a given query and all sections of a webpage. Each section embedding is compared with the query embedding, and a cosine similarity score is added to the section.

This step transforms raw embeddings into actionable relevance signals. By sorting sections based on similarity scores, the function highlights which parts of a webpage are most aligned with the query. These scores later drive semantic gap analysis, internal linking recommendations, and visualization insights.

The function ensures robustness by validating the presence of sections and embeddings, normalizing vectors for proper cosine similarity calculation, and safely handling errors without breaking the pipeline.

Key Line Explanation

· Input Validation for Sections

Confirms that the page has content sections to process; otherwise, logs a warning and exits early.

Collecting Section Embeddings

Retrieves embeddings from each section. If no embeddings exist, the function terminates safely.

Normalization for Cosine Similarity

Both section and query embeddings are normalized to unit length. This ensures that cosine similarity (dot product of normalized vectors) is correctly calculated.

Similarity Calculation

Computes cosine similarity between each section and the query. Each section receives a numerical score representing alignment with the query.

Sorting by Relevance

Reorders sections in descending order of similarity, making the most relevant sections appear first.

Attaching Query Metadata

page[“query”] = query

Stores the query alongside the sections for traceability in downstream analysis.

Function: cluster_sections

Overview

The cluster_sections function identifies semantic groupings among webpage sections. After embeddings are generated and similarity with queries is calculated, clustering provides a higher-level understanding of content. Sections with related themes or overlapping meaning are grouped into clusters, helping uncover topical structures across multiple URLs.

This process is particularly important for detecting content overlap, consolidating internal linking opportunities, and evaluating how well different pages address the same query intent. The function uses hierarchical clustering (AgglomerativeClustering) with either a fixed number of clusters or a distance threshold to determine natural groupings.

By assigning a cluster label to every section, the function makes it easier to visualize, compare, and interpret semantic relationships. This enables actionable insights such as internal linking recommendations and identifying content duplication or gaps.

Key Line Explanation

· Collecting Embeddings Across Pages

Extracts embeddings from all sections across multiple pages. section_refs keeps track of where each embedding came from, ensuring clusters can be mapped back correctly.

Clustering Setup

Applies hierarchical clustering to the section embeddings. If n_clusters is not specified, the distance_threshold controls how granular or broad clusters become.

Assigning Cluster Labels

Each section is assigned a numeric cluster identifier, connecting it to its semantically closest neighbors.

Function: recommend_internal_links

Overview

The recommend_internal_links function generates internal linking suggestions between semantically similar sections located on different pages. It leverages cosine similarity of embeddings to identify passages that closely align in meaning and would benefit from being connected through internal links.

This method ensures that linking decisions are made based on semantic relevance rather than just keyword overlap. By doing so, the function helps improve site navigation, topical authority, and user experience while also supporting stronger SEO signals for related content.

The recommendations include the source URL, target URL, a similarity score, and short text snippets from each section to provide context for why the link is suggested.

Key Line Explanation

· Collect Section Embeddings Across Pages

Gathers all sections and their embeddings from every page. Each section is tied to its URL and raw text for later linking suggestions.
Build Similarity Matrix

sim_matrix = cosine_similarity(embeddings)

Computes pairwise semantic similarity between every section. The higher the cosine similarity, the closer two passages are in meaning.

Filter for Cross-URL Matches

· if src_url == tgt_url:
continue

Ensures only cross-page links are recommended. Links between sections of the same page are excluded to maintain meaningful internal linking.

Apply Similarity Threshold

· if sim_score >= similarity_threshold:
sims.append((sim_score, tgt_url, tgt_text))

Only suggests links if the similarity exceeds the defined threshold (default 0.75), which helps avoid irrelevant recommendations.

Rank and Select Top-k

sims = sorted(sims, key=lambda x: x[0], reverse=True)[:top_k]

Sorts potential link candidates by similarity and limits to the strongest top_k (default 5).

Generate Final Recommendations

Prepares structured recommendations with URLs, similarity score, and trimmed snippets for readability in reports.

Function: semantic_gap_analysis

Overview

The semantic_gap_analysis function evaluates how well client queries are covered across a site’s content. It scans through the results (with query-to-section similarity scores already computed) and identifies queries that lack strong representation in the site.

By comparing the best similarity score per query against a configurable threshold (coverage_threshold), the function marks queries as either:

Covered: The site has at least one section that adequately matches the query.
Gap: The site lacks strong enough coverage, signaling a content opportunity.

The output gives a clear map of queries with strong matches vs. weak or missing ones, enabling data-driven SEO content expansion.

Key Line Explanation

· Group Results by Query

queries = set([res[“query”] for res in results])

Ensures that analysis is done query by query across all URLs and sections.

· Find Best Section Coverage per Query

Loops through every page’s sections to find the highest similarity score for the query. Stores both the score and the URL of the best-matching page.

Apply Coverage Threshold

status = “covered” if best_score >= coverage_threshold else “gap”

Marks the query as a gap if its best score falls below the threshold (default 0.6). This ensures only strong semantic matches count as covered.

Store Final Findings

Prepares structured output per query with its best score, best URL, and status (covered / gap).

Function: run_pipeline

Overview

The run_pipeline function acts as the central execution hub for the entire system. It takes in a list of URLs (client web pages) and queries (search terms to align), and then orchestrates the full workflow:

Model loading
Content extraction and preprocessing
Embedding generation for both sections and queries
Query-to-content alignment
Cross-page clustering of sections
Internal linking recommendations
Semantic gap analysis

The output is packaged into a structured dictionary that clients can use directly for SEO insights. This function ensures all components work together in a single, repeatable process.

Key Line Explanation

· Page-Level Processing Loop

· for url in urls:

page = extract_structured_blocks(url)

page = preprocess_page(page)

embedded_data = embed_sections(model, page)

For each URL:

Extracts structured sections from raw HTML.
Cleans and normalizes the extracted text.
Generates embeddings for all sections on the page.

· Query Alignment Loop

· for query in queries:

embedded_query = embed_query(model, query)

page_copy = deepcopy(embedded_data)

aligned_page = align_query_to_sections(query, embedded_query, page_copy)

results.append(aligned_page)

For each query:

Embeds the query into vector space.
Makes a deep copy of the embedded page to prevent overwriting.
Aligns the query with all page sections and appends the result.

· Clustering Across Pages

clustered_page = cluster_sections(results)

Groups semantically similar sections across different URLs, revealing content relationships and coverage clusters.

· Generate Deliverables

· internal_links = recommend_internal_links(results)

semantic_gaps = semantic_gap_analysis(results)

Produces two key insights:

Internal link opportunities between related sections across pages.
Semantic gap findings showing where queries lack adequate content coverage.

· Bundle Final Output

Packages all processed data into a single dictionary for downstream use.

Result Analysis and Explanation

Overall Content Evaluation

The analysis processed a large number of content sections, identifying distinct clusters of related content topics. Twenty-five clusters indicate a wide variety of themes are being covered, providing a clear view of how content is structured across the website. Clustering also highlights potential areas of content overlap or underrepresentation.

Query Alignment and Relevance

The alignment of content sections with the queries shows variation in relevance:

Some pages show medium-level matches, indicating that sections contain relevant technical or explanatory information directly related to the query topics.
Other pages show low-level matches, suggesting that content addresses the topic more generally or indirectly.

This alignment highlights areas where content is effectively covering topics versus areas where expansion or refinement could increase relevance.

Cluster Analysis and Content Structuring

Clusters provide insight into content grouping and topical distribution:

Clusters allow identification of sections that share semantic similarity, enabling understanding of which content areas are addressing similar topics.
The analysis shows that certain clusters contain multiple medium-level relevance matches, suggesting overlapping content that may benefit from consolidation or restructuring.

Internal Linking Opportunities

Sixteen internal linking opportunities were identified, many with high similarity scores:

Sections with exact or near-exact content matches are flagged for cross-linking.
Linking high-similarity sections strengthens thematic connections and improves the structural coherence of content across pages.
Lower-similarity but still relevant sections can be linked to provide additional pathways and contextual reinforcement.

Semantic Gaps and Coverage

Two queries were analyzed for coverage:

Partially covered queries have medium relevance matches but do not fully satisfy the topic, indicating sections may need additional depth, examples, or technical explanation.
Well-covered queries have medium-to-high relevance matches, suggesting existing content addresses the topic but could still benefit from reinforcement to increase clarity and completeness.

Similarity Score Thresholds

Similarity scores were categorized into three levels: High, Medium, and Low. Thresholds used for these categories are:

High: ≥ 0.65
Medium: ≥ 0.50
Low: < 0.50

These thresholds provide a reference for interpreting the strength of alignment between a query and page sections. Scores above the high threshold indicate strong semantic alignment, scores in the medium range indicate partial coverage, and scores below the medium threshold indicate weaker relevance. The thresholds are designed to balance sensitivity and selectivity, ensuring meaningful matches are highlighted without overestimating weaker associations.

Practical Interpretation

Sections with medium or high relevance indicate areas where content effectively addresses query topics.
Low-relevance sections identify opportunities for content expansion or refinement.
Cluster patterns and internal linking opportunities reveal structural connections between related sections, supporting content discoverability and cohesion.
Semantic gap identification highlights queries that require additional coverage or reinforcement, providing a roadmap for optimizing content comprehensiveness.

Result Analysis and Explanation

Overview of Content Coverage and Relevance

The analysis evaluated multiple web pages against a set of targeted queries, breaking each page into hundreds of discrete sections for granular evaluation. These sections were then clustered to identify thematically similar content areas across the pages.

Section evaluation: By examining over two thousand content sections, it is possible to understand how content is distributed across topics. Clusters group similar content together, enabling a macro-level view of thematic coverage.
Content alignment: Each section is assessed for semantic similarity with the target queries, producing a spectrum of relevance. This reveals which areas of the website directly satisfy informational intent and which areas may only partially address it.
Strategic insight: Understanding section-level coverage enables prioritization for content expansion, restructuring, or linking to strengthen topical authority.

Query-to-Content Alignment

The system matches each query against all content sections, producing multiple ranked matches per query:

High relevance matches: These sections strongly align with the query, containing comprehensive coverage of the topic. High-scoring sections are ideal for forming the core content that answers a user’s intent.
Medium relevance matches: Sections in this range provide partial coverage or contextually relevant information but may omit certain critical aspects. These sections are valuable as supplementary content or starting points for expansion.
Low relevance matches: Sections with low similarity indicate minimal alignment with the query, highlighting areas that may need rewriting, addition of new content, or reorganization to improve topic coverage.

This multi-level relevance scoring enables a detailed view of how well each page addresses different queries and where gaps or redundancies may exist.

Threshold Score Values and Their Role

To simplify interpretation, similarity scores are divided into bins: High, Medium, and Low. These thresholds are not raw data for decision-making but act as reference points to understand content alignment more clearly:

High threshold (e.g., 0.65): Sections scoring above this threshold are considered strong matches. These sections reliably answer the query or cover the topic comprehensively.
Medium threshold (e.g., 0.50–0.65): Sections in this range are moderately relevant. They may cover certain aspects of the topic but require additional content or contextual linking to fully satisfy the query.
Low threshold (below 0.50): Sections falling below the medium threshold have weak alignment, indicating limited coverage. These areas can be targeted for content creation, restructuring, or interlinking to improve relevance.

By establishing these bins, the evaluation translates numeric similarity into meaningful qualitative insights. It allows content assessment without needing to interpret the raw similarity scores directly, making the results more actionable.

Internal Linking Opportunities

The analysis identifies potential internal links between pages based on semantic similarity:

Link quality: Recommendations are grouped into high, medium, and low similarity buckets. High-similarity links connect sections with strong topical relevance, enhancing user navigation and reinforcing semantic relationships.
Volume and distribution: A significant number of internal linking opportunities were identified, suggesting that multiple pages can be interconnected to improve both SEO value and content discoverability.
Strategic impact: Internal linking strengthens the semantic network of the website, supports better indexing by search engines, and guides users to complementary content, increasing engagement and retention.

Semantic Gap Analysis

Semantic gaps indicate queries that are not fully addressed by the existing content:

Covered topics: Queries with high relevance matches demonstrate comprehensive coverage, meaning existing content satisfies the information need effectively.
Gaps: Queries with lower relevance or missing strong matches highlight areas where additional content, optimization, or restructuring could be beneficial.
Actionable insight: Recognizing these gaps allows for targeted content development to close coverage gaps, ensuring that all strategic queries are addressed across the website.

Visualization Insights

Visual representations provide a holistic and intuitive understanding of content alignment and coverage:

Query–URL heatmap: Displays average similarity scores for all queries across all URLs, quickly highlighting pages that best address specific topics. It visually identifies both strong coverage and underperforming areas.
Section relevance distribution (by URL): Illustrates how many sections per page fall into high, medium, or low relevance bins. This reveals which pages contribute most to covering key queries and where content is weak or fragmented.
Section relevance distribution (by query): Shows the number of sections across all URLs for each query grouped by relevance. This highlights which queries are well-covered across the website and which need additional attention.
Internal linking distribution: Graphs the number of recommended links per page, categorized by similarity. It identifies the strongest linking candidates and provides a roadmap for enhancing content interconnectivity.
Semantic gap summary: Pie charts or proportion plots represent the fraction of queries that are covered versus those that show gaps. This provides an at-a-glance view of overall content completeness.

Together, these visualizations convert complex data into actionable insights, making it easier to prioritize content optimization, linking, and coverage strategies.

Key Takeaways

Content coverage: Multiple pages exhibit varying levels of relevance across different queries, highlighting strengths and areas for improvement.
Relevance thresholds: High, medium, and low bins simplify interpretation of semantic similarity, allowing prioritization of content refinement without needing to analyze raw scores.
Linking potential: A significant number of internal linking opportunities can enhance navigation, topical authority, and indexing efficiency.
Semantic gaps: Identifying partially or under-covered queries guides strategic content development, ensuring all targeted topics are addressed.
Visualization support: Graphical representations provide a clear, actionable view of content alignment, section relevance, interlinking, and coverage gaps, supporting data-driven optimization decisions.

This detailed analysis supports precise, evidence-based strategies for enhancing content quality, improving internal linking, and closing semantic gaps to maximize the effectiveness of website content in achieving strategic goals.

Q&A: Understanding Results and Recommended Actions

How can the relevance scores guide content improvement?

Relevance scores categorize content sections into high, medium, and low alignment with the targeted queries. High-scoring sections represent strong coverage and can serve as reference content. Medium-scoring sections highlight partially covered topics, indicating areas where content can be enriched with additional details, examples, or context. Low-scoring sections point to gaps or weak alignment, suggesting the need for rewriting, expansion, or reorganization to ensure all strategic queries are addressed comprehensively.

How should internal linking recommendations be applied?

Internal linking suggestions are based on semantic similarity between sections of different pages. High-similarity links are ideal for guiding users to complementary content, enhancing SEO value, and strengthening the topical structure. Medium-similarity links provide additional opportunities for contextual connections, while low-similarity links may be considered selectively. Implementing these links improves discoverability, reduces orphan pages, and creates a coherent content network that reinforces authority on key topics.

What do semantic gaps indicate and how should they be addressed?

Semantic gaps identify queries that are not fully covered by existing content. Covered queries indicate strong content alignment, whereas gaps highlight opportunities for new content creation or expansion of existing sections. Addressing these gaps ensures the website comprehensively meets user intent, strengthens topical authority, and minimizes missed opportunities for traffic and engagement. Prioritizing content development based on gap analysis maximizes impact.

How can threshold score bins (High/Medium/Low) help prioritize actions?

Threshold bins simplify understanding of the quantitative similarity scores:

High bin: Sections already perform well; these may only require minor updates to maintain relevance.
Medium bin: Sections need refinement or supplementation to fully satisfy queries.
Low bin: Sections require significant content updates or new sections to address the query effectively. This approach allows systematic prioritization, focusing efforts where they will deliver the greatest improvement in content coverage and alignment.

How should the visualization insights be interpreted for actionable decisions?

Visualizations summarize key aspects of content relevance, coverage, and linking potential:

Query–URL heatmap:** Quickly identifies which pages best satisfy each query, highlighting underperforming area
Section relevance distribution plots: Show which pages or queries have the largest number of medium and low relevance sections, guiding content optimization.
Internal linking distribution: Helps prioritize pages that need linking adjustments to strengthen semantic connections.
Semantic gap summary: Clearly indicates where new content creation or restructuring is necessary. Together, these visuals support evidence-based decisions for content optimization, internal linking, and gap filling.

How can these results benefit website strategy?

The analysis provides a roadmap for enhancing content quality and visibility:

Identify strong-performing content to leverage or repurpose.
Highlight areas requiring content enrichment or new creation to close gaps.
Optimize internal linking to reinforce topic authority and guide user navigation.
Make data-driven decisions for SEO improvements, content expansion, and strategic topic coverage.

What is the practical approach to implementing these insights?

A stepwise approach is recommended:

Start by enhancing medium and low relevance sections based on the threshold bins.
Implement high-priority internal linking suggestions to strengthen site structure.
Address semantic gaps with targeted content creation or augmentation.
Use visualization trends to continuously monitor coverage and alignment across all queries and pages.

Final Thoughts

The analysis leverages Gemma embeddings to interpret complex discourse structures across multiple long-form documents, providing a detailed mapping between queries and relevant content sections. Using these embeddings, the system captures semantic nuances that go beyond keyword matching, allowing precise identification of how well each section aligns with specific user or SEO-focused queries.

Internal linking recommendations, generated through Gemma-based similarity scoring, enhance the structural coherence of content, ensuring related topics are connected and easily discoverable. Semantic gaps highlight areas where queries or topics are underrepresented, providing actionable insights for content enrichment and expansion.

Relevance thresholds categorized into high, medium, and low scores allow for prioritization of actions, guiding content refinement efficiently. Visualization modules further support understanding of coverage, alignment, and linking potential, making complex results actionable and intuitive.By showcasing Gemma embeddings in this workflow, the project demonstrates how advanced semantic representations can improve query–content alignment, detect topical gaps, and enhance internal linking strategies. This approach ensures systematic evaluation of long-form content, enabling data-driven content strategy and optimization at scale.

Tuhin Banik

Thatware | Founder & CEO

Tuhin is recognized across the globe for his vision to revolutionize digital transformation industry with the help of cutting-edge technology. He won bronze for India at the Stevie Awards USA as well as winning the India Business Awards, India Technology Award, Top 100 influential tech leaders from Analytics Insights, Clutch Global Front runner in digital marketing, founder of the fastest growing company in Asia by The CEO Magazine and is a TEDx speaker and BrightonSEO speaker.