Granite-Powered Search Alignment - Refining Query Matching

SUPERCHARGE YOUR ONLINE VISIBILITY! CONTACT US AND LET’S ACHIEVE EXCELLENCE TOGETHER!

This project focuses on enhancing the alignment between user search queries and webpage content through advanced semantic matching techniques powered by the Granite model. The objective is to improve content relevance, identify optimization opportunities, and support SEO strategies by providing actionable insights derived from data-driven analyses.

The approach involves two primary components:

Embedding-Based Retrieval: Webpage sections are transformed into dense vector representations using the Granite embedding model. User queries are similarly embedded, and cosine similarity is calculated to identify semantically relevant content sections.
Reranking with a Relevance Model: The top-k sections retrieved are further ranked using the Granite reranker model, which assigns scores indicating the relevance of each section to the query.

The system processes multiple URLs and queries, generating deliverables such as average embedding scores, intent coverage metrics, and ranked lists of top sections per query. These outputs are designed to assist in evaluating content effectiveness and guiding optimization efforts.

From a technical perspective, the system comprises modular functions for embedding generation, reranking, deliverable computation, and visualization. The results are presented with clear, structured outputs and visualizations to facilitate informed decision-making.

Project Purpose

The project aims to improve the accuracy and efficiency of search query alignment with webpage content, enabling a precise assessment of content relevance across complex, long-form documents. By leveraging advanced embedding and reranking techniques, it provides quantitative insights into how well webpage sections satisfy user search intent, highlights content gaps, and identifies areas for optimization.

The system supports detailed analysis of both individual sections and the overall page, producing structured deliverables such as relevance scores, intent coverage metrics, and ranked lists of content segments. These outputs enable the identification of high-performing content, potential redundancies, and sections that may require enhancement to improve SEO performance.

The inclusion of the Granite embedding and reranker models ensures that semantic understanding is state-of-the-art, allowing for sophisticated matching beyond simple keyword overlap. The project, therefore, not only demonstrates technical capabilities but also delivers practical, actionable insights for SEO strategy refinement.

Project’s Key Topics Explanation and Understanding

Granite Model Overview

The Granite model is a state-of-the-art, enterprise-grade foundation model developed by IBM, designed to handle a wide array of natural language processing (NLP) tasks with high efficiency and accuracy. It is built upon a hybrid architecture that combines elements of Mamba-2 and Transformer models, allowing it to process long-context inputs effectively while maintaining performance across various domains.

Key characteristics of the Granite model include:

Hybrid Architecture: Integrates Mamba-2 blocks with Transformer layers to optimize memory usage and computational efficiency.
Extended Context Length: Supports context lengths up to 8,192 tokens, enabling the processing of lengthy documents without significant performance degradation.
Enterprise-Grade Performance: Achieves state-of-the-art results on multiple NLP benchmarks, including BEIR, MIRACL, and MLDR, demonstrating its capability in real-world applications.

The Granite model’s design emphasizes scalability, adaptability, and robustness, making it suitable for complex tasks such as semantic search, content alignment, and information retrieval in SEO contexts.

Embedding-Based Retrieval

Embedding-based retrieval is a technique that transforms text into dense vector representations, capturing semantic information that traditional keyword-based methods may overlook. This approach involves:

Text Representation: Converting both queries and content into fixed-size vectors using embedding models.
Similarity Measurement: Calculating the cosine similarity between query and content vectors to assess relevance.
Ranking: Sorting content based on similarity scores to identify the most relevant sections.

The Granite embedding model, specifically the granite-embedding-english-r2, is utilized in this project to generate high-quality embeddings. This model is based on the ModernBERT architecture and has been fine-tuned for enterprise-scale retrieval tasks, offering improved performance over its predecessors.

Reranking with Cross-Encoder

Reranking is a process that refines the initial retrieval results by re-evaluating the relevance of content using a more sophisticated model. The steps involved are:

Initial Retrieval: Using embedding-based retrieval to fetch top-k candidate sections.
Cross-Encoder Evaluation: A cross-encoder model, such as the Granite reranker, is employed to jointly encode the query and each candidate section, producing a relevance score.
Reordering: Candidates are re-ranked based on their relevance scores to present the most pertinent content.

The Granite reranker model leverages a list-wise ranking objective (PListMLE) to optimize the ordering of content, ensuring that the most relevant sections are prioritized.

Section-Level Analysis

Section-level analysis involves examining individual segments of content to assess their relevance and alignment with user queries. This process includes:

Segmentation: Dividing content into logical sections based on structure or semantics.
Embedding Generation: Creating embeddings for each section using the Granite embedding model.
Relevance Scoring: Applying the reranker model to evaluate the relevance of each section to the query.
Ranking: Ordering sections based on their relevance scores to identify the top-performing segments.

This detailed analysis allows for a granular understanding of how well each part of the content addresses user intent, facilitating targeted optimization efforts.

keyboard_arrow_down

Q&A: Project Value and Importance

What makes this project relevant for SEO optimization?

This project leverages advanced embedding and reranking techniques to align webpage content with search queries. It enables precise identification of the most relevant sections within content, ensuring that SEO strategies can target high-value areas efficiently. By focusing on content-level alignment rather than just page-level analysis, the project addresses nuanced SEO challenges such as partial content relevance, semantic gaps, and topic coverage.

Why is the Granite model used in this project?

The Granite model provides state-of-the-art embeddings optimized for semantic understanding in English. It excels in capturing contextual similarity between queries and webpage content, which is crucial for SEO where exact keyword matches often miss semantic intent. Additionally, Granite’s built-in reranking capabilities allow sections to be scored based on relevance, further improving the precision of content-query alignment.

What is the advantage of combining embeddings with reranking?

Embeddings provide a first-pass semantic similarity measure between queries and content sections. However, embeddings alone cannot fully capture nuanced relevance differences for high-stakes SEO decisions. The reranking step evaluates top candidate sections using a fine-grained model, refining the order of relevance. This two-step approach ensures both efficiency (through embeddings) and accuracy (through reranking), balancing speed and quality in practical SEO analysis.

How does this project improve the understanding of content alignment?

By segmenting pages into structured sections and scoring each section for semantic alignment with queries, the project identifies precise areas of content strength and potential gaps. This enables strategic SEO interventions such as improving underperforming sections, optimizing content placement, and ensuring comprehensive coverage of key topics.

What are the key features of this approach?

Granite embeddings: Capture semantic similarity with high precision.
Granite reranker: Refines the ranking of top candidate sections for improved relevance.
Section-level granularity: Supports fine-grained analysis beyond page-level insights.
Scalable multi-query analysis: Handles multiple queries efficiently across multiple URLs.
SEO-focused design: Tailored to address practical content optimization and coverage gaps.

Why is this approach important compared to traditional keyword-based analysis?

Traditional keyword-based SEO focuses on exact keyword matches, which often misses semantic intent and context. This project uses embeddings and reranking to understand meaning, not just word presence. It can capture latent relevance, identify semantically related content, and prioritize sections that are contextually most important, making SEO strategies more precise and data-driven.

How does this enhance decision-making in SEO strategy?

This methodology provides actionable intelligence at the content section level, helping prioritize optimization efforts based on semantic relevance and potential impact. Decisions such as content rewrites, internal linking improvements, and query targeting can be guided by detailed semantic scores rather than generic page-level metrics, increasing efficiency and ROI of SEO actions.

Libraries Used

time

The time library in Python provides functions for measuring time, pausing execution, and tracking the duration of operations. It allows precise measurement of how long specific tasks take, which is crucial for optimizing performance in data processing pipelines.

In this project, time is used to measure the duration of page fetching, embedding computation, and reranking operations. Tracking execution time ensures efficient handling of multiple URLs and queries, enabling performance benchmarking and optimization for larger datasets.

re

The re library provides support for regular expressions in Python, enabling pattern matching, text search, and string manipulation. It is essential for extracting structured information from unstructured text, cleaning HTML content, and standardizing URLs or section titles.

Within this project, re is used to clean webpage content, remove unnecessary HTML tags or prefixes from URLs, and assist in trimming strings for plotting labels. This ensures consistent processing of text for embeddings, reranking, and visualization, reducing errors from inconsistent formatting.

html (as _html)

The html library allows safe handling of HTML-encoded characters and strings. It provides functions to escape or unescape HTML entities, which is important when processing web page content that may contain special characters or encoding.

In this project, _html is used to clean the fetched content, converting HTML-encoded characters to standard text. This ensures accurate embeddings and reranking, as semantic models rely on clean, human-readable text for proper context understanding.

hashlib

The hashlib library offers cryptographic hash functions, such as MD5 and SHA, to generate unique identifiers from strings. Hashing is useful for caching, deduplication, and ensuring consistent IDs for content sections.

Here, hashlib is used to generate unique identifiers for each webpage section. This allows storing embeddings, rerank scores, and deliverables in a structured format, avoiding collisions and ensuring traceable, reproducible processing across multiple pages and queries.

unicodedata

The unicodedata library provides access to Unicode character properties and normalization. It ensures that text with accents, diacritics, or special symbols is processed in a consistent manner.

In this project, unicodedata is used to normalize webpage content before computing embeddings. This step ensures that semantically identical text with different Unicode representations is treated uniformly, preventing misalignment in semantic similarity calculations.

logging

The logging library is used for structured logging of information, warnings, and errors during program execution. Logging helps monitor the pipeline, identify issues, and provide transparent feedback during long-running processes.

Here, logging is configured to debug level to capture detailed information about content fetching, embedding, and reranking. It aids in diagnosing issues in multi-URL processing, tracking execution flow, and ensuring reliability of the pipeline.

warnings

The warnings library is used to control the display of runtime warnings generated by other libraries or custom code. Suppressing irrelevant warnings ensures clarity in logs while highlighting only important messages.

In this project, warnings is employed to suppress minor warnings from libraries like torch and transformers that do not impact execution. This keeps the log output clean, focusing attention on critical performance or data issues.

requests

requests is a widely used HTTP library for Python that enables sending HTTP/HTTPS requests and handling responses easily. It simplifies fetching web content, managing headers, cookies, and handling timeouts.

This project uses requests to fetch webpage content from multiple URLs. It provides robust error handling for network requests and ensures reliable retrieval of HTML content for further processing and semantic analysis.

typing

The typing library provides type hints for Python functions and variables, improving code readability, maintainability, and reducing runtime errors. Type hints help document expected input and output types.

In this project, typing is used to annotate function parameters and return values for clarity, particularly for complex data structures like lists of dictionaries representing pages, sections, queries, and embeddings. This ensures that the code is maintainable and understandable in a production setting.

bs4 (BeautifulSoup)

BeautifulSoup is a library for parsing HTML and XML documents. It allows easy navigation, searching, and modification of the parse tree, which is essential for extracting structured content from web pages.

Here, BeautifulSoup is used to parse HTML content of webpages, extract sections, headings, and paragraph text. It enables precise segmentation of pages into subsections, which is crucial for computing embeddings and reranking relevance for SEO analysis.

numpy

NumPy is a fundamental library for numerical computing in Python, offering efficient array operations, linear algebra functions, and statistical utilities. It is widely used in data science and machine learning workflows.

In this project, NumPy is used to store and manipulate embeddings, compute similarity scores, and handle numeric operations like averages and thresholds. It ensures high performance when dealing with large embedding matrices across multiple pages and queries.

sentence_transformers

The sentence_transformers library provides pre-trained transformer models for computing sentence and paragraph embeddings. It enables semantic similarity computation and retrieval tasks efficiently.

In this project, SentenceTransformer generates embeddings for webpage sections, while CrossEncoder (Granite reranker) refines relevance scores. util provides functions like cosine similarity to compare embeddings, forming the backbone of semantic content alignment for SEO.

torch

PyTorch is a deep learning framework providing tensor computations, GPU acceleration, and neural network modules. It powers modern NLP and embedding models efficiently.

In this project, torch underlies the sentence_transformers and Granite reranker models, enabling fast GPU-based computation of embeddings and reranking scores. It ensures scalable processing for multiple URLs and queries in near real-time.

transformers (logging)

transformers library provides state-of-the-art NLP models and tools. Its logging utility allows control of verbosity and progress bars for cleaner output during model inference.

Here, transformers.utils.logging is used to suppress unnecessary warnings and progress output while running transformer models. This keeps logs clean and focused on critical debug information for embedding and reranking operations.

pandas

Pandas is a data manipulation library providing DataFrame structures for efficient tabular data processing. It is widely used for analysis, visualization, and data organization.

In this project, Pandas is used to organize embeddings, rerank scores, and section metadata into structured DataFrames. This facilitates visualization, filtering, aggregation, and plotting for clear interpretation of SEO content analysis.

matplotlib.pyplot

Matplotlib is a core plotting library for Python that enables creation of static, high-quality visualizations. It offers fine-grained control over plots and figure aesthetics.

Here, matplotlib.pyplot is used to create line plots, area plots, and boxplots for visualizing embeddings, rerank scores, and section distributions. It allows detailed inspection of content relevance patterns across queries and URLs.

seaborn

Seaborn is a statistical data visualization library built on top of Matplotlib. It provides aesthetically pleasing plots with built-in support for grouping, color palettes, and statistical annotations.

In this project, Seaborn is used for advanced visualizations like boxplots and line plots of rerank scores and distributions. It enhances readability and interpretability of content alignment insights, making results actionable and client-friendly.

Function fetch_html

Overview

The fetch_html function is responsible for retrieving the raw HTML content from a given URL. It implements a robust approach that handles common encoding issues, retries multiple encodings, and introduces a polite delay between requests. This ensures that fetched content is usable even when web pages contain uncommon or inconsistent encodings.

This function is critical in this project as it forms the foundation for downstream processing. Reliable HTML retrieval is essential for accurate content extraction, semantic embedding, and relevance scoring.

Key Code Explanations

· time.sleep(delay): Introduces a pause before making the HTTP request, following polite crawling practices.

· requests.get(url, timeout=request_timeout, headers={…}): Sends an HTTP GET request with a custom User-Agent and timeout to prevent blocking or rate-limiting.

· encodings_to_try = [response.apparent_encoding, ‘utf-8’, ‘iso-8859-1’, ‘cp1252’]: Ensures the function attempts multiple encodings to decode the page text correctly.

· if len(text.strip()) > 50: return text: Confirms that retrieved content is meaningful before returning it.

· logging.warning / logging.error: Captures warnings or errors during fetching for monitoring and debugging.

Function clean_html

Overview

The clean_html function processes raw HTML content and removes unwanted tags that do not contribute to meaningful page content. Tags like <script>, <style>, <nav>, and <footer> are eliminated to retain only relevant text for semantic analysis.

This cleaning step is essential because raw HTML often contains noise that can distort embeddings and content similarity metrics.

Key Code Explanations

· soup = BeautifulSoup(html_content, “lxml”): Parses the HTML content into a structured object for easy traversal and manipulation.

· for tag in soup([…]): tag.decompose(): Iterates over tags considered non-essential and removes them from the parse tree to produce a cleaner document structure.

Function _clean_text

Overview

The _clean_text function normalizes and cleans extracted inline text. It converts HTML entities to characters, applies Unicode normalization, and collapses excess whitespace.

Accurate text normalization ensures consistent input for embeddings and semantic similarity comparisons, preventing mismatches due to trivial formatting differences.

Key Code Explanations

· _html.unescape(text): Converts HTML entities to standard characters.

· unicodedata.normalize(“NFKC”, text): Standardizes Unicode representations to avoid inconsistencies.

· re.sub(r”\s+”, ” “, text): Collapses multiple consecutive whitespace characters into a single space.

· text.strip(): Removes leading and trailing whitespace for clean, uniform text.

Function extract_structured_blocks

Overview

The extract_structured_blocks function extracts structured content from a webpage, maintaining a hierarchical structure: sections -> subsections -> content blocks. Section headers (H1/H2) are treated as top-level sections, while H3/H4 define subsections. Paragraphs, list items, and blockquotes are collected as content blocks.

This function is crucial for segmenting web pages into analyzable units, which are later embedded and reranked for SEO relevance.

Key Code Explanations

· h1_tags = soup.find_all(“h1”) and page_title = _clean_text(…): Determines the main page title from H1 tags or the HTML <title>.

· for el in soup.find_all([…]): Iterates through all relevant tags to construct sections, subsections, and blocks.

· if len(text) >= min_block_chars: Ensures only meaningful content is included in the blocks.

· subsection[“blocks”].append({…}): Stores each content block with block_id, text, tag type, and heading chain for hierarchical context.

Function extract_page

Overview

The extract_page function acts as a top-level wrapper for full-page extraction. It combines HTML fetching, cleaning, and structured block extraction into a single pipeline, returning a structured dictionary containing the page URL, title, and sections.

This abstraction simplifies downstream workflows, providing a consistent structure for embedding, reranking, and visualization.

Key Code Explanations

· html_content = fetch_html(url, request_timeout, delay): Fetches the raw HTML using the robust retrieval function.

· soup = clean_html(html_content): Cleans the HTML to remove noise before extraction.

· sections, page_title = extract_structured_blocks(soup, min_block_chars): Segments the cleaned HTML into sections, subsections, and blocks.

· return {“url”: url, “title”: page_title, “sections”: sections}: Provides a structured dictionary containing all extracted information, ready for embedding and analysis.

· logging.error(…) and note: “Parsing failed”: Logs and returns informative errors when extraction fails.

Function preprocess_text

Overview

The preprocess_text function cleans and normalizes raw text before it is sent to embedding or reranking models. It removes HTML entities, normalizes Unicode characters, applies common character substitutions, filters boilerplate content, and enforces token/word count constraints.

This preprocessing ensures that only meaningful, high-quality text is embedded or used for semantic ranking, reducing noise from repetitive, short, or promotional content that could skew results.

Key Code Explanations

· _html.unescape(text) and unicodedata.normalize(“NFKC”, text): Convert HTML entities and normalize Unicode characters to standard forms.

· substitutions dictionary: Handles common text artifacts like non-breaking spaces, special quotation marks, and bullets.

· re.sub(r”http\S+|www\.\S+”, “”, text): Removes URLs to prevent embedding noise.

· boilerplate_terms: Filters common non-informative phrases such as “read more”, “subscribe”, and “privacy policy”. Extra terms can be passed via boilerplate_extra.

· Word count and diversity checks (min_word_count, max_token_length, len(set(words)) < 4): Ensures text is neither too short nor repetitive.

Function chunk_text

Overview

The chunk_text function splits long text into smaller, overlapping chunks. This is necessary because embedding models have token/character limits, and overlapping ensures context is preserved across adjacent chunks.

By producing manageable chunks, this function enables consistent embeddings for long-form content and prevents truncation of important semantic information.

Key Code Explanations

· if not text or len(text) <= max_chars: return [text]: Returns the text as-is if it fits within the size limit.

· for word in words: loop: Iteratively accumulates words until the character limit is reached.

· current = current[-overlap:] if overlap else []: Maintains overlap between chunks to preserve contextual continuity.

· chunks.append(” “.join(current)): Finalizes each chunk for downstream embedding.

Function preprocess_page

Overview

The preprocess_page function applies text preprocessing and chunking at the subsection level of a structured page. It aggregates cleaned blocks into a merged_text field and generates unique IDs for each chunk, preparing the page content for embedding and semantic ranking.

This function is essential for structuring web page content hierarchically while ensuring that each subsection is represented by meaningful, manageable text for the Granite model.

Key Code Explanations

· Iterates through sections -> subsections -> blocks to process each text block.

· cleaned = preprocess_text(…): Cleans each block before merging. Blocks failing filters are skipped.

· merged_text = ” “.join(merged_pieces): Combines cleaned blocks to form subsection-level text.

· chunks = chunk_text(merged_text, max_chars, overlap): Splits long merged text into smaller overlapping segments for embedding.

· sub_id = hashlib.md5(…): Generates a unique identifier for each chunk using URL, section/subsection titles, and chunk index, ensuring traceability and uniqueness.

· Returns a dictionary with the page URL, title, and all cleaned/chunked subsections ready for embedding.

Function load_embedding_model

Overview

The load_embedding_model function initializes a SentenceTransformer model for generating embeddings, specifically the Granite model in this project. It handles device selection automatically, choosing GPU if available or defaulting to CPU, ensuring efficient model execution.

This function is essential for setting up the embedding pipeline and ensuring that text chunks generated from web pages are encoded into meaningful vector representations suitable for semantic similarity and ranking.

Key Code Explanations

· device = “cuda” if torch.cuda.is_available() else “cpu”: Automatically detects GPU availability for faster computation.

· SentenceTransformer(model_name, device=device): Loads the specified embedding model on the selected device. In this project, the Granite embedding model (ibm-granite/granite-embedding-english-r2) is used to generate high-quality semantic vectors for SEO content.

This function abstracts model initialization and device management, making the embedding step modular and ready for integration into the full content processing pipeline.

Model: ibm-granite/granite-embedding-english-r2

Overview

The ibm-granite/granite-embedding-english-r2 model is a high-capacity transformer-based embedding model designed to convert textual content into dense numerical vectors, also known as embeddings. These embeddings capture semantic meaning, context, and relationships between words, phrases, and sentences. The model is optimized for generating embeddings suitable for semantic search, content matching, and ranking tasks, making it highly relevant for SEO-oriented analysis of webpage content.

Architecture and Internal Structure

The model is built on the SentenceTransformer framework, which wraps a transformer backbone and pooling layer to produce fixed-size embeddings from variable-length text. Key components include:

1. Transformer Backbone (ModernBertModel):

o Processes input text using self-attention mechanisms, capturing both local and long-range dependencies between words.

o Supports long sequences (up to 8192 tokens), which allows capturing the context from long-form content without truncation.

o Generates contextualized token embeddings that encode semantic information at the word level.

2. Pooling Layer:

o Aggregates token-level embeddings from the transformer into a fixed-size vector suitable for downstream tasks.

o In this model, pooling is performed using the [CLS] token (pooling_mode_cls_token=True) which is a standard approach to summarize the entire sequence’s meaning.

o Other pooling modes like mean or max of tokens are disabled, ensuring the embedding represents the overall semantic intent of the text rather than averaging subparts.

o Includes prompt embeddings (include_prompt=True) if prompt-based inputs are used, allowing some adaptability for task-specific contexts.

3. Embedding Dimension:

Produces vectors of 768 dimensions, balancing representational power with computational efficiency.

How It Works

The model converts textual input into semantic vectors in a two-step process:

1. Text Encoding:

o Input text is tokenized and passed through the transformer.

o Contextual embeddings are generated for each token, where each embedding captures both the token’s meaning and its context within the sentence or document.

2. Vector Pooling:

o The [CLS] token embedding is extracted to produce a single vector representing the entire text segment.

o The resulting 768-dimensional vector is normalized for downstream similarity computations (e.g., cosine similarity).

Usage in This Project

In this project, the Granite embedding model is used to:

· Generate embeddings for all content subsections: Each section of a webpage is converted into a dense vector representation, capturing its semantic meaning.

· Generate embeddings for queries: SEO search queries are embedded in the same vector space as page content.

· Compute semantic similarity: Using cosine similarity, the model identifies which subsections are most relevant to each query.

· Support ranking and content alignment: The embedding scores are later used as input for reranking and content coverage calculations.

Why This Model Was Chosen

· High-quality semantic embeddings: The model provides precise semantic representations that improve content-query alignment.

· Handles long sequences: With a max sequence length of 8192, it can encode entire subsections or long-form content without losing context.

· Pretrained for English: Optimized for English web content, which matches the project’s domain.

· Integration with SentenceTransformer: Simplifies embedding generation with batch processing and normalization, making it practical for large-scale SEO content analysis.

Function generate_embeddings

Overview

The generate_embeddings function transforms a list of text strings into vector representations using a SentenceTransformer model. Each text is encoded into a fixed-size numeric vector that captures semantic meaning. Normalized embeddings (unit vectors) are returned to facilitate cosine similarity calculations, which are widely used for measuring semantic similarity between content and queries.

This function is central to the content relevance pipeline, as it converts preprocessed text chunks into a form suitable for similarity scoring and ranking.

Key Code Explanations

· model.encode(texts, batch_size=batch_size, show_progress_bar=False, convert_to_numpy=True, normalize_embeddings=True):

o batch_size=batch_size: Efficiently processes texts in batches to manage memory usage.

o show_progress_bar=False: Suppresses the progress bar for cleaner logging in production or notebooks.

o convert_to_numpy=True: Returns embeddings as a NumPy array for downstream computations.

o normalize_embeddings=True: Converts embeddings to unit vectors, ensuring that cosine similarity values remain consistent and interpretable.

· model.get_sentence_embedding_dimension(): Ensures the fallback zero array matches the dimensionality of the embedding space.

This function standardizes the embedding process and ensures robustness, allowing the pipeline to continue even if individual texts cause encoding issues.

Function compute_similarity

Overview

The compute_similarity function calculates the cosine similarity between query embeddings and content embeddings. Cosine similarity measures the semantic closeness of two vectors, producing a value between -1 and 1. In this context, higher similarity values indicate that a content section is more relevant to a given query. The function returns a 2D array where each row corresponds to a query and each column corresponds to a content section.

This step is critical for identifying which parts of a webpage are most relevant to specific queries, forming the foundation for ranking and SEO-focused analysis.

Key Code Explanations

· np.dot(query_embeddings, content_embeddings.T):

o Computes the dot product between each query vector and each content vector.

o Since embeddings are normalized unit vectors (from generate_embeddings), the dot product is equivalent to cosine similarity.

o Produces a 2D array of shape (num_queries, num_sections), enabling fast vectorized computation.

· Output can be directly used for ranking or thresholding, forming the basis of subsequent scoring logic in the pipeline.

This function is a lightweight, vectorized way to quantify relevance between queries and content, enabling both embedding-based ranking and downstream reranking operations.

Function embed_and_retrieve

Overview

The embed_and_retrieve function generates embeddings for both the page subsections and the input queries, then retrieves the top relevant subsections for each query based on embedding similarity. It integrates seamlessly with the pipeline for embedding-based retrieval, storing relevance scores for each subsection and producing structured top-k results for downstream reranking or analysis.

The function serves as a core step in semantic SEO alignment, enabling the system to quantify how well page content matches user queries at a granular, subsection level.

Key Code Explanations

· subsection_embeddings = generate_embeddings(subsection_texts, embedding_model)

Converts all subsection merged_text into dense vector embeddings.
Embeddings are normalized unit vectors, so dot product is equivalent to cosine similarity.

· query_embeddings = generate_embeddings(queries, embedding_model)

Generates embeddings for all input queries. Each query is mapped to a vector in the same embedding space as subsections.

· sim_matrix = compute_similarity(query_embeddings, subsection_embeddings)

Computes cosine similarity between each query and every subsection in a vectorized manner.
The resulting matrix has shape (num_queries, num_subsections).

· subsection[“query_results”][query][“embedding_score”] = float(sim_matrix[q_idx, sub_idx])

Stores embedding score for every query-subsection pair. This allows analysis or reranking without recomputation.

· topk_results construction:

Uses np.argsort(…)[::-1][:top_k] to select the most relevant subsections per query.
Maintains both subsection ID (string UUID) and index (int) for efficient downstream access.

This function ensures that semantic retrieval is robust, efficient, and structured for practical use in SEO content analysis pipelines.

Function load_reranker_model

Overview

The load_reranker_model function initializes a CrossEncoder model specifically designed for reranking tasks in semantic SEO pipelines. The model evaluates the relevance of query–subsection pairs with high precision, refining the initial embedding-based ranking. Device handling is incorporated to automatically utilize GPU if available, which ensures efficient inference for large-scale content.

This function is crucial for enhancing the retrieval pipeline by moving beyond raw embedding similarity and producing more fine-grained relevance scores for subsections.

Key Code Explanations

· device = “cuda” if torch.cuda.is_available() else “cpu”

Automatically selects GPU when available, otherwise defaults to CPU.
Ensures optimal performance for inference-intensive reranking operations.

· model = CrossEncoder(model_name, device=device)

Loads the CrossEncoder model from Hugging Face.
Accepts query–subsection pairs as input and outputs a scalar relevance score per pair.

Model: ibm-granite/granite-embedding-reranker-english-r2

Overview

The ibm-granite/granite-embedding-reranker-english-r2 is a transformer-based cross-encoder model designed for fine-grained ranking of text pairs. Unlike embedding models that encode queries and content independently, cross-encoders take a query and a candidate text together and produce a single relevance score. This approach allows the model to capture intricate interactions between query and content, leading to highly precise ranking decisions. It is optimized for semantic search, content prioritization, and query-to-content alignment tasks.

How It Works

1. Text Pair Input: The model accepts a query and a candidate content subsection as a single concatenated input, separated by a special token.

2. Contextual Encoding: The combined input passes through the transformer backbone, which uses self-attention to model relationships both within and across the query-content pair.

3. Relevance Scoring: The [CLS] token or pooled representation from the sequence is passed through a classification head that produces a score indicating the relevance of the content to the query.

4. Sigmoid Activation: The final score is normalized between 0 and 1 using a sigmoid activation, making it interpretable as a probability of relevance.

Model Structure

· Transformer Backbone (ModernBertModel): Processes the combined text with 22 layers of self-attention, rotary embeddings, and MLP blocks. Each layer captures complex interactions between query and content tokens.

· Prediction Head (ModernBertPredictionHead): Maps the pooled sequence embedding to a single scalar relevance score.

· Activation: Sigmoid to normalize output between 0 and 1.

· Despite the complex internal architecture, the model is used directly through a CrossEncoder interface for ranking purposes.

Usage in This Project

· Reranking Top-k Sections: After embedding-based retrieval identifies candidate subsections for each query, the reranker evaluates these candidates jointly with the query to produce precise relevance scores.

· Normalization & Ranking: Scores from the reranker are normalized and used to sort the candidate sections, improving alignment with the user’s search intent.

· Enhancing Semantic Precision: The cross-encoder captures token-level interactions missed by independent embedding comparison, leading to more accurate identification of the most relevant content subsections.

Why This Model Was Chosen

· Fine-grained ranking: Cross-encoder architecture ensures accurate relevance scoring beyond simple semantic similarity.

· Domain suitability: Pretrained on English web content, making it ideal for SEO content and query matching.

· Integration with embedding pipeline: Works seamlessly with embedding-based top-k retrieval to produce a two-stage ranking system (embedding → reranker), balancing efficiency with precision.

· Normalized outputs: Sigmoid scores provide consistent, interpretable ranking for thresholding and deliverable computation.

Function rerank_topk

Overview

The rerank_topk function refines the ranking of content subsections for each query using the Granite reranker model. It takes the top-k candidates previously identified by embedding similarity and applies a fine-grained relevance scoring, producing rerank scores and rank positions for each query–subsection pair. This stage ensures that highly relevant subsections are prioritized in the retrieval results, improving alignment between queries and content sections.

The function integrates embedding-based retrieval with cross-encoder reranking, maintaining scalability while producing precise subsection-level relevance rankings.

Key Code Explanations

· topk_results = embedding_data.get(“topk_results”, [])

Retrieves top-k candidates per query generated by the embedding similarity step.
Ensures only the most promising subsections are reranked, optimizing computational efficiency.

· passages.append(subsection.get(“merged_text”, “”))

Prepares the textual input for the reranker. Each passage corresponds to a candidate subsection’s merged text.

· results = reranker_model.rank(query, passages, return_documents=True)

Invokes the Granite reranker to score each passage for the current query.
Returns results with corpus_id to align reranker scores with the original candidates.

· norm_scores = [(s – min_s) / (max_s – min_s) for s in scores]

Normalizes rerank scores to a 0–1 range per query to facilitate consistent comparison across different queries.

· order = sorted(range(len(norm_scores)), key=lambda i: norm_scores[i], reverse=True)

Determines the rank order based on normalized scores, assigning the highest score a rank of 1.

· subsection[“query_results”][query][“rerank_score”] = float(norm_scores[idx])

Stores the normalized rerank score in the subsection’s query results dictionary.
Subsequent analysis or visualization can directly use this data for insights.

· subsection[“query_results”][query][“rerank_rank”] = int(rank_pos)

Assigns a final rank to each subsection for the given query, enabling comparison between embedding-based ranking and reranker output.

This function bridges the coarse similarity scoring from embeddings with the precise reranker evaluation, producing actionable ranking data for SEO content assessment and query alignment.

Function compute_page_deliverables

Overview

The compute_page_deliverables function generates structured, user-facing deliverables from the embedding and reranker results. For each query, it calculates key metrics such as average embedding score, intent coverage, and relevance gap, and identifies the top subsections based on rerank scores. These deliverables provide actionable insights on which sections best align with the queries and indicate potential gaps in content coverage.

This function applies thresholds to filter out low-relevance sections and organizes the results in a format suitable for visualizations or client reports.

Key Code Explanations

· query_subsections = [sub for sub in page.get(“subsections”, []) if query in sub.get(“query_results”, {})]

Filters subsections to include only those that have results for the current query.
Ensures that metrics and top sections are computed only on relevant content.

· valid_scores = [s for s in embedding_scores if s >= embedding_threshold]

Applies the embedding threshold to exclude low-relevance subsections.
This focuses the deliverables on high-quality, relevant content for the query.

· avg_embedding_score = float(np.mean(valid_scores)) if valid_scores else 0.0

Calculates the mean embedding score among subsections meeting the threshold, providing a summary relevance metric per query.

· coverage = round(len(valid_scores) / len(embedding_scores) * 100, 2) if embedding_scores else 0.0

Computes the proportion of subsections above the threshold as a percentage, representing how well the page covers the query intent.

· top_sections = sorted([…], key=lambda x: (x[“rerank_score”] if x[“rerank_score”] is not None else 0.0), reverse=True)[:top_k]

Selects the top-k subsections based on rerank score.
Ensures that the most relevant and highest-ranked content is highlighted for each query.
Each top section includes the subsection title, section title, preview text, embedding score, rerank score, and rerank rank.

· relevance_gap = coverage < relevance_threshold

Flags queries with low content coverage relative to the specified relevance threshold, highlighting potential gaps for content optimization.

Function display_results

Overview

The display_results function provides a structured, human-readable display of SEO deliverables for a list of pages. It prints page URL, title, and per-query metrics including average embedding score, intent coverage, and relevance gap flags. Additionally, it shows the top sections per query, including section hierarchy, preview text, embedding score, and reranker information. This function is designed for quick inspection of the results without generating plots or modifying underlying data.

Result Analysis and Explanation

The analysis focuses on the page “Handling Different Document URLs Using HTTP Headers Guide”, evaluated against two queries. Each query demonstrates how Granite-powered embeddings and reranker models identify the most relevant content.

Query 1: “How to handle different document URLs”

Average Embedding Score: 0.790

Indicates strong semantic alignment between the query and the content text. Values above 0.75 are considered high, showing that the content closely matches the query’s intent.

Intent Coverage: 100.0%

All key content segments relevant to the query are captured, showing no missing coverage.

Relevance Gap Detected: No

There are no significant content gaps.

Top 3 Sections (based on preview text):

1. Preview: “Replace with your actual canonical URL. This directive ensures that all duplicate or similar PDFs reference the preferr…”

Embedding Score: 0.846 — Very strong semantic similarity.
Rerank Score: 1.000 — Highest confidence by the reranker.
Interpretation: This content closely matches the query and should be considered highly relevant for user guidance.

2. Preview: “Ensures that different versions of a webpage are served correctly based on device type or language. Helps prevent duplic…”

Embedding Score: 0.833 — High semantic similarity.
Rerank Score: 1.000 — Confirms strong relevance.
Interpretation: Also highly relevant; embedding score slightly lower but still excellent.

3. Preview: “Works for non-HTML content (PDFs, images, videos) where traditional HTML canonical tags can’t be used. Prevents dupli…”

Embedding Score: 0.831
Rerank Score: 1.000
Interpretation: Relevant, though embedding slightly lower than the top two; reranker maintains top relevance.

Score Interpretation:

Good: Embedding scores above 0.8 are strong; reranker score of 1.0 confirms the content is highly aligned with the query.
Average: Scores around 0.75–0.8 indicate decent alignment but may need minor validation.
Bad: Scores below 0.6 would indicate content is likely unrelated.

Query 2: “Using HTTP headers for PDFs and images”

Average Embedding Score: 0.810

Slightly higher than Query 1, indicating even stronger alignment.

Intent Coverage: 100.0%

All relevant content is included.

Relevance Gap Detected: No

Top 3 Sections (based on preview text):

1. Preview: “PDF, Image, and Video SEO: Proper use of headers ensures these non-HTML resources are indexed correctly and efficiently…”

Embedding Score: 0.924 — Excellent semantic match.
Rerank Score: 1.000 — Top confidence.
Interpretation: Extremely strong relevance; this text should be prioritized.

2. Preview: “Vary Header: Helps search engines recognize dynamic content and prevent cloaking issues. Canonical Tags in Headers: Used…”

Embedding Score: 0.869
Rerank Score: 1.000
Interpretation: Highly relevant; embedding score shows strong similarity.

3. Preview: “Always test your headers using browser Developer Tools, cURL, or online header checkers. Avoid excessive redirects and e…”

Embedding Score: 0.863
Rerank Score: 1.000
Interpretation: Very strong alignment; suitable for direct guidance or recommendations.

Score Interpretation:

Embedding scores above 0.85 indicate exceptionally strong alignment with the query text.
Scores between 0.75–0.85 are good and show reliable relevance.
Reranker consistently confirms the strongest content, making it easy to identify high-priority sections.

Overall Score Insights

1. Embedding Scores:

High scores (>0.8) indicate semantic alignment.
Scores below 0.7 suggest weaker relevance, which may warrant reviewing content for gaps.

2. Reranker Scores:

A normalized reranker score of 1.0 signals the highest confidence in relevance.
Scores between 0.5–0.9 indicate partial alignment; scores below 0.5 suggest low relevance.

3. Interpretation:

o Combining embedding and reranker scores provides a dual perspective:

Embeddings capture broad semantic similarity.
Reranker confirms fine-grained relevance among top candidates.

o This combination ensures accurate identification of actionable content text and highlights potential content gaps for review.

Result Analysis and Explanation

Score Threshold Bins

Embedding scores indicate the semantic similarity between queries and webpage content. For practical interpretation and actionable decision-making, scores are categorized into three bins:

Poor (<0.70): Content demonstrates weak alignment with the query intent. Sections with scores in this range may fail to satisfy user search intent and require content improvement or re-alignment.
Moderate (0.70 – 0.85): Content shows partial alignment with the query. These sections may require additional enrichment, clarification, or restructuring to fully cover the intent.
Strong (>0.85): Content is highly aligned with the query. These sections are performing well in terms of semantic coverage and require minimal adjustments.

This binning approach allows quick identification of underperforming sections and prioritization of optimization efforts.

Embedding Score Analysis

Embedding scores provide a quantitative measure of how well page content matches the target queries:

Distribution Insights: High concentration of strong scores indicates robust semantic coverage across sections and queries. Moderate scores highlight content with potential gaps, while poor scores flag areas requiring intervention.
Comparative Analysis: Differences in score distributions across URLs reveal variations in query alignment and content effectiveness. Pages with consistently low or moderate scores may need a thorough content audit.
Actionable Interpretation: Sections with poor scores can be targeted for content expansion, addition of relevant keywords, or restructuring to better match the intended search queries.

Intent Coverage Analysis

Intent coverage measures the proportion of queries for which the page contains sections above the embedding threshold:

High Intent Coverage: Indicates the majority of queries are well-represented in the page content. Pages with high coverage are likely satisfying diverse user intents effectively.
Low Intent Coverage: Suggests gaps in content for certain queries. These gaps can be addressed by adding new sections, refining existing content, or restructuring information flow.
Practical Implications: Monitoring intent coverage across multiple URLs helps prioritize pages for optimization and ensures comprehensive query alignment.

Section Distribution Analysis

Analyzing the distribution of sections exceeding the embedding threshold offers insight into content structure:

Section-Level Performance: Sections are ranked by embedding scores to identify which parts of the content are most relevant to each query.
Coverage Patterns: Multiple high-scoring sections indicate content redundancy or thorough coverage, while few sections above threshold may indicate sparse content.
Optimization Strategy: Observing which sections consistently score below thresholds enables targeted updates, ensuring that key topics are addressed adequately.

Rerank Score Analysis

Reranker scores provide an additional layer of relevance assessment, reflecting the relative importance of sections for each query:

Score Alignment: Sections with high embedding and reranker scores confirm strong content-query relevance. Discrepancies between embedding and reranker rankings can highlight sections that semantically match the query but may not be contextually prioritized.
Rank Interpretation: Lower rerank rank values indicate higher priority sections. These are the sections that should be emphasized or preserved as core content.
Strategic Use: Reranker scores guide internal linking, navigation, or featured snippet targeting, ensuring that top content is surfaced effectively.

Visualization Insights

Four visualizations provide actionable insights:

1. Average Embedding Score per Query (Barplot)

Purpose: Compares the average semantic alignment of each query across all URLs.
Interpretation: Higher bars represent strong overall query coverage; shorter bars indicate queries requiring content enhancement.
Actionable Insight: Queries with moderate or poor scores should be prioritized for content improvement or restructuring.

2. Intent Coverage per Query (Barplot)

Purpose: Illustrates how well each page addresses the target queries.
Interpretation: Pages with full coverage show strong semantic representation; partial coverage indicates missing or weakly aligned sections.
Actionable Insight: Low coverage queries signal content gaps and highlight areas to expand or enrich.

3. Section Distribution Above Embedding Threshold (Line Plot)

Purpose: Tracks embedding scores of sections exceeding the threshold across page rankings.
Interpretation: Steeper lines with higher values denote multiple high-performing sections; flatter lines indicate sparse content coverage.
Actionable Insight: Patterns of low or inconsistent scores suggest a need to reorganize content or add sections for comprehensive query coverage.

4. Rerank Rank Distribution per Query (Boxplot)

Purpose: Visualizes the spread of reranker rankings for sections grouped by page.
Interpretation: Narrow boxplots centered near rank 1 indicate consistent prioritization of relevant sections; wide or high-rank distributions indicate inconsistencies in section relevance.
Actionable Insight: Pages with inconsistent reranker distributions should be reviewed to ensure top sections are clearly highlighted and prioritized.

Actionable Recommendations

Based on the analysis of embedding scores, intent coverage, section distribution, and reranker rankings:

Content Enhancement: Focus on sections with poor or moderate embedding scores to improve semantic alignment with target queries.
Query Coverage Expansion: Identify queries with low intent coverage and create additional sections or expand existing content.
Section Prioritization: Leverage reranker scores to optimize internal linking, headings, and featured content placement.
Regular Monitoring: Continuously track embedding and reranker scores across pages to maintain high-quality query alignment and address emerging content gaps.

Q&A: Understanding Project Value and Importance

What is the primary value of this project for content optimization?

This project identifies the semantic alignment between website content and target search queries. By quantifying embedding and reranker scores, it highlights which sections of content strongly satisfy user intent and which require enhancement. The actionable output allows prioritizing content updates, improving relevance, and increasing the likelihood of higher search visibility.

How do embedding scores help in understanding content performance?

Embedding scores measure the semantic similarity between a query and content sections:

Strong scores (>0.85) indicate sections are highly relevant and require minimal updates.
Moderate scores (0.70–0.85) identify content that partially addresses the query and can be enriched for better coverage.
Poor scores (<0.70) signal sections that are not aligned with the query and need substantial improvement. This allows targeted content optimization rather than blanket changes.

Why are reranker scores important alongside embeddings?

Reranker scores provide a relative importance ranking of content sections for a given query. Even if multiple sections have high embeddings, reranker scores prioritize the most contextually relevant sections for user intent. This ensures that the most valuable content is surfaced first, guiding internal linking, headings, and featured snippet strategy.

How does intent coverage analysis inform content strategy?

Intent coverage shows what percentage of queries are adequately addressed by a page.

High coverage means most queries are well-represented and the page is broadly optimized.
Low coverage indicates gaps where content is missing or underperforming. Focusing on low-coverage areas ensures that no search intent is left unmet, improving overall page performance and user satisfaction.

How do the visualization plots support decision-making?

Visualizations translate complex data into actionable insights:

Avg embedding score per query (barplot): Quickly identifies which queries have weak coverage across pages.
Intent coverage per query (barplot): Highlights queries with missing or underrepresented content.
Section distribution above threshold (line plot): Shows the depth of content coverage for each query and page.
Rerank rank distribution (boxplot): Reveals how consistently top sections align with queries, guiding prioritization. Together, these plots enable focused, data-driven content strategy decisions.

How can this project directly impact SEO outcomes?

By pinpointing sections that are poorly aligned with queries, the project enables targeted improvements, which can:

Increase content relevance for search engines.
Improve user engagement through better query satisfaction.
Enhance internal linking and content flow using reranker insights.
Reduce the risk of content gaps, ensuring all important queries are addressed.

What practical steps should be taken based on the results?

Recommended actions include:

Update or expand sections with embedding scores below 0.70.
Enrich sections with moderate scores (0.70–0.85) to strengthen query coverage.
Use reranker rankings to prioritize top-performing sections in navigation and internal linking.
Monitor intent coverage and maintain a high proportion of queries above the threshold to prevent gaps.

Final thoughts

This project successfully evaluated the semantic alignment of website content against target search queries using advanced embedding and reranker models, specifically leveraging the IBM Granite models for both embedding and cross-encoder ranking. The implementation demonstrated how normalized embeddings and reranker scores can be combined to provide actionable insights into content relevance and intent coverage.

The analysis highlights areas where content is strongly aligned with user queries, sections that require enhancement, and potential gaps in coverage. By integrating Granite embeddings, the project ensured high-quality semantic representations, allowing precise identification of relevance levels and improving confidence in the results. The reranker model further refined the prioritization of content, guiding which sections should be emphasized for maximum SEO impact.

Overall, the project provides a clear, data-driven framework for content optimization, enabling targeted improvements to meet search intent, enhance user engagement, and strengthen search performance. The use of Granite models ensures robust, real-world applicability, making the outputs reliable for actionable SEO decisions.