Disentangled Representations in LLMs for Focused Retrieval

SUPERCHARGE YOUR ONLINE VISIBILITY! CONTACT US AND LET’S ACHIEVE EXCELLENCE TOGETHER!

This project implements a representation learning pipeline using large language models (LLMs) to generate disentangled embeddings that separate out the core aspects of content meaning—namely, topic, tone, and semantic content. These representations are then used to compute query-aware relevance scores across blocks of web content for high-precision information retrieval in SEO applications.

The pipeline processes one or more webpages, extracts content blocks, computes embeddings using pre-trained models, performs topic clustering, classifies tone, encodes query vectors, and finally ranks the blocks based on multi-dimensional similarity. The final output includes relevance-ranked content blocks along with topic and tone insights. Client-facing visualizations and explanations make results actionable and interpretable.

Project Purpose

The goal of this project is to enable more focused, high-quality content retrieval from large collections of web data, where relevance depends not only on keywords but on deeper meaning dimensions such as tone and topic coherence.

In modern SEO, understanding how something is said (tone) and what it is about (topic) is as important as the semantic content itself. This project empowers SEO professionals to:

Detect the most relevant content blocks from long or multiple documents
Analyze tone to match target audience expectations
Compare content tone-topic alignment across competitors
Improve targeting for content curation, optimization, or generation

By disentangling key dimensions of meaning and using them to score relevance, the project moves beyond traditional keyword or embedding-based retrieval into a more human-aligned and semantically aware framework.

Project’s Key Topics Explanation and Understanding

The project revolves around separating and analyzing distinct aspects of webpage content—such as topic, tone, and meaning—to improve content discovery and retrieval in SEO contexts. Each topic below represents a foundational concept that contributes to the effectiveness of this approach.

Disentangled Representations in Content Retrieval

Disentangled representation refers to separating different semantic attributes of text into independent components. This allows each block of text to be understood not just in terms of what it says, but how it says it and what larger theme it belongs to. In this project, three aspects are distinctly represented:

Semantic meaning (content): What the text is about at a sentence or paragraph level.
Tone or communication style: How the message is conveyed, e.g., whether it is promotional, informative, or advisory.
Topic context: The broader thematic group or subject matter the text belongs to.

This separation ensures higher accuracy in content understanding and alignment with user intent.

Topic Clustering for Thematic Understanding

Topic clustering groups similar content blocks into coherent thematic categories. This allows the system to recognize when multiple parts of a page—or multiple pages—talk about the same subject. A topic cluster can represent ideas such as “technical setup instructions”, “pricing details”, or “SEO best practices”.

This technique improves content structuring, enables topical summaries, and helps prioritize content that aligns thematically with user queries.

Tone Classification for Communication Analysis

Tone classification identifies the stylistic and emotional intent behind content. This includes tones such as:

Informative, Persuasive, Promotional, Neutral, Advisory, Confident, Conversational

Understanding tone helps match content style with user expectations. For example, a query seeking professional advice may prefer an “advisory” or “confident” tone, while product promotion benefits from “persuasive” or “promotional” styles.

This level of analysis supports editorial decisions, competitive comparisons, and alignment with brand voice.

Block-level Text Segmentation

Rather than evaluating entire web pages, this project processes content at the block level—treating headings, paragraphs, and list items as independent semantic units. This enables precise extraction and scoring, identifying the most relevant parts of a document for specific queries or topics.

Such granularity enhances the ability to retrieve relevant, high-quality responses even from lengthy or mixed-topic documents.

Vector-Based Semantic Matching

Each aspect of content (meaning, tone, topic) is represented as a vector in a high-dimensional space. This allows for similarity-based comparison between user queries and content blocks. Matching occurs on multiple dimensions simultaneously:

Semantic similarity assesses direct relevance.
Tone similarity captures stylistic alignment.
Topic similarity ensures thematic consistency.

This layered matching model supports more meaningful and accurate retrieval than keyword-based systems.

Relevance Scoring Using Multi-Aspect Alignment

Relevance scoring is based on comparing the vector representations of content and queries across all dimensions. Each comparison contributes to a final relevance score that reflects overall alignment in meaning, tone, and topic.

This scoring mechanism ensures the surfaced content is not only topically relevant but also stylistically and semantically appropriate, offering a more intelligent and user-aligned retrieval process.

Q&A: Understanding the Project’s Value and Importance

What does this project do, and why is it important for SEO?

This project analyzes webpage content to detect and separate distinct aspects of meaning—primarily topic and tone. By disentangling these components, the system enables precise relevance scoring of page content against user queries. This directly supports SEO efforts by revealing how well a page’s content aligns with a specific intent, topic, and communication style.

How does this project help improve page-level content relevance?

The system identifies which sections of a page are most aligned with a given query based on both topical focus and tone. This allows SEO professionals to evaluate whether a page has the right depth, coverage, and tone balance for each target intent. It enables pinpointed optimization of specific content blocks without needing to revise entire pages blindly.

How does this system help in identifying the most relevant content?

The system uses a dual-representation of content — one vector capturing the topic, another the tone — and calculates a query-specific relevance score using semantic and stylistic alignment. This means blocks are not only checked for lexical matches, but also for conceptual and intent-based relevance. This helps detect hidden gaps, such as when content covers the right topic but in the wrong voice or with misaligned emphasis, helping optimize for both user intent and SERP expectations.

What is the benefit of separating topic and tone in content blocks?

Separating topic and tone allows for more granular interpretation of a content block’s purpose. While topic identifies what the content is about (e.g., product features, how-to guides), tone captures how the message is conveyed (e.g., persuasive, advisory). This disentangled representation helps identify mismatches — for example, when an informative topic is written with a promotional tone, which may reduce trust in informational search contexts. This separation helps SEO teams make precise improvements without rewriting entire pages blindly.

How actionable are the results for SEO content teams?

Each content block is scored with relevance, labeled with a topic, and classified by tone. These results are structured for direct action: optimizing or replacing low-performing blocks, adjusting tone where mismatch exists, or identifying content gaps relative to intent. Visualizations further support strategic review and client reporting.

What types of SEO scenarios benefit most from this analysis?

Content audits for high-intent keywords.
Tone alignment for advisory, YMYL, or product content.
Competitive analysis of top-ranking pages.
Topic coverage evaluations for pillar or cluster pages.
Identifying cannibalization across multi-page websites.

Libraries and Dependencies Used

This project integrates a carefully chosen set of libraries across various domains—web content processing, NLP modeling, clustering, embedding analysis, and visualization. Below is a detailed overview of each core library used, with justification grounded in real-world usage within the SEO domain.

requests

The requests library is a well-established tool for making HTTP requests and retrieving online content. It provides a simple API to handle GET and POST calls and supports headers, timeouts, and error handling mechanisms.

In this project, requests is used to fetch the raw HTML content of webpages from the provided URLs. It acts as the entry point to the pipeline by enabling seamless access to the underlying document structure of any SEO-relevant webpage.

trafilatura

trafilatura is a state-of-the-art content extraction tool designed to remove clutter and boilerplate code from webpages. It leverages advanced heuristics to extract main article-like content from HTML documents.

This project uses trafilatura as the primary method for block-level content extraction, ensuring that only meaningful, SEO-optimized content is retained from web pages. Its precision in identifying central text makes it suitable for downstream NLP processing.

BeautifulSoup (from bs4)

BeautifulSoup is a Python library that enables parsing and navigation of HTML and XML documents. It simplifies locating and extracting tags, attributes, and structured elements from webpage markup.

Here, it is used as a fallback mechanism when trafilatura fails. The pipeline uses BeautifulSoup to manually parse and extract visible text blocks while discarding script tags, comments, and non-informational elements. This dual-extraction approach ensures higher robustness in real-world, messy web content.

re, html, unicodedata

These are built-in Python modules used for text normalization. re enables regular expression-based cleaning, html decodes HTML entities (e.g.,  , &), and unicodedata standardizes text using Unicode conventions.

These modules are essential in cleaning and sanitizing extracted blocks. Without them, downstream models could misinterpret encoded characters or unnecessary whitespace, affecting tone classification and embedding generation. They help maintain a clean textual input for more accurate NLP analysis.

string

The string module provides access to common character groups such as punctuation and ASCII letters.

It is used primarily to remove punctuation during keyword filtering and prepare text for topic modeling and KeyBERT keyword extraction, ensuring a cleaner, more signal-rich textual representation of SEO content blocks.

defaultdict, Counter (from collections)

These are part of Python’s standard collections module. defaultdict allows automatic default values in dictionaries, while Counter enables frequency analysis of elements.

They are used extensively to accumulate tone distributions, relevance scores, and topic statistics per query and per URL. Their inclusion simplifies grouping and summarization of results in an efficient and readable manner.

typing (List, Dict, Any, Optional)

These constructs enable type annotations in Python functions and data structures. Type hints make the code more maintainable, predictable, and less error-prone.

All core pipeline functions are strongly typed using typing components to ensure clear expectations for inputs and outputs. This supports better integration and debugging in production deployments.

numpy

numpy is the foundational library for scientific computing in Python. It offers efficient arrays, matrix operations, and mathematical utilities.

In this project, numpy is used for vector operations, including the averaging of topic vectors, computing cosine similarity between query and block embeddings, and calculating relevance score statistics across blocks and topics. It supports fast and memory-efficient matrix manipulation required in large-scale SEO analysis.

SentenceTransformer (from sentence_transformers)

SentenceTransformer provides pretrained models optimized for producing high-quality semantic sentence embeddings. It wraps models like BERT, RoBERTa, etc., for fast and scalable use.

We use it to encode both content blocks and user queries into fixed-length dense vectors that capture their meaning. This embedding space becomes the foundation for computing semantic relevance and matching SEO blocks to user intent, replacing traditional keyword-based matching with modern contextual understanding.

pipeline (from transformers)

The pipeline utility from Hugging Face simplifies access to pretrained models for tasks like sentiment classification, text generation, and more.

Here, it powers the tone classification module, which analyzes each block of content for its communicative style (e.g., informative, promotional). We use cardiffnlp/twitter-roberta-base-sentiment, which is efficient for detecting tone across real-world web content.

logging (from transformers.utils)

This utility is used to control the verbosity of Hugging Face transformers. In production or client-facing environments, uncontrolled logging can clutter outputs.

We use it to disable warnings and progress bars, ensuring a clean, professional runtime environment during inference.

BERTopic

BERTopic is an advanced unsupervised topic modeling technique that clusters documents based on semantic embeddings and refines topics using class-based TF-IDF.

This project uses BERTopic to cluster content blocks into interpretable topics, enabling topic-aware content analysis and matching user queries to topic-relevant blocks. It also helps aggregate scores at the topic level for enhanced interpretability.

PCA (from sklearn.decomposition)

Principal Component Analysis (PCA) reduces high-dimensional vectors to lower dimensions while preserving important variance.

Here, it compresses dense block embeddings into low-dimensional topic vectors, making them more efficient and noise-tolerant for topic scoring and similarity calculations.

ENGLISH_STOP_WORDS (from sklearn.feature_extraction.text)

This is a predefined list of common English stopwords (e.g., “the”, “and”, “is”) that are typically removed to focus on meaningful content.

The stopword list is used in keyword filtering and topic labeling, ensuring that noisy or generic terms do not skew relevance scores or cluster interpretations.

cosine_similarity (from sklearn.metrics.pairwise)

This function computes the cosine similarity between two sets of vectors, a common measure in semantic matching.

It is the core scoring mechanism to measure how relevant each content block is to the user query, enabling precise ranking of content across pages for SEO use cases.

KeyBERT

KeyBERT is a keyword extraction technique built on transformer embeddings. It identifies the most representative keywords from a block of text.

In this pipeline, it is used to assign interpretable labels to discovered topics, enhancing the readability and actionability of topic clusters in client reports.

matplotlib.pyplot, seaborn

These are the industry-standard libraries for producing high-quality plots and charts in Python.

They are used to visualize tone distributions, top relevant content blocks, and topic-level heatmaps, giving clients an intuitive understanding of content alignment, tone balance, and query relevance across domains.

Let me know when to proceed with explanation of the first function in the pipeline, or if you’d like to insert this section into the report and adjust formatting.

Function: extract_blocks() — Extract Content Blocks from a Webpage

Summary:

The extract_blocks() function is responsible for extracting readable content from a given webpage URL. It plays a critical role in the pipeline by ensuring that only clean and semantically meaningful textual blocks are passed on for downstream analysis like embedding generation, topic classification, or tone prediction.

This function uses a two-stage strategy:

It first attempts to extract structured text using trafilatura, which is optimized for web article extraction. If that fails, it falls back to manual extraction using BeautifulSoup, applying a series of filters and de-duplication strategies to retrieve relevant visible content.

This dual-layered extraction ensures robustness across a wide variety of page structures, which is essential in real-world SEO projects dealing with unpredictable webpage formatting.

Highlighted Code Explanation:

· Trafilatura as Primary Extractor:

result = trafilatura.extract(downloaded, include_comments=False, include_tables=False, …)

Trafilatura parses the page content into structured text, stripping comments, formatting, and tables for a cleaner output. This increases the relevance of extracted data for semantic processing.

· Parsing and Filtering Text Blocks:

paragraphs = parsed.get(“text”, “”).split(“\n”) blocks = [p.strip() for p in paragraphs if len(p.strip().split()) >= min_word_count]

Only blocks with a minimum number of words are retained, ensuring short noisy fragments are excluded. This step is vital for downstream model performance.

· Fallback to BeautifulSoup with Cleaning Logic:

for tag in [‘script’, ‘style’, …]: tag.decompose()

Non-content tags are removed to avoid boilerplate noise. This includes JavaScript, styling, navigation, forms, and other UI components.

· Duplicate Removal and Block Construction:

digest = hash(text.lower()) if digest not in seen: blocks.append(text)

Hash-based deduplication prevents repeated blocks, improving the quality and uniqueness of the extracted content, which is crucial for embedding consistency.

Function: clean_and_filter_blocks() — Clean and Filter Extracted Text Blocks

Summary:

The clean_and_filter_blocks() function processes the raw extracted text blocks from a webpage and converts them into structured, clean, and relevant records. It plays a crucial role in sanitizing and normalizing content for downstream semantic analysis by removing visual clutter, boilerplate noise, formatting artifacts, and unnecessary links or symbols.

This step ensures high signal-to-noise ratio in the input data, which significantly improves the accuracy of embedding models, topic clustering, and tone classification. It also associates each cleaned block with its source URL, maintaining traceability for result presentation.

Highlighted Code Explanation:

· Boilerplate and Noise Filtering:

boilerplate = re.compile(r’\b(?:read more|click here|subscribe|…)\b’, re.IGNORECASE)

This regex targets common boilerplate phrases frequently found in footer, header, or promotional sections. Removing them ensures the extracted blocks remain focused on primary content.

· Text Cleaning Pipeline:

text = html.unescape(text) text = unicodedata.normalize(“NFKC”, text)

HTML entities are decoded, and Unicode is normalized for consistent token representations, which improves semantic vectorization and prevents encoding anomalies.

· Numbering, Bullets, and URL Removal:

text = url_pattern.sub(”, text) text = bullet_pattern.sub(”, text) text = numbered_pattern.sub(”, text)

This removes structural artifacts like bullet symbols, steps, section numbers, and inline URLs. The result is a more natural, plain-text version suitable for modeling.

· Final Filtering and Structuring:

if len(cleaned_text.split()) >= min_word_count: cleaned.append({ “text”: cleaned_text, “url”: url })

Short or noisy blocks are filtered out based on a minimum word count. Each valid block is stored as a dictionary with both content and its origin URL, enabling backtracking and attribution in the final output

Function: load_embedding_model() — Load Pretrained Embedding Model

Summary:

The function loads a pretrained SentenceTransformer model for generating semantic embeddings of text blocks and queries. This embedding model serves as the backbone for various components in the project, including content clustering, similarity scoring, and topic representation.

The function currently uses the model “all-MiniLM-L6-v2” — a lightweight yet high-performing transformer known for producing dense vector representations that preserve semantic meaning. This model strikes a balance between accuracy and efficiency, making it ideal for production-scale SEO applications involving hundreds or thousands of content blocks.

Highlighted Code Explanation:

· Model Loading from Hugging Face:

topic_model = SentenceTransformer(model_name)

This line initializes a SentenceTransformer using the specified model name (default: “all-MiniLM-L6-v2”). This model is trained on semantic similarity tasks and outputs dense embeddings ideal for clustering, search, and ranking.

Model Details and Explanation

This project uses a SentenceTransformer-based embedding model to convert content blocks and queries into dense semantic vectors. The model serves as the foundation for all similarity-driven operations, including topic clustering, tone-topic disentanglement, and query scoring. Below is a breakdown of its architecture, processing strategy, and relevance to SEO analysis tasks.

Model Overview: all-MiniLM-L6-v2

Source: Hugging Face’s SentenceTransformers library
Size: Approximately 22 million parameters
Embedding Dimension: 384
Performance: Achieves a strong trade-off between speed and accuracy on tasks like semantic search, clustering, and content comparison.

This model is pretrained on a large corpus using a contrastive objective to bring semantically similar sentence pairs closer in vector space. It’s ideal for downstream SEO applications due to its real-time inference capability and domain-agnostic robustness.

Architecture Breakdown

Transformer Encoder

Backbone: BERT-based transformer model
Max Length: 256 tokens per input
Preprocessing: Case-sensitive (do_lower_case = False)

The transformer converts input tokens into contextualized word-level embeddings. Unlike vanilla BERT used for classification, this setup focuses on text similarity and dense retrieval tasks.

Pooling Layer

Mode: pooling_mode_mean_tokens=True

Instead of using only the [CLS] token, the model takes the mean of all token embeddings to produce a sentence-level vector. This significantly improves performance in semantic matching tasks, especially with real-world noisy content like web pages.

Normalization Layer

The final sentence embedding is L2-normalized to ensure uniform scaling during cosine similarity calculations — a crucial step for reliable distance-based clustering and ranking.

Why This Model for SEO Applications?

This model was specifically selected for its suitability in analyzing noisy, diverse web content at scale:

Semantic Granularity: Captures subtle meaning variations across blocks — essential for detecting tone, intent, or topical divergence in long-form content.
Efficiency: Offers sub-second inference per document, enabling real-time or batch-scale processing of large URL sets.
Transferability: Generalizes well across content verticals (e.g., tech, finance, health), reducing the need for domain-specific fine-tuning.
Compatibility: Outputs embeddings compatible with downstream tasks like BERTopic clustering, cosine similarity, and PCA-based dimensionality reduction.

Function: load_classifier_pipeline()

Summary

This function initializes the zero-shot tone classification pipeline, which plays a central role in disentangling the tone aspect from the content blocks. Rather than using a fixed-label sentiment model, this approach enables flexible, label-guided classification where domain-specific tones (e.g., informative, persuasive, promotional) can be assessed without fine-tuning.

The resulting classifier is used to assign a tone label and tone vector to each content block, allowing separation of content meaning (topics) from delivery style (tone) — a core capability in this project’s disentangled representation pipeline.

Highlighted Code Explanation:

Zero-Shot Classifier Initialization

tone_classifier = pipeline(‘zero-shot-classification’, model=model_name)

Pipeline Used: zero-shot-classification from Hugging Face Transformers.
Model: Default is “FacebookAI/roberta-large-mnli”, a robust model trained on Natural Language Inference (NLI) tasks.
Why Zero-Shot? This approach allows passing custom tone labels at inference time, offering flexibility to support multiple SEO tone categories (e.g., confident, advisory, conversational) without retraining the model.

Tone Classification Model: FacebookAI/roberta-large-mnli

The project uses the FacebookAI/roberta-large-mnli model for tone classification through a zero-shot inference pipeline. This model is designed to classify text into custom tone categories without requiring task-specific fine-tuning, making it a practical and scalable choice for real-world SEO content analysis.

Model Overview

· RoBERTa (Robustly Optimized BERT Pretraining Approach) is a transformer-based language model developed by Facebook AI. It builds on the architecture of BERT but introduces key improvements, such as training on more data for longer durations and removing next sentence prediction (NSP) during pretraining.

· The version used here — roberta-large-mnli — is specifically fine-tuned on the Multi-Genre Natural Language Inference (MNLI) dataset. This makes it ideal for zero-shot classification, where the model predicts the relationship between input text and custom labels expressed as hypotheses.

Why Used in This Project

· Zero-Shot Flexibility: The model can evaluate whether a block of content expresses a tone like informative, promotional, or advisory, even though it was not originally trained for tone analysis. This is done by reframing tone classification as a natural language inference task.

· No Fine-Tuning Needed: Since tones are not fixed labels in the dataset, a zero-shot approach allows on-the-fly labeling without requiring retraining on tone-specific data.

· SEO-Driven Tone Categories: The project uses customized tone labels that are meaningful in the SEO and marketing domain. The model adapts to these categories directly via prompt-based zero-shot evaluation.

Tone Label Strategy

The tone classification process operates by scoring each block against a set of predefined tone labels.
The label with the highest entailment score is assigned as the block’s tone label.
A full tone distribution vector is also extracted to represent confidence scores across all tone categories — useful for downstream scoring and visualization.

Model Architecture

While the inner workings of RoBERTa are abstracted during inference, here’s a high-level view of how the zero-shot setup operates:

Input: Content block + hypothesis (e.g., “This text is promotional”).
Output: Probabilities for entailment, neutral, and contradiction.
Classification: The highest entailment score among candidate tone labels determines the final tone.

This architecture allows the system to handle open-ended, domain-specific tone classification robustly — essential for modeling nuanced stylistic variations in SEO content

Function: generate_disentangled_representations() — Create Disentangled Topic and Tone Representations

Summary:

The generate_disentangled_representations() function generates two separate yet complementary representations for each content block — a content embedding for topic modeling and a tone label/vector for stylistic characterization. This dual representation is central to the project’s goal of disentangling different semantic aspects of content for focused retrieval, tone-aware ranking, and topic analysis.

It processes raw block data and enriches each with semantic vectors and tone classification results, enabling precise downstream computations for SEO applications such as clustering, retrieval, and relevance scoring.

Highlighted Code Explanation:

· Tone Classification Using Zero-Shot Model:

tone = tone_classifier(block[‘text’], candidate_labels=all_possible_tones, multi_label=True)

This line uses a zero-shot classification pipeline to assign tone probabilities across a fixed set of SEO-relevant tone labels (e.g., “informative”, “promotional”). The multi_label=True flag allows the model to consider multiple tones simultaneously, which is important for nuanced content styles.

· Tone Output Formatting:

tone_label = tone[‘labels’][0] tone_vector = [tone[‘scores’][tone[‘labels’].index(label)] for label in all_possible_tones]

The tone label is selected as the highest scoring class, and a tone vector is constructed in a fixed order to preserve alignment with expected categories. This vector is essential for tone-aware relevance scoring.

· Semantic Embedding Computation:

content_vector = embedding_model.encode(block[‘text’], show_progress_bar=False, convert_to_numpy=True)

This line encodes the block text into a dense content embedding that represents its topical meaning. These vectors are used in topic clustering, similarity scoring, and semantic relevance evaluation.

Function: load_topic_model() — Load BERTopic for Topic Clustering

Summary:

The load_topic_model() function initializes a new instance of the BERTopic model, which is used in this project to identify coherent topic clusters from content block embeddings. BERTopic (Bidirectional Encoder Representations for Topic modeling) is a powerful tool that combines transformer-based embeddings with clustering algorithms (like HDBSCAN) and class-based TF-IDF to generate interpretable topics.

In this pipeline, BERTopic plays a central role in organizing content by semantic themes, enabling topic-aware summarization, relevance scoring, and visual content analysis for SEO use cases.

Highlighted Code Explanation:

· BERTopic Initialization:

return BERTopic(embedding_model=None, verbose=False)

o embedding_model=None indicates that the model will not internally compute embeddings. Instead, external embeddings (already generated by SentenceTransformer) will be passed during fitting. This is critical because we use disentangled content embeddings for better control and consistency throughout the pipeline.

o verbose=False disables logging output from BERTopic, keeping the pipeline clean for large-scale, batch processing in real-world client scenarios.

Model Explanation: BERTopic for Topic Clustering

Overview of BERTopic

BERTopic is an unsupervised topic modeling technique that builds on modern transformer-based embeddings. It allows grouping semantically similar content into distinct clusters (topics) without requiring labeled data. Unlike traditional models like LDA, BERTopic supports rich language understanding through contextual embeddings and dynamic topic management.

In this project, BERTopic plays a core role in identifying semantic topic groups across web page blocks. It uses precomputed dense embeddings (from SentenceTransformer) and applies dimensionality reduction followed by clustering, helping us disentangle topic-related structure from tone or surface-level text.

How BERTopic Works Internally

The typical BERTopic workflow includes the following major stages:

· Embedding: In our project, precomputed sentence embeddings are generated using a separate SentenceTransformer. These embeddings are passed to BERTopic manually by setting embedding_model=None, giving us full control and performance optimization.

· Dimensionality Reduction: BERTopic applies UMAP or other methods to reduce embedding dimensionality. However, in this project, we manually apply PCA, which allows us to fine-tune clustering stability and performance.

· Clustering: The reduced vectors are clustered using HDBSCAN, a density-based clustering algorithm that can automatically determine the number of clusters and handle noise (i.e., outlier blocks).

· Topic Representation: For each topic cluster, BERTopic extracts representative keywords using class-based TF-IDF. In our case, we enhance this process with KeyBERT, enabling interpretable and query-aligned topic labels.

Why BERTopic Was Used in This Project

This project requires modular, interpretable, and context-aware topic clustering over real-world web content. BERTopic is chosen because:

It works directly with dense, semantic vectors, unlike LDA or NMF.
It scales well with thousands of page-level content blocks across domains.
It produces interpretable keywords per topic, helping clients understand topical structure.
Its clustering is adaptive — automatically ignoring noise or poorly clustered items.
It fits the disentangled representation goal by handling topics independently from tone (which is classified separately).

Architecture & Customization in This Project

In the current pipeline, BERTopic is customized and modularized:

embedding_model=None allows injecting external vectors from SentenceTransformer.
PCA is applied before topic modeling to control vector density and noise.
Topic labeling is post-processed using KeyBERT for clear output.
BERTopic’s output is enriched with tone distribution stats, enhancing interpretability.

This flexible setup aligns perfectly with client goals, making BERTopic both a semantic engine and a visual tool to explore how topics span across different URLs and tones.

Practical Benefits in the Pipeline

Improved Cluster Quality: By separating tone and topic embeddings, BERTopic ensures clusters are based on actual content themes, not sentiment or tone.
Modular Control: Allows experimentation with different embedding models, clustering techniques, or labeling logic without modifying the entire architecture.
SEO and Content Strategy Insights: The final clusters can help clients identify which content areas are strong, which need refinement, and how topic distribution aligns with user queries.

Function: cluster_by_topic() — Assign Topic Clusters and Generate Topic-Level Representations

Summary:

The cluster_by_topic() function performs unsupervised topic clustering of content blocks using the BERTopic model, following dimensionality reduction through PCA. Each block is assigned a topic_id, and the function additionally computes tone label distribution per topic and the mean embedding vector per topic. These outputs are central to enabling topic-specific search relevance, tone profiling, and visualization.

The function operates on block-level data enriched with content embeddings and tone labels, delivering the core topic structure that drives the project’s disentangled representation goal.

Highlighted Code Explanation:

PCA-Based Dimensionality Reduction

This step reduces the dimensionality of content embeddings using PCA to improve clustering stability and runtime efficiency. The reduced embeddings preserve semantic structure but avoid overfitting and noise issues typical with high-dimensional vectors.

BERTopic Clustering

Here, BERTopic is applied on the reduced vectors to identify topic clusters. The fit() method trains the clustering model using both embeddings and their associated texts, and transform() assigns a topic_id to each block.

Block Annotation with Topic ID

Each content block in the dataset is updated with the assigned topic cluster ID. This enables future filtering, grouping, or matching based on topic membership.

Compute Topic-Level Tone Distribution

This loop counts how often each tone label (e.g., “informative”, “promotional”) appears within each topic cluster. The resulting statistics enable clients to analyze tone dominance per topic, useful for SEO or content audits.

Compute Topic-Level Mean Embeddings

Each topic is represented by the average of all block embeddings it contains. These mean vectors serve as topic centroids, which are later used to match user queries to topics based on semantic similarity.

Function: assign_topic_labels() — Generate Human-Readable Topic Labels

Summary:

The assign_topic_labels() function assigns clean, interpretable string labels to each topic cluster identified by BERTopic. It primarily uses KeyBERT, a keyword extraction model based on BERT embeddings, to extract meaningful phrases from the content blocks belonging to each topic. If KeyBERT fails or returns low-quality output, the function falls back to using top words directly from the BERTopic model’s vocabulary, after removing stopwords.

This step is essential for making topic clusters intelligible and client-facing. These labels can be used in visualizations, summaries, dashboards, and UI displays to describe each discovered topic.

Highlighted Code Explanation:

KeyBERT Initialization and Topic-to-Text Grouping

keybert_model = KeyBERT() … topic_to_docs.setdefault(topic, []).append(doc[“text”])

The function initializes a KeyBERT model to handle phrase extraction. Then, it groups all input documents by their topic_id, skipping outliers labeled as -1. Grouping is required to create a unified corpus per topic for keyword extraction.

Primary Label Generation Using KeyBERT

For each topic, the merged text corpus is processed using KeyBERT to extract the most relevant keyphrase (1 to 3 words). If the extracted phrase is valid — not too short, empty, or a common stopword — it is assigned as the topic’s label.

Fallback: Extract Top-N Words from BERTopic Model

If KeyBERT fails or returns noisy results, the fallback method uses get_topic() from BERTopic, which returns top-ranked words per topic based on TF-IDF or class-based term weighting. After removing stopwords and filtering for word length, a label is formed using the top n terms (default: 3).

This ensures every topic receives a usable label, even if KeyBERT fails due to poor text quality or short inputs.

Function: topic_clustering_pipeline() — Complete Topic Modeling Workflow

Summary:

The topic_clustering_pipeline() function encapsulates the full topic discovery workflow in this project. It performs clustering of content blocks using BERTopic, applies dimensionality reduction via PCA, and assigns interpretable labels to each topic using a dual strategy (KeyBERT + fallback). This modular pipeline returns enriched metadata, allowing downstream systems to understand and visualize topic-specific groupings across content.

Highlighted Code Explanation:

Clustering via cluster_by_topic()

clustering_result, pca_reducer, topic_model_fit = cluster_by_topic(…)

This step reduces the dimensionality of input vectors (default n_components=25) using PCA and applies the BERTopic model to discover semantically grouped clusters. It outputs:

clustering_result: Contains topic IDs assigned per block
pca_reducer: Fitted PCA object for dimensionality reduction (used later for query projection)
topic_model_fit: Trained BERTopic instance

Assigning Human-Readable Labels

topic_labels = assign_topic_labels(clustered_data, topic_model, topic_ids)

Uses a hybrid label generation strategy:

Tries KeyBERT to extract top keyphrases from grouped blocks.
Falls back to get_topic() words from BERTopic with stopword filtering.

The labels are added to the result for display or downstream usage.

This pipeline is central to the disentangled representation framework. It enables:

Topic-aware retrieval, where only semantically aligned blocks are considered.
Coherent document clustering, improving interpretability for clients.
Cross-document comparison, by grouping multiple URLs into labeled topic spaces.
Support for query-to-topic mapping, by using the returned PCA and topic model downstream.

Function: encode_query_representation() — Disentangled Query Encoding

Summary:

The encode_query_representation() function creates a multi-aspect representation of a user query by encoding it across three disentangled dimensions:

Semantic content via embeddings

Tone style via a zero-shot tone classifier

Topic relevance via BERTopic and precomputed topic embeddings

This unified representation allows the query to be aligned precisely with semantically and tonally compatible document clusters, powering focused retrieval and summarization.

Highlighted Code Explanation:

Content Embedding

content_vector = embedding_model.encode([query], show_progress_bar=False)[0]

The core semantic meaning of the query is captured using a pre-loaded sentence transformer. This high-dimensional embedding forms the base for similarity computation against content blocks and topic centroids.

Tone Prediction using Zero-Shot Classification

The function applies a tone classifier (e.g., HuggingFace zero-shot pipeline) to predict the dominant tone of the query. It returns both:

tone_label: The most probable tone (e.g., “confident”, “advisory”)
tone_vector: Full distribution over tone categories

This enables tone-aware matching between the query and document clusters.

If tone classification fails due to input or model error, the result defaults to “unknown” and a zero vector.

Topic Assignment via BERTopic

If both the pca_reducer and topic_model are provided, the function projects the query into reduced space and assigns it to the closest BERTopic cluster. This identifies the most semantically relevant topic for downstream filtering.

Topic Embedding Lookup (via Cosine Similarity)

score = cosine_similarity(content_vector.reshape(1, -1), centroid.reshape(1, -1))[0][0]

Using precomputed topic_embeddings (centroids from block clusters), the query is scored against each topic, and the most similar topic vector is selected as its topic_vector. This allows for soft topic alignment — even when BERTopic fails to assign cleanly.

Function: compute_relevance_scores() — Multi-Dimensional Relevance Scoring

Summary:

The compute_relevance_scores() function assigns a numeric relevance score to each document block by comparing it with the query across three disentangled dimensions:

Content relevance (semantic similarity)
Topic compatibility (cluster alignment)
Tone alignment (stylistic compatibility)

This composite scoring approach ensures results are not only topically relevant but also aligned with the user’s intent and tone, enabling high-precision ranking of document segments.

Highlighted Code Explanation:

Input Structure

query_repr: Dict[str, Any] clustered_results: Dict[str, Any]

· query_repr contains the encoded query (content, topic, tone vectors)

· clustered_results includes block-level data (clustered_data), along with:

topic_labels (label dictionary)
topic_embeddings (centroid vectors for each topic)

Similarity Computation

For each block in clustered_data:

Content Similarity is computed between the query and block content vectors
Topic Similarity is computed between the query and block topic centroids
Tone Similarity compares the tone distribution vectors

cosine_similarity([query_vector], [block_vector])[0][0]

If any vector is missing or improperly typed, similarity defaults to 0.0 to ensure robustness.

3. Weighted Scoring

The final score is computed using the configurable weight scheme:

This produces a single float value between 0 and 1, representing how relevant a block is to the user’s query — not just by meaning, but also by tone and topic context.

Function: display_scored_results() — Human-Readable Output Display

The display_scored_results() function renders the top-N most relevant content blocks in a human-readable format for each (query, URL) pair. It presents:

The query string and source URL
The top N blocks sorted by relevance score
For each block: the score, tone label, topic label, and truncated text

This is intended purely for console-based inspection or debugging and is not part of the downstream processing or result packaging. It helps quickly verify the effectiveness of scoring and alignment during experimentation or reporting.

Function: visualize_disentangled_results() — Visual Output for Query-Content Alignment

The visualize_disentangled_results() function generates practical, informative visualizations from scored results for client interpretation. It produces three key plot types:

· Block-level Relevance Scores per Query and URL — Compares how well top blocks align with different queries for each source page.

· Tone Distribution per Query — Shows how the tone of retrieved blocks varies, aiding in tone-targeted content strategy.

· Topic vs Average Relevance Heatmap — Highlights which topics yield the most relevant results, helping refine topic focus.

These plots provide clear, digestible insights to non-technical stakeholders and help validate that content blocks retrieved by the model align with both semantic and tone-based client expectations.

Result Analysis and Discussion: Focused Insights from Query-Specific Evaluation

In response to the query “How to manage SEO”, the evaluation reveals highly focused, actionable content that addresses advanced SEO strategies in a practical and contextually rich manner. The top-scoring blocks from the URL analyzed demonstrate a strong match across three essential dimensions — relevance, tone, and topical alignment — offering a strong foundation for performance optimization.

Depth of Strategic Insight

The content surfaces multiple high-value insights tailored for decision-makers and SEO professionals. One of the highest-scoring blocks emphasizes the importance of customized SEO strategies, stating that “There’s no one-size-fits-all approach…” — directly addressing the nuanced and evolving needs of businesses operating in diverse digital environments. This aligns well with user intent focused on management, not just execution, of SEO initiatives.

Other blocks reinforce the criticality of technical SEO audits, visibility improvements, and scalable strategies — showing a clear orientation toward operational excellence. This strategic framing is particularly important when targeting organizations seeking robust frameworks for sustainable digital visibility.

Tone and Message Consistency

The tone distribution — a blend of informative and promotional — is well balanced. Informative blocks provide substance and technical clarity (e.g., site audits, performance metrics), while promotional elements highlight measurable outcomes, such as “tenfold improvement in search visibility.” This dual-tone mix is ideal for users at various stages of the decision journey: those researching SEO frameworks and those evaluating service offerings.

Topic Cohesion and Clarity

All content blocks were categorized under the unified topic label: advanced SEO audit. This level of topical cohesion not only improves the semantic alignment of the page but also ensures high continuity for search engines and human readers alike. For content teams, this indicates a strong signal that the page remains focused and does not drift into unrelated messaging — a valuable trait for domain authority and query relevance.

Practical Takeaway

The results reflect a high level of optimization and message clarity for the query. However, they also reveal a performance ceiling: all top blocks cluster closely in the high 0.77–0.79 score range, suggesting consistency but limited diversification in message targeting. Future enhancements could introduce more nuanced subtopics (e.g., reporting, automation in SEO management) to further enrich the content spectrum.

For businesses and content teams, this type of granular analysis provides a diagnostic lens to validate content effectiveness, reinforce messaging strategy, and identify expansion opportunities — all with measurable, interpretable outcomes.

Result Analysis and Explanation

Understanding Topic Alignment Across Content Blocks

The analysis demonstrates a strong alignment between content topics and the thematic focus of the user queries. For queries centered around improving SEO for dynamic URLs, the retrieved content consistently gravitated toward topics such as SEO performance improvement, indexing efficiency, and canonicalization practices. These alignments reflect a high degree of contextual match between the underlying page content and the intent behind optimization-focused search queries.

For more tool-specific queries, such as the advantages of using SEO Tool Lab for keyword clustering, the retrieval process successfully identified segments that promote or describe the functionalities, automation benefits, and strategic applications of such tools. This indicates a useful disentanglement of topical relevance — distinguishing technical implementation advice from tool-centric value propositions, even when both are embedded in a single source.

This level of topical clarity helps digital strategists identify exactly which portions of their content address specific audience intents, enabling precision updates without reworking entire pages.

Tone Distribution: Practicality Meets Promotional Strategy

Tone analysis reveals a well-balanced spread between informative, promotional, and advisory tones across blocks retrieved for each query. Informative tones were more dominant for queries seeking general SEO improvement advice, whereas promotional tones appeared prominently when the intent was tool or service-oriented.

This tonal distinction is practically valuable. For example, promotional content can drive conversions for queries related to product benefits, while informative content supports educational or guidance-based queries. The presence of both tones, strategically mapped to different queries, reflects positively on content adaptability and audience targeting.

Moreover, the presence of advisory tones in optimization-related queries adds a layer of consultative value to the content. This strengthens the perception of authority and expertise when addressing best practices or strategic implementation advice.

Scoring Pattern and Relevance Thresholds

To interpret relevance scores meaningfully, it’s important to consider score distributions across varying contexts and queries. Based on this and similar results, the following generalized relevance score bins can help interpret block effectiveness:

· Score > 0.80: Exceptional Relevance Indicates highly focused and query-specific content. These blocks often contain direct answers, strategic recommendations, or high-conversion messaging.

· Score 0.70 – 0.80: Strong Relevance These blocks are highly aligned in context and value. They deliver useful, actionable information or highlight key functionalities. Ideal for repurposing or highlighting in SEO metadata.

· Score 0.60 – 0.70: Moderate Relevance Content in this range typically supports the query in a broader sense. While not directly answering the query, these blocks contribute valuable background, framing, or secondary insight.

· Score 0.50 – 0.60: Partial or Supporting Relevance These blocks touch on related themes but may lack specificity. Useful for discovery or internal linking but not ideal for standalone response snippets.

· Score < 0.50: Low or Peripheral Relevance Blocks in this range may introduce general content or diverge from the query topic. Often not optimal for surfacing as answers or featured blocks.

In the current result set, a significant number of blocks fall within the strong to exceptional range, showing a healthy level of content-query alignment, especially for tool-related queries. This reinforces the notion that certain pages are better optimized — topically and tonally — for specific types of audience intent.

Differentiation Across Pages and Query Types

Notably, the result landscape reveals a practical separation in how different URLs serve different queries. For instance, pages oriented toward technical SEO practices tend to produce high-scoring results for implementation-focused queries. In contrast, pages structured around tools and digital offerings exhibit stronger performance for brand- or solution-oriented queries.

This differentiation suggests that both content types — tactical guides and product narratives — can coexist effectively across a site as long as the content architecture supports clear segmentation. From a client perspective, this validates the benefit of query-specific content blocks being indexed and scored independently, rather than relying on broad page-level relevance.

Visual Insights That Support Strategic Planning

The underlying visual analysis of this result set provides clear and actionable takeaways:

· Block-level relevance distribution highlights which specific segments of a page drive search value. This allows teams to double down on high-performing blocks and revise or reposition weaker ones.

· Tone distribution per query helps assess whether the tonal strategy aligns with the intent of different audience segments. It becomes easier to identify if a query is being underserved tonally.

· Topic vs. average relevance heatmaps guide content rebalancing efforts. If a topic consistently underperforms in relevance despite being present, it may signal outdated, vague, or misaligned messaging that warrants content refinement.

These visual patterns collectively support a more nuanced optimization strategy — one that is both intent-driven and block-specific.

Q&A Section: Practical Takeaways from the Analysis

How does this analysis help us improve SEO content on a granular level?

This project enables content evaluation at the block level rather than treating entire pages as monolithic entities. By assigning relevance scores to specific content segments based on a query, you can pinpoint which blocks contribute most to search performance.

For example, for queries like *”how to improve SEO for dynamic URL structures”*, blocks that discussed canonicalization and indexing efficiency scored highly. This reveals which exact phrases and topics should be retained, enhanced, or repurposed into featured snippets, metadata, or new articles.

Actionable takeaway: Instead of broadly updating an entire page, selectively refine high-impact blocks, repurpose underperforming ones, and ensure top blocks are positioned prominently in the page layout or internal linking structure.

Can we determine if the content tone matches user intent?

Yes — tone classification is integrated into every block’s evaluation, allowing you to match tone with query intent. Informative and advisory tones align better with queries seeking knowledge or solutions, whereas promotional tones work well for tool-specific or product-oriented queries.

In the current analysis, for tool-related queries like *”benefits of using SEO tool lab for keyword clustering”*, the highest scoring blocks had a promotional tone — reflecting clear intent-to-sell alignment. Conversely, technical SEO queries leaned more toward informative tones, suggesting a better fit for guidance content.

Actionable takeaway: Use tone mapping to realign mismatched content. For instance, if an educational query surfaces only promotional content, it may be a signal to develop new supporting content or reframe the tone of existing blocks.

What do the relevance scores really tell us — and how should we act on them?

Relevance scores measure how well a block responds to a given query. Blocks are bucketed into generalized score bins, which indicate different action strategies:

Scores > 0.80- are content assets — ideal for highlighting in FAQs, schema markup, or snippets.
Scores between 0.60 and 0.80 are contextually strong but can be enhanced for clarity or tone.
Scores < 0.60 may require significant revision or removal if they don’t support any key query clusters.

In this analysis, most top-scoring blocks (up to 0.81) provided clear value tied to both technical SEO practices and tool benefits — showing that current messaging is on point, but with room to elevate supporting blocks for broader coverage.

Actionable takeaway: Use these scores as a prioritization map for content revision. Focus on boosting blocks in the 0.65–0.72 range to push them into high-performing territory.

How does this help us manage content across different URLs?

The analysis highlights how different pages respond to different types of queries. For instance, the SEO Tool Lab page performed significantly better for tool-related queries, while the Dynamic URL page dominated technical SEO queries.

This natural separation of relevance signals that content is already semantically focused — a sign of strong content architecture. However, it also reveals gaps: certain queries yield moderately relevant content from multiple pages, suggesting potential overlap or missed targeting.

Actionable takeaway: Use these cross-URL comparisons to decide which pages should own which query clusters. Then, optimize each page for its primary query intent while consolidating or reassigning overlapping content to avoid dilution.

How does the topic modeling support focused SEO improvement?

The integrated topic modeling ensures that every content block is tagged with a latent topic — even before scoring. This allows you to track which topics tend to perform well for which types of queries. For example, “seo performance improves” emerged as a consistent top topic across both pages for optimization-related queries.

This helps decouple general keywords from thematic content. Even if two blocks use similar terms, the underlying topic helps differentiate between strategic intent — such as technical efficiency vs. product promotion.

Actionable takeaway: Use topic-level insights to plan content clusters or hubs. If a topic consistently scores highly, create a pillar page. If a topic appears often but scores poorly, refine the framing or context.

Final Thoughts

This analysis demonstrates a precise, layered approach to evaluating SEO content — not just at the page level, but down to the individual content block. By disentangling the relevance, tone, and topic of each section in response to specific queries, it becomes possible to execute SEO strategies with surgical accuracy.

The findings confirm that high-performing content often aligns not only in terms of keywords but also in tone and topical focus. Moreover, the ability to compare relevance across multiple URLs for the same query unlocks new dimensions of competitive content alignment and internal content governance.

The strength of this approach lies in its adaptability. It scales across domains, supports various intent types, and offers actionable insight into both existing strengths and areas needing improvement. By operationalizing this analysis, SEO teams can confidently make high-impact decisions — whether optimizing current pages, realigning tone, or planning future content investments.