KL Divergence for Topic Modeling: A metric that measures the divergence between two probability distributions

KL Divergence for Topic Modeling: A metric that measures the divergence between two probability distributions

Get a Customized Website SEO Audit and Online Marketing Strategy and Action Plan

    This project implements KL Divergence to enhance the interpretability of topic modeling applied to webpage content. Each webpage is segmented into content sections, and a topic model maps these sections into a structured topic space. Simultaneously, search queries are represented in the same topic space, generating a query-topic distribution. By comparing these distributions, KL Divergence quantifies the alignment between webpage content and intended search intent.

    KL Divergence for Topic Modeling

    This approach enables the identification of sections that are strongly aligned with the target topics, as well as sections that diverge or underperform in terms of thematic relevance. Beyond individual sections, aggregated page-level divergence provides an overall measure of content alignment.

    The methodology integrates advanced natural language processing techniques to maintain semantic fidelity while offering interpretable, quantitative insights. It supports the evaluation of content coverage, relevance, and potential gaps, providing a robust framework for optimizing content strategy through topic-consistency analysis.

    Project Purpose

    The purpose of this project is to provide a systematic and quantitative approach to measure content alignment using KL Divergence within a topic modeling framework. In structured content environments, different sections of a page may vary in their relevance to specific topics or search queries. Evaluating this alignment at both the section and page levels allows for precise identification of content strengths and weaknesses.

    By applying KL Divergence to the probability distributions of topics in webpage sections versus query or target topic distributions, the methodology highlights sections that are strongly aligned, as well as those that deviate from the intended thematic focus. This supports content assessment, optimization, and strategic prioritization by indicating which areas require reinforcement, reorganization, or refinement.

    Additionally, the framework facilitates a comparative evaluation across multiple pages, enabling the detection of patterns, topic gaps, and potential inconsistencies in content strategy. The integration of interpretable NLP techniques ensures that the alignment scores are actionable and meaningful, bridging the gap between complex semantic modeling and practical content insights.

    Project’s Key Topics Explanation and Understanding

    This project leverages KL Divergence in combination with Topic Modeling to quantitatively assess content alignment with intended topics. A deeper understanding of the underlying concepts provides clarity on methodology, results, and practical interpretation.

    Kullback–Leibler (KL) Divergence

    KL Divergence is a fundamental concept in information theory and statistics. It measures the difference or divergence between two probability distributions, often described as the “distance” from a target distribution (Q) to an observed distribution (P). Mathematically, it is defined as:

    DKL(P||Q)=∑iP(i)logP(i)Q(i)

    Key properties include:

    • Non-Symmetry: KL Divergence is directional. DKL(P||Q)≠DKL(Q||P), meaning it measures how well Q approximates P, but not vice versa.
    • Non-Negativity: The divergence is always ≥ 0. A value of 0 indicates identical distributions.
    • Interpretability: Smaller values indicate closer alignment between the distributions; larger values signal greater deviation.

    Applications in this project:

    • Semantic Alignment: Each content section’s topic distribution (P) is compared to the target query or topic profile (Q) to evaluate alignment.
    • Quantitative Scoring: Produces a measurable score representing how closely a section adheres to the intended semantic focus.
    • Actionable Insights: Highlights areas of content misalignment, supporting optimization or restructuring.

    Topic Modeling

    Topic Modeling is an unsupervised NLP technique that identifies latent semantic structures within a corpus of text. Topics are represented by distributions over words, and each document or content section is expressed as a distribution over topics.

    Key concepts:

    • Document-Topic Probability: Indicates the degree to which a section is associated with each topic.
    • Topic Representation: Each topic is summarized using representative keywords and can be used to interpret the dominant themes in content.
    • Granularity: Applied at the section level, allowing fine-grained semantic analysis.

    Integration with KL Divergence:

    • KL Divergence is applied to the section-level topic distributions to compare them against a target query distribution.
    • This enables precise measurement of content alignment at both section and page levels.

    Section-Level Content Alignment

    By combining topic modeling and KL Divergence, the project achieves section-wise semantic scoring:

    • Sections are ranked by alignment score, revealing areas of strong thematic relevance or potential gaps.
    • Aggregating section scores provides page-level alignment, quantifying the overall focus on intended topics.
    • Comparative analysis across multiple pages allows identification of consistency, redundancy, or content drift.

    Distribution Analysis

    Beyond individual scores, the project employs distribution-based visualization:

    • Histograms and KDE Plots: Display the distribution of alignment scores across sections or pages.
    • Pattern Recognition: Detects areas where content consistently underperforms or aligns well with the target topics.
    • Cross-Page Comparisons: Highlights relative alignment and facilitates prioritization for optimization efforts.

    Practical Significance

    The combination of KL Divergence and Topic Modeling provides:

    • A quantitative, interpretable metric for content alignment.
    • Insights at multiple levels: section, page, and multi-page comparisons.
    • A framework to identify topic drift, content gaps, and optimization opportunities.

    This comprehensive approach transforms complex semantic structures into actionable, measurable insights, providing a robust foundation for content evaluation and refinement.

    Q&A: Project Value and Importance

    What is KL Divergence and why is it used in this project?

    KL Divergence, or Kullback-Leibler Divergence, is a statistical measure of how one probability distribution differs from another. It quantifies the information lost when one distribution is used to approximate another. In this project, KL Divergence is used to compare topic distributions between a query set and the content sections of a webpage. This allows for an objective measurement of how closely the content aligns with the intended search or informational intent. Unlike simple keyword matching, KL Divergence captures the probabilistic relationship between topics, ensuring a more nuanced understanding of semantic alignment.

    What is the purpose of measuring KL Divergence in content analysis?

    KL Divergence provides a quantitative measure of how closely content aligns with a target topic or query intent. By comparing the topic distribution of each section to the intended query distribution, it identifies areas where the content strongly represents the topic and areas where it deviates. This allows precise, numerical assessment of thematic consistency across sections and pages, which is especially valuable for evaluating technical content, blogs, and long-form articles where multiple subtopics exist. Without such a metric, alignment evaluation would rely on subjective judgment or manual review.

    How does topic modeling support KL Divergence in this project?

    Topic modeling extracts the underlying semantic structure from content. Each section is represented as a probability distribution over multiple topics, reflecting the different themes discussed. KL Divergence then measures the difference between this distribution and the query-focused topic distribution. The combination allows for:

    • Section-level semantic alignment scoring.
    • Identification of content gaps or sections that drift from the intended topic.
    • Quantitative insights into overall page alignment. Essentially, topic modeling provides the semantic map, and KL Divergence measures how well content navigates that map toward the target query intent.

    What SEO benefits can be realized by using this project?

    Using KL Divergence and topic modeling delivers several practical SEO advantages:

    • Improved Content Focus: Sections misaligned with target topics can be optimized or rewritten, ensuring the content is concentrated on relevant queries.
    • Enhanced Page Authority: By aligning all sections with the main topic, search engines can more confidently interpret page relevance, improving ranking potential.
    • Detection of Topic Drift: Sections that introduce unrelated topics or redundant content can be flagged and refined, minimizing the risk of diluting page relevance.
    • Structured Content Strategy: Provides actionable insight into content gaps and strong-performing sections, enabling strategic planning for internal linking, content expansion, or canonical structuring.
    • Multi-Page Benchmarking: Comparing KL divergence scores across multiple pages highlights areas for improvement and establishes performance benchmarks for future content.

    How does this methodology contribute to strategic insights in content planning?

    By quantifying the alignment between content sections and query-driven topic distributions, this methodology provides a clear picture of which sections reinforce key topics and which diverge. This insight supports content refinement, optimization of topical coverage, and better structuring of long-form content. It serves as a foundation for making informed decisions about content expansion, restructuring, or consolidation, ensuring content fulfills its informational and search relevance goals.

    Libraries Used

    requests

    The requests library is a Python HTTP library used for sending HTTP requests in a simple and human-readable way. It abstracts the complexities of network communication, allowing retrieval of web page content through GET or POST methods, handling headers, cookies, and status codes with minimal code.

    In this project, requests is used to fetch webpage content from URLs. This is the first step in processing each page for topic modeling and KL Divergence computation, enabling extraction of structured text blocks for further analysis.

    time

    The time module provides various time-related functions in Python, including pausing execution (sleep), measuring time intervals, and obtaining timestamps.

    time is primarily used to manage delays between HTTP requests when fetching multiple pages to avoid server overload or rate limiting. It also helps in logging execution times for different processing stages in the pipeline.

    logging

    The logging library provides a flexible framework for emitting log messages from Python programs. It supports different severity levels (DEBUG, INFO, WARNING, ERROR, CRITICAL) and can output logs to consoles, files, or other handlers.

    Logging is used throughout the pipeline to track the status of page extraction, section processing, topic modeling, and KL divergence computations. It allows monitoring of potential failures, warnings, and overall progress, making debugging and pipeline transparency easier.

    re

    The re module provides support for regular expressions in Python, allowing complex pattern matching and text manipulation operations.

    Regular expressions are used to clean and preprocess webpage text, such as removing unwanted characters, HTML tags, or patterns that could interfere with embedding generation or topic modeling.

    html

    The html library provides utilities for manipulating HTML data, including escaping or unescaping HTML entities.

    It is used to decode HTML entities within text extracted from web pages to ensure the text is human-readable and clean before feeding it into preprocessing and topic modeling pipelines.

    unicodedata

    The unicodedata module provides access to the Unicode Character Database, allowing normalization and classification of Unicode text.

    It ensures consistent handling of Unicode characters across different web pages, especially for non-ASCII characters, which improves the quality of embeddings and topic extraction.

    BeautifulSoup from bs4

    BeautifulSoup is a Python library for parsing HTML and XML documents. It allows traversal, search, and modification of the parse tree using an intuitive interface.

    BeautifulSoup is used to extract structured sections from HTML pages. This includes headings, subheadings, paragraphs, and other content blocks, forming the base for section-wise topic modeling and KL divergence analysis.

    typing (Optional, List)

    The typing module provides type hints for Python code, enabling optional type annotations to improve code readability, maintainability, and static analysis.

    Type hints such as Optional and List are used in function signatures to indicate expected input and output types, making the code more robust and easier to understand for developers and reviewers.

    numpy (np)

    NumPy is a fundamental library for numerical computing in Python. It provides support for high-performance multidimensional arrays, matrix operations, and mathematical functions.

    NumPy is used for all numerical computations, including storing embeddings, normalizing topic distributions, computing KL divergence, and performing vectorized operations efficiently on arrays representing section-topic or query-topic distributions.

    BERTopic

    BERTopic is a topic modeling library that leverages transformer-based embeddings and clustering to identify interpretable topics from text. It produces topic distributions for documents and allows visualization of topic representations.

    BERTopic is used to extract topics from webpage sections and assign topic probabilities to each section. These topic distributions are then used in conjunction with KL divergence to measure content alignment with query topics.

    SentenceTransformer

    SentenceTransformer is a library that provides pre-trained transformer models for generating semantically meaningful sentence embeddings suitable for similarity tasks.

    It generates dense vector embeddings for queries and webpage sections. These embeddings enable similarity calculations and fallback topic approximation when BERTopic’s transform function cannot directly produce topic distributions.

    torch

    PyTorch is a deep learning framework for tensor computations, model building, and GPU acceleration.

    PyTorch is used under the hood by transformer models, including SentenceTransformer and BERTopic embeddings. It enables efficient computation of embeddings for large sections of text.

    transformers.utils

    The transformers.utils module provides configuration and logging control for Hugging Face transformer models.

    Logging verbosity is suppressed to prevent excessive console output when generating embeddings or running transformer-based models. This ensures cleaner notebook output during execution.

    cosine_similarity from sklearn.metrics.pairwise

    cosine_similarity computes the cosine similarity between vectors, a measure of orientation similarity regardless of magnitude.

    It is used to measure similarity between query embeddings and section embeddings, particularly in fallback scenarios for approximating topic distributions when direct transformation is not feasible.

    KMeans from sklearn.cluster

    KMeans is a clustering algorithm that partitions data points into k clusters based on distance metrics, typically Euclidean distance.

    KMeans is used internally by BERTopic to cluster embeddings and define topic groups. It allows the creation of discrete topic clusters from continuous embedding spaces.

    TfidfVectorizer from sklearn.feature_extraction.text

    TfidfVectorizer converts a collection of text documents into a matrix of TF-IDF features, which highlight the importance of words relative to the corpus.

    It is used by BERTopic to create representative topic vectors and to extract the most significant terms for each topic. These terms form the summary and labeling of topics in the final results.

    matplotlib.pyplot (plt)

    Matplotlib is a widely-used Python plotting library that provides an object-oriented API for creating static, animated, and interactive visualizations.

    Matplotlib is used to generate all visualizations in the project, including page-level KL divergence bars, section alignment bars, topic distribution pie charts, and histograms of alignment scores.

    seaborn (sns)

    Seaborn is a Python data visualization library based on Matplotlib. It provides high-level interfaces for drawing attractive and informative statistical graphics.

    Seaborn is used to plot smooth density curves (KDE) for multi-page alignment distributions, making it easier to visually compare alignment across multiple URLs and better understand the underlying patterns in topic alignment.

    Function extract_structured_blocks

    Overview

    The extract_structured_blocks function is designed to extract meaningful textual content from a webpage and organize it into structured sections. Each section typically contains a hierarchical context: a main heading (H2), a subheading (H3), and the associated text blocks. The function supports multiple fallback strategies to ensure content is captured even if the HTML structure is irregular. Additionally, it splits long text into manageable blocks and filters out very short or irrelevant sections. The function returns a dictionary containing the URL and a list of extracted sections, each with heading, subheading, and text.

    This approach allows downstream processing, such as topic modeling and KL divergence computation, to operate on well-defined content blocks rather than raw, unstructured HTML. By handling hierarchical tags, splitting large text, and providing fallbacks, it ensures robust coverage across diverse webpage structures.

    Key Code Explanations

    • Webpage Fetching and Error Handling

    This block fetches the raw HTML content from the given URL. It includes a timeout and a user-agent header to prevent request blocking. raise_for_status() ensures that HTTP errors are caught and logged, enabling graceful handling of failed requests.

    • HTML Cleaning

    Removes non-content elements such as scripts, styles, navigation bars, and sidebars. This cleaning step reduces noise and focuses extraction on meaningful textual sections.

    • Text Cleaning and Splitting

    _clean normalizes whitespace for better text consistency. _split_text handles long paragraphs by splitting them into smaller blocks, ensuring each section stays within the defined max_block_chars. This is critical for topic modeling, as very long blocks can distort embeddings.

    • Hierarchical Extraction

    The hierarchical strategy assigns context to each block by tracking H2 and H3 tags. Text blocks inherit the headings, which allows subsequent topic modeling to consider structural hierarchy, improving the semantic understanding of sections.

    • Fallback Strategies

    If the hierarchical method fails to produce blocks (e.g., the page lacks proper headings), the function falls back to section-based or paragraph-based extraction. This ensures robust extraction even for pages with inconsistent HTML structures.

    Function preprocess_text

    Overview

    The preprocess_text function is responsible for cleaning and normalizing extracted webpage content before downstream NLP tasks such as topic modeling or similarity computations. It removes boilerplate phrases, common navigational text, legal disclaimers, URLs, and unnecessary Unicode characters. Additionally, it standardizes quotation marks, dashes, and whitespace. This ensures that only meaningful content is retained for semantic analysis, improving both model accuracy and computational efficiency.

    Key Code Explanations

    • Boilerplate and Extra Patterns

    Defines common non-informative phrases to remove from text. Additional custom phrases can be provided via boilerplate_extra. Using a compiled regex ensures efficient and case-insensitive removal across the text.

    • URL Removal

    Removes URLs to avoid irrelevant tokens in the NLP pipeline, which could distort embeddings or topic assignments.

    • Unicode Normalization and Character Substitution

    This step ensures consistent encoding, replaces typographical quotes and dashes with standard equivalents, and removes invisible characters. This is essential for models like BERTopic or sentence transformers, which are sensitive to inconsistent characters.

    • Whitespace Cleaning

    text = re.sub(r”\s+”, ” “, text).strip()

    Reduces multiple spaces to single spaces and trims leading/trailing whitespace. This prevents noisy tokens and improves model tokenization accuracy.

    Function preprocess_page

    Overview

    The preprocess_page function applies the preprocess_text function to every section of a given page. By iterating over each extracted content block, it ensures that all sections are consistently cleaned and normalized. The function returns the same page structure but with preprocessed text, making it ready for topic modeling and KL divergence computation.

    Key Code Explanations

    • Section-wise Preprocessing

    Iterates over each section and applies text preprocessing individually. This preserves the hierarchical structure (heading, subheading) while ensuring that each text block is clean and semantically meaningful.

    Function load_topic_model

    Overview

    The load_topic_model function initializes a BERTopic model with a specified embedding model from Hugging Face. BERTopic combines transformer-based embeddings with clustering to identify coherent topics within text data. By leveraging pre-trained sentence embeddings, the model can generate semantically meaningful topics without requiring manual labeling. This function ensures that the topic model is ready for use in downstream tasks such as computing section-level topic distributions, aligning content with queries, and calculating KL divergence scores.

    Key Code Explanations

    • Device Selection

    device = “cuda” if torch.cuda.is_available() else “cpu”

    Automatically detects if GPU acceleration is available for PyTorch. Using cuda significantly speeds up embedding computations, especially for large volumes of text. If a GPU is not available, the function defaults to CPU, ensuring compatibility across environments.

    • Embedding Model Initialization

    embedder = SentenceTransformer(model_name, device=device)

    Creates a SentenceTransformer embedding model. This transformer model encodes text into high-dimensional vectors representing semantic meaning, which BERTopic uses for clustering and topic extraction. The choice of model (all-mpnet-base-v2 by default) balances performance and accuracy for general-purpose semantic tasks.

    • BERTopic Model Setup

    topic_model = BERTopic(

        embedding_model=embedder,

        calculate_probabilities=True,

        verbose=False,

    )

    Initializes BERTopic with the embedding model. calculate_probabilities=True enables the model to generate topic probabilities for each document, which is essential for computing KL divergence between section topics and query topics. Suppressing verbose logging keeps the output clean during batch processing.

    Function _safe_softmax_row

    Overview

    The _safe_softmax_row function computes a softmax transformation over an input 1D array. Softmax is critical for converting raw scores or similarities into normalized probabilities. In this project, it is used to convert section-to-topic similarity scores into probabilistic topic assignments, allowing each section to have a weighted association with multiple topics. This ensures the KL divergence calculation accurately reflects the probability distributions of section-level topic assignments. The function also ensures numerical stability, preventing overflow or underflow issues that can occur when exponentiating very large or very small numbers.

    Key Code Explanations

    ·         x = x – np.max(x)

    Subtracts the maximum value in the array to stabilize exponentiation. Without this, large values could lead to inf after np.exp(x).

    ·         ex = np.exp(x)

    Computes the exponentials of the stabilized values.

    ·         s = ex.sum()

    Sums the exponentials to normalize into probabilities.

    ·         ex / s if s > 0 else np.ones_like(ex) / len(ex)

    Normalizes the exponentials so that they sum to 1. If the sum is zero (possible in edge cases), returns a uniform probability distribution to avoid errors.

    This function ensures that downstream mapping of probabilities is always valid, which directly impacts the reliability of KL divergence scores.

    Function _get_topic_id_order

    Overview

    This function retrieves the ordered list of topic IDs from a BERTopic model. Consistent topic ordering is essential for aligning the columns of the probability matrix with the corresponding topic IDs. Without a fixed order, downstream mapping of probabilities to topics could be mismatched, leading to inaccurate KL divergence calculations.

    Key Code Explanations

    ·         info = topic_model.get_topic_info()

    Fetches metadata about all topics including their IDs and sizes.

    ·         return list(info[“Topic”].values)

    Extracts the topic IDs in order from the model’s internal summary.

    ·         Fallback: If the primary method fails (e.g., incompatible BERTopic version), topic_model.get_topics().keys() retrieves topic IDs from the raw topic dictionary. This ensures robustness and guarantees that even in non-ideal scenarios, the probability mapping remains consistent.

    Function _coerce_probs_matrix

    Overview

    The _coerce_probs_matrix function standardizes BERTopic’s probability output into a 2D NumPy array of shape (n_docs, n_topics). BERTopic sometimes returns 1D arrays for single-document cases or higher-dimensional arrays for batch outputs. This function ensures that all outputs conform to a consistent format, which is essential for mapping topics to sections and calculating KL divergence without errors.

    Key Code Explanations

    ·         Checks the dimensionality of the input (arr.ndim) to handle:

    • 1D array of length n_docs (ambiguous case)
    • 1D array representing a single document
    • Already 2D array
    • Higher-dimensional arrays

    ·         Attempts to reshape higher-dimensional arrays to (n_docs, -1) to standardize input.

    ·         Returns None if coercion fails, allowing the pipeline to fallback safely to approximate distributions or KMeans. This design choice ensures the pipeline is robust to unexpected BERTopic outputs.

    Understood. I will rewrite all the remaining functions in the proper report style with:

    • Function heading (h2)
    • Overview (h3) – clear, detailed, practical explanation
    • Key Code Explanations (h3) – include actual lines of code from the function, explain their purpose, why it was written that way, and how it contributes to the pipeline

    I will cover all remaining functions:

    _kmeans_fallback_topics, _build_doc_topic_maps_from_matrix, _map_results_to_sections, generate_topics.

    Here’s the detailed report-style explanation:

    Function _kmeans_fallback_topics

    Overview

    The _kmeans_fallback_topics function provides a fallback topic generation mechanism when BERTopic cannot be reliably applied (e.g., too few documents or failed embeddings). It clusters document embeddings using KMeans and assigns soft probabilistic topic distributions. This ensures that every section still receives a topic assignment, maintaining the integrity of KL divergence calculation for topic alignment analysis. It also computes representative terms for each cluster, providing interpretable topic summaries for sections.

    Key Code Explanations

    ·         if n_clusters is None: n_clusters = max(1, min(n_docs, max(2, n_docs // 2)))

    Determines the number of clusters automatically based on document count, ensuring reasonable cluster granularity for small or large pages.

    ·         if embeddings is None or embeddings.shape[0] != n_docs:

    Builds embeddings using TF-IDF if precomputed embeddings are unavailable. TF-IDF is used as a lightweight fallback to capture semantic content without SentenceTransformer embeddings.

    ·         kmeans = KMeans(n_clusters=n_clusters, random_state=42)

    Initializes KMeans with deterministic random state for reproducibility.

    ·         labels = kmeans.fit_predict(embeddings)

    Assigns each document to a cluster.

    ·         sims = cosine_similarity(embeddings, centroids)

    Computes similarity of each document to all cluster centroids.

    ·         probs_rows = np.vstack([_safe_softmax_row(r) for r in sims])

    Converts similarities into a softmax probability distribution for each document.

    ·         Cluster summaries: top_idx = np.argsort(cluster_tfidf)[-top_n_terms:][::-1]`

    Extracts the top terms for each cluster based on TF-IDF, providing interpretable keywords for each topic.

    ·         topics_list = [int(l) for l in labels]

    Assigns each document to a cluster topic.

    This function ensures robustness by providing interpretable topic assignments even in low-data scenarios, preserving the downstream analysis for KL divergence.

    Function _build_doc_topic_maps_from_matrix

    Overview

    This function maps the per-document topic probability matrix to structured section-level outputs. Each section receives a topic_probs dictionary, the assigned topic_id, and a confidence score. It also calculates an explicit outlier probability (-1 topic), ensuring all probability distributions sum to 1. This mapping is crucial for accurate KL divergence calculations because it aligns each section with a probabilistic topic distribution.

    Key Code Explanations

    ·         probs_matrix = np.asarray(probs_matrix, dtype=np.float64)

    Ensures the probability matrix is a NumPy array with uniform numeric type for stability.

    ·         col_topic_ids = list(topic_order)[:n_cols]

    Maps matrix columns to topic IDs. If topic_order is not provided, a fallback is generated from the BERTopic model.

    ·         mapping[int(tid)] = float(row[j] / non_outlier_sum if non_outlier_sum > 0 else 0.0)

    Normalizes each topic probability for a section; ensures numerical stability.

    ·         outlier_prob = max(0.0, 1.0 – p_sum)

    Explicitly assigns leftover probability to outlier (-1) topic.

    ·         assigned = topics_list[i] if i < len(topics_list) else None

    Determines the most likely topic for the section, using fallback to highest probability if missing.

    ·         results.append({“topic_probs”: mapping, “topic_id”: int(assigned), “confidence”: confidence})

    Constructs the final output dictionary for each section, maintaining structured, probabilistic topic mapping.

    Function _map_results_to_sections

    Overview

    This function integrates topic modeling results back into the original page structure. For each section, it adds topic_probs, topic_id, and confidence, enriching sections with topic distribution data. It ensures that the mapping respects the order of non-empty section texts and aligns properly with the probability matrix. This function is essential to maintain the connection between raw content and topic assignments for KL divergence computation.

    Key Code Explanations

    ·         docs = [s.get(“text”, “”).strip() for s in sections if s.get(“text”, “”).strip()]

    Extracts only non-empty section texts for topic processing.

    ·         probs_matrix = np.asarray(probs_matrix, dtype=np.float64)

    Ensures consistent numeric type for all calculations.

    ·         doc_results = _build_doc_topic_maps_from_matrix(topic_model, docs, probs_matrix, topics_list, topic_order=topic_order)

    Generates per-document structured mappings with probability distributions.

    ·         di = 0; for s in sections: …

    Iterates through original sections and injects topic results, preserving empty sections with None values for completeness.

    Function generate_topics

    Overview

    The generate_topics function orchestrates single-page topic modeling. It integrates BERTopic with fallback mechanisms to KMeans for robust handling of small or challenging datasets. Each section receives probabilistic topic assignments (topic_probs), an assigned topic_id, and a confidence score. Page-level summaries are also generated. This function is the core of topic extraction, enabling KL divergence to measure divergence between query-intent and section-level topic distributions accurately.

    Key Code Explanations

    ·         embedder = getattr(topic_model, “embedding_model”, None)

    Attempts to use a SentenceTransformer embedder if available for semantic embeddings.

    ·         if n_docs < min_docs_for_bertopic:

    Uses KMeans fallback for small datasets, ensuring stable topic generation.

    ·         topics, probs = topic_model.fit_transform(docs, embeddings=doc_embs)

    Fits BERTopic to section texts and retrieves probability distributions.

    ·         probs_matrix_with_outlier = np.vstack(adjusted_rows)

    Adds explicit outlier probability for each section to maintain a complete probability distribution.

    ·         doc_results = _build_doc_topic_maps_from_matrix(topic_model, docs, probs_matrix_with_outlier, topics, topic_order=final_topic_order)

    Produces structured per-section topic mappings, used later for KL divergence.

    ·         page[“summary”] = topic_model.get_topic_info().to_dict(“records”)

    Adds page-level topic summary for reference and visualization.

    Function generate_query_topic_distribution

    Overview

    The generate_query_topic_distribution function computes an aggregated topic distribution for a set of user queries, aligning them with a pre-trained BERTopic model. The output is a dictionary mapping topic IDs to probabilities, including an explicit outlier topic (-1) to account for unaligned content.

    This function uses a preferred path of transforming queries through the BERTopic model (transform or approximate_distribution) to ensure queries are embedded in the same topic space as page sections. If this fails (e.g., invalid shapes or small datasets), a fallback mechanism computes query embeddings, calculates cosine similarity against topic embeddings, applies softmax, and normalizes probabilities. The function optionally supports query expansion for very short queries to enhance topic coverage.

    This robust approach ensures that queries always have a valid topic distribution, which can be directly used for KL divergence calculations or content-query alignment analyses.

    Key Code Explanations

    • Query expansion (optional)

    This allows enriching short queries to improve topic alignment. Any failure in expansion is logged but does not break the pipeline.

    • Query embeddings using model embedder

    If the BERTopic model has a SentenceTransformer embedder, the queries are encoded into embeddings for consistent topic space alignment. Numeric coercion ensures stability in downstream computations.

    • Transforming queries to topic space

    topics_out, probs = topic_model.transform(queries, embeddings=q_embs)

    Uses the BERTopic transform method to obtain per-query topic assignments and probabilities. If embeddings are not available, transform is called without them. This is the preferred path for high-quality topic alignment.

    • Coercing probability matrix

    probs_matrix = _coerce_probs_matrix(probs, n_q)

    Ensures the output probabilities are in a consistent 2D NumPy array. This step handles cases where BERTopic returns irregular shapes (e.g., single vectors).

    • Outlier topic computation

    outlier_prob = max(0.0, 1.0 – r_sum)
    row_with_outlier = np.concatenate([r, np.array([outlier_prob], dtype=np.float64)])
    row_with_outlier = row_with_outlier + EPS
    row_with_outlier = row_with_outlier / row_with_outlier.sum()

    Explicitly calculates the probability of the outlier topic (-1) as the remainder after summing non-outlier probabilities. Normalization ensures the sum of all probabilities equals 1.

    • Fallback to embedding-centroid similarity

    sims = cosine_similarity(q_embs, topic_embs)
    probs_rows = np.vstack([_safe_softmax_row(r) for r in sims])

    If transform or approximate_distribution fails, cosine similarity between query embeddings and topic embeddings is used, followed by a softmax conversion to probabilities. This ensures a reliable distribution even when BERTopic methods fail.

    • Aggregating across queries

    topic_sums = probs_with_outlier.sum(axis=0)
    topic_dist = {int(final_topic_order[i]): float(topic_sums[i] / total) for i in range(len(topic_sums))}

    Sums per-topic probabilities across all queries and normalizes to produce a final aggregated distribution over topic IDs, including the outlier. This output is suitable for KL divergence or page-query relevance calculations.

    Function compute_kl_divergence

    Overview

    The compute_kl_divergence function calculates the Kullback-Leibler (KL) divergence between two probability distributions, P (source) and Q (target). KL divergence is a fundamental concept in information theory that measures how one probability distribution diverges from another.

    In this project, P typically represents the topic distribution of a page section, while Q represents the aggregated topic distribution of client queries. By computing the KL divergence between sections and queries, the function quantifies how well a section aligns with the user’s search intent. Sections with lower KL divergence are more aligned with query topics, while higher KL divergence indicates potential gaps or misalignment in content relevance.

    Key Code Explanations

    • Conversion to NumPy arrays and type enforcement

    p = np.array(p_dist, dtype=np.float64)

    q = np.array(q_dist, dtype=np.float64)

    Ensures that input distributions are numerical arrays of type float64, which is critical for accurate mathematical operations and stability in downstream calculations.

    • Epsilon addition to avoid log(0) and division by zero

    p = p + epsilon

    q = q + epsilon

    Adding a small constant epsilon prevents undefined operations when any probability value is zero. This is especially important in sparse distributions where some topics may have zero probability.

    • Normalization to valid probability distributions

    p = p / p.sum()

    q = q / q.sum()

    Ensures that both p and q sum to 1, satisfying the requirements for proper KL divergence computation. This step also mitigates inconsistencies due to numerical errors or unnormalized input.

    • KL divergence computation

    kl_div = np.sum(p * np.log(p / q))

    Applies the standard formula for KL divergence:

    DKL(P||Q)=∑iP(i)⋅logP(i)Q(i)

    This measures the “distance” or information loss when using Q to approximate P. In the SEO context, this quantifies how well a content section’s topics cover the query’s intent.

    Function compute_section_kl

    Overview

    The compute_section_kl function evaluates how well each content section aligns with the client’s query intent by computing KL divergence between each section’s topic distribution and the query topic distribution.

    In practical terms, this function takes the topic probabilities for each section (generated by BERTopic or fallback KMeans) and compares them to the aggregated query topic distribution. The resulting KL scores quantify content relevance: lower scores indicate higher alignment with the user’s search intent. The function also supports aggregating section-level KL divergences into a page-level KL score by averaging across all sections, giving clients a quick, high-level measure of overall intent coverage.

    Key Code Explanations

    • Check for valid sections

    Ensures that the page contains sections. If not, KL computation is skipped and a warning is logged. This prevents runtime errors in downstream calculations.

    • Check for empty query distribution

    Validates the presence of query topic probabilities. If the distribution is empty, section KL scores are set to None and processing stops. This avoids meaningless divergence calculations.

    • Align section and query topic vectors

    Collects all topic IDs from both sections and query distribution, ensuring the vectors are aligned. This guarantees that each probability array corresponds to the same set of topics, which is critical for accurate KL divergence calculation.

    • Compute section-level KL divergence

    For each section:

    1. Extracts the topic probability vector (p_arr) aligned to all_topic_ids.
    2. Calls compute_kl_divergence to calculate divergence with the query vector (q_arr).
    3. Stores the result in sec[“kl_divergence”]. This ensures per-section relevance scoring, giving clients detailed insights into which sections are most aligned with search intent.
    • Aggregate KL divergence across sections

    When aggregate=True, calculates the mean of section-level KL scores to provide a single page-level KL metric. This allows clients to quickly gauge overall content alignment at the page level.

    Function display_topic_alignment_results

    Overview

    The display_topic_alignment_results function provides a client-facing visualization of topic alignment between page sections and query intent. It summarizes:

    • Page-level alignment scores (via KL divergence).
    • Top performing sections (best aligned to query topics).
    • Sections that are underperforming or misaligned (likely candidates for optimization).
    • Overall page topic summary including representative terms for each topic.

    This function is non-destructive; it only reads the results from previous computations and prints them in a structured, interpretable format, making it ideal for reporting or client presentations.

    Result Analysis and Explanation

    This section interprets the results from the KL Divergence-based topic alignment analysis. The focus is on understanding how well the content of a page aligns with query intents, identifying strong-performing sections, and highlighting areas of misalignment.

    Overall Page-Level Alignment Score

    Score: 0.5920

    The overall alignment score represents the average KL divergence between the topic distribution of all page sections and the query topic distribution.

    • Lower values indicate stronger alignment, meaning the page’s sections closely match the query topics.
    • Higher values indicate divergence, showing sections may not reflect the intended query focus.

    Interpretation of thresholds:

    • 0.0 – 0.3: Excellent alignment; the content reflects the query topics very closely.
    • 0.3 – 0.6: Moderate alignment; some sections may require adjustment for better focus.
    • 0.6 – 1.0: Noticeable divergence; content may need restructuring or enhancement.
    • Greater than 1.0: High divergence; sections are largely misaligned with the query topics.

    For this page, a score of 0.5920 indicates moderate alignment, suggesting that some sections align well while others show divergence that could be addressed.

    Best-Aligned Sections

    Sections with the lowest KL divergence are considered best-aligned and closely match the query topic distribution.

    ·         Section: “This ensures that is indexed instead of alternate video versions.” Alignment Score: 0.2511 Topic ID: 0 Topic Confidence: 0.5927

    ·         Section: “This instructs search engines to treat as the primary document.” Alignment Score: 0.2528 Topic ID: 0 Topic Confidence: 0.6002

    ·         Section: “Under the Headers section, check for the Link header and verify the canonical tag is correctly assigned.” Alignment Score: 0.2877 Topic ID: 0 Topic Confidence: 0.6826

    Interpretation:

    • Low KL divergence indicates strong alignment with query topics.
    • Topic confidence represents the certainty of the section’s assigned topic. Higher confidence confirms reliable alignment.

    These sections can serve as models for extending or reinforcing content structure and topic coverage.

    Least-Aligned Sections

    Sections with the highest KL divergence show misalignment and are candidates for content improvement.

    ·         Section: “Observe search rankings and ensure that the duplicate versions do not compete with the canonical file.” Alignment Score: 1.0216 Topic ID: 1 Topic Confidence: 1.0000

    ·         Section: “Monitor Performance in Search Results Observe search rankings and ensure that the duplicate versions do not compete with …” Alignment Score: 1.0216 Topic ID: 1 Topic Confidence: 1.0000

    ·         Section: “If Google indexes the duplicate version instead of the canonical one, re-evaluate your implementation.” Alignment Score: 1.0216 Topic ID: 1 Topic Confidence: 1.0000

    Interpretation:

    • High KL divergence indicates these sections diverge significantly from the query topic distribution.
    • High topic confidence confirms that the assigned topic is clear, but the content is not aligned with the query focus.

    These sections are suitable for review, restructuring, or content revision to improve topic alignment.

    Page Topic Summary

    ·         Topic -1 | Count: 1 | Name: -1_index_ensures_preferred_engines | Top Terms: index, ensures, preferred, engines, this Represents outlier content that does not clearly belong to any primary topic.

    ·         Topic 0 | Count: 41 | Name: 0_the_headers_and_http | Top Terms: the, headers, and, http, for Dominates the page content. Sections under this topic show strong alignment with query topics.

    ·         Topic 1 | Count: 11 | Name: 1_to_search_and_duplicate | Top Terms: to, search, and, duplicate, be Sections under this topic show high divergence, indicating a need for content improvement or restructuring.

    The distribution provides insight into which topics dominate the page, which align well with the queries, and which require focus for optimization.

    Key Interpretations and Actions

    • Moderate overall alignment indicates a mix of well-aligned and misaligned sections.
    • Sections with low KL divergence can guide content reinforcement strategies.
    • Sections with high KL divergence highlight areas for content revision or restructuring.
    • Topic coverage distribution shows which topics dominate and which require adjustments to improve query relevance.
    • KL divergence provides a quantitative measure of alignment with query topics, enabling a systematic approach to content optimization.

    This analysis translates KL divergence scores into actionable insights, allowing content focus to be adjusted in a measurable way for improved topic alignment.

    Result Analysis and Explanation: KL Divergence for Topic Modeling

    This section provides a detailed interpretation of the results produced by KL divergence-based topic alignment analysis. It covers multiple URLs and queries in a generalized form, ensuring that the explanation applies regardless of specific URLs, scores, or queries. The discussion focuses on understanding the alignment metrics, section performance, topic distribution, and actionable insights derived from the results.

    Overall Page-Level Alignment

    The overall alignment score for a page is calculated as the average KL divergence across all sections when compared to the query topic distribution. Lower scores indicate stronger alignment between the content and the query topics, while higher scores suggest potential misalignment.

    Score Threshold Interpretation:

    • 0.0 – 0.5 (Excellent Alignment): Sections are highly consistent with query intent; content is strongly aligned with targeted topics.
    • 0.5 – 1.0 (Moderate Alignment): Sections are generally aligned but may contain areas needing improvement or refinement.
    • 1.0 – 2.0 (Weak Alignment): Sections show noticeable divergence from query topics; attention is required to optimize content structure and coverage.
    • Above 2.0 (Poor Alignment): Significant misalignment between content and query topics; content restructuring, rewrites, or deeper topic focus is recommended.

    This threshold framework allows a clear interpretation of the page’s overall alignment and can guide prioritization of content optimization efforts.

    Section-Level Alignment

    Each page section is assigned a KL divergence score relative to the query topic distribution, providing granular insight into which sections perform well or poorly.

    Best-Aligned Sections:

    • Sections with low KL scores are considered strong performers.
    • These sections reflect content that closely matches the target query intent and topic distribution.
    • Maintaining, updating, or expanding these sections can reinforce content authority.

    Least-Aligned Sections:

    • Sections with high KL scores indicate misalignment.
    • These sections may contain irrelevant information, overlapping topics, or structural issues.
    • Optimizing these sections may involve reassigning topic focus, rewriting content, or redistributing information for better alignment with query intent.

    Confidence Consideration:

    • Each section includes a confidence score representing topic assignment reliability. Sections with low confidence or outlier topic IDs should be interpreted with caution and may require verification.

    Topic Distribution Analysis

    The page topic summary shows how content is distributed across the modeled topics. This provides insight into content balance and potential gaps.

    Key Points:

    • Dominant topics appear as larger slices in the distribution, indicating high content coverage in these areas.
    • Less-represented topics indicate gaps where content may not fully address specific query aspects.
    • Outlier or miscellaneous topics (often labeled with negative IDs) may signal sections that do not fit well into the main topic clusters and may require review.

    Understanding the topic distribution aids in ensuring comprehensive coverage and minimizing unintended gaps or overemphasis in certain areas.

    Interpretation of KL Divergence Metrics

    KL Divergence quantifies the divergence between the probability distribution of query topics and the probability distribution of section topics. Important points for interpretation:

    • Lower KL divergence reflects strong alignment; content is highly relevant to queries.
    • Higher KL divergence reflects weaker alignment; content may be off-topic or insufficiently aligned.
    • Aggregating section-level KL scores provides a page-level alignment score, summarizing overall performance.
    • KL divergence is asymmetric, meaning the direction of comparison matters (section vs query). In practice, the divergence from content to query distribution highlights areas where content does not meet the intended topic coverage.

    Score-Based Binning for Actionable Insights

    To facilitate action-oriented analysis, alignment scores can be categorized into bins:

    • Excellent (0.0 – 0.5): Maintain and reinforce these sections. High alignment indicates effective topic coverage.
    • Good (0.5 – 1.0): Review these sections for minor improvements or additional context to increase alignment.
    • Moderate (1.0 – 2.0): Evaluate content structure and topic focus; consider partial rewrites or reorganization.
    • Poor (>2.0): Immediate action recommended; sections may need full rewrite, reorganization, or targeted content updates.

    This binning framework helps prioritize content optimization efforts and track alignment improvement over time.

    Visualization Interpretation

    Visualization plays a critical role in understanding KL divergence results, offering multiple perspectives:

    Page-Level Alignment Plot

    • Bar chart showing average KL divergence per page.
    • Lower bars correspond to stronger alignment.
    • Provides a quick comparative view across multiple pages.

    Multi-URL Alignment Distribution

    • KDE curves display the distribution of section-level alignment scores across multiple pages.
    • Left-shifted curves indicate stronger alignment; right-shifted curves indicate weaker alignment.
    • Highlights consistency and variability in section performance across pages.

    Page Topic Distribution Pie Chart

    • Pie chart representing topic coverage of a page.
    • Larger slices indicate dominant topics; smaller slices reveal underrepresented areas.
    • Top terms for each topic help connect KL scores to actual content semantics.

    Section Alignment Histogram

    • Histogram of section-level KL divergence for a single page.
    • Peaks on the left indicate well-aligned sections; peaks on the right indicate misaligned sections.
    • Helps identify the proportion of sections performing at various alignment levels.

    Section-Level Alignment Comparison

    • Horizontal bar chart comparing best and worst-aligned sections within a page.
    • Best-aligned sections in green highlight strong content; worst-aligned in red indicate areas needing attention.
    • Numeric KL scores provide precise reference for content evaluation.

    Summary

    The analysis of KL divergence provides a quantitative and interpretable measure of content-topic alignment. Key takeaways include:

    • Page-level and section-level scores identify strong performers and areas needing optimization.
    • Topic distribution highlights coverage gaps and overrepresented areas.
    • Visualization supports rapid assessment and prioritization.
    • Score thresholds and bins facilitate actionable decisions for content refinement and alignment with query intent.

    This structured approach ensures a clear understanding of alignment performance and supports data-driven content optimization and topic coverage improvement.

    Q&A Section: Insights and Actions Based on KL Divergence Analysis

    Which sections demonstrate the strongest alignment with query topics, and how should this inform content strategy?

    Sections with the lowest KL divergence scores indicate content that closely matches the target query topics. These are high-performing sections where the topic coverage aligns well with search intent. In practice, these sections can be leveraged as models for similar content development:

    • Expand related subtopics using the language and structure of these sections.
    • Use them as internal linking hubs to reinforce topical authority.
    • Maintain or slightly update these sections to preserve alignment while expanding coverage.

    Interpretation from visualization: Bar charts of section KL scores and the section alignment plots highlight the best-performing sections clearly, enabling identification of strong content anchors.

    Which sections are least aligned, and what actions are recommended for optimization?

    Sections with high KL divergence are misaligned with the query topics. These indicate gaps where the content does not address the expected themes or user intent. Actions include:

    • Revising content to better match query intent and related subtopics.
    • Breaking long sections into smaller, focused segments to target specific topics.
    • Adding relevant keywords or semantic terms derived from topic modeling to improve coverage.

    Interpretation from visualization: Histograms of section-level KL divergence and the section alignment plots make it easy to identify outlier sections requiring attention.

    How can the overall page alignment score inform prioritization of optimization efforts across multiple pages?

    The page-level KL divergence aggregates alignment across all sections. Pages with lower scores are closer to the intended topic coverage, while higher scores indicate misalignment at the page level. Actions based on these scores:

    • Prioritize optimization on high KL divergence pages to maximize ROI from content updates.
    • Benchmark page performance over time by tracking changes in the KL divergence after content modifications.
    • Use page-level scores to compare multiple pages targeting similar queries and identify which need restructuring or refinement.

    Interpretation from visualization: The page-level bar chart of KL divergence scores across multiple URLs visually identifies pages that are strong performers versus those needing intervention.

    How can topic distribution analysis guide content coverage and expansion strategies?

    Topic distribution summaries show how evenly content covers all relevant topics for a page. Observations:

    • Dominance of a single topic may indicate overemphasis, potentially leaving gaps in related queries.
    • Low-frequency topics might represent missed opportunities for coverage and should be expanded with additional content.
    • Topic outliers (-1) represent content not captured by known topics and may need refinement or reclassification.

    Interpretation from visualization: Pie charts of topic distribution per page reveal the proportional coverage of topics, helping identify areas for expansion and balance across content.

    How can the alignment score distributions across sections and pages inform overall content quality assessment?

    The distribution of KL divergence scores, visualized via histograms or KDE plots, provides a view of consistency and spread of alignment:

    • Narrow distributions skewed towards low scores indicate uniform, high-quality content.
    • Wide or right-skewed distributions signal uneven coverage, highlighting sections requiring targeted updates.
    • Multi-page KDE comparisons reveal which pages consistently underperform relative to others.

    Actions derived from these insights:

    • Focus updates on sections in the tail of the distribution to improve consistency.
    • Use the overall distribution patterns to guide editorial strategies, ensuring all topics are sufficiently addressed.
    • Monitor changes in distributions after optimization to validate improvements.

    How can the page-level KL divergence bar chart be used to prioritize pages for content review?

    The bar chart shows overall alignment for each page. Lower bars indicate better alignment, higher bars indicate misalignment. Pages with high KL divergence should be reviewed first to identify sections needing content updates, restructuring, or optimization. This enables efficient allocation of optimization efforts across multiple pages.

    How does the multi-page alignment score distribution (KDE plot) help identify gaps or inconsistencies?

    Each curve represents the distribution of section-level alignment scores for a page. Left-shifted curves indicate strong alignment; right-shifted curves highlight sections that diverge from query topics. Comparing distributions helps detect pages with systemic misalignment and prioritize sections that fall in the high-score tail for targeted improvements.

    How should the page topic distribution pie chart be interpreted for content planning?

    Each slice represents a topic’s proportion across the page. Large slices indicate content concentration on specific topics, small slices highlight underrepresented topics. This chart helps identify gaps or overemphasis, guiding the addition of new sections for low-represented topics or adjusting existing content for better topic balance.

    What insights does the section-level KL divergence histogram provide for content optimization?

    The histogram shows the distribution of KL scores for sections within a page. Low-score bins indicate well-aligned sections, high-score bins indicate poorly aligned sections. Sections in high-score bins should be prioritized for rewriting, content restructuring, or semantic enrichment to improve overall page-topic alignment.

    How can the section alignment horizontal bar chart guide micro-level content improvements?

    Green bars represent best-aligned sections, red bars show worst-aligned sections, with exact KL scores displayed. This chart identifies individual sections requiring attention, allowing direct interventions such as rewriting, content expansion, or topic-focused adjustments. Best-aligned sections can serve as structural or semantic models for weaker sections.

    Final Thoughts

    The analysis demonstrates how KL Divergence effectively quantifies the alignment between content sections and target topic distributions. Sections with low divergence indicate strong semantic consistency with intended topics, while higher divergence highlights areas that differ from the expected topic distribution. Aggregating these measures provides an overall alignment score at the page level, allowing a clear assessment of content focus and relevance.

    Topic-level distributions and alignment metrics enable informed decisions regarding content balance, structural improvements, and semantic coverage. Best-aligned sections can guide the development of consistent content, while insights from less-aligned sections help refine and optimize topic representation across the page.

    Visualization tools complement the analysis by providing intuitive, actionable views of alignment patterns, topic distributions, and section-level performance. Together, these metrics and visualizations provide a structured framework for interpreting, understanding, and optimizing content alignment in a quantifiable, data-driven manner.

    Overall, KL Divergence serves as a reliable metric to measure divergence between probability distributions in topic modeling, translating complex semantic relationships into actionable insights for content evaluation and enhancement.


    Tuhin Banik - Author

    Tuhin Banik

    Thatware | Founder & CEO

    Tuhin is recognized across the globe for his vision to revolutionize digital transformation industry with the help of cutting-edge technology. He won bronze for India at the Stevie Awards USA as well as winning the India Business Awards, India Technology Award, Top 100 influential tech leaders from Analytics Insights, Clutch Global Front runner in digital marketing, founder of the fastest growing company in Asia by The CEO Magazine and is a TEDx speaker and BrightonSEO speaker.


    Leave a Reply

    Your email address will not be published. Required fields are marked *