Cohesion-Based Text Segmentation - Analyzes Text Cohesion

SUPERCHARGE YOUR ONLINE VISIBILITY! CONTACT US AND LET’S ACHIEVE EXCELLENCE TOGETHER!

Cohesion-Based Text Segmentation is an advanced SEO-focused solution designed to analyze webpage content and intelligently segment it based on shifts in topical coherence. Instead of treating a page as a flat block of text, this approach breaks the content into meaningfully distinct sections by detecting changes in semantic similarity across content blocks.

The segmentation is further enriched by optionally incorporating user queries to measure how relevant each segment is to specific search intents. This dual-layered analysis—content cohesion and query alignment—enables businesses to understand which parts of their content support visibility for key queries and which segments may be off-topic, diluted, or in need of restructuring.

Built as a modular, production-grade pipeline, the project supports:

Batch processing of multiple URLs
Flexible query-aware or query-agnostic modes
CSV exports and intuitive visualizations to support actionable insights

Ultimately, this system equips SEO teams and digital strategists with a deeper understanding of how content structure affects user relevance, search engine crawling, and ranking precision.

Project Purpose

Traditional SEO audits often overlook the internal structure and coherence of page content, focusing primarily on surface-level factors like keyword presence, metadata, or backlinks. However, search engines increasingly prioritize content that is contextually consistent, semantically segmented, and aligned with user intent.

This project was developed to fill that gap.

The primary purpose of Cohesion-Based Text Segmentation is to:

Detect topic shifts within a webpage by analyzing the semantic cohesion between content blocks.
Isolate distinct topical segments, allowing SEOs to treat each segment as an individual unit for optimization, targeting, or restructuring.
Measure segment-level relevance to a given query, helping to identify which parts of the page serve the user’s intent and which dilute the content’s focus.
Support strategic decision-making for SEO improvement through clear visualizations, segment previews, and exportable CSV data.

This segmentation-driven perspective supports smarter on-page optimization strategies, improved content targeting, and enhanced user experience—all of which translate into better rankings and higher engagement.

Project’s Key Topics Explanation and Understanding

This section explains the core technical and conceptual components that underpin the project “Cohesion-Based Text Segmentation: Analyzes text cohesion to detect topic shifts within pages, refining search results.” Each concept is directly tied to the project title and plays a central role in how the system operates to improve search alignment and user experience.

Understanding Text Cohesion

Text cohesion refers to how well different parts of a document stick together in terms of meaning, structure, and semantic flow. A cohesive passage exhibits continuity—its blocks or sentences are semantically related and smoothly connected.

Cohesion is used as the foundational signal for detecting boundaries within a web page. When cohesion between adjacent blocks drops significantly, it may indicate a shift in topic. These low-cohesion points are flagged as potential segment boundaries.

How It’s Measured: In this project, cohesion is quantified using semantic similarity between consecutive content blocks, computed via embedding vectors. Lower similarity implies weaker cohesion, signaling a possible topic shift.

Topic Shift Detection

Topic shift detection identifies points in a text where the subject or context meaningfully changes. This change can happen between paragraphs, sections, or even within complex blocks of content.

Detecting topic shifts enables the system to split a page into meaningful, topic-consistent segments. This makes it easier to:

Index and retrieve specific portions of content,
Improve query-based ranking,
Eliminate noisy or irrelevant text from influencing search alignment.

Detection Strategy: The project uses two primary methods to detect topic shifts:

Cohesion Drop: A low similarity between a block and the next block suggests a possible shift.
Topic Embedding Divergence (Optional): If enabled, topic representations of adjacent segments are compared to confirm a substantial thematic deviation.

Segment-Based Page Representation

Rather than treat the entire page as a monolithic unit, the system splits it into logical segments, each representing a consistent topic or subtopic.

This segmentation allows for:

Fine-grained matching of page segments to user queries,
Reduction of irrelevant content from affecting search rankings,
More informative snippet generation in SERPs.

Construction of Segments: Segments are defined as contiguous groups of blocks with high internal cohesion and no topic shifts in between. Each segment is stored with metadata such as:

Number of blocks
Whether it starts at a detected topic shift
Similarity with neighboring segments

Query-Aware Relevance Scoring

A mechanism to assess how relevant each segment is to a given user query. It allows prioritizing segments that are not only topically consistent but also aligned with search intent.

After segmentation, each segment is scored against the user query using vector similarity methods (e.g., cosine similarity between query embedding and segment embedding). This ensures:

Segments most relevant to the query are highlighted,
Result summaries become more accurate and intent-focused,
Irrelevant segments are filtered out, even if they’re part of the same page.

Q&A Section for Understanding Project Value

This section answers key client-facing questions that explain the practical value of the project “Cohesion-Based Text Segmentation: Analyzes text cohesion to detect topic shifts within pages, refining search results.” Each Q&A is crafted to help clients clearly understand why this solution matters for their SEO performance and how it supports measurable improvements.

Why is segmenting a webpage by topic important for SEO?

Traditional search engines evaluate entire pages as a single unit, which means that irrelevant or off-topic content can dilute the relevance of high-value sections. When a page covers multiple topics (e.g., combining informational content with promotional offers or FAQs), this can reduce ranking accuracy for specific queries.

By segmenting the page based on topic shifts and cohesion, we isolate self-contained, topically aligned sections. This enables:

Better alignment with user intent: Only the most relevant segments are matched to search queries.
Improved snippet generation: Richer, context-specific excerpts appear in SERPs.
Reduced content dilution: Non-relevant sections no longer affect ranking calculations.

This leads to higher visibility, better user engagement, and improved conversion rates from organic search.

How does detecting topic shifts improve the search performance of my pages?

When topic shifts are detected, the system understands where one idea ends and another begins. This allows the system to:

Target the most relevant content in your page for a given search query.
Avoid showing mixed or fragmented content in search results.
Refine internal linking strategies by linking to specific segments instead of entire pages.

As a result, users find exactly what they’re looking for, faster—reducing bounce rates and improving overall engagement.

Does this system work with existing content or do I need to rewrite my pages?

This solution works entirely on your existing content. It analyzes your pages as they are and automatically segments them based on natural shifts in cohesion and meaning. There is no need to manually rewrite or reorganize content.

However, the insights generated from segmentation can help you optimize future content creation, especially by identifying sections that:

Frequently cause topic shifts,
Are off-topic or redundant,
Could be turned into standalone landing pages.

How is this different from keyword-based or rule-based content splitting?

Unlike rule-based systems that split based on headings or specific phrases, this system uses semantic cohesion and topic embeddings—powered by advanced language models—to understand meaning at a deeper level.

This means:

It adapts to any writing style or content structure.
It captures implicit topic boundaries even without headings.
It is language- and context-aware, offering a far more intelligent segmentation.

This results in far more accurate and reliable content analysis than traditional keyword-based heuristics.

How does segment-level scoring benefit SEO strategy?

Segment-level scoring allows you to see which parts of your content best match target queries, rather than relying on whole-page relevance. This enables:

Prioritized optimization: Focus your SEO efforts on underperforming segments.
Targeted internal linking: Link from anchor text to high-scoring segments.
Conversion-focused layout: Reorganize your page to surface the most valuable segments earlier.

Ultimately, this improves both search engine understanding and user experience, leading to better organic performance.

Libraries Used

requests

The requests library is a widely used HTTP library in Python that allows for sending HTTP/1.1 requests easily and intuitively. It simplifies tasks such as fetching data from webpages, handling headers, authentication, and session management.

In this project, requests is used to retrieve the HTML content from client webpages. This serves as the initial step in the pipeline, enabling us to analyze the live content structure directly from the source URLs.

bs4 (BeautifulSoup) and Comment

BeautifulSoup is a powerful Python library used for parsing HTML and XML documents. It enables clean and flexible traversal, searching, and modification of the parse tree (i.e., the document structure).

In this project, BeautifulSoup (along with Comment) is used to extract text blocks from webpage HTML while removing unnecessary elements like scripts, styles, and comments. This ensures that only meaningful user-visible content is passed downstream for processing.

re

The re module is Python’s built-in library for regular expressions. It provides tools for string matching, searching, and pattern-based text substitution.

Using re in multiple stages of the project for cleaning text content, removing non-content patterns (e.g., URLs, symbols), and detecting common text structures like multiple spaces or HTML tag patterns.

logging

The logging module in Python provides a flexible framework for emitting log messages from programs. It supports various levels of severity (e.g., DEBUG, INFO, WARNING, ERROR) and configurable output formats.

In this project, logging is used to track information and debugging messages throughout the content extraction and segmentation pipeline. This helps in identifying issues during live runs and maintaining a professional-grade development environment.

html

The html library (from Python’s standard library) provides utilities for handling HTML entities, such as escaping or unescaping special characters.

Using html to unescape HTML character codes in the extracted content. This ensures that the content is human-readable and semantically accurate before vector embedding and processing.

unicodedata

unicodedata is a built-in Python module used to work with Unicode character properties. It helps in standardizing text by normalizing characters.

Applied unicodedata to normalize all extracted textual data to a consistent form (e.g., handling accented characters), which is critical for ensuring vector consistency and reliable downstream text analysis.

numpy

numpy is a foundational library for numerical computing in Python. It provides efficient array operations, matrix manipulation, and vectorized computation.

In this project, numpy is used to manage and operate on vector embeddings of content blocks and queries, as well as during similarity scoring, PCA projection, and topic vector computations.

sentence_transformers (SentenceTransformer)

The sentence-transformers library offers pre-trained transformer-based models for generating semantically meaningful sentence and paragraph embeddings.

Using SentenceTransformer to convert cleaned content blocks and queries into vector embeddings that capture their semantic meaning. These embeddings are the foundation for similarity scoring and topic segmentation.

transformers.utils

transformers.utils is part of the Hugging Face Transformers library, providing utility functions to control logging behavior, display outputs, and manage internal settings.

Using utils.logging.set_verbosity_error() and disable_progress_bar() to suppress unwanted output and warnings during model usage. This keeps the console output clean and focused during batch processing.

sklearn.metrics.pairwise (cosine_similarity)

This module from scikit-learn provides utilities for calculating similarity or distance between vectors. cosine_similarity measures angular similarity, which is commonly used for comparing text embeddings.

In our pipeline, cosine_similarity is used to score how closely each content block matches the query. It’s a key component in determining block-level relevance and helps rank the segmented outputs.

keybert (KeyBERT)

KeyBERT is a lightweight wrapper around BERT-based embeddings to extract keywords from text based on semantic similarity rather than frequency.

Applied KeyBERT to extract meaningful topic labels for each content segment or cluster. These labels are client-friendly and help explain what each topic shift is about within the segmented content.

typing (List, Dict, Optional)

The typing module provides type hinting capabilities in Python, allowing developers to annotate variables, function parameters, and return types with expected data types.

Using these types throughout the codebase to enforce function-level type safety, improve code readability, and enable better auto-completion and debugging support during development.

csv

The csv module is part of Python’s standard library and provides tools to read from and write to CSV (Comma-Separated Values) files.

Although optional, csv can be used in this project to log processed output, export segment scores, or save analysis results for external inspection or report sharing.

matplotlib.pyplot

matplotlib.pyplot is a widely used Python plotting library for creating static, interactive, and animated visualizations.

Using matplotlib.pyplot to generate bar plots, line charts, and distribution visuals that compare segment-level and page-level relevance. These visualizations are client-facing and offer actionable insights for website optimization.

seaborn

seaborn builds on top of matplotlib and provides a high-level interface for drawing attractive and informative statistical graphics.

In this project, seaborn is used to plot box plots and distribution charts with enhanced visual appeal and better default styling, making the visualizations easier to interpret for non-technical stakeholders.

collections.defaultdict

defaultdict is a subclass of Python’s built-in dict that provides default values for missing keys automatically.

Using defaultdict to accumulate grouped data such as per-URL segment statistics, tone distributions, and topic counts. It simplifies the data aggregation logic, especially when iterating across nested structures.

Function: extract_text_blocks

Overview

The extract_text_blocks() function is responsible for extracting high-quality, readable, and deduplicated content blocks from a given web page URL. Its core objective is to isolate only user-visible and semantically valuable content, eliminating noise such as scripts, styles, forms, navigation bars, or hidden elements.

This function supports real-world SEO use cases where only meaningful content (such as paragraphs, headings, and list items) should be analyzed for further processing like embedding, segmentation, and scoring. It includes robust handling for HTTP errors, decoding inconsistencies, and content duplication.

Key Code Highlights

HTTP Request with Proper Headers

Here, a desktop browser-like User-Agent is used to avoid being blocked or served incomplete content by the server. This helps mimic a real user and ensures maximum compatibility with SEO-targeted pages.

Robust HTML Decoding Fallback

This block ensures that even if response.text fails due to encoding issues, a fallback is attempted using response.content.decode(). It’s especially important when crawling diverse web pages with varying encodings.

Cleaning Unwanted Elements from the DOM

for tag in soup([“script”, “style”, “noscript”, …]): tag.decompose()

This loop aggressively removes non-content elements such as scripts, styles, headers, navbars, and sidebars. These elements are common in SEO pages but do not add value to content analysis and can mislead downstream models.

Removing Comments and Hidden Tags

HTML comments and hidden content (via inline display:none or hidden attributes) are removed to avoid indexing or embedding irrelevant or misleading content.

Filtering by Content Tag Types

allowed_tags = [‘p’, ‘li’, ‘blockquote’, ‘h1’, ‘h2’, ‘h3’, ‘h4’]

The function only considers tags that typically hold readable and meaningful content. This targets core SEO-relevant blocks while skipping layout or decorative tags.

Block Quality Filtering

Content blocks must meet a minimum word threshold (min_word_count) and must not be dominated by non-ASCII (often non-English or special symbol-heavy) characters. This helps ensure relevance and readability for English-language SEO projects.

Deduplication Using Hashing

The hash of the lowercase version of each block is used to detect and skip duplicate content. This prevents the same block from being analyzed multiple times, especially in pages with repeated sections (like repeated CTAs or boilerplate).

Function: preprocess_blocks

Overview

The preprocess_blocks() function performs rigorous cleaning and filtering of raw textual content blocks extracted from a web page. Its goal is to prepare high-quality, noise-free inputs for embedding models by applying a sequence of transformations that remove boilerplate language, links, formatting markers, and redundant characters.

This function supports content standardization for SEO tasks such as topic modeling, relevance scoring, or segmentation — where the quality and consistency of input text directly affect the reliability of downstream models.

It maintains the original block IDs, ensuring traceability from raw extraction through to final output, which is important for interpretability and URL-block level alignment in SEO analysis.

Key Code Highlights

Regex Pattern Setup for Noise Removal

A comprehensive set of regular expressions are compiled to identify and remove:

Common boilerplate SEO phrases (e.g., legal links or generic CTAs)
Embedded URLs
List-style formatting (bullets, numbers, roman numerals)

This helps strip away non-informative scaffolding that would otherwise pollute semantic representations.

Text Substitution for Special Characters

Special typography symbols (curly quotes, dashes, invisible spaces) are normalized into simpler equivalents. This helps prevent token mismatch during embedding and improves readability in output.

Cleaning Pipeline: clean_text() Inner Function

This function handles:

Decoding HTML entities like & into &
Unicode normalization for accented or unusual characters
Regex-based stripping of URLs, bullets, and unwanted prefixes
Final whitespace compression and trimming

By encapsulating this logic, the code keeps cleaning steps modular and maintainable.

Block Filtering and Output Structuring

Only blocks with a minimum word count (min_word_count, default 5) are retained. This ensures that trivial or malformed lines are excluded, preserving only semantically valuable content.

Output is returned as a list of dictionaries with block_id and cleaned text, ready for embedding.

Function: load_embedding_model

Overview

The load_embedding_model() function is responsible for loading a Sentence-BERT model used to convert cleaned textual content blocks into dense vector embeddings. These embeddings capture semantic meaning in a way that allows for measuring similarity, performing clustering, and powering downstream tasks like topic segmentation and relevance scoring.

This function abstracts the model-loading step to allow configurable model selection (defaulting to “all-MiniLM-L6-v2”), enabling flexibility for experiments or upgrades. It’s used consistently across the pipeline wherever semantic representation of text is needed.

Key Code Highlights

SentenceTransformer Model Loading

sbert_model = SentenceTransformer(model_name)

The function utilizes the SentenceTransformer class from the sentence-transformers library, a wrapper around pretrained Transformer models optimized for sentence- and paragraph-level embeddings.

The default model, “all-MiniLM-L6-v2”, is:

Compact and efficient (suitable for large-scale crawling or client-side deployments)
Trained to produce embeddings that perform well on semantic similarity and clustering
Frequently used in SEO contexts where embedding both content blocks and queries is needed for relevance and semantic alignment

Function: generate_block_embeddings

Overview

The generate_block_embeddings() function takes in a list of content blocks (typically parsed and cleaned text segments from a web page) and a preloaded Sentence-BERT model. It transforms each block’s textual content into a high-dimensional semantic embedding — a numerical vector representing its meaning.

These embeddings are crucial for the project’s downstream tasks:

Measuring similarity between adjacent blocks (for topic shift detection)
Comparing blocks with user queries (for query relevance scoring)
Clustering or segmenting blocks (for semantic topic grouping)

This step is foundational — it transforms raw text into a mathematically tractable form for all later stages.

Key Code Highlights

Text Extraction

texts = [block[“text”] for block in blocks]

Extracts only the textual content from each block.
These texts are sent to the Sentence-BERT model in batch mode for efficient embedding generation.

Embedding Generation

embeddings = model.encode(texts, convert_to_numpy=True, normalize_embeddings=True)

model.encode() is called on the list of block texts.
convert_to_numpy=True: returns embeddings as NumPy arrays for compatibility with similarity functions (e.g., cosine similarity).
normalize_embeddings=True: ensures embeddings are L2-normalized, which improves cosine similarity accuracy and ensures stability when comparing across varying content.

This generates one embedding per content block — each embedding being a fixed-size vector (typically 384 dimensions for “all-MiniLM-L6-v2”).

Function: score_block_cohesion

Overview

The score_block_cohesion() function measures the semantic similarity between adjacent content blocks using cosine similarity. It computes how closely two neighboring blocks are related in terms of meaning, which is fundamental for detecting topic shifts within a webpage.

In the context of this SEO project, cohesion scores help:

Segment content based on semantic continuity
Highlight where topic transitions occur
Support refined retrieval by separating unrelated or loosely connected sections

Key Code Highlights

Iterate Over Adjacent Block Pairs

Loop through every pair of adjacent content blocks.
Extract the pre-computed embeddings for each pair (from SBERT).

Compute Cosine Similarity

Measures semantic cohesion using cosine similarity.
Returns a score between -1 and 1, with values closer to 1 indicating strong similarity (high cohesion).
Wrapped in float() to ensure it’s JSON-serializable.

Function: segment_blocks_by_cohesion

Overview

The segment_blocks_by_cohesion() function divides a sequence of content blocks into semantically cohesive segments based on pre-computed pairwise cohesion scores. This segmentation identifies where topic shifts naturally occur within a webpage and splits content accordingly. It is a central part of the project’s goal: refining content understanding and improving search result alignment by analyzing cohesion.

This approach helps:

Isolate distinct subtopics within a page.
Improve query-to-content relevance by mapping queries to cleaner topic-aligned segments.
Simplify content audits for SEO optimization, e.g., spotting where to split or merge content.

Overview and Key Code Highlights

Iterate Through Cohesion Scores

Compares each pair’s cohesion score to a threshold.
If the cohesion is above the threshold, the next block is considered part of the same topic.
If the cohesion drops below the threshold, it marks a topic boundary and starts a new segment.

Add Final Segment

Ensures the last segment is captured, even if it doesn’t hit a boundary.

Function: generate_segment_representations

Overview

The generate_segment_representations() function transforms each segment — a group of semantically cohesive content blocks — into a single, unified representation. This is done by merging the block texts and computing a single embedding for the entire segment. The output is a list of enriched segment dictionaries containing both metadata and vector representation. These segment-level embeddings support further analysis, such as relevance scoring or segment-level clustering.

This function is crucial in:

Providing a semantic summary of each segment.
Supporting block-to-segment-level aggregation in search or content analytics.
Enabling efficient comparison between segments and queries.

Key Code Highlights

Initialize Output Structure

segment_representations = []

Prepares a list to collect processed segment data.

Iterate Over Segments and Generate Representations

Concatenates all block texts within a segment into one continuous string.
Encodes this combined text using the SBERT model to obtain a normalized segment embedding.

Prepare Segment Metadata

urls = list({block.get(‘url’, ‘unknown’) for block in seg_blocks})

Extracts distinct source URLs from the blocks (though the result does not store urls, this prepares for future improvements).

Function: get_extractor_model

Overview

The get_extractor_model() function initializes and returns an instance of the KeyBERT model, which is used for keyword extraction based on semantic similarity. KeyBERT leverages transformer-based sentence embeddings (typically from BERT or Sentence-BERT) to identify the most representative keywords or phrases from a given text.

This model becomes a fundamental component when we need to extract concise topic descriptors from larger segments of text, such as content blocks or segment-level summaries.

Overview and Key Code Highlights

Model Initialization

model = KeyBERT()

This line initializes the KeyBERT model using its default settings, which typically uses all-MiniLM-L6-v2 under the hood if not explicitly set.
The model computes similarity between the full text embedding and phrase embeddings extracted from that same text to rank keyword candidates.

Function: extract_keywords_for_segments

Overview

The extract_keywords_for_segments() function enriches each segmented block of content with a set of top keywords or phrases using the KeyBERT model. These keywords serve as compact representations of each segment’s topical focus, which is helpful in identifying and labeling topic shifts, enhancing visual interpretation, and supporting query-focused summarization.

The function supports optional customization such as number of keywords, and n-gram range (min/max phrase length), while using MaxSum similarity strategy for improved diversity in keyword extraction.

Key Code Highlights

Iterate Over Segments and Extract Keywords

· For each segment, the combined segment text is passed to extract_keywords().

· Extraction uses:

keyphrase_ngram_range: to control phrase length (e.g., 2-4 words).
stop_words=’english’: filters common non-informative words.
top_n: limits number of returned keywords.
use_maxsum=True: ensures selected keywords are diverse and less redundant.
nr_candidates=20: expands the candidate pool before applying selection.

Attach Extracted Keywords to Each Segment

seg[“keywords”] = [kw for kw, _ in keywords]

Only the keyword text is stored (not the similarity score).
Each segment dictionary is updated with a “keywords” field.

Function: generate_embeddings

Overview

The generate_embeddings() function produces a dense vector (embedding) for any given input text using a pre-loaded embedding model such as Sentence-BERT. These embeddings are used throughout the project to compare semantic similarity between content blocks, segments, and queries.

The function ensures normalized vector output, which is essential for accurate cosine similarity comparison in downstream steps such as cohesion scoring or relevance ranking.

Key Code Highlights

Input Validation and Embedding Generation

Checks if both text and embedding model are provided.
Uses the model’s .encode() method to convert the input string into a high-dimensional numpy vector.
convert_to_numpy=True ensures compatibility with downstream NumPy-based operations.
normalize_embeddings=True scales the vector to unit length, which is crucial for cosine similarity to function as intended.

Function: score_query_segment_relevance

Overview

The score_query_segment_relevance() function calculates how semantically relevant each content segment is with respect to a given query. It uses cosine similarity between the segment embeddings and the query embedding to generate a query_relevance_score for each segment.

This score helps identify which content segments are most aligned with the user’s search intent, allowing more accurate filtering and ranking of segments in the final results. The function also supports filtering with a minimum relevance threshold and optionally returning only the top-k most relevant segments.

Key Code Highlights

Relevance Score Computation

score = float(cosine_similarity([query_embedding], [seg[“embedding”]])[0][0])

Calculates cosine similarity between the query vector and each segment’s embedding.
cosine_similarity returns a matrix, from which the scalar similarity is extracted and cast to float.
The score represents how closely the segment matches the query on a semantic level.

Filtering by Threshold and Collecting Results

Segments with scores below the specified min_score_threshold are excluded.
Matching segments are copied and updated with a query_relevance_score key.
The score is rounded for cleaner presentation in result summaries.

Sorting and Optional Top-K Selection

Segments are sorted in descending order by relevance score.
If top_k is specified, only the top-k segments are retained.
This ensures that only the most valuable and relevant segments are returned for final use in client-facing results or visualizations.

Function: compute_segment_similarities

Overview

The compute_segment_similarities() function measures semantic cohesion between adjacent segments by computing cosine similarity between their text embeddings. This comparison reveals how smoothly the content flows from one segment to another in terms of meaning, allowing for the identification of major topic shifts across a document.

The function utilizes Sentence-BERT or similar models to embed each segment’s combined text and then calculates similarity between consecutive embeddings. This step is particularly useful when analyzing or visualizing how different sections relate to one another.

Overview and Key Code Highlights

Embedding Segment Texts

embeddings = [generate_embeddings(seg[“segment_text”], model) for seg in segments]

Uses a pre-defined utility generate_embeddings() to compute normalized vector representations of each segment’s combined text.
The embeddings are crucial for meaningful similarity calculations and topic cohesion scoring between segments.
The same model is used for all embeddings to maintain consistency.

Calculating Cosine Similarities Between Consecutive Segments

Iterates over adjacent pairs of segment embeddings.
Computes cosine similarity for each pair and stores it in the similarities list.
The result is a series of similarity scores that reflect how semantically close each segment is to the next.

This function enables post-segmentation diagnostics and supports better display formatting or visualization of topic transitions in long-form SEO content.

Function: format_segment_output

Overview

The format_segment_output() function prepares segmentation results into a structured and interpretable format suitable for client reporting, visualizations, or downstream processing. It consolidates metadata such as text previews, block ranges, topic keywords, relevance scores, and topic shift indicators into a simplified, client-friendly dictionary per segment.

This function enhances the usability of segment data by embedding logical cues like topic shift detection and block continuity, allowing for effective storytelling in SEO reports.

Key Code Highlights

Core Metadata Extraction and Text Previewing

Limits the segment text preview to a configurable number of characters for concise display.
Ensures the reported block count reflects either the provided value or the actual length of source_blocks.

Block Range Formatting for Client Interpretation

Extracts and sorts block IDs from each segment to compute a human-readable range.
If blocks are sequential, uses a dash (e.g., 4–7), otherwise lists them explicitly (e.g., 2, 4, 6).
This helps clients see which parts of their content are grouped together.

Topic Shift Detection Using Similarity Thresholds

Compares the current segment to the next one using precomputed cosine similarity scores.
Flags a segment boundary as a topic shift when similarity falls below a user-defined threshold (default 0.75).
Supports diagnostic views that explain why segments were split.

Function: display_segment_results

The display_segment_results() function presents the segmented output from a web page in a clear, structured, and client-friendly format. It highlights each segment’s block range, preview text, key topic phrases, and optional query relevance scores. Most importantly, it visually marks where topic shifts occur based on content cohesion, helping clients understand how their page structure aligns with coherent topic boundaries.

This display function enhances interpretability by summarizing key information per segment and making the content segmentation actionable for audits, SEO restructuring, and optimization planning.

Result Analysis and Explanation

Project Context

This analysis demonstrates how the page content at ThatWare’s Advanced SEO Services is segmented into distinct topical sections based on internal cohesion and semantic flow. The goal is to evaluate how well the site supports the query: “how to improve SEO with advanced tools”, and to identify which parts of the page are most valuable, actionable, and relevant to that query.

The system generated 40 topic-based segments, from which the top 5 most relevant segments were selected based on a combination of semantic alignment (cosine similarity with query embedding) and cohesion structure (topic shift detection across segments).

Key Observations from Top Segments

Segment #22 — Core Strategic Insights

· Block Range: 66–77 (12 blocks)

· Topic Cohesion: High continuity (Similarity with next: 0.85)

· Key Themes:

Strategic execution of advanced SEO
Competitive advantage through combined techniques
Focused increase in traffic using sophisticated tools

Interpretation: This is the strongest segment for the query. It highlights direct benefits of using advanced SEO strategies, explicitly referring to increased traffic, execution effectiveness, and service focus. The consistent tone and high internal cohesion make it a primary content block for user queries around actionable SEO improvements using tools.

Segment #19 — Value Proposition and Service Reliability

· Block Range: 49–52 (4 blocks)

· Topic Shift Detected (Similarity with next: 0.73)

· Key Themes:

Long-term impact of advanced SEO
Positioning ThatWare as an industry expert
Importance of ongoing investment and trust in services

Interpretation: This segment communicates the value and expectation-setting around advanced SEO tools. Although slightly less dense in length, it reinforces ThatWare’s positioning and conveys why investing in structured strategies pays off over time—an essential concern for decision-makers evaluating SEO vendors.

Segment #24 — Dedicated Support as a Differentiator

· Block Range: 79–84 (6 blocks)

· Topic Cohesion: Moderate-to-strong (Similarity: 0.76)

· Key Themes:

Access to support and expertise
ThatWare’s team integration model
Making advanced SEO accessible and affordable

Interpretation: This segment enhances trust and reliability by emphasizing hands-on support. For clients looking to improve SEO using tools, the availability of a committed team adds reassurance that tools won’t be left to clients to figure out alone. The mention of affordability also speaks to ROI concerns.

Segment #12 — Technical SEO Execution Focus

· Block Range: 31–38 (8 blocks)

· Topic Shift Detected (Similarity with next: 0.74)

· Key Themes:

Assignment of technical SEO experts
Role of audits and proactive assistance
Importance of a systematic SEO foundation

Interpretation: This section drills down into operational excellence. For queries about improving SEO with tools, this segment highlights how technical audits and expert assignments turn strategy into execution. The personalized nature of the service connects well with businesses seeking structured onboarding.

Segment #3 — Framing the Concept of Advanced SEO

· Block Range: 2–21 (20 blocks)

· Topic Shift Detected (Similarity with next: 0.73)

· Key Themes:

Introduction to advanced SEO philosophy
Complexity and depth of modern SEO practices
Starting point for clients unfamiliar with technical terms

Interpretation: Though introductory in tone, this segment lays a critical conceptual foundation. It sets expectations and educates users about what “advanced SEO” actually involves. This is essential for framing later high-value content and ensuring clients fully understand what they are investing in.

Overall Patterns and Implications

· High-Value Clusters Identified: Segments #22, #24, and #12 form a strong mid-to-bottom funnel cluster—speaking directly to implementation, benefits, and support. These are most valuable for clients actively looking to improve SEO outcomes.

· Effective Messaging Progression: The content flows from explanation (Segment #3) to credibility building (Segment #19), to action-oriented sections (Segment #22) and support assurance (Segment #24). This mirrors a user journey from awareness to conversion.

· Clear Topic Shifts Enable Auditing: Identifying where topic shifts occur (e.g., Segments #3, #12, #19) helps pinpoint where new sections might need clearer visual or structural separation. This can improve UX and better guide the reader through the content.

· Strong Relevance to Query: All selected segments align semantically with the query, though Segment #22 most directly answers “how to improve SEO with advanced tools.” This indicates that the page has pockets of very strong alignment but could benefit from highlighting or surfacing those insights more clearly.

Result Analysis and Explanation

This section outlines how the content on multiple web pages was segmented, analyzed for topic coherence, and scored for relevance against a specific SEO-focused query. The analysis provides actionable insights into how well different sections of the page align with user intent and how clearly content transitions from one topic to another.

Segment Identification and Topic Cohesion

Each webpage was parsed into coherent textual segments based on underlying content cohesion patterns. These segments represent shifts or continuities in the discussion, which are important for identifying the structural quality of content. Content that shifts too frequently or lacks clear topical structure can harm both user engagement and search engine interpretation.

On average, each URL yielded between 40 to 250 segments, depending on page length and topical variance. Pages with more detailed content naturally generated a higher number of segments, especially when covering multiple SEO concepts or tools.

Query Relevance Scoring

Each segment was evaluated for its semantic relevance to the target query using contextual embeddings. The relevance scores range from 0 (not relevant) to 1 (highly relevant). For interpretability, the scores were grouped into the following bins:

Highly Relevant: Scores > 0.5
Moderately Relevant: Scores between 0.40 and 0.50
Low Relevance: Scores < 0.40

Most segments across the URLs fell into the Moderately Relevant range, suggesting that while the pages addressed related concepts, only a limited number of sections were tightly aligned with the specific query. This provides an opportunity to either restructure existing content or insert targeted sections that directly address high-intent user queries.

Topic Shift Detection

To understand structural flow, the analysis also computed semantic similarity between consecutive segments. A low similarity score between two adjacent segments indicates a potential topic shift, while a high score suggests continuity. This helps identify abrupt transitions that may confuse readers or dilute thematic focus.

Topic shifts were flagged when the cosine similarity dropped below a context-aware threshold, typically around 0.5 to 0.6. A healthy balance of continuity and occasional shifts (when transitioning to a new sub-topic) is desirable. Excessive or unintentional shifts may suggest disorganized content flow.

Key Visualization Insights

The following plots were generated to provide a high-level overview and support content improvement decisions:

Total Segments per URL
- Indicates page complexity and content density. A high number of segments may require better organization or summarization.
Average Query Relevance Score per URL
- Shows how well each page addresses the target query overall. This helps compare performance across URLs and identify the strongest content.
Detected Topic Shifts per URL
- Reflects how often content transitions between topics. Pages with frequent shifts may need clearer headings, better grouping, or content smoothing.
Similarity with Next Segment Distribution
- Visualizes segment-to-segment cohesion across all pages. A skewed distribution toward low similarity may suggest fragmented or loosely connected content.
Query Relevance Score Distribution (All URLs)
- Helps assess whether most content is directly addressing the query or diverging into peripheral areas. A healthy distribution should lean toward higher scores, especially in key sections.

Practical Interpretation for SEO Strategy

Content Structuring: Pages with high segment counts and frequent topic shifts may benefit from improved subheadings, internal linking, or reordering of paragraphs to preserve topical cohesion.
Targeted Optimization: Segments with low relevance scores present prime opportunities for inserting query-aligned keywords, refining explanations, or expanding on tools/metrics discussed superficially.
Content Prioritization: Pages with higher average relevance scores can be prioritized for promotion, while others may need content refresh cycles or targeted optimization strategies.

What does segmenting the content reveal about my page?

Segmenting your page using cohesion-based analysis breaks it into logically connected content blocks. This allows us to pinpoint where your content naturally shifts in topic and how well it flows overall. For example, if your page discusses technical SEO, content strategy, and analytics, we identify where one ends and the next begins. This is valuable because a well-segmented structure improves readability, user engagement, and the way search engines interpret content relevance — all of which impact SEO performance.

How do I interpret the query relevance scores assigned to content segments?

Each segment is scored based on how closely it aligns with the target query (e.g., “how to improve SEO with advanced tools”). These scores range from 0 (not relevant) to 1 (highly relevant) and are binned into:

Highly Relevant (> 0.5): Content directly answers or supports the query.
Moderately Relevant (0.40–0.50): Related content that touches the topic but lacks direct alignment.
Low Relevance (< 0.40): Generic or off-topic content from the query’s perspective.

Segments with high relevance scores are where your SEO value is strongest. Others may need adjustment—e.g., rephrasing, deeper explanation, or keyword refinement—to increase alignment and drive better performance.

Why are topic shifts important for SEO content?

Frequent or unintentional topic shifts make content harder to follow, which can hurt user engagement and increase bounce rates. Our analysis uses semantic similarity to detect these shifts. If segments frequently jump between unrelated topics, it may signal that content organization needs improvement. On the other hand, a smooth progression of ideas keeps readers engaged and helps search engines clearly map your content to user queries.

How do I know if my content is too fragmented?

We analyze how many segments are generated and how often topic shifts occur. A page with many short segments and frequent shifts might be overly fragmented, signaling a lack of cohesive structure. This fragmentation can confuse readers and dilute the content’s SEO focus. By reviewing the “Total Segments per URL” and “Detected Topic Shifts” visualizations, we can assess if consolidation, restructuring, or clearer subheadings are needed.

What improvements should I make based on low-relevance segments?

Segments with low query relevance are ideal targets for optimization. These areas can be updated to better reflect the query’s intent—e.g., by adding examples, including tools, or introducing clearer connections to the main topic. This ensures more of your content actively contributes to your SEO goals. You can also repurpose low-value sections into new content or supplement them with targeted FAQs or internal links.

Can this analysis help me prioritize pages for optimization?

Yes. Our analysis compares relevance and cohesion metrics across all your pages. Pages with low average query relevance scores or high topic fragmentation are clear candidates for targeted updates. Meanwhile, high-performing pages can be prioritized for promotion, backlinking, or structured data enhancements. The “Average Query Relevance Score per URL” plot is especially helpful in setting content priorities across your site.

How does this project go beyond standard SEO audits?

Unlike traditional audits that only highlight missing tags or keyword counts, this project provides semantic-level insights. It tells you not just what is missing, but where and why — using advanced NLP techniques to detect topic coherence, semantic shifts, and alignment with user intent. This allows for surgical precision in content improvements, giving your SEO strategy a competitive edge.

Final Thoughts

This project delivers a deeper, more intelligent understanding of how your web content performs in relation to user intent and topical structure. By applying cohesion-based segmentation and query-aware relevance scoring, we move beyond surface-level SEO metrics and into the semantics of your content — identifying where it aligns, where it diverges, and how its structure either supports or hinders discoverability.

The insights uncovered through segment-level analysis, topic shift detection, and relevance scoring provide a clear, actionable roadmap for content refinement. With visual breakdowns across key metrics like topic continuity and alignment with target queries, this approach empowers strategic SEO decisions that are both scalable and focused on impact.

Click here to download the full guide about Cohesion-Based Text Segmentation.