ELMo (Embeddings from Language Models) Word Representations

SUPERCHARGE YOUR ONLINE VISIBILITY! CONTACT US AND LET’S ACHIEVE EXCELLENCE TOGETHER!

Project Summary

This project applies ELMo (Embeddings from Language Models) to analyze and optimize web content through deep contextual understanding. The focus is on two key areas: website similarity detection and keyword match evaluation. By leveraging ELMo’s ability to capture word meanings in context, the project aims to provide more accurate insights into content overlap and keyword relevance, supporting SEO strategies with meaningful language-based analysis.

Purpose of the Project

The purpose of this project is to improve SEO-related decision-making by analyzing website content through contextual language understanding. Traditional keyword or phrase matching often misses the deeper meaning and relevance between different pieces of content. This project uses ELMo embeddings to overcome that limitation by understanding how words are used in different contexts across web pages.

By focusing on two specific tasks — identifying similarities between web pages and evaluating keyword match quality — the project provides insights that help in detecting redundant content, optimizing keyword use, and improving overall content strategy.

What is ELMo?

ELMo stands for Embeddings from Language Models. It is a deep learning model developed by researchers at the Allen Institute for AI. ELMo is designed to give computers a deeper understanding of natural language, helping machines grasp not only the meaning of individual words, but also how the meaning changes based on context. Traditional methods treated every word the same, no matter where it appeared. ELMo, on the other hand, understands that the meaning of a word depends on the sentence it is used in.

For example:

In the sentence “She sat by the river bank,” the word “bank” refers to land near a river.
In the sentence “He deposited cash into the bank,” the same word “bank” refers to a financial institution.

ELMo can tell the difference because it looks at the full sentence before deciding what the word means. This makes it far more accurate for understanding text on websites.

How ELMo Works

Unlike traditional models that assign a fixed meaning to each word (regardless of where or how it is used), ELMo uses a bidirectional language model. This means:

It reads the sentence from left to right and from right to left.
It captures how the meaning of each word is influenced by the entire sentence.

ELMo is built using a type of deep learning model known as a bi-directional LSTM (Long Short-Term Memory network). Here’s how it works at a high level:

Input Layer: ELMo starts by breaking each word into smaller parts and encoding them numerically.
Contextual Encoding: Then it uses deep neural networks to process these word pieces while looking at the full sentence. It creates three levels of representations for every word:
- A base word representation
- A representation from the forward LSTM (left to right)
- A representation from the backward LSTM (right to left)
Final Output: These layers are combined to generate a contextual embedding — a high-dimensional vector that reflects the meaning of the word in its specific sentence.

Because ELMo looks at the surrounding words before forming an understanding of each one, it can detect subtle differences in meaning — making it ideal for comparing complex web content.

What Are Word and Sentence Embeddings?

Before a machine learning model can understand language, the language must be translated into numbers. This translation is done through embeddings.

Word Embeddings

A word embedding is a numerical representation of a word’s meaning. Think of it as a location on a map, where words with similar meanings are located near each other.

For example:

The word “SEO” will be close to “search engine optimization” or “rankings” in this numerical space.
It will be far from unrelated words like “banana” or “snowfall”.

Sentence and Document Embeddings

Since we’re working with entire web pages, not just individual words, we need to convert sentences and full documents into embeddings.

To do this:

Each sentence is passed through ELMo to get word-level embeddings.
These are then combined (pooled) using techniques like:
- Mean Pooling: Averages all the word vectors to summarize the overall sentence meaning.
- Max Pooling: Selects the strongest signals from any part of the sentence, highlighting the most important aspects.
These sentence embeddings are stacked to represent an entire page’s content.

The result is a dense vector representation that captures both the general topic and the specific context of the web page.

What is Contextual Embedding?

One of the most important strengths of ELMo is that it produces contextual embeddings. This means the same word will have different vector representations depending on its context.

Examples:

“The bank of the river is beautiful.” -> “bank” relates to nature.
“The bank approved the loan.” -> “bank” relates to finance.

This ability to distinguish context makes ELMo highly accurate for tasks like website content comparison, where keyword repetition alone may not reflect true similarity.

How Web Page Similarity is Calculated

Once the web pages are converted into embeddings, the next step is to compare them to determine how similar the content is. This is done through several methods:

Cosine Similarity Cosine similarity is a standard technique for measuring the angle between two vectors. A value closer to 1 means high similarity; closer to 0 means dissimilar content. It checks whether two vectors (representing web pages) point in the same direction.
Similarity Matrix Instead of just comparing two pages as a whole, each sentence from one page is compared with each sentence from the other. This creates a matrix of scores showing how every sentence in one document aligns with the sentences in another.
Enhanced Scoring Strategy To make the similarity calculation more reliable, multiple strategies are combined:
- Max-over-row/column alignment: Captures the best sentence-to-sentence matches.
- Diagonal similarity: Useful when documents are structured similarly.
- Median similarity: Adds robustness by ignoring extreme highs/lows.

These scores are averaged into a final similarity score, providing a more accurate picture of how related the two pages are in terms of meaning.

Why is it important to understand content context and not just keywords?

Search engines no longer match queries with keywords alone. They try to understand what the user actually wants — the intent behind the query. This project mimics that behavior using contextual analysis. It ensures that content:

Matches real user intent
Uses keywords in the right context
Communicates clearly to both users and search engines

Why This Matters for SEO

Accurate content understanding is essential for search engine optimization. Using ELMo and similarity scoring helps address several SEO challenges:

Duplicate Detection: Identify pages with overlapping content.
Topic Coverage: See which topics are well-covered and where content gaps exist.
Keyword Context Relevance: Check if the content genuinely reflects the keyword’s intent.
Content Differentiation: Ensure pages target different user queries or buyer intents, avoiding cannibalization.

This makes the technology useful for both content audits and strategic planning across websites.

What is the significance of this project in the SEO domain?

This project introduces a modern, intelligent way of analyzing web content. Instead of relying on surface-level keywords, it looks at the true meaning of content, using advanced language models like ELMo. This reflects how search engines like Google now interpret content — based on intent and context, not just exact word matches. As search engine algorithms become smarter, the tools used for SEO must evolve. This project helps bridge that gap.

How does this project benefit website owners practically?

Website owners gain a clear, measurable understanding of:

Which pages are too similar, potentially hurting SEO performance due to duplicate or overlapping content.
Which keywords are truly relevant to the content, helping prioritize high-quality optimization.
Where content gaps exist, revealing areas where new content can be created to target untapped search opportunities.
How each page compares to others internally or against competitors, improving overall content strategy.

TensorFlow and TensorFlow Hub (tensorflow, tensorflow_hub)

Purpose: Used to load and run the ELMo language model, which is responsible for understanding the meaning of text in context.
Why it matters: ELMo is a pre-trained deep learning model that requires a robust framework to run. TensorFlow provides the foundation for handling these computations efficiently.
How it helps: Enables the system to generate contextual embeddings, which are numerical representations of sentences or words based on their actual meaning.

NumPy (numpy)

Purpose: Used for mathematical operations and array manipulation.
Why it matters: Much of the content comparison involves large numerical matrices (embeddings). NumPy helps manage and compute over these efficiently.
How it helps: Makes it easy to perform operations like averaging, reshaping, or stacking vectors during similarity analysis.

BeautifulSoup (bs4) and Requests (requests)

Purpose: These libraries are used together to extract content from web pages.
- requests retrieves the HTML content of a page.
- BeautifulSoup parses that HTML to extract only the meaningful elements like headings, paragraphs, and list items.
Why it matters: Raw web pages include ads, navigation menus, and other non-essential content. These tools help filter out the noise and focus on the actual written content.
How it helps: Allows for clean and structured extraction of text that is relevant for SEO evaluation.

Scikit-learn (sklearn.metrics.pairwise.cosine_similarity)

Purpose: Calculates cosine similarity, a common method for comparing two sets of text based on their meaning.
Why it matters: Cosine similarity provides a numerical score representing how similar or different two pieces of content are.
How it helps: Enables the project to compare content from different URLs or keywords and generate similarity scores that guide SEO decisions.

NLTK (nltk, sent_tokenize)

Purpose: NLTK (Natural Language Toolkit) is used for text processing, specifically for sentence tokenization — breaking down long paragraphs into individual sentences.
Why it matters: Sentence-level processing ensures finer-grained and more accurate analysis when generating contextual embeddings.
How it helps: Makes the embedding process more precise by feeding smaller, meaningful units of text to the language model.

NLTK Downloads (nltk.download(‘punkt’), nltk.download(‘punkt_tab’))

Purpose: Downloads required resources for sentence tokenization to work.
Why it matters: These models are pre-trained patterns that allow NLTK to understand where sentences begin and end.
How it helps: Ensures smooth and accurate sentence splitting during text processing.

Function: extract_text(url)

Purpose

Extracts clean, meaningful content from a webpage by removing unnecessary HTML elements like scripts, menus, and popups. Focuses only on the text that’s relevant for SEO analysis, such as headings, paragraphs, and lists.

How It Works

Fetches Webpage Content

response = requests.get(url, timeout=10)

Sends a request to the URL and retrieves its HTML using requests.
Sets a timeout to avoid delays from slow websites.

Parses HTML Structure

soup = BeautifulSoup(response.text, “html.parser”)

Uses BeautifulSoup to convert raw HTML into a searchable structure.

Removes Irrelevant Elements

Deletes non-content tags like <script>, <style>, <nav>, and others that don’t add value for content analysis.
Also removes divs with common layout-related classes like footer, navbar, or popup.

Extracts Core Content

Keeps only important tags: <p>, <h1>, <h2>, <h3>, <li>.
Joins the text from these elements into a single clean string.

Why It’s Important

Ensures only relevant on-page content is analyzed.
Removes noise that could distort SEO metrics.
Helps deliver accurate similarity scores and keyword insights.

Function: preprocess_text(raw_text)

Purpose

Cleans and filters the raw webpage content to prepare it for accurate embedding and similarity analysis. Helps focus only on meaningful and information-rich sentences.

How It Works

Converts Text to Lowercase

Normalizes the text to avoid case-related mismatches.

Splits into Sentences

Uses nltk’s sent_tokenize to break down the content into real, natural-language sentences.

Filters Out Irrelevant Sentences

Removes sentences that are too short (less than 6 words), which are often uninformative.
Excludes sentences containing common marketing phrases, cookie consent messages, or call-to-actions like:
- “Get started”, “Sign up”, “Cookie”, “Contact us”, “Learn more”, etc.
These are typically not useful for analyzing page topics or comparing core content.

Returns Clean Content

Outputs a list of filtered, well-formed sentences ready for embedding and further analysis.

Why It’s Important

Eliminates filler and boilerplate text.
Ensures that embeddings reflect real content, not marketing fluff.
Improves the quality of similarity scoring and keyword context analysis.

Loading the ELMo Model

Purpose

Loads the pre-trained ELMo (Embeddings from Language Models) model from TensorFlow Hub, making it ready for use in extracting deep, context-aware word embeddings from text.

Code Breakdown

elmo = hub.load(“https://tfhub.dev/google/elmo/3”)

hub.load(…): Downloads and loads the model directly from TensorFlow Hub.
URL: “https://tfhub.dev/google/elmo/3” points to Google’s official ELMo model (version 3).
Model Type: This version of ELMo returns contextual word representations for each token (word) in a sentence, useful for tasks like content similarity, semantic comparison, or keyword matching.

elmo

Output: <tensorflow.python.trackable.autotrackable.AutoTrackable at 0x7fafd6566350>
This confirms that the model is successfully loaded and ready to generate embeddings.

Unlike traditional word vectors (like Word2Vec or GloVe), ELMo dynamically understands words based on their context in the sentence.

It captures both the syntax and semantic meaning of content, which is crucial for analyzing web pages, understanding keyword relevance, and identifying similarities or gaps in content.

Function get_elmo_embeddings

Purpose

Generates contextual word embeddings using ELMo for a list of sentences. These embeddings help capture the meaning and context of each word for accurate content comparison.

How It Works

if isinstance(sentences, str): sentences = [sentences]

Ensures consistent handling whether input is a single string or a list.

elmo_output = elmo.signatures[“default”](tf.constant(batch))[“elmo”] batch_embeddings = elmo_output.numpy()

Uses the ELMo model to generate token-level embeddings (1024-dimensional vectors per word).

all_embeddings.append(sent_emb)

Stores embeddings for each sentence.

full_text_embedding = np.concatenate(all_embeddings, axis=0)

Combines all word-level vectors into one array representing the full text.

Why It’s Useful

Provides a deep understanding of content meaning, enabling accurate similarity scoring between different webpages.

ELMo Embedding Extraction (Text Representation)

Purpose

Transforms the processed text from each webpage into numerical vectors using the ELMo model. These vectors capture the meaning of each word based on its context.

Code Overview

text1_embedding = get_elmo_embeddings(text1) text2_embedding = get_elmo_embeddings(text2)

text1 and text2: Lists of cleaned sentences from two webpages.

function get_elmo_embeddings() Converts each word in the text into a 1024-dimensional vector, considering its usage in context.

Output Example

The output is a 2D NumPy array where:

Each row is a token (word) in the text.
Each column is a value from the 1024-dimensional embedding.

What Are These Numbers?

Each row in the array corresponds to a word or token in the webpage text. Each column (there are 1024 in total) represents a numerical feature that captures a specific aspect of the word’s meaning in context. This is known as a 1024-dimensional embedding.

These numbers are not randomly assigned — they are learned features that allow the model to represent complex language patterns like:

The difference between “bank” as a financial institution and “bank” as a riverbank.
How the meaning of “marketing” changes when used near “SEO”, “ads”, or “email”.
Relationships between ideas such as “traffic growth”, “visibility”, and “search ranking”.

Why This Matters

These embeddings are the foundation for comparing two texts. They carry context-aware representations of each word, which helps in capturing deeper semantic similarity beyond basic word matching.

Similarity Score Calculation

Purpose

Calculates a numerical similarity score between two web pages by comparing their ELMo embeddings, which represent the contextual meaning of words and sentences.

How It Works

Cosine Similarity Matrix

Compares each token from one page to every token in the other.
Forms a matrix that captures how semantically close each pair of words is.

Multiple Aggregation Techniques

Diagonal Mean: Measures aligned content similarity (e.g., sentence-to-sentence).
Max Over Rows & Columns: Captures the strongest matching parts.
Mean & Median: Represent overall similarity across the content.

Weighted Combination

Final score combines all components with specific weights to emphasize strong alignment while maintaining overall balance.

Interpretation

0.9 and higher -> Very similar (e.g., duplicated or strongly related pages)
between 0.5 and 0.8 -> Moderately similar
less than 0.5 -> Mostly different content

This helps in identifying:

Content redundancy
Competitive content overlap
Gaps in topic coverage

Page Similarity

In this section, the content similarity between various web pages on the same or different websites is assessed. The purpose of this comparison is to measure how closely the content of two pages aligns in terms of contextual meaning, which can offer valuable insights into content overlap, optimization opportunities, and page categorization for SEO.

How the Similarity Score Works:

Similarity scores range from 0 to 1, where:
- Higher scores indicate greater similarity in content between two pages, meaning the topics, language, and themes are closely related.
- Lower scores suggest that the content on the two pages is less similar and may focus on different themes or target different audiences.

Interpreting the Scores:

Scores above 0.85: A very high similarity score indicates that the two pages are almost identical in terms of their content, suggesting that they may target the same audience or serve similar purposes. For example, a score of 0.9111 between two identical URLs, like comparing the page to itself, confirms near-perfect similarity.
Scores around 0.55 to 0.80: This range indicates moderate similarity, meaning the pages share some common topics or themes but may focus on different aspects. For example, a score of 0.5844 between a “reseller SEO services” page and a “business intelligence services” page shows fair similarity in content, likely sharing some overlapping terms or concepts, but also significant differences.
Scores below 0.55: Scores below this threshold indicate low similarity, meaning the pages are focused on largely different topics or target different keywords, which suggests limited content overlap. For example, a score of 0.5790 between “reseller SEO services” and “branding press release services” indicates mild similarity, but the pages likely address distinct aspects of SEO services or related fields.

Example Breakdown:

High Similarity (around 0.85 and above): A high similarity score (e.g., 0.9111) reflects near-identical content, typically when comparing a page to itself or with a very closely related topic.
Moderate Similarity (0.55 to 0.80): Scores within this range, such as 0.5844 or 0.5790, suggest that two pages share some common themes or language but differ in the specifics of the content. These pages may cater to related topics or complementary services.
Low Similarity (below 0.55): Low similarity scores indicate a notable difference between the two pages’ focus areas. For instance, comparing “business intelligence services” with “branding press release services” would yield a low score, suggesting the pages focus on distinct topics and may require separate optimization strategies.

Practical Insight:

The page similarity scores provide valuable insights into the degree of overlap between different pages on a website.

Pages with high similarity could benefit from more specific or unique content to avoid potential duplicate content issues.
Pages with moderate similarity might indicate an opportunity for better differentiation or clearer categorization of the topics for improved user experience and SEO.
ages with low similarity likely represent distinct content areas, and it would be beneficial to ensure these pages are optimized for separate keyword strategies to avoid content cannibalization.

This approach helps website owners and SEO managers ensure that content is well-organized, unique, and relevant to their target audience, optimizing their SEO strategy for better visibility and ranking.

How does the page-to-page similarity score help in content planning?

This score shows how similar the content is between two web pages, based on the meaning of the words, not just exact matches. It’s a valuable tool for identifying:

Content duplication: If two pages have very high similarity scores, they may be unintentionally competing for the same rankings (keyword cannibalization).
Content clustering opportunities: Moderate similarity might suggest two pages could be linked together in a hub-and-spoke strategy.
Gap detection: Low similarity among pages that are supposed to serve related purposes might mean one or both need content improvements to align them more closely.

With this, clients can make informed decisions like:

Merging or redirecting similar pages.
Enhancing one page to become the “pillar” piece and reworking the others as supportive content.
Identifying weak or redundant content during audits.

This keeps the website lean, purposeful, and search-engine friendly.

Keyword Context Relevance

In this project, keyword context relevance is assessed by comparing how well the semantic meaning of a keyword matches the content of a webpage. This is achieved by converting both the content and the keyword into deep contextual embeddings using ELMo, which captures the nuanced meanings of words based on their surrounding context.

This section of the project assesses how well a particular keyword fits semantically and contextually within a webpage’s content. Unlike traditional methods that simply check for keyword appearances, this approach evaluates whether the content genuinely discusses the topic represented by the keyword.

What the Score Means:

· The score ranges from 0 to 1, where:

Higher scores indicate stronger contextual alignment between the keyword and the page content.
Lower scores suggest weaker relevance, meaning the keyword is less connected to the content or might not be the central focus of the page.

Interpreting the Scores:

Moderate scores (0.35–0.45): A score within this range suggests that the keyword is relevantly integrated into the page, but might not be the primary topic. The content is somewhat aligned with the keyword.
Lower scores (below 0.35): When the score is on the lower end, it typically indicates that the content may not be strongly aligned with the keyword. This suggests potential gaps in content optimization, where the keyword could be more deeply integrated or the focus of the page could be better tailored to the keyword.
Higher scores (0.45 and above): These scores indicate a strong contextual match, suggesting that the keyword is a central part of the content. The page is well-optimized for the keyword and the content strongly supports the keyword’s meaning and intent.

Example Breakdown:

Moderate scores (0.35–0.45): If a keyword such as “SEO services” shows a score within this range, it suggests that the content includes the keyword but may not be fully optimized for it. It’s a fairly relevant match.
Lower scores (below 0.35): For a webpage focusing on “SEO” but not well aligned with the keyword, the score might be lower. This indicates the content could benefit from improvements to better align with the keyword.
Higher scores (above 0.45): A high score here would indicate that the keyword is highly relevant and closely matched with the content, suggesting the page is well-optimized for that particular keyword.

Practical Insight:

The purpose of this analysis is to highlight how closely a specific keyword is related to the main content of the page. By reviewing these scores:

Moderate to high scores typically indicate that the keyword is well-integrated into the content, possibly needing minor adjustments for even stronger relevance.
Lower scores signal the need for content optimization, ensuring the keyword is more naturally embedded in the content to boost its SEO performance.

How do the keyword relevance scores help us?

Keyword relevance scores indicate how well each page’s content aligns contextually with specific SEO keywords. A higher score means the content naturally reflects the intent and vocabulary associated with that keyword — signaling that the page is already well-optimized or at least on the right track.

If a page scores consistently low across important keywords, that’s a red flag. It means:

The keyword may not appear naturally in the content.
The surrounding sentences may lack the contextual framing search engines expect.
The page might be targeting the wrong topic or audience altogether.

This allows clients to take specific next steps:

Refine content to include semantically related phrases.
Reposition the content to target better-aligned keywords.
Create new content for keywords that don’t fit the current structure.

Ultimately, it provides a content performance snapshot at the keyword level, helping prioritize where effort is most needed.

What should we do after getting these scores?

Once the keyword and similarity scores are available, they become a diagnostic tool for ongoing content strategy. Here’s a structured way to act on them:

Audit Content:

Use similarity scores to identify pages that are too alike and consider merging, de-duplicating, or differentiating them.
Use keyword scores to find content that is underperforming for its intended SEO goals.

Prioritize Optimization:

Pages with mid-range scores (e.g. moderate similarity or keyword relevance) are often low-hanging fruit — small changes can result in noticeable ranking improvements.

Strategically Expand Content:

For keywords with low scores across all pages, consider creating new, targeted content rather than forcing it into unrelated existing pages.

Guide Internal Linking:

Similarity scores can guide smarter internal link structures — link pages that are semantically close to strengthen SEO authority on that topic.

Content Calendar Planning:

This analysis directly feeds into content planning: what to improve, what to retire, and what to build next. It helps reduce guesswork.

In short, the scores allow the client to go from “we need to improve SEO” to specific, measurable actions that align with both search intent and user experience.

Final Thoughts

This analysis bridges advanced language modeling with practical SEO insights, helping transform raw website content into actionable intelligence. By using ELMo-based semantic embeddings, the project surfaces how closely each page aligns with target keywords and how similar different pages are in meaning and focus.

The keyword-to-page relevance scores reveal how well individual pages are optimized for specific search intents. The page-to-page similarity scores uncover potential overlaps, content gaps, or opportunities for stronger internal linking. Together, these scores provide a multi-dimensional understanding of your content’s current strengths and weaknesses.

More importantly, the value of this analysis lies not only in the data itself, but in what it enables next. Whether refining content for clarity, reassigning keyword focus, creating new pages, or resolving internal competition — these insights offer a clear, measurable path toward improving search visibility and user experience.

As your site continues to grow, revisiting this process periodically will help ensure that your content remains relevant, targeted, and strategically structured for both users and search engines.

Tuhin Banik

Thatware | Founder & CEO

Tuhin is recognized across the globe for his vision to revolutionize digital transformation industry with the help of cutting-edge technology. He won bronze for India at the Stevie Awards USA as well as winning the India Business Awards, India Technology Award, Top 100 influential tech leaders from Analytics Insights, Clutch Global Front runner in digital marketing, founder of the fastest growing company in Asia by The CEO Magazine and is a TEDx speaker and BrightonSEO speaker.