ALBERT (A Lite BERT) - Smaller | More Efficient Variant of BERT

SUPERCHARGE YOUR ONLINE VISIBILITY! CONTACT US AND LET’S ACHIEVE EXCELLENCE TOGETHER!

This project demonstrates how a powerful language model called ALBERT (A Lite BERT) can be used to detect content similarity across multiple websites, which is highly valuable for SEO-focused tasks. The goal of the project is to identify how similar the textual content is between different web pages, especially across multiple websites of the same company or among competitors.

Purpose of the Project

The primary purpose of this project is to measure the degree of similarity between web pages using a pre-trained ALBERT model without requiring custom training or complex configurations. This approach enables scalable and fast evaluation of content across multiple domains. It helps in making informed decisions about content uniqueness, duplication, or overlap between web pages — either within the same organization or among competitors.

By using ALBERT embeddings and cosine similarity, this project simplifies a complex Natural Language Processing task into an accessible and insightful tool for clients focused on maintaining strong SEO health and content quality.

Model Used: ALBERT (A Lite BERT)

This project uses a powerful natural language processing (NLP) model called ALBERT, which stands for A Lite BERT. ALBERT is a smaller and more efficient version of BERT (Bidirectional Encoder Representations from Transformers), one of the most influential language models in modern AI.

What is ALBERT?

ALBERT is a pre-trained language model designed to understand and process human language. It has been trained on a massive amount of English text from books, articles, and websites. Its main job is to convert text into numerical representations (called embeddings) that capture the meaning and context of the text.

The version used in this project is:

paraphrase-albert-small-v2, a smaller, fine-tuned model that specializes in detecting whether two pieces of text are semantically similar — i.e., they mean the same thing, even if the words are different.

How ALBERT Works

Here’s a simplified breakdown of how ALBERT processes text:

Tokenization: First, the input text is split into smaller pieces called “tokens” (like words or sub-words).
Embedding: ALBERT converts those tokens into high-dimensional numeric vectors.
Contextual Understanding: Using a Transformer architecture, ALBERT looks at the entire sentence at once and understands the meaning of each word in context.
Output: The final output is an embedding — a smart vector that represents the full meaning of the input sentence or paragraph.

These embeddings are used to compare how similar two different texts are, regardless of the specific words used.

What is Similarity in This Project?

In this project, similarity refers to how semantically close two pieces of text are — meaning, how similar their meanings are, even if the wording is different.

For example:

“Buy affordable used cars in London”

and

“Get budget second-hand vehicles in London”

These are different in words, but very similar in meaning.

That’s where semantic similarity comes in — and ALBERT helps us measure this.

How Similarity is Calculated

Once ALBERT converts a piece of text into a vector embedding (a numerical representation of meaning), compare these embeddings using a mathematical formula to get a similarity score.

In this project, Cosine Similarity method is used.

Cosine Similarity

Cosine Similarity measures the angle between two vectors (i.e., the text embeddings) in a high-dimensional space.

If the angle is small (vectors are close), the similarity is high then, The texts are very similar.

If the angle is large, the similarity is low then, The texts are different.

The formula is:

Cosine Similarity = (A . B) / (||A|| X ||B||)

Where:

A ⋅ B = Dot product of the two vectors
∣∣A∣∣ and ∣∣B∣∣ = Norm (or length) of each vector

The result is a number between -1 and 1, but in most NLP use cases, the range is:

0 to 1, where:
1 = Identical in meaning
0.8 – 1 = Highly similar
0.5 – 0.8 = Somewhat similar
Below 0.5 = Low similarity

Libraries Used in the Project

requests

Purpose: Makes HTTP requests to websites and fetches their HTML content.

Used to load web pages by URL.

It acts like a browser requesting the website’s data.

Used for: Accessing raw webpage content

re (Regular Expressions)

Purpose: Used for pattern matching and text cleaning.

Helps in removing unwanted characters from text (like punctuation, symbols, etc.).

Allows us to clean and simplify the website content before further processing.

Used for: Text preprocessing and cleaning

BeautifulSoup from bs4

Purpose: Parses HTML and extracts readable text.

Converts raw HTML into usable content by pulling out paragraphs (<p> tags).

Helps ignore navigation menus, buttons, and other non-text elements.

Used for: Extracting clean text from websites

sentence_transformers

Purpose: Makes it easy to use pre-trained transformer models (like ALBERT) for sentence and text embeddings.

Used SentenceTransformer to load a fine-tuned ALBERT model.

It converts entire blocks of text into numerical embeddings (vectors) that represent their meaning.

Used for: Generating semantic embeddings of content

nltk (Natural Language Toolkit)

Purpose: A toolkit for working with human language data (text).

Used stopwords from nltk.corpus.

Stopwords are common words (like “and”, “the”, “in”) that are often removed because they don’t carry meaningful content.

Used for: Cleaning the text by removing non-informative words

torch (PyTorch)

Purpose: A deep learning framework used to handle tensors and math operations.

ALBERT embeddings are in the form of PyTorch tensors.

Used torch.nn.functional.cosine_similarity() to calculate the similarity between embeddings.

Used for: Handling tensor data and computing similarity

These lines are part of the text preprocessing process. They help in removing stopwords, which are very common words in English that usually do not add meaningful value to text analysis — such as “the”, “is”, “at”, “which”, and “on”.

nltk.download(‘stopwords’)
- This command ensures that the NLTK stopword list is downloaded.
- NLTK comes with built-in lists of stopwords for different languages.
- These lists are not bundled by default, so they must be downloaded explicitly.
stop_words = set(stopwords.words(‘english’))
- This creates a Python set containing all English stopwords.
- A set is used because it provides faster lookups when filtering words.

This step performs web scraping to collect raw text content from a webpage.

First, a website URL is defined with url = “https://thatware.co/best-backlink-opportunities-identification-using-cuckoo-algorithm/”. To avoid being blocked by the server, a custom browser-like header is set using headers = {‘User-Agent’: ‘Mozilla/5.0’}. This header helps mimic a real user browsing the page.

An HTTP GET request is then sent using response = requests.get(url, headers=headers). If the status code of the response is “200”, which means the page was successfully retrieved, the HTML content is parsed using BeautifulSoup: soup = BeautifulSoup(response.text, ‘html.parser’).

Next, all paragraph elements (<p>) are collected from the parsed HTML using paragraphs = soup.find_all(‘p’). The visible text inside each paragraph is extracted and combined using ‘ ‘.join([p.get_text() for p in paragraphs]), resulting in a single string of raw text from the page.

If the status code is not “200”, the code sets raw_text = “” and prints a message: “Failed to fetch the webpage.”.

This scraped text becomes the foundation for all further processing, such as cleaning, embedding, and comparing content. High-quality extraction ensures meaningful analysis during similarity detection.

Preprocessing the Text

raw_text = raw_text.lower() The entire text is converted to lowercase to ensure consistency. This helps in treating words like “SEO” and “seo” as the same during further processing.

raw_text = re.sub(r'[^a-z0-9\s]’, ”, raw_text) A regular expression is used to remove all special characters, punctuation, and symbols from the text. Only lowercase letters, numbers, and spaces are retained.

words = raw_text.split() The cleaned string is split into individual words based on spaces. This step transforms the text into a list of words for easier filtering.

cleaned_text = ‘ ‘.join([word for word in words if word not in stop_words]) Common English stopwords (like “the”, “and”, “is”, etc.) are removed from the text. These words do not carry much meaning in analysis and are filtered out using NLTK’s stopword list. The remaining words are joined back into a single string.

This preprocessing ensures that the text is clean, normalized, and free from noise, making it more meaningful for the embedding model to process in the next step.

Load ALBERT-based Sentence Transformer Model

model = SentenceTransformer(‘sentence-transformers/paraphrase-albert-small-v2’) This line loads a pre-trained SentenceTransformer model using ALBERT, specifically fine-tuned for understanding sentence meaning. It converts text into numerical vectors (embeddings) that represent the semantic content of the text.

Model Breakdown

· Transformer (AlbertModel): Processes input text and captures contextual meaning.

· Pooling Layer: Averages token embeddings into one fixed-size sentence vector.

· word_embedding_dimension: 768: Each sentence is converted into a 768-length vector.

· pooling_mode_mean_tokens: True: Token vectors are averaged to get a clean sentence representation.

embedding = model.encode(cleaned_text, convert_to_tensor=True) This line feeds the cleaned webpage content into the ALBERT-based model. The model processes the entire text and returns a sentence embedding — a numerical representation of the overall meaning of the input.

What is an Embedding?

An embedding is a high-dimensional vector (here, 768 dimensions) that captures the semantic essence of the input text. Similar texts will generate similar vectors.

This step is crucial because these vectors are what will be used in the next phase to calculate text similarity between different webpages.

This line of code calculates the cosine similarity between the text embedding and itself. Since the comparison is done using the same embedding, the score is always very close to 1.0, which indicates perfect similarity.

Why Similarity Is 1?

Since the embedding is compared with itself because the embeddings are from the same website, the result is exactly (or very close to) 1.0 — meaning 100% similar. This acts as a baseline to test whether the process is working correctly.

This function is designed to clean raw web page text before sending it into the model. The process is called text preprocessing, which improves accuracy and consistency when comparing website content.

text.lower() Converts all letters to lowercase so that “Car” and “car” are treated the same.

re.sub(r'[^a-z0-9\s]’, ”, text) Removes anything that isn’t a letter, number, or space—like punctuation and special symbols.

text.split() Breaks the entire text into individual words (tokens).

[word for word in words if word not in stop_words] Removes common words like “the”, “is”, “and” that do not contribute meaningfully to the comparison. These are called stopwords and are ignored to focus on important terms.

return ‘ ‘.join(words) Joins the cleaned words back together into a single string for model input.

This standardized and simplified form of the text ensures better and more reliable results when calculating similarity.

This function is used to automatically visit a webpage and extract its written content, specifically for SEO comparison.

headers = {‘User-Agent’: ‘Mozilla/5.0’} Imitates a browser visit to avoid being blocked by the website.

requests.get(url, headers=headers) Sends a request to open the webpage.

if response.status_code != 200: If the page doesn’t open successfully (e.g., error 404), the function returns None and skips processing.

soup = BeautifulSoup(response.text, ‘html.parser’) Uses BeautifulSoup, a web scraping tool, to understand the page structure and extract readable text.

paragraphs = soup.find_all(‘p’) Gathers all paragraph elements from the webpage (most readable content is inside

tags).

‘ ‘.join([p.get_text() for p in paragraphs]) Extracts and joins the paragraph text into one long string.

return preprocess_text(text_content) Sends the extracted text through the previously defined cleaning function to make it suitable for comparison.

This function ensures that only the meaningful and clean content of a website is collected for analysis, excluding menus, ads, and code.

This function is responsible for converting the cleaned text into a numerical format that a machine learning model can understand and compare.

Text Input: The input is the plain cleaned text extracted and processed from a website.

model.encode(…) This uses the ALBERT-based transformer model to turn the text into an embedding—a numerical representation that captures the meaning of the entire content.

These embeddings are like fingerprints of the text, where similar meanings lead to similar values.

convert_to_tensor=True The result is returned as a PyTorch tensor, which is a format that allows mathematical operations like cosine similarity.

The main goal of this step is to prepare the text so it can be compared mathematically. This way, the content of two different websites can be compared not just by the words used, but by the overall meaning.

This function measures how similar two pieces of text are by comparing their embeddings using a method called cosine similarity.

How it works:

vec1 and vec2 are the numerical representations (embeddings) of two different texts.

Dot Product (torch.dot) finds how aligned the two vectors are.

Norm (torch.norm) calculates the size (or magnitude) of each vector.

The formula divides the dot product by the product of both magnitudes, resulting in a similarity score between -1 and 1.

A score close to 1 means the two texts are very similar in meaning.
A score close to 0 means the texts are unrelated.
A score below 0 means they are opposite in meaning (rare in this context).

This calculation helps quantify how closely the content of two websites aligns, even if the words used are different.

This function brings all the earlier pieces together to perform a full comparison between two websites. It handles the entire pipeline from extraction to evaluation:

scrape_website(url) is used to extract and clean the text content from each of the two URLs.

preprocess_text(text) ensures the extracted text is filtered and formatted correctly.

generate_embedding(text) converts the cleaned text into embeddings using the ALBERT model.

cosine_similarity(vec1, vec2) calculates how similar the two websites are in terms of their textual content.

The code compares the text content of two different websites.

The calculated Similarity Score is 0.2888

What This Means:

The similarity score is approximately 0.29 on a scale from 0 to 1.

A score of 1.0 means the websites are identical, while a score of 0.0 means they are completely unrelated.

A score of 0.29 is relatively low, which suggests that the two websites talk about different topics or use very different language, even though both are within the SEO domain.

This low score is expected, because one website discusses backlink opportunities, while the other discusses algorithm-inspired strategies.

Why This Is Useful:

This kind of comparison can help identify:

Duplicate content (if score is very high, like 0.95 or more).
Content uniqueness (if score is low).
Overlap in topics (moderate scores like 0.5–0.7).

In this case, both websites are original in what they cover, with minimal overlap. This is valuable in SEO because unique content ranks better on search engines and avoids penalties for duplication.

Analyzing Multiple Websites for Textual Similarity

In this step, the project was expanded to handle and compare multiple websites at once. The process remains the same in terms of functionality—scraping the page, cleaning the content, converting it into a meaningful numerical format using the ALBERT model, and then calculating similarity between every possible pair of websites. However, the goal here is much broader than comparing just two sites. We’re now identifying relationships and differences across several web pages in a single batch. This allows us to assess whether some content may be overlapping or too similar, or if each page offers its own unique value to users and search engines.

What Do the Similarity Scores Mean?

Each pair of websites was compared using a similarity metric called cosine similarity, which calculates how close the meaning of two pieces of text are. The score ranges from 0 (completely different) to 1 (exactly the same). The results were as follows:

· The first and second websites (Cuckoo Algorithm vs. Swarm Intelligence) had a similarity score of 0.2888. This is a low score, meaning that while both articles touch on AI in SEO, they cover it from very different angles. One is heavily focused on backlinks using algorithmic ranking systems, and the other on collective behavior models applied to broader SEO tasks.

· The first and third websites (Cuckoo Algorithm vs. Gemini Audio) scored slightly higher at 0.3174. This small increase suggests a little more thematic overlap, possibly due to both pages talking about automation and innovation. However, the formats and objectives of the articles are still very different—one is about link-building, the other about multimedia SEO.

· The second and third websites (Swarm Intelligence vs. Gemini Audio) scored 0.3117. Again, the similarity is mild. Both are exploring innovative ways to apply technology to SEO, but one is rooted in nature-inspired AI strategies, while the other is focused on transforming how users consume content.

What is Means for Website Owner?

· These scores provide strong evidence that each article is unique and serves its own purpose within the SEO strategy. There is no content duplication, which is important for both user experience and Google’s search algorithm. From an SEO perspective, this means the content is well-structured and avoids cannibalization—where multiple pages compete for the same keyword or intent.

· Another benefit is the possibility of identifying internal linking opportunities. Since the topics are connected at a conceptual level but distinct in their focus, they could be cross-linked strategically. For example, a backlink strategy article might link to the swarm intelligence post when discussing advanced AI models. This kind of intelligent internal linking not only improves SEO but also helps users navigate more intuitively through related content.

· Lastly, this analysis gives confidence to the content team and stakeholders. It shows that even though all the articles are written around SEO and AI, they are well-separated in terms of intent, structure, and application. That’s a strong sign of a diversified content strategy, which is essential for long-term organic growth.

Understanding Content Similarity Scores

Low Similarity Scores (Below 0.40)

What the score indicates: A similarity score below 0.40 suggests that the compared pages contain largely different content. These pages likely focus on separate topics, use distinct terminology, and are written with different purposes in mind. There is minimal overlap in phrasing, sentence structure, and information presented.
Why this is a positive sign: Low similarity is generally desirable when comparing different service pages, product descriptions, or core sections of a website. It indicates that each page has a unique informational role, which helps in aligning with specific user intents. This differentiation supports stronger keyword targeting, reduces the risk of internal competition for search rankings, and enhances the overall user experience.
Impact on SEO: Search engines favor content that is unique and purposeful. When pages are clearly distinct, it becomes easier for algorithms to understand their relevance to different queries. This improves indexing accuracy and maximizes the visibility of each page in relevant search results.
Recommended approach: No major changes are necessary for pages with low similarity. However, ensure that even distinct pages maintain a consistent tone and branding style across the website.

Moderate Similarity Scores (0.40 to 0.59)

What the score indicates: A score within this range suggests that there are partial similarities in content, structure, or vocabulary between pages. This is often due to shared brand messaging, repeated introductory phrases, or template-based formatting. While the pages are not identical, they may not be sufficiently differentiated to fully justify their separate existence in the eyes of search engines.
Why this requires attention: Moderate similarity can lead to reduced content effectiveness. If multiple pages discuss overlapping themes or use the same phrases without offering distinct value, they may fail to rank independently. In some cases, this might dilute topical authority or cause content to compete with itself in the search index.
Impact on SEO and user experience: From an SEO standpoint, moderate similarity increases the risk of keyword cannibalization—where multiple pages unintentionally compete for the same terms. From a user perspective, moderately similar pages can seem repetitive, which may discourage deeper exploration of the site.
Recommended approach:
- Audit the similar pages to determine where overlaps occur. Refine the content by:
- Adding page-specific information or examples.
- Rewriting shared sections to reflect unique angles.

Introducing distinct calls-to-action, headlines, or visual elements to clearly define each page’s role.

High Similarity Scores (0.60 and Above)

What the score indicates: A score above 0.60 signifies significant duplication or near-identical content between pages. The pages may share entire paragraphs, lists, or descriptions, indicating that the content has been copied or adapted with minimal variation. In many cases, this means the pages are functionally redundant.
Why this is problematic: High similarity between multiple pages can lead to several issues:
- Search engines may struggle to determine which page to rank, resulting in reduced visibility for both.
- Duplicate content penalties may apply, especially if internal duplication is extensive or if the same content appears elsewhere on the web.
- Users may be confused or frustrated, encountering similar information on multiple pages with different titles or purposes.
Impact on site performance: When multiple pages serve the same purpose or deliver nearly identical content, it reduces the overall clarity of the site architecture. This can affect bounce rates, lead generation, and page authority. In extreme cases, search engines might choose to only index one of the similar pages or ignore both entirely.
Recommended approach: Immediate action is required. Consider the following steps:
- Revise or rewrite one or more of the pages to introduce original content, additional details, or differentiated messaging.
- If the purpose of two pages is too similar, consider merging them into a single, more comprehensive page.
- Update metadata, internal links, and headings to ensure they reflect the updated structure.

What Should You Do?

Evaluate Content Purpose: Every page on a website should have a distinct goal. Whether it’s educating users, generating leads, or providing support—its content should reflect that objective clearly.
Avoid Repetition: While consistency in branding is important, repeating full sections of content across pages reduces uniqueness. This can be resolved by customizing intros, expanding details, or rewriting common elements in different ways.
Use Similarity Analysis Regularly: Regularly monitoring similarity scores across important landing pages can help detect issues early and maintain a healthy content structure over time.
Focus on Search Intent: Align each page with a specific search intent or keyword cluster. This ensures that even related topics are written in a way that targets different user queries and stages of the buyer journey.

By applying these best practices, it becomes possible to maintain a clean, effective content architecture that is optimized for both search engines and real users.

Why is content similarity analysis important for a website?

Content similarity analysis helps identify duplicate or near-duplicate content across your website. High similarity between pages can negatively affect your search rankings, confuse users, and reduce your site’s effectiveness in targeting different keywords or user intents. Keeping content distinct and purposeful across pages supports better SEO performance and a smoother user experience.

What’s considered an acceptable similarity score?

Ideally, core content pages should have similarity scores well below 0.60. Scores under 0.40 are preferred for pages serving different functions (e.g., services vs. contact pages). While some overlap is normal (such as headers or footers), too much similarity in the main content should be avoided.

What should be done if two pages have very similar content?

Instead of removing a page outright, it is generally more effective to revise the content so each page serves a distinct and valuable purpose. Alternatively, if both pages cover the same topic and serve the same function, combining them into one comprehensive page may be beneficial. Content should only be removed if it adds no unique value and is not contributing to performance or user engagement.

Does similar content always hurt SEO?

Not always. Some level of similarity is expected, especially for branding or structural elements. However, when the main content body of two pages is nearly identical, it can lead to keyword cannibalization or reduced visibility. That’s why we focus only on meaningful similarity—where it counts.

Final Thoughts

Content similarity analysis is a foundational step in building a strong, search-optimized, and user-focused website. It ensures that each page has a clearly defined purpose and delivers unique value. High similarity between important pages can hinder performance, dilute keyword effectiveness, and lead to poor user engagement.

By identifying and addressing these issues proactively:

Search engines better understand and index your content.
Users enjoy a more informative and meaningful site experience.
Your site structure becomes more strategic and easier to manage over time.

Going forward, it is recommended to integrate regular similarity audits into your content strategy—especially when adding new pages or making significant updates. This will help maintain SEO performance and keep your website aligned with both business goals and user expectations.

Tuhin Banik

Thatware | Founder & CEO

Tuhin is recognized across the globe for his vision to revolutionize digital transformation industry with the help of cutting-edge technology. He won bronze for India at the Stevie Awards USA as well as winning the India Business Awards, India Technology Award, Top 100 influential tech leaders from Analytics Insights, Clutch Global Front runner in digital marketing, founder of the fastest growing company in Asia by The CEO Magazine and is a TEDx speaker and BrightonSEO speaker.