SUPERCHARGE YOUR ONLINE VISIBILITY! CONTACT US AND LET’S ACHIEVE EXCELLENCE TOGETHER!
The project, titled “Named Entity Recognition (NER) Enhanced Ranking — Extracts and ranks pages based on the prominence of named entities,” introduces a content-driven evaluation system that assesses webpages through the lens of named entity analysis. Named entities include identifiable real-world concepts such as people, organizations, locations, products, events, and more.
Rather than relying solely on surface-level SEO indicators like keyword frequency or meta tags, this system uses advanced language models to extract meaningful entities from webpage content. These entities are then analyzed for their variety, frequency, and contextual confidence to derive an entity-based score for each page.
The scoring logic is designed to reflect the semantic depth and topical richness of a webpage, making it a strong signal for ranking pages based on real informational value. This method allows for objective and scalable evaluation across different pages or websites.
Project Purpose
The primary purpose of this project is to help businesses:
- Identify high-value content that demonstrates topical authority through strong use of meaningful named entities.
- Evaluate and compare webpages based on entity prominence rather than surface SEO metrics.
- Uncover gaps where entity usage is weak, indicating areas where content could be improved for better authority and relevance.
- Visualize content quality through informative charts that show the type, density, and diversity of entities present on each page.
This approach provides a deeper, entity-level insight into how well a page covers important concepts in its domain — enabling better strategic decisions for content optimization, content auditing, and competitive benchmarking.
What is Named Entity Recognition (NER)?
Named Entity Recognition (NER) is a Natural Language Processing (NLP) technique that identifies and classifies words or phrases in text into predefined categories. These categories represent real-world objects or abstract concepts, such as names of people, companies, locations, products, and more.
NER models scan through content and automatically highlight important elements that are critical to understanding the meaning and context of a page. This is similar to how a human reader picks out keywords or highlights in an article — except NER does this programmatically at scale.
What types of entities does NER commonly detect?
NER models typically detect a wide range of entities, including but not limited to the following popular categories:
- PERSON: Names of individuals (e.g., Elon Musk, Marie Curie)
- ORG (Organization): Names of companies, institutions, or teams (e.g., Google, Microsoft, World Health Organization)
- GPE (Geopolitical Entity): Countries, cities, or regions (e.g., India, New York, European Union)
- LOC (Location): Physical locations not covered under GPE (e.g., Himalayas, Pacific Ocean)
- PRODUCT: Consumer or industrial products (e.g., iPhone, Tesla Model 3)
The exact set of entities depends on the NER model used. In this project, a state-of-the-art model trained on high-quality, real-world datasets is used to extract accurate and meaningful entities.
Why are named entities important in content evaluation?
Named entities reflect the core informational elements of a page. They tell what the content is really about — whether it’s a product review, a biography, a business overview, or a legal article.
When a page includes relevant and well-placed entities, it tends to:
- Be topically richer
- Show authority and depth
- Align more closely with user intent
By focusing on named entities, it becomes possible to evaluate content in a meaningful and scalable way, rather than relying on shallow signals like keyword repetition.
What makes NER suitable for analyzing web pages and online content?
NER is well-suited for analyzing digital content because:
- Web pages often contain structured and semi-structured text (e.g., product names, brands, locations).
- Entities appear repeatedly across different pages, allowing for pattern recognition and comparisons.
- It enables language-independent understanding, since entities are generally named the same way across languages (e.g., Google, Tesla, COVID-19).
This makes NER a powerful tool for content audits, topical relevance checks, and ranking decisions in SEO and content marketing.
How does NER differ from traditional keyword extraction or tagging?
NER looks for specific real-world references, whereas traditional keyword extraction methods often rely on frequency or statistical co-occurrence without knowing what the word represents.
For example:
- A keyword tool might highlight the word “apple” frequently in a document.
- NER will identify whether “Apple” refers to a company (ORG) or a fruit (PRODUCT) based on context.
This context-aware nature of NER makes it smarter and more accurate, especially in long-form or technical content.
Libraries Used
requests
What it is: A widely used library for sending HTTP requests in Python.
Why it is used: This library is used to fetch raw HTML content from the URLs provided. It plays a key role in accessing the web pages that are later analyzed for named entities.
BeautifulSoup (from bs4)
What it is: A Python library for parsing HTML and XML documents.
Why it is used: After retrieving the raw HTML with requests, BeautifulSoup is used to extract clean and structured text from HTML tags like headings, paragraphs, and list items. This ensures the text passed to the NER model is meaningful and not cluttered by code, ads, or navigation.
re (Regular Expressions)
What it is: A built-in Python module for string matching and pattern searching.
Why it is used: Helps in cleaning and preprocessing the extracted text — such as removing non-text characters, unwanted line breaks, or HTML artifacts before sending it to the NER model.
collections.Counter and collections.defaultdict
What it is: Built-in tools in Python to handle counting and grouping.
Why it is used:
Counter is used to count how often each named entity appears, supporting frequency-based analysis.
defaultdict helps in grouping entities and their metadata (e.g., their types, positions) efficiently while processing multiple blocks of text.
numpy
What it is: A core scientific computing library in Python, optimized for working with arrays and numerical operations.
Why it is used: Supports calculations like computing score metrics (e.g., average, ratios) during entity ranking. Ensures fast and efficient numerical handling.
pandas
What it is: A data analysis library used for working with structured data (like tables or spreadsheets).
Why it is used: Provides a clean structure to store and manipulate extracted entities and their scores. Allows easy transformation and sorting before visualizing or exporting the data.
matplotlib.pyplot and seaborn
What they are: Visualization libraries for Python.
matplotlib.pyplot provides foundational plotting tools.
seaborn builds on top of it for better aesthetics and convenience.
Why they are used: These libraries are used to create visual representations of named entity distributions across pages. The charts help clients visually understand which entities are prominent and how content varies by topic or type.
flair
What it is: A powerful natural language processing (NLP) library developed by Zalando Research.
Why it is used: This library provides access to pretrained NER models, including models trained on high-quality datasets like OntoNotes. In this project, flair is used to perform Named Entity Recognition — the core function that extracts real-world entities (e.g., people, organizations, locations) from the web page content.
Sentence (from flair.data): Helps convert raw text into a format that can be processed by the NER model.
SequenceTagger (from flair.models): Loads the pretrained model used for tagging named entities within the text.
Function: extract_text(url)
Function Purpose:
This function is responsible for retrieving and cleaning the main readable content from a given webpage. It prepares the text for Named Entity Recognition (NER) by removing clutter like ads, navigation bars, and code snippets — ensuring only meaningful content like headings and paragraphs is kept.
Fetch the Webpage
response = requests.get(url, timeout=10) response.raise_for_status()
- This line connects to the web page and downloads its content.
- The timeout=10 ensures the request won’t hang for too long.
- If the page cannot be reached or returns an error (e.g., 404 Not Found), the function gracefully handles it by printing an error and returning an empty string.
Parse the HTML
soup = BeautifulSoup(response.text, ‘html.parser’) elements = soup.find_all([‘p’, ‘h1’, ‘h2’, ‘h3’])
· BeautifulSoup is used to read the HTML and make sense of the structure of the web page.
· Only key content tags are selected:
- Paragraph text(<p>), Headings(<h1>, <h2>, <h3>)
· This excludes menus, scripts, and decorative elements that don’t contribute meaningful textual content.
Clean and Filter the Text
text = el.get_text(separator=’ ‘, strip=True) if len(text) >= 30: visible_text.append(text)
- Each selected element is converted into clean, plain text.
- Blocks shorter than 30 characters are skipped to avoid noise such as headings like “Read More” or “Contact Us”.
Return Final Output
return “\n”.join(visible_text)
- The meaningful text blocks are joined into a single string, separated by line breaks.
- This final output is ready to be passed to the NER model for further analysis.
Why This Matters:
- This function ensures that only valuable content reaches the NER model.
- By avoiding noisy elements, it increases the accuracy of entity recognition, making the overall scoring and ranking system more reliable.
Function: preprocess_text(raw_text)
Function Purpose:
This function is designed to clean and normalize the raw text extracted from a webpage before it is sent to the NER model. The goal is to reduce noise and make the text easier for the model to understand, improving the accuracy and consistency of entity recognition.
Step-by-Step Breakdown
Remove URLs
text = re.sub(r’\b(?:https?|www|ftp)\S+\b’, ”, raw_text)
- Removes links like https://example.com or www.website.com.
- URLs rarely contribute useful named entities and may distract the model.
Remove Special Characters
text = re.sub(r'[^A-Za-z0-9\s]’, ‘ ‘, text)
- Strips out symbols, punctuation marks, and special characters.
- This focuses the text on actual words and meaningful sequences.
Remove Extra Spaces
text = re.sub(r’\s+’, ‘ ‘, text)
- Replaces multiple spaces or irregular spacing with a single space.
- Improves text uniformity and model readability.
Trim Whitespace
text = text.strip()
- Removes any leading or trailing white spaces.
Remove Numbers
text = re.sub(r’\d+’, ”, text)
- Numbers (years, phone numbers, etc.) are removed unless relevant to the specific use case.
- This step is helpful in general content understanding where numbers often do not indicate named entities.
Convert to Lowercase
text = text.lower()
- Converts all text to lowercase.
- Helps normalize entity recognition (e.g., recognizing “Google” and “google” as the same).
Why This Matters:
- Preprocessing ensures that only clean, consistent, and structured text is passed to the NER model.
- This helps the model reduce confusion, identify entities more accurately, and avoid being misled by clutter or formatting.
- Without this step, results could include irrelevant or missed entities due to poor formatting or noise in the data.
Function: load_model()
Loads a pre-trained Named Entity Recognition (NER) model using the Flair library.
Why It’s Used:
Provides a ready-to-use model for identifying named entities in text, allowing for fast integration without custom training
Model Section: Flair NER (OntoNotes-Large)
The Named Entity Recognition (NER) functionality in this project is powered by the Flair NER model based on the OntoNotes 5.0 dataset. This model is one of the most comprehensive publicly available NER models and is integrated via the Flair NLP framework.
What Is This Model?
The backbone of this project is a Named Entity Recognition (NER) model called flair/ner-english-ontonotes-large. It is a transformer-based model built on XLM-RoBERTa and trained on the OntoNotes 5.0 dataset, a well-known benchmark in the NLP community for tasks involving linguistic annotations. This model is integrated through the Flair NLP framework and is specialized in identifying a wide range of named entities across diverse types of online text, including webpages, news articles, and social media.
What This Model Does
The model scans web content and identifies spans of text that correspond to specific real-world concepts such as people, companies, places, events, and more. For each recognized entity, it assigns one of 18 predefined categories. This allows the system to “understand” which named concepts are present on the page and what types of entities dominate the content.
The model works token by token, using the BIO (Beginning, Inside, Outside) tagging scheme to mark whether each token begins an entity (B), continues an entity (I), or lies outside of any entity (O). This enables multi-word entities like “New York City” or “International Business Machines” to be identified accurately.
Underlying Architecture of the Model
- The model uses XLM-RoBERTa, a large-scale transformer pretrained on multiple languages, known for its ability to capture contextual semantics.
- It processes text using self-attention mechanisms across 24 transformer layers.
- The model generates dense embeddings (vector representations) of words and uses them to predict entity labels.
- A linear classifier head at the end converts these embeddings into named entity labels based on learned patterns.
This architectural foundation allows the model to achieve state-of-the-art performance across multiple NER benchmarks and domains.
Why This Model Was Selected
- Broad Entity Coverage: Unlike simpler models that only detect names and places, this model identifies 18 different entity categories, enabling deeper content analysis.
- Contextual Precision: The use of transformers ensures that entity predictions take context into account, which is essential for resolving ambiguity.
- Domain Versatility: The model is trained on diverse data types (news, conversation, web data), making it suitable for analyzing a wide variety of website content.
- Robustness to Noise: It performs well even on noisy or informal content, which is common on modern web pages.
All Entity Classes of this model
The Flair OntoNotes model predicts one of the following 18 entity classes for each detected entity span:
- PERSON – Refers to names of people. Example: “Marie Curie”, “Elon Musk”. Useful for identifying influencers, authors, or public figures mentioned in content.
- NORP – Refers to nationalities, religious groups, or political affiliations. Example: “American”, “Buddhist”, “Democrats”.
- FAC – Short for facilities. Includes buildings, airports, highways, bridges, etc. Example: “Golden Gate Bridge”, “JFK Airport”.
- ORG – Denotes companies, institutions, or agencies. Example: “Google”, “Harvard University”.
- GPE – Stands for Geo-Political Entities. Includes countries, cities, and states. Example: “Germany”, “New York”, “California”.
- LOC – Covers non-GPE locations such as mountain ranges, bodies of water, and physical regions. Example: “Himalayas”, “Pacific Ocean”.
- PRODUCT – Consumer products or commercial goods. Example: “iPhone”, “Tesla Model 3”, “Microsoft Word”.
- EVENT – Named events like wars, festivals, or conferences. Example: “World War II”, “CES 2024”.
- WORK_OF_ART – Artistic creations such as books, songs, movies, and paintings. Example: “Inception”, “Mona Lisa”, “Hamlet”.
- LAW – Legal documents or named laws. Example: “GDPR”, “Civil Rights Act”.
- LANGUAGE – Human languages. Example: “English”, “Mandarin”, “Spanish”.
- DATE – Specific dates or date references. Example: “January 1st”, “2025”, “last Monday”.
- TIME – Time expressions. Example: “3 PM”, “midnight”, “an hour later”.
- PERCENT – Percentages. Example: “20%”, “half of the population”.
- MONEY – Monetary values. Example: “$100”, “two million euros”.
- QUANTITY – Measurements and amounts. Example: “3 kilometers”, “ten liters”.
- ORDINAL – Ordered numbers. Example: “first”, “third”, “10th place”.
- CARDINAL – Non-ordered numbers. Example: “three”, “100”, “forty-two”.
Strengths of This Entity Set
This rich label set allows the system to perform multidimensional content profiling. For instance, a page with high mentions of ORG, PRODUCT, and GPE entities might be commercial or geopolitical in nature. A page with high PERSON and EVENT entities may reflect news or biographical content. This granularity is essential for ranking and evaluating pages based on topical authority and semantic richness.
Function: extract_named_entities()
This function is the core NER extraction module of the project. It takes in a block of cleaned text and uses the Flair OntoNotes NER model to identify and return relevant named entities with associated metadata such as entity type and confidence score.
sentence = Sentence(text)
- What it does: Converts raw text into a format understood by the Flair NER model.
- Why it matters: Flair requires text to be wrapped inside a Sentence object to process it for entity recognition.
tagger.predict(sentence)
- What it does: Runs the pre-trained NER model on the input sentence and attaches entity predictions to it.
- Why it matters: This is the core prediction step where named entities are extracted by the model.
for entity in sentence.get_spans(‘ner’)
- What it does: Iterates through all identified named entity spans in the sentence.
- Why it matters: Flair groups recognized entities into spans like names, organizations, locations, etc. This loop retrieves each of them for processing.
label = entity.get_label(‘ner’).value
- What it does: Retrieves the predicted label (entity type) such as PERSON, ORG, PRODUCT, etc.
- Why it matters: Helps understand what kind of entity was detected—this is the semantic category used for scoring and visualization.
score = entity.get_label(‘ner’).score
- What it does: Gets the model’s confidence in the prediction (between 0 and 1).
- Why it matters: Allows filtering out low-confidence predictions to improve reliability.
if min_score_threshold and score < min_score_threshold: continue
- What it does: Skips entities that fall below the minimum confidence threshold.
- Why it matters: Filters out uncertain predictions, ensuring only reliable entities are used.
if exclude_keywords and any(re.search(pat, word.lower()) for pat in exclude_keywords): continue
- What it does: Skips entities that match any unwanted keyword or pattern (like “test”, “lorem”, etc.).
- Why it matters: Helps remove noisy or irrelevant terms that don’t add value to the ranking.
if word.lower() not in seen_words:
- What it does: Ensures that each entity is counted only once (case-insensitive).
- Why it matters: Prevents over-representation of repeated terms in the same text block.
entities.append({…})
- What it does: Stores valid entity data (word, label, confidence) into a structured list.
- Why it matters: This final list is used for scoring, ranking, and visualizations.
Understanding the Named Entity Output
Each dictionary inside the output list represents a named entity detected from the cleaned text using the Flair NER model. The components are:
1. word
- The exact term or phrase from the text identified as an entity.
- Example: “forbes agency council”, “today”, “three”
2. entity_group
o The type or class of the entity, such as:
- ORG: Organizations
- DATE: Temporal references (dates, time expressions)
- CARDINAL: Numbers (counts, quantities)
o These help categorize what the term represents in context.
3. score
- Confidence score (between 0 and 1) showing how certain the model is about the classification.
- Higher values (closer to 1) indicate greater certainty.
- Example: “forbes agency council” has a score of 0.9995, showing very high confidence.
Function: create_entity_profile
This function is responsible for creating a summary profile of named entities extracted from a web page. It analyzes the list of named entities and aggregates them based on their entity type (like ORG, PERSON, DATE, etc.). The output is a dictionary that shows how many times each entity type appears, helping to understand what kinds of named entities dominate the content.
This is useful for:
- Highlighting which entity categories are prominent on a page.
- Comparing entity distributions across different pages or websites.
- Providing high-level insights for visualizations or content audits.
Explanation
entity_count = Counter()
- Initializes a Counter object from the collections module.
- This will store counts of how many times each entity type appears.
for entity in ner_results:
- Iterates over each named entity returned by the NER model.
entity_group = entity[‘entity_group’]
- Extracts the entity type label (e.g., ‘ORG’, ‘PERSON’) from the result.
entity_count[entity_group] += 1
- Increments the count for the corresponding entity type.
- For example, if ‘ORG’ is seen multiple times, it keeps a running total.
return dict(entity_count)
- Converts the Counter object into a standard Python dictionary for easier use later (e.g., in plots or scoring logic).
Interpreting the Output
Each key in the dictionary is an entity type, and the value is the number of times that type appeared in the webpage’s text. For instance:
‘ORG’: 7
There are 7 occurrences of entities classified as organizations.
‘GPE’: 27
27 geopolitical entities (such as countries, cities, or states) were detected — this is a strong indicator that location-based content dominates the page.
‘CARDINAL’: 3
Three numerical expressions (like counts, rankings, etc.) were found.
‘WORK_OF_ART’: 1
There’s one mention of an artistic title or creative work.
Practical Significance
This profile helps answer questions like:
- What kind of content does this page focus on?
- Is the content more people-centric (PERSON) or organization-centric (ORG)?
- Does the content include many locations (GPE) or dates (DATE)?
- Are there creative assets or references (e.g., WORK_OF_ART)?
Such insights are valuable for SEO strategists, content auditors, and analysts evaluating whether a webpage aligns with topical authority or target audience relevance.
Function: calculate_entity_score
This function evaluates a web page based on its named entities using three key components:
- Diversity – Measures how many different types of entities appear (e.g., ORG, PERSON, GPE).
- Density – Measures how frequently entities appear in proportion to total word count.
- Confidence – Measures the average prediction score of all detected entities.
Each of these components contributes to a final score that reflects how entity-rich and semantically structured the page is — crucial for evaluating topical authority and content depth in SEO.
Explanations:
Entity Weighting
ENTITY_WEIGHTS = {…}
Each entity type is given a predefined weight based on its assumed semantic value. For example:
- ‘ORG’ and ‘PERSON’ have the highest weight (1.2) — indicating importance for authoritative content.
- ‘CARDINAL’ and ‘ORDINAL’ have the lowest (0.03, 0.05) — often add noise and are less semantically useful.
Weighted Diversity Score
unique_types = set(ent[‘entity_group’] for ent in entities)
- This line identifies how many distinct entity types are present.
diversity_score = sum(ENTITY_WEIGHTS.get(t, 0.0) for t in unique_types) max_possible_diversity = sum(sorted(ENTITY_WEIGHTS.values(), reverse=True)[:len(unique_types)]) weighted_diversity_score = diversity_score / max_possible_diversity
- It normalizes the diversity score by comparing actual diversity against the ideal max diversity if the top-N most important types were all present.
Density Score
weighted_entity_count = sum(ENTITY_WEIGHTS.get(ent[‘entity_group’], 0) for ent in entities) density_score = weighted_entity_count / (total_word_count / 1000)
- Computes how many weighted entities appear per 1000 words — a length-adjusted frequency.
final_density_score = np.log1p(density_score) / np.log1p(MAX_REASONABLE_DENSITY)
- Applies log-scaling to prevent overly frequent entities from dominating the score.
- Caps at an upper bound of 50 entities per 1000 words to avoid skew.
Confidence Score
avg_confidence = sum(ent[‘score’] for ent in entities) / len(entities)
- Averages the prediction confidence values of the NER model for all entities, indicating model certainty.
Final Combined Score
final_score = (0.4 * weighted_diversity_score) + (0.4 * final_density_score) + (0.2 * avg_confidence)
- The three components are weighted:
- Diversity (40%)
- Density (40%)
- Confidence (20%)
This balanced weighting ensures that the final score favors rich, frequent, and confidently identified entities.
Interpreting the Output
The goal of this code is to compute a final entity score that reflects the semantic richness of a web page’s content based on named entities.
Explanation:
total_words = len(cleaned_text.split())
- Counts the total number of words in the page content.
- This count is used to normalize entity frequency per 1000 words, helping adjust for document length.
score = calculate_entity_score(ner_results, total_words)
- Calls the previously defined calculate_entity_score function.
- Uses the list of named entities (ner_results) and the word count to generate: Entity Diversity Score, Entity Density Score, Confidence Score, Final Composite Score.
Output Explanation:
1. Diversity (0.764)
- Indicates moderate-to-high variety in the types of entities used.
- A score of 0.764 means that the content includes multiple important types (e.g., ORG, PERSON, GPE), but does not cover the full spectrum of top-weighted entities.
2. Density (0.65011)
- Reflects the number of weighted entities per 1000 words.
- A 0.65 score here means entities are well distributed through the content, suggesting good semantic saturation without redundancy.
3. Confidence (0.94981)
- Very high average confidence from the NER model in tagging — nearly 95% average probability.
- This indicates the entities were clearly and correctly identified by the model.
4. Final Score (0.75561)
- This weighted score combines all three factors, showing an overall strong entity profile for the page.
- A score above 0.75 generally implies the content is semantically rich, diverse, and confidently tagged — important for SEO and topical authority evaluation.
Function: visualize_entity_distribution
The visualize_entity_distribution function generates a bar chart showing the distribution of high-importance named entity types within the content of a specific URL. It filters out less significant types (based on a weight threshold) and displays only the most meaningful entity categories (like ORG, PERSON, GPE, etc.). Each bar reflects the frequency of a specific entity type, helping visualize how dominant or diverse the high-priority entities are in the content.
Explanation:
top_entities = {k for k, v in ENTITY_WEIGHTS.items() if v >= 0.8}
- Defines which entity types are considered important for visualization. Only entities with a weight ≥ 0.8 (e.g., ORG, PERSON, GPE, PRODUCT, etc.) are selected.
filtered_profile = {k: v for k, v in profile.items() if k in top_entities}
- Filters the input profile to retain only those entity types considered important. This ensures the chart remains focused on entities that meaningfully contribute to ranking.
if not filtered_profile: return
- If the filtered profile is empty (i.e., no high-weight entity types present), the function skips visualization for that URL.
df = pd.DataFrame({…}).sort_values(by=”Count”, ascending=False)
- Converts the filtered profile into a DataFrame for plotting, sorting the entity types by descending count for better readability in the chart.
sns.barplot(…)
- Generates the actual bar chart using Seaborn, plotting each high-weight entity type on the x-axis and their corresponding counts on the y-axis. Color is assigned by entity type.
plt.xticks(rotation=45)
- Rotates the x-axis labels for better legibility, especially when there are many types.
plt.title(…) and plt.tight_layout()
- Adds a title and ensures clean layout formatting without overlaps.
This visualization helps content analysts or SEO strategists quickly identify which entity types dominate a page, and whether the distribution aligns with the target domain — e.g., content dominated by ORG, PERSON, and GPE might indicate strong authority-related information.
Explanation of the Generated Visualization:
The function generates a bar chart titled “Entity Type Distribution”. On the x-axis, it shows the filtered entity types:
GPE, ORG, PERSON, EVENT, WORK_OF_ART.
On the y-axis, it displays their corresponding frequencies:
GPE: 27 occurrences — the tallest bar, clearly dominating the chart.
ORG: 7 occurrences — second highest bar, indicating a strong presence of organization-related entities.
PERSON: 1 occurrence
EVENT: 1 occurrence
WORK_OF_ART: 1 occurrence
The last three are shorter bars of equal height (count = 1), showing minimal presence.
This visualization quickly communicates that the geopolitical entities (GPE) are heavily represented, followed by ORG, while other high-weight categories are present in much smaller numbers. It helps assess if the content is overly focused on location-based mentions, potentially revealing a skew that may or may not align with content goals (e.g., global authority vs. local focus).
Function: visualize_score_breakdown()
The function visualize_score_breakdown is designed to create a bar plot that visually represents the breakdown of the entity score components for a particular URL. These components include diversity, density, and confidence scores. The scores are displayed as separate stacked bars, allowing a clear visual understanding of how each component contributes to the overall score for the URL. The function takes a list of score breakdowns and plots them using a bar chart.
Explanation:
DataFrame creation for plotting
df = pd.DataFrame({“Component”: [“Diversity”, “Density”, “Confidence”], “Score”: [score[“Diversity”], score[“Density”], score[“Confidence”]] })
- This line creates a Pandas DataFrame that holds the names of the score components (Diversity, Density, and Confidence) as well as their corresponding values. The score dictionary is passed to extract each component’s value.
Plotting the bar chart
sns.barplot(data=df, x=”Component”, y=”Score”, hue=df[‘Component’], legend=False)
- This line uses Seaborn’s barplot to plot the bar chart. The x axis represents the components (Diversity, Density, Confidence), and the y axis represents the corresponding scores. The hue parameter is used to differentiate the bars based on their component type.
Set plot limits and title
plt.ylim(0, 1)
- plt.title(f”Score Breakdown”) The y-axis limit is set to be between 0 and 1 to ensure the scores fall within a standardized range. The plot title “Score Breakdown” is added to describe the chart.
Tight layout and display
plt.tight_layout() plt.show()
- plt.tight_layout() adjusts the layout to ensure that the plot components do not overlap. plt.show() is used to display the plot.
Visualization Output:
The bar chart produced by this function contains three vertical bars, each corresponding to one of the score components:
- Diversity — the first bar will show a height of 0.764, representing the weighted variety of entity types detected in the content.
- Density — the second bar will rise to 0.65011, indicating the relative frequency of named entities across the text, normalized per 1000 words.
- Confidence — the third bar will nearly reach the top, at 0.94981, representing the model’s average certainty across all extracted entities.
All bars are plotted on a common vertical scale from 0 to 1, allowing for easy comparison across the three dimensions. The bars are color-coded by component and clearly labeled on the x-axis. The chart’s title is “Score Breakdown”, and there’s no legend since the hue coloring is self-explanatory.
This visualization gives an at-a-glance summary of how balanced and strong the named entity profile is for the URL — combining entity variety, frequency, and reliability.
Result Discussion — Multi-URL Named Entity Analysis
In this section, the entity-driven analysis was expanded from a single URL to a list of URLs, allowing a comparative evaluation of content quality and topical authority across multiple web pages. Each URL underwent named entity recognition (NER), followed by profiling, scoring, and visualization. This multi-URL analysis offers deeper insight into how consistently a website embeds meaningful, high-confidence named entities throughout its pages — a key factor in topical relevance and semantic richness.
Entity Diversity Across Pages
Each page was evaluated for its diversity of entity types, placing greater value on entities such as organizations (ORG), persons (PERSON), and locations (GPE). Pages with a wider range of these higher-weight categories reflect a richer informational context. Diversity, in this sense, isn’t merely about quantity but about variety and relevance of named entities, which often correlates with broader topic coverage and higher semantic specificity.
Pages with limited entity variety tend to be more narrow in scope, potentially missing opportunities to signal domain authority or topical depth to both users and search engines.
Entity Density as a Signal of Informational Density
Another dimension evaluated was entity density — the number of weighted entities normalized over 1,000 words. This metric helps distinguish between content that is densely packed with meaningful references versus pages that are sparse or overly generic. High-quality pages typically exhibit a balance: they mention enough prominent entities to anchor the content in reality without being overly repetitive or keyword-stuffed.
Entity density also indirectly reflects the information-to-noise ratio — valuable when assessing whether a page offers concrete, referential knowledge or leans heavily on filler content.
Confidence as a Signal of Model Certainty
The confidence score, derived from the NER model’s internal certainty for each entity extraction, plays a vital role in assessing content clarity and precision. Higher confidence generally corresponds to well-structured, coherent text, where entities are clearly expressed and unambiguous. Pages with inconsistent sentence structure, ambiguous phrasing, or excessive jargon may lead to lower confidence levels, which in turn reduce the reliability of entity-based scoring.
Visualization Interpretation
Two forms of visualization were used to support the interpretation:
- Score Breakdown Charts: These provide a side-by-side view of diversity, density, and confidence for each URL. Such visual breakdowns are useful for spotting imbalances, such as high density but low diversity, which might indicate over-optimization or narrow content focus.
- Entity Type Distribution Charts: These reveal the dominant entity types per page. For instance, a content piece heavily focused on GPE entities (locations) might reflect geographic targeting, while a page rich in ORG and PERSON entities could signify brand authority, expert contributions, or case studies.
Content Consistency and Strategic Opportunities
Analyzing multiple URLs together uncovers patterns — some pages may consistently perform well across all components, while others may lack either confidence or entity richness. These differences can guide content audits or SEO strategy improvements, such as:
- Enhancing underperforming pages with more precise references (e.g., adding organization names, expert contributors, or product mentions).
- Structuring content to improve readability and model confidence.
- Identifying content clusters that align well with high-authority signals through strong entity presence.
What does this project tell us about our website content?
This project analyzes your website content using Named Entity Recognition (NER) — a technique that identifies important real-world references such as organizations, locations, people, products, and dates. The goal is to measure how semantically rich, well-structured, and topically authoritative your content is. By scoring content based on entity diversity, density, and confidence, you receive an objective measure of how well your pages demonstrate depth, relevance, and clarity to both users and search engines.
In simpler terms, we’re asking: Are we talking about the right things, in a clear and credible way, and often enough to matter?
What does the confidence score mean, and why is it important?
The confidence score reflects how certain the NER model is that the text contains valid named entities. High confidence generally means your content is:
- Grammatically structured,
- Semantically clear,
- Free from excessive ambiguity or noise.
If your confidence score is low, consider:
- Rewriting unclear or fragmented sentences.
- Avoiding jargon, filler, or keyword-stuffed content.
- Breaking long, complex sentences into clearer, simpler structures.
Improving this helps both NLP models and human readers better understand your content — which contributes to better rankings and user trust.
How can we use the entity type distribution chart practically?
This chart shows which types of named entities dominate each page. For example, a page might heavily reference GPE (locations), while another focuses on ORG (organizations).
Use this to:
- Align entity focus with search intent: If a service page is targeting brand-related queries, ensure it includes organizations and product names.
- Balance content across clusters: Ensure your blog or resource center isn’t overly focused on only one entity type (like places), which could signal narrow topical coverage.
- Support internal linking strategy: Pages with strong ORG and PRODUCT references may deserve more internal links as topic anchors or cluster heads.
What’s the first thing to do after reviewing the scores and entity breakdowns?
Begin with a content audit of each page that received a low or moderate score in either diversity, density, or confidence. Examine the actual content on those pages and assess whether it:
- References relevant brands, people, locations, or products,
- Uses clear and unambiguous language,
- Aligns with a focused topic that matches your SEO goals.
This audit helps determine whether the content is too shallow, vague, or disconnected from authoritative entities in your niche.
If a page has low confidence in entity extraction, what should be fixed?
Low confidence typically results from:
- Poor grammar, overly complex sentences, or fragmented structure.
- Excessive filler text, boilerplate phrasing, or vague writing.
- Lack of clarity in sentence roles — making it harder for models to detect proper names and types.
To fix this:
- Simplify sentence structure and improve readability.
- Replace vague terms with specific examples or references.
- Use consistent capitalization and punctuation to improve clarity.
These changes not only help the model but also improve user comprehension and trust.
What’s the recommended next step for content that already scores well?
Pages with high scores are strong candidates for:
- Featured snippets or link building — their entity-rich structure makes them suitable for outreach.
- Content repurposing — turn them into PDFs, videos, or case studies while preserving entity structure.
- Anchor content in cluster models — use these as authoritative sources and link other content toward them.
Maintaining and expanding these high-performing assets is key to long-term SEO gains.
Final Thoughts
In conclusion, this analysis offers valuable insights into how well your content aligns with the needs of search engines and the expectations of your target audience. By evaluating entity diversity, density, and confidence, you gain a deeper understanding of your content’s relevance, comprehensiveness, and credibility.
The scores and breakdowns highlight areas where your content excels and pinpoint opportunities for improvement, whether through enriching the diversity of referenced entities, increasing entity density, or enhancing clarity and confidence in the text. The entity distribution provides a clear view of how well your content represents key topics, helping you optimize both content and internal linking strategies.
Moving forward, this data-driven approach should guide your content refinement efforts, allowing you to stay ahead of SEO trends, improve topical authority, and ultimately drive more meaningful engagement with your audience. Regular analysis and content optimization based on these insights will help you achieve sustainable growth and stronger positioning in search results.
The next steps should focus on iterating and applying these insights to further enhance content quality, improve internal structure, and ensure your pages consistently meet the evolving standards of relevance and authority in your niche.
Thatware | Founder & CEO
Tuhin is recognized across the globe for his vision to revolutionize digital transformation industry with the help of cutting-edge technology. He won bronze for India at the Stevie Awards USA as well as winning the India Business Awards, India Technology Award, Top 100 influential tech leaders from Analytics Insights, Clutch Global Front runner in digital marketing, founder of the fastest growing company in Asia by The CEO Magazine and is a TEDx speaker and BrightonSEO speaker.