SUPERCHARGE YOUR ONLINE VISIBILITY! CONTACT US AND LET’S ACHIEVE EXCELLENCE TOGETHER!
The purpose of this project is to implement a Named Entity Recognition (NER) system that can be applied to SEO to identify relevant entities like Persons, Organizations, Locations, and Miscellaneous from publicly available websites. By leveraging the power of a pre-trained BERT model for NER, the system will help identify which entities are mentioned on the target website, and how it compares to competitors’ websites in terms of coverage. This project helps identify content gaps by showing which entities (relevant to a particular domain) are missing or underrepresented in a given website.
How It Will Help to Extract Content Gap:
Content gap analysis is crucial in SEO for identifying which areas of a website are missing information or not covering important topics. This NER-based system helps identify entities that are being mentioned on competitors’ websites but are absent or not properly covered on your website. By comparing the entities detected from different websites, we can highlight areas where your website can improve its content to compete more effectively. This analysis will help in targeting additional keywords and topics to improve search engine rankings.
How It Will Help Website Owners:
For website owners and SEO specialists, this tool provides a valuable insight into their content coverage compared to competitors. By automatically extracting named entities such as Person, Organization, Location, and Miscellaneous, clients can identify opportunities for optimizing their content and targeting underrepresented topics. This will help improve their search engine rankings, making their websites more visible and competitive. This system can also be used for regular content audits to ensure that important entities are being covered comprehensively.
Why NER is crucial for SEO?
Named Entity Recognition helps identify important keywords (entities) that are present in the content of websites. By recognizing entities such as names of people, companies, locations, etc., it helps understand how well the content is aligned with the search intent and identifies areas where the website is lacking or has room for improvement.
What kind of content can benefit from this analysis?
Websites with informational articles, blogs, product descriptions, and even news articles will benefit from this analysis. NER helps highlight important entities mentioned within the text, and ensuring that relevant and valuable entities are adequately covered can boost a website’s relevance and visibility in search engines.
What is BERT?
BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based machine learning model designed for natural language processing (NLP) tasks. It was developed by Google and is one of the most advanced and powerful models for understanding the context of words in a sentence.
How BERT Works:
BERT is trained using a technique called masked language modeling. In this approach, a percentage of words in a sentence are randomly hidden, and the model is tasked with predicting these masked words based on the surrounding context. This bidirectional training allows BERT to have a deep understanding of the context of words before and after the masked word.
Unlike traditional models that read text left-to-right or right-to-left, BERT can read in both directions, allowing it to capture the full meaning of a sentence.
Applications of BERT:
BERT is widely used for various NLP tasks such as:
- Text Classification (e.g., sentiment analysis)
- Named Entity Recognition (NER) (e.g., identifying names of people, organizations, and locations)
- Question Answering
- Text Summarization
The flexibility and efficiency of BERT have made it a state-of-the-art model for NLP, and it is often fine-tuned for specific tasks like NER.
PyTorch
What is PyTorch?
PyTorch is an open-source machine learning library used for deep learning applications. It is particularly well-suited for developing neural networks and training them on large datasets. PyTorch provides an easy-to-use API for defining models, running backpropagation, and optimizing neural networks.
What is PyTorch doing in the project?
In this project, PyTorch is used to manage and run the pre-trained BERT model for token classification. Specifically, PyTorch is responsible for performing the forward pass through the model, where the input (text) is passed through the model to predict the named entities.
Why is it used in the project?
PyTorch provides an efficient and flexible way to work with neural networks, which makes it the ideal framework for running BERT. The pre-trained BERT model is based on PyTorch, and PyTorch provides seamless integration for fine-tuning models on custom tasks, including Named Entity Recognition (NER).
Transformers
What is Transformers?
Transformers is a popular library developed by Hugging Face that provides access to state-of-the-art NLP models like BERT, GPT, and others. It simplifies the process of using pre-trained models for various NLP tasks such as text classification, question answering, and NER.
What is Transformers doing in the project?
In this project, the transformers library is used to load the pre-trained BERT model (dslim/bert-large-ner) and its tokenizer. The tokenizer converts text into a format that the BERT model can understand (tokenized words), and the model performs NER to detect entities in the text.
Why is it used in the project?
The transformers library makes it extremely easy to use BERT and other pre-trained models for NLP tasks. It abstracts much of the complexity involved in loading, tokenizing, and running models, allowing us to focus on using the model to extract meaningful entities.
- Requests
What is Requests?
Requests is a simple and easy-to-use Python library for making HTTP requests. It allows you to send HTTP requests (e.g., GET, POST) to web servers to fetch data.
What is Requests doing in the project?
In this project, the requests library is used to scrape content from websites by sending HTTP GET requests and retrieving the raw HTML content of the page. The content is then parsed to extract text from <p> tags using BeautifulSoup.
Why is it used in the project?
requests makes it easy to fetch data from websites and retrieve the raw HTML content needed to perform NER. Without requests, it would be difficult to programmatically access web pages.
BeautifulSoup
What is BeautifulSoup?
BeautifulSoup is a Python library used for parsing HTML and XML documents. It provides tools to extract data from web pages in a structured way, making it easier to work with web scraped data.
What is BeautifulSoup doing in the project?
In this project, BeautifulSoup is used to parse the HTML content fetched by the requests library. It extracts the text from <p> tags, which typically contain the main body content of articles and web pages.
Why is it used in the project?
BeautifulSoup simplifies the task of navigating and extracting useful data from web pages, which is essential for scraping content for NER analysis. It allows us to focus on extracting text without worrying about the details of HTML parsing.
About BERT Model
In this section, a pre-trained BERT model fine-tuned for Named Entity Recognition (NER) is used. The selected model for this project is the dslim/bert-large-ner model, which is a large-scale BERT model fine-tuned on a general NER task.
Model Details
The dslim/bert-large-ner model is a large-scale pre-trained BERT model fine-tuned for NER tasks. This model is available through the Hugging Face library and is specifically designed to identify Named Entities in text, such as Person, Organization, Location, and Miscellaneous.
BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based architecture that processes text in both directions—left to right and right to left—allowing for a deeper understanding of context. This ability to analyze both sides of a word is critical for NER, where the context of a word plays a significant role in its identification as a specific entity.
Dataset for Fine-Tuning NER Models
The CoNLL-03 dataset is commonly used for fine-tuning NER models. This dataset contains labeled text where entities such as Person (PER), Organization (ORG), Location (LOC), and Miscellaneous (MISC) are annotated.
Understanding ‘B’ and ‘I’ Labels in NER
In Named Entity Recognition, entities can consist of multiple tokens. To accommodate this, two types of labels are used: B and I.
- B- (Beginning): Indicates the first token of a multi-token entity.
- Example: For “Elon Musk”:
- Elon → B-PER (Beginning of Person)
- Musk → I-PER (Inside Person)
- Example: For “Elon Musk”:
- I- (Inside): Marks a token inside a multi-token entity but not the first token.
- Example: For “New York”:
- New → B-LOC (Beginning of Location)
- York → I-LOC (Inside Location)
- Example: For “New York”:
- O (Outside): Used for tokens that are not part of any named entity.
Classes in the Model:
The dslim/bert-large-ner model uses the following classes to categorize named entities:
- Person (PER): Includes names of individuals, whether real or fictional.
- Example: “Elon Musk” → B-PER, I-PER
- Organization (ORG): Refers to names of companies, institutions, or any other kind of organization.
- Example: “Google” → B-ORG, I-ORG
- Location (LOC): Represents geographical locations, such as cities, countries, and landmarks.
- Example: “New York City” → B-LOC, I-LOC
- Miscellaneous (MISC): Covers events, products, or other terms that do not belong to the other categories.
- Example: “Olympics” → B-MISC
Using the Model for NER
The dslim/bert-large-ner model is employed to identify named entities from web content. The process involves:
- Input Text: A piece of text extracted from a webpage or article.
- Tokenization: The text is tokenized into words or subwords.
- Entity Recognition: The model classifies each token into one of the predefined NER classes.
- Output: The tokens and their respective labels are displayed, facilitating the identification of entities within the text.
Function predict_ner(sentence)
Purpose:
This function is responsible for predicting Named Entities (NER) from a given sentence using the pre-trained BERT-based NER model. It identifies entities such as Person, Organization, Location, and Miscellaneous.
Steps:
Tokenization:
- The function uses the tokenizer to convert the input sentence into tokens that the model can process.
- tokenizer(sentence, return_tensors=”pt”, truncation=True, padding=True) tokenizes the sentence and prepares it for input into the model.
Model Inference:
- torch.no_grad() is used to prevent the computation of gradients, which is unnecessary during inference and reduces memory usage.
- The model is then used to predict the entities of the tokens.
- model(**tokens).logits performs the forward pass through the model and provides raw predictions.
Select Predicted Labels:
- The predicted labels are determined by selecting the label with the highest score for each token.
- torch.argmax(outputs, dim=2) returns the predicted label index for each token.
Token Conversion:
- The function converts the predicted token indices into human-readable words using tokenizer.convert_ids_to_tokens().
Merge Subwords:
- The function merges tokens that start with ## into the previous token. This is done to handle subword tokens that BERT uses for out-of-vocabulary words.
- If the token starts with ##, it is appended to the previous word.
Return Output:
- The function returns a list of tuples containing the merged tokens and their corresponding predicted labels.
Testing predict_ner Function with a Sample Sentence
This test evaluates the model’s ability to recognize named entities in a simple sentence. The function predict_ner(sentence) is used to tokenize the input text, pass it through the model, and retrieve named entity predictions for each token.
Input text:
“Elon Musk is the CEO of Tesla and SpaceX, which are based in the United States.”
Explanation of the Output:
The model identifies named entities and classifies them into Person (PER), Organization (ORG), and Location (LOC) categories. Each token in the sentence receives a label:
Person (PER):
- “Elon” is labeled as B-PER (Beginning of a Person entity).
- “Musk” is labeled as I-PER (Inside a Person entity), meaning it continues from “Elon”.
Organization (ORG):
- “Tesla” is labeled as B-ORG (Beginning of an Organization entity).
- “SpaceX” is labeled as B-ORG (Beginning of another Organization entity).
Location (LOC):
- “United” is labeled as B-LOC (Beginning of a Location entity).
- “States” is labeled as I-LOC (Inside a Location entity), meaning it continues from “United” to form “United States”.
Outside (O):
Words like “is”, “the”, “CEO”, “which”, “are”, “based”, “in”, “,”, and “.” do not belong to any named entity class and are labeled as “O” (Outside any entity classification).
Key Observations:
- The model successfully identifies Elon Musk as a person.
- Tesla and SpaceX are correctly recognized as organizations.
- United States is correctly classified as a location.
- Non-entity words (such as “is”, “CEO”, “which”) are correctly labeled as “O” (outside any named entity classification).
- The model distinguishes between different entity types by assigning B- (Beginning) and I- (Inside) labels.
Analysis of Named Entity Recognition (NER) on Webpage Title
The extracted webpage title:
“Google Business Profile Updates 2025: New Features Revealed”
The model identified:
- Google → Organization (ORG)
- Business → Miscellaneous (MISC)
Recognizing entities within webpage titles helps in:
- SEO Optimization – Enhancing visibility in search engines
- Entity-Based Content Structuring – Organizing information effectively
- Improving Search Relevance – Aligning with user intent and search queries
This approach aids in understanding how search engines interpret key entities, allowing for more targeted content strategies.
Function extract_entities(entities)
Purpose:
This function processes the output from predict_ner and categorizes the detected entities into four main categories: Person, Organization, Location, and Miscellaneous. It also removes duplicates and merges tokenized words (e.g., merging “Elon” and “Musk” into “Elon Musk”).
Steps:
Categorize Entities:
- The function checks each tokenized word and its corresponding label (e.g., B-PER, I-PER) and places them in the appropriate category.
- If the token’s label indicates it is a Person, Organization, Location, or Miscellaneous, the word is added to the corresponding list.
Merge Tokens:
- After categorizing the words, the function merges any subword tokens. For example, if the tokens are [“Elon”, “Musk”], it will merge them into a single entity “Elon Musk”.
- list(dict.fromkeys(entity_dict[key])) removes any duplicate entities from each category to ensure only unique entities are returned.
Return Categorized Entities:
- The function returns a dictionary with entity types as keys and the corresponding list of entities as values.
Function format_entities_output(entities_dict)
Purpose:
This function formats the extracted entities into a human-readable string, making it easier to view the detected named entities for each category (e.g., Person, Organization).
Steps:
Format Output:
- The function iterates over each entity category (e.g., Person, Organization, etc.) and formats the entities into a readable string.
- If there are entities for a category, it joins them with commas. If no entities are found, it displays “No entities found”.
Return Formatted Output:
- The function joins all the entity strings and returns them as a single formatted string.
Refined NER Analysis with Entity Extraction and Formatting
Using advanced entity extraction extract_entities and formatting function format_entities_output, the refined output presents:
Organization: Google
Miscellaneous: Business
This structured extraction improves clarity and usability, making it easier to analyze key entities within webpage.
Function fetch_content(url)
Purpose:
This function is responsible for scraping the content from a given URL and extracting only the text within the <p> tags of the webpage. This step isolates the main body of text from other HTML elements like scripts or styles.
Steps:
Send Request:
- The function sends a request to the specified URL using the requests library.
- requests.get(url) retrieves the HTML content of the page.
Parse HTML:
- The function uses BeautifulSoup to parse the raw HTML content and make it easier to extract the text.
- BeautifulSoup(response.text, ‘html.parser’) parses the HTML text.
Extract Text:
- After parsing the HTML, the function extracts the text from all <p> tags, which typically contain the main content of the webpage.
- soup.find_all(‘p’) finds all the paragraph tags.
The content from each <p> tag is then joined together to form the entire text body.
Return Text:
The function returns the text found in the <p> tags.
Entity Extraction Results for Webpage Content
The following results were extracted from the webpage content of https://thatware.co/regex-for-seo-guide/:
- Person: No entities found
- Organization: ThatWare, Google, Search, Console, Analytics
- Location: No entities found
- Miscellaneous: Regex, Regular, Expressions, SEO, Wildcards
Analysis of the Results:
Person:
No personal names or entities were detected. This is expected for a page focused on technical or business topics rather than human figures. The absence of personal names could indicate that the content is less focused on specific individuals, which is typical for informational or corporate content.
Organization:
The recognized organizations are ThatWare, Google, Search, Console, and Analytics.
- ThatWare is the name of the company, correctly recognized as an organization.
- Google is also correctly identified as an organization, as expected from the content related to SEO tools and services.
- Search, Console, and Analytics are also identified, though they might be treated as part of the Google ecosystem. Here, “Search” could refer to Google Search, Console could refer to Google Search Console, and Analytics is likely referring to Google Analytics. These terms are commonly associated with SEO-related tools and platforms.
The presence of these terms highlights the webpage’s focus on SEO-related tools and services. However, for clearer results, these terms could be categorized as parts of a single entity (e.g., “Google Search Console” and “Google Analytics”). Further refinement could be done to merge these related terms into one organization entity.
Location:
No geographical locations were detected. This is not surprising for a page focused on SEO tools and techniques, which typically don’t emphasize specific locations unless referring to local SEO or regional services.
Miscellaneous:
The recognized miscellaneous entities are Regex, Regular, Expressions, SEO, and Wildcards.
- Regex, Regular, Expressions are related to the core topic of the webpage, which is regex in the context of SEO. These are technical terms used in SEO practices for pattern matching and search engine optimization.
- SEO is a key term in the context of the page and is rightly classified as miscellaneous due to its broad and technical nature.
- Wildcards is another technical term used in SEO for pattern matching, also fitting under miscellaneous. These terms reflect the technical focus of the page on regular expressions and SEO. They are important for understanding the content and its relevance to users searching for SEO-related technical information.
Implications for SEO and Content Analysis:
Person and Location: The absence of person and location entities suggests that the content is more informational and focused on business and technical aspects, which is typical for SEO guides or tutorials.
Organization: The extraction of organizations like Google and ThatWare helps in identifying key players and companies associated with SEO. Recognizing Google Search Console and Google Analytics as part of the organization helps align the content with well-known SEO tools, which is important for visibility in the SEO domain.
Miscellaneous: The presence of terms like Regex, SEO, and Wildcards in the miscellaneous category highlights the technical depth of the content, which is useful for targeting users seeking technical SEO knowledge. These entities indicate the core subject matter and could be important for refining keyword strategies.
The entity extraction results provide useful insights into the content’s focus and key elements, such as the use of SEO-related tools (Google, Analytics) and technical terms (Regex, Wildcards). These findings can be leveraged to refine SEO strategies by focusing on these key terms and their relevance to the targeted audience.
Result Analysis: Report on Website Comparison
Introduction:
In this section, the comparison of multiple websites is carried out by extracting entities using a BERT-based Named Entity Recognition (NER) model. The analysis identifies the presence or absence of key entities such as organizations, locations, and miscellaneous terms. The goal is to highlight content gaps between the websites, offering valuable insights into areas where one website may cover terms that another does not.
Detailed Entity Analysis of Websites:
- Comparing https://www.seotechexperts.com/seo-agency-india.html vs https://www.incrementors.com/seo-services/:
Entities for https://www.seotechexperts.com/seo-agency-india.html:
- Person: No entities found
- Organization: Amazon, SeoTechExperts
- Location: No entities found
- Miscellaneous: No entities found
Entities for https://www.incrementors.com/seo-services/:
- Person: No entities found
- Organization: Google
- Location: No entities found
- Miscellaneous: SEO, Ahrefs
Key Insights:
SeoTechExperts’ Branding: The SeoTechExperts website is strongly identified with its own brand, and it also mentions Amazon as an organization. The mention of Amazon may indicate that the website is referencing Amazon in the context of its services or using Amazon as an example for SEO strategies.
Incrementors’ Focus on Major SEO Tools: The Incrementors website mentions Google, a key player in the SEO and digital marketing field. It also references SEO and Ahrefs, which are important tools in the SEO industry.
Content Gaps:
Missing Entities on SeoTechExperts: SeoTechExperts does not mention Google, which could be a significant entity to include in content, especially when discussing SEO practices. This could be a content gap for targeting a broader SEO-related audience.
- Comparing https://www.seotechexperts.com/seo-agency-india.html vs https://www.techwebers.com/seo-services/:
Entities for https://www.seotechexperts.com/seo-agency-india.html:
- Person: No entities found
- Organization: Amazon, SeoTechExperts
- Location: No entities found
- Miscellaneous: No entities found
Entities for https://www.techwebers.com/seo-services/:
- Person: No entities found
- Organization: Tech, Webers, Google
- Location: No entities found
- Miscellaneous: No entities found
Key Insights: SeoTechExperts’ and TechWebers’ Entity Mentions: Both websites mention their own brands, SeoTechExperts and TechWebers, in the Organization category. However, TechWebers also mentions Google as an important organization, which indicates that TechWebers might be discussing or integrating Google’s services, tools, or practices into their offerings.
Comparison of Branding: Both sites leverage their branding to establish their identity in the SEO field. However, SeoTechExperts could benefit from adding more recognized global entities like Google, as TechWebers does, to improve its relevance in SEO conversations.
Content Gaps:
SeoTechExperts’ Lack of Google Mention: SeoTechExperts could benefit from including Google as an organization, especially if they want to focus more on SEO strategies related to Google search algorithms, rankings, or tools.
Entities for https://www.incrementors.com/seo-services/:
- Person: No entities found
- Organization: Google
- Location: No entities found
- Miscellaneous: SEO, Ahrefs
Entities for https://www.techwebers.com/seo-services/:
- Person: No entities found
- Organization: Tech, Webers, Google
- Location: No entities found
- Miscellaneous: No entities found
Key Insights:
Shared Entity (Google): Both websites mention Google as a key organization, which indicates that both websites likely focus on SEO and digital marketing strategies aligned with Google’s tools and search engine algorithms.
TechWebers’ Additional Entities: TechWebers includes additional entity mentions of Tech and Webers, suggesting a stronger brand identity or focus on their own business model. This could differentiate their content from Incrementors, which does not mention these additional terms.
Content Gaps:
Content Gap for Incrementors: Incrementors could benefit from adding Tech or Webers to reflect their competitive advantage, particularly if they wish to target audiences looking for specific SEO solutions offered by their brand.
What Does This Entity Analysis Mean for Website Owners?
· Importance of Identifying Key Entities: Named entities (such as company names, brands, and tools) are crucial for SEO. Including well-known entities such as Google or Amazon can improve search engine visibility, as they help search engines understand what your page is about and associate it with important industry players.
· Content Gaps: Websites that do not mention important brands or tools may miss opportunities to rank for relevant queries. For example, SeoTechExperts does not mention Google, which is a major player in the SEO field. This is a potential content gap that could be addressed to target more relevant search traffic.
· Improving SEO Visibility: By integrating relevant keywords and branded terms (like SEO, Ahrefs, or Google) into your website’s content, you are better positioning your site to appear in search results when users search for these terms. Additionally, consistently using your brand name and relevant industry terms can help you establish authority and relevance in your field.
Why Are Named Entities Important for SEO?
Named entities help search engines like Google understand the context of your content. When you include recognized organizations (like Google, Amazon, or Ahrefs) in your content, search engines can connect your page with these well-known brands or tools, which improves its relevance for users searching for these terms.
How Can Named Entity Recognition (NER) Help with SEO?
· NER for Content Optimization: By identifying the most relevant named entities in your content, you can optimize it for better search engine rankings. Including recognized names, especially those closely related to your business or service, makes your content more visible to users searching for related terms.
· Addressing Content Gaps: Identifying content gaps, like missing entities, helps you identify areas where your content could be enhanced. For instance, if your competitors are mentioning Google or SEO tools and your website is not, you might want to incorporate these terms to stay competitive.
How accurate is the model in recognizing entities across different industries and domains?
The Named Entity Recognition (NER) model used in this project is pre-trained on a broad set of textual data, allowing it to recognize a wide range of entities such as persons, organizations, locations, and miscellaneous terms. While the model is versatile and can handle diverse types of content, it is optimized for general content such as news and widely used terms. For industries like SEO, it performs well in identifying commonly mentioned entities, but it may not capture very niche or specialized terms specific to highly technical fields. This can occasionally result in missed or imprecise entity identification, particularly for industry-specific jargon or emerging brands.
How can the results from the model be used to improve SEO strategies?
The NER model identifies key entities across various content types, allowing insights into the types of companies, individuals, locations, and industry-specific terms that are most prominent in the content. By comparing these extracted entities across competitor websites, gaps in coverage can be identified, helping to refine SEO strategies. For instance, if the model detects that a competitor mentions a specific tool or organization frequently, but your website does not, this could indicate an opportunity to better target those terms and improve your content relevance. These insights can be used to tailor content, optimize for missing entities, and strengthen SEO performance.
Final Thoughts:
This entity-based analysis highlights the importance of including relevant and recognized named entities in your content. Websites that mention industry leaders like Google and important tools like Ahrefs can improve their search engine rankings by aligning with user search intent. Additionally, by identifying content gaps—such as missing organizations or tools—you can fine-tune your SEO strategy to ensure your website ranks for the most important terms in your industry.
This approach to content optimization can help improve search visibility, attract more targeted traffic, and ultimately contribute to the success of your SEO campaigns.
Thatware | Founder & CEO
Tuhin is recognized across the globe for his vision to revolutionize digital transformation industry with the help of cutting-edge technology. He won bronze for India at the Stevie Awards USA as well as winning the India Business Awards, India Technology Award, Top 100 influential tech leaders from Analytics Insights, Clutch Global Front runner in digital marketing, founder of the fastest growing company in Asia by The CEO Magazine and is a TEDx speaker and BrightonSEO speaker.