F1 Score for NER: A Metric To Evaluate Precision And Recall In Named Entity Recognition tasks

F1 Score for NER: A Metric To Evaluate Precision And Recall In Named Entity Recognition tasks

SUPERCHARGE YOUR ONLINE VISIBILITY! CONTACT US AND LET’S ACHIEVE EXCELLENCE TOGETHER!

    This project focuses on implementing a Named Entity Recognition (NER) system and evaluating its performance using the F1 Score, a crucial metric in natural language processing (NLP). Named Entity Recognition is a key task in NLP that involves identifying and classifying proper names into predefined categories such as persons, organizations, locations, and more.

    F1 Score for NER

    The project involves training a transformer-based deep learning model to recognize named entities in text. The dataset used for this project is the Entity Annotated Corpus, a widely used benchmark dataset for NER tasks.

    A critical part of this project is evaluating the model’s performance. Instead of only using accuracy, which may not be the best metric for imbalanced datasets, we use the F1 Score, Precision, and Recall to get a more reliable assessment of how well the model identifies named entities.

    Additionally, this project incorporates techniques to handle class imbalances, such as weighted loss functions, to improve the model’s ability to recognize underrepresented entity types. The implementation is optimized to ensure faster training while maintaining high accuracy.

    Why Use a Custom Dataset for This Project?

    In this project, the goal is to evaluate how well a model can recognize different types of words or phrases in a given text. This evaluation is done using a measurement called F1 Score, which helps determine how accurately the model can detect and classify specific entities such as names, places, organizations, and other categories.

    To calculate the F1 Score correctly, the model needs to be tested on labeled data—data that has already been marked with the correct categories for each word or phrase. Without this labeled data, it would be impossible to check whether the model is making the right predictions.

    A model that is only used to generate predictions (without labeled data for comparison) cannot provide an F1 Score because there is no way to verify if its output is correct. This is why a custom dataset is necessary. The dataset used in this project contains thousands of sentences where each word is already tagged with its correct classification. This allows for a direct comparison between the model’s predictions and the actual correct answers, making it possible to calculate accuracy, precision, recall, and F1 Score.

    In summary, without a labeled dataset, the effectiveness of the model cannot be measured properly. This project requires a custom dataset to ensure an accurate evaluation of how well the model performs in recognizing named entities.

    Understanding Model Performance Metric F1 Score

    F1 Score: A Balanced Measure for Model Performance

    Imagine a machine learning model that is trained to detect different types of entities in text, such as company names, locations, and people. The goal of this model is to correctly identify and label words in a sentence while avoiding mistakes. But how do we measure whether the model is doing a good job?

    If the model detects many correct entities but also makes many mistakes, is it a good model? Or if the model detects only a few correct entities but never makes a mistake, is it better?

    This is where the F1 Score comes in—it helps strike a balance between these two extremes by considering two important factors: Precision and Recall.

    What is the F1 Score?

    The F1 Score is a way to measure a model’s overall effectiveness by combining two key metrics: Precision (how accurate the model’s predictions are) and Recall (how many correct answers the model found).

    If a model is too cautious, it will only make predictions when it is very confident, leading to high Precision but low Recall.

    • If a model is too aggressive, it will make many predictions and find more correct answers, but it will also make more mistakes, leading to high Recall but low Precision.
    • The F1 Score balances both by ensuring the model is making enough correct predictions while minimizing errors.

    The formula for F1 Score is

    F1 = (2 X Precision X Recall) / (Precision + Recall)

    To fully understand this, let’s break down Precision and Recall with simple examples.

    Precision: How Many of the Model’s Predictions Were Correct?

    Imagine a spam detection system in an email inbox. The system scans emails and labels some of them as spam.

    If the system labels 10 emails as spam, but only 6 are actually spam, then the Precision tells us how many of the system’s spam predictions were correct.

    Precision = True Positive / (True Positive + False Positive)

    • True Positives (TP): The system correctly marked an email as spam.
    • False Positives (FP): The system mistakenly marked a normal email as spam.

    Example Calculation:

    If out of 10 predicted spam emails, 6 are actually spam and 4 are normal emails incorrectly labeled as spam, then:

    Precision = 6 / 6 + 4 = 6 / 10 = 0.6 (or 60%)

    A model with high Precision rarely makes mistakes, but it might miss some actual spam emails, which leads us to Recall.

    Recall: How Many of the Actual Entities Were Found by the Model?

    Now, let’s say there were 20 actual spam emails in the inbox, but the model only detected 6 of them.

    Recall tells us how many of the actual spam emails were correctly detected by the system.

    Recall = True Positive / (True Positive + False Negetive)

    • False Negatives (FN): Emails that were actually spam, but the system failed to detect them.

    Example Calculation:

    If the inbox had 20 spam emails, but the system only found 6, then:

    Recall = 6 / 6 + 14 = 6 / 20 = 0.3 (or 30%)

    A model with high Recall detects most spam emails but might wrongly label many normal emails as spam.

    How F1 Score Balances Precision and Recall

    Now that we understand Precision and Recall, we can see why the F1 Score is important.

    If a model only focuses on Precision, it will make fewer mistakes but miss many correct answers.

    If a model only focuses on Recall, it will find more correct answers but make many mistakes.

    F1 Score ensures both are considered together.

    F1 Score for the spam detection example:

    • Precision = 0.6 (60%)
    • Recall = 0.3 (30%)

    F1 = (2 X 0.6 X 0.3) / (0.6 + 0.3) = (2 X 0.18) / 0.9 = 0.4 (or 40%)

    A higher F1 Score means the model is performing well in both aspects—it is identifying the correct entities while keeping mistakes low.

    Why F1 Score is Important for This Project?

    In Named Entity Recognition (NER), the goal is to correctly identify entities like names, locations, and organizations while avoiding mistakes.

    If the model is too cautious, it will only label words when it is very confident, leading to high Precision but missing many correct entities (low Recall).

    If the model is too aggressive, it will mark many words as entities, catching more correct answers but also making more mistakes (high Recall but low Precision).

    Since the purpose of this project is to measure how balanced the model’s predictions are, the F1 Score is the best metric to use because it considers both Precision and Recall together.

    This is why this project is focused on calculating the F1 Score to properly evaluate how well the model performs on Named Entity Recognition tasks.

    Libraries Used in the Project

    This project relies on several key Python libraries for data processing, model training, and evaluation.

    pandas – Data Handling and Processing

    What is pandas?

    pandas is a Python library used for handling and analyzing structured data. It allows easy manipulation of large datasets using DataFrames, which are similar to tables in a database or Excel.

    How does it work?

    pandas provides functions to load, clean, and transform data efficiently. It allows filtering rows, selecting columns, and performing operations like sorting and aggregating.

    Why is it used in this project?

    The dataset used in this project is stored in CSV format (ner_dataset.csv), and pandas is used to read this file into a DataFrame.

    It helps in extracting entity labels, counting occurrences of each label, and preparing the dataset for training.

    transformers – Pretrained Models for NLP

    What is transformers? The transformers library by Hugging Face provides state-of-the-art NLP models, including BERT, ALBERT, T5, and others. These models are pre-trained on large datasets and can be fine-tuned for specific tasks like Named Entity Recognition (NER).

    How does it work?

    It allows loading pre-trained transformer models and fine-tuning them on custom datasets for specific NLP tasks.

    Provides tokenization, model training, and inference functionalities.

    Why is it used in this project?

    The NER model is based on a BERT-based transformer, which is loaded from the transformers library.

    This allows the model to understand contextual meaning and accurately recognize named entities.

    torch – Deep Learning Framework

    What is torch? torch (PyTorch) is a deep learning framework used for building and training machine learning models.

    How does it work?

    Provides tools for creating neural networks, handling large datasets, and optimizing models using GPU acceleration.

    Supports automatic differentiation, which helps in training deep learning models.

    Why is it used in this project?

    The transformer-based NER model is trained using PyTorch’s deep learning framework.

    It enables efficient training, backpropagation, and optimization of the model.

    datasets – Efficient Dataset Loading and Management

    What is datasets?

    datasets is a library from Hugging Face that provides efficient tools for loading, processing, and managing large-scale datasets.

    How does it work?

    It enables seamless integration with machine learning models by handling data efficiently, supporting streaming, and allowing automatic tokenization.

    Why is it used in this project?

    The datasets library is used to convert the structured NER dataset into a format suitable for training the model.

    It enables easy batch processing and provides tools to work with labeled text data efficiently.

    evaluate – Model Performance Measurement

    What is evaluate?

    evaluate is a library from Hugging Face used to calculate standard evaluation metrics such as accuracy, precision, recall, and F1 Score for machine learning models.

    How does it work?

    The library provides predefined functions for computing various metrics.

    It ensures consistency in evaluating models across different projects.

    Why is it used in this project?

    Since this project focuses on evaluating F1 Score, Precision, and Recall, the evaluate library is used to compute these metrics.

    It helps in analyzing the model’s performance on the NER dataset

    seqeval – NER-Specific Evaluation

    What is seqeval? seqeval is a specialized evaluation library designed for sequence labeling tasks like Named Entity Recognition (NER).

    How does it work?

    It calculates precision, recall, and F1 Score at the sequence level, ensuring a more accurate evaluation for NER models.

    Unlike general-purpose evaluation tools, seqeval understands the structure of B- (Beginning) and I- (Inside) entity tags.

    Why is it used in this project?

    Standard evaluation metrics may not work well for structured entity recognition.

    seqeval provides a more precise way to measure F1 Score, Precision, and Recall for named entities, making it an essential tool for this project.

    Dataset Details

    The dataset used for this project is the Entity Annotated Corpus sourced from Kaggle: Kaggle – Entity Annotated Corpus

    Dataset Overview and Structure

    The dataset used in this project is the Entity Annotated Corpus, available on Kaggle. This dataset is structured with the following columns:

    ·         Sentence #: Indicates the sentence number.

    ·         Word: The actual word in the sentence.

    ·         POS: Part of Speech (not used in this project).

    ·         Tag: The NER label assigned to the word.

    It consists of manually labeled text data structured in a tabular format. Each row represents a single token (word), and each token is assigned an entity label.

    Dataset Structure NER Tagging Format

    The dataset follows the BIO tagging scheme, which is commonly used for NER tasks:

    ·         B-XXX – (Beginning) represents the first token of an entity. The beginning of a named entity of type XXX (e.g., “B-PER” for the first word in a person’s name).

    ·         I-XXX – (Inside) represents subsequent tokens of the same entity. Inside a named entity of type XXX (e.g., “I-PER” for the second or later words in a person’s name).

    ·         O – (Outside) indicates words that do not belong to any named entity.

    Entity Categories in the Dataset

    The dataset includes the following named entity categories:

    ·         Person (PER) – Names of individuals (e.g., “Elon Musk”).

    ·         Organization (ORG) – Companies, institutions, government agencies (e.g., “Tesla Inc.”).

    ·         Location (LOC) – Physical locations, including cities, countries, and landmarks (e.g., “New York”).

    ·         Geopolitical Entity (GPE) – Countries, states, or government-defined areas (e.g., “United States”).

    ·         Time (TIM) – Specific dates or time references (e.g., “January 2024”).

    ·         Events (EVE) – Names of historical or scheduled events (e.g., “World War II”).

    ·         Art (ART) – Titles of books, movies, or artworks (e.g., “Mona Lisa”).

    ·         Natural Phenomena (NAT) – Natural elements or disasters (e.g., “Hurricane Katrina”).

    These categories form the basis for training the NER model.

    The dataset contains 887,908 tokens (words) and has 17 distinct entity labels (including “B-” and “I-” variants for each category except “O”).

    Explanation:

    ·         The “Sentence #” column in the dataset is missing values for words that belong to the same sentence.

    ·         .fillna(method=”ffill”) fills these missing values by copying the previous sentence number down the column.

    ·         This ensures that every word is correctly assigned to a sentence.

    Purpose:

    ·         Named Entity Recognition (NER) models require complete sentence structures to understand context.

    ·         Forward-filling ensures that words are grouped into proper sentences, making data processing and training accurate.

    After this step, df.head() displays the first 5 rows to confirm the changes.

    Explanation:

    • Removes any rows where the “Word” or “Tag” columns have missing values (NaN).
    • Ensures that every word in the dataset has a corresponding entity tag.

    Purpose:

    • Missing values can cause issues during model training, leading to inaccurate results.
    • Ensuring a clean dataset improves the reliability of Named Entity Recognition (NER) predictions.

    Explanation:

    • Groups the dataset by “Sentence #” so that all words and their corresponding entity tags are combined into lists for each sentence.

    Purpose:

    • The dataset is structured in a word-by-word format, where each row represents a single word.
    • To train a model effectively, sentences must be treated as complete units rather than isolated words.
    • This transformation helps in preparing the data for tokenization and further processing.

    Explanation:

    This code extracts the words and corresponding named entity labels from the grouped dataset. Since the data was previously grouped by sentence, sentences now contains lists of words for each sentence, and labels contains the associated entity tags.

    Purpose:

    By structuring the data in this way, it becomes easier to process for model training, ensuring that each sentence and its corresponding labels remain properly aligned.

    Explanation:

    This code splits the dataset into training and test sets using an 80-20 split. The train_test_split function randomly selects 80% of the data for training and reserves 20% for testing. The random_state=42 ensures reproducibility by making sure the split remains the same each time the code runs. After splitting, the data is converted into Dataset objects, which is a structured format optimized for working with machine learning models.

    Purpose:

    Dividing the data ensures that the model is trained on one portion while being evaluated on another, preventing overfitting and allowing for a fair assessment of its performance. Converting the data into Dataset objects makes it compatible with the Hugging Face framework, streamlining further processing.

    Explanation:

    This code creates a mapping between named entity labels and numerical IDs. First, all unique labels from the dataset are extracted and stored in unique_labels. Then, two dictionaries are created:

    ·         label2id assigns a unique integer ID to each label.

    ·         id2label performs the reverse mapping, converting numerical IDs back to their corresponding labels.

    Purpose:

    Since machine learning models work with numerical data, this mapping allows the text-based entity labels to be converted into numerical form for training and prediction. The reverse mapping (id2label) ensures that predicted numerical outputs can be translated back into meaningful entity labels.

    Explanation:

    The purpose of this line is to automatically load the appropriate preprocessing method for the selected model — in this case, a BERT-based model for Named Entity Recognition (NER). The AutoProcessor class is part of the Hugging Face Transformers library and serves as a unified interface for all model-related preprocessing tasks.

    The term “processor” here refers to a built-in tool from the Hugging Face Transformers library that automatically prepares text (or other data types) in a format that the model understands. It acts like a smart helper that knows exactly how to convert raw input into model-ready form — including splitting sentences into tokens (tokenization), converting them to numbers (input IDs), and adding other necessary information like attention masks.

    AutoProcessor is a general-purpose preprocessing tool. Depending on the type of model being used, it can automatically act as:

    • A tokenizer (for text-based models like BERT),
    • An image processor (for vision models),
    • Or a combination of both (for models that use both text and images).

    Sentence Length Analysis

    Before tokenizing the dataset, it’s essential to analyze sentence lengths to determine an appropriate maximum sequence length. This step helps ensure that most sentences fit within the model’s input size while minimizing truncation or excessive padding.

    • The longest sentence in the dataset contains 104 words, while the shortest has just 1 word.
    • The average sentence length is approximately 21.88 words, meaning most sentences fall around this range.
    • The most frequently occurring sentence length (mode) is 20 words, suggesting that a sequence length around this value is common.
    • The standard deviation is 7.96, indicating some variation in sentence lengths.

    Considering these statistics, the chosen maximum sequence length of 32 is a balanced choice. It ensures that most sentences fit within the limit while reducing unnecessary truncation or padding.

    Explanation:

    This function takes a batch of sentences and their corresponding entity labels and performs the following steps:

    ·         Tokenization using BERT

    o Converts words into subword tokens using the BERT tokenizer (tokenizer).

    o Ensures uniform sequence length using padding (max_length=32) and truncation for long sentences.

    o is_split_into_words=True ensures that the tokenizer treats input as a list of words instead of a single string.

    ·         Aligning Entity Labels with Tokenized Words

    o The tokenizer may split words into multiple subwords, so labels need to be realigned accordingly.

    o word_ids(batch_index=i) retrieves the original word index for each token.

    o Special tokens (e.g., [CLS], [SEP]) and padding tokens are assigned a label of -100 to be ignored during training.

    ·         Ensuring Labels Match Tokenized Sequences

    ·         If a sentence is shorter than max_length=32, labels are padded with -100 to maintain consistency.

    ·         Labels are converted from text (B-PER, I-LOC, etc.) to numerical IDs using label2id.

    Finally, the tokenized inputs are returned, including the aligned labels.

    Purpose:

    ·         This function prepares the data for Named Entity Recognition (NER) by ensuring that each tokenized word retains the correct entity label.

    ·         Proper alignment of labels is critical since BERT tokenizes words into subwords, requiring adjustments in the label mappings.

    ·         The tokenized dataset is then passed to train_data.map() and test_data.map(), applying this function to all training and test sentences.

    Explanation: Loading the Pretrained Model

    bert-base-cased is a pre-trained model from the BERT family developed by Google.

    “Base” means it has 12 layers and around 110 million parameters.

    “Cased” means it distinguishes between uppercase and lowercase letters, so “Apple” and “apple” are treated differently — important in tasks like NER where names and proper nouns matter.

    AutoModelForTokenClassification is a special Hugging Face class that sets up the model for token-level classification, which is required for Named Entity Recognition (NER). This allows the model to label each word (token) with a specific category like PER (person), ORG (organization), etc.

    from_pretrained() loads a model that has already learned the structure of the English language using massive public datasets. This saves time and resources because the model doesn’t need to be trained from scratch.

    num_labels=len(unique_labels) configures the model’s final output layer to match the number of entity classes in the dataset. Without this, the model wouldn’t know how many different types of entities it needs to predict.

    By using bert-base-cased with this setup, the project benefits from the deep language understanding of BERT, while tailoring it to the specific entity recognition task through fine-tuning.

    Explanation: Computing F1 Score for NER

    • compute_metrics function evaluates model performance using precision, recall, and F1-score, essential for NER tasks. It:
    • Extracts logits and true labels from predictions.
    • Converts logits into predicted class indices and maps them to entity labels.
    • Uses seqeval to compute the F1-score, focusing on complete entity sequences.

    Purpose

    • Ensures NER performance is measured meaningfully by evaluating entity sequences.
    • Aligns with the project goal of F1-score computation.
    • Helps track training improvements and refine the model.

    Training Arguments

    The TrainingArguments define how the model is trained, evaluated, and optimized. These parameters control batch sizes, learning rate, evaluation strategy, and model checkpointing. The goal is to balance efficient training with robust evaluation to ensure the best performance for Named Entity Recognition (NER).

    Key Parameters

    ·         output_dir=”./results”: Saves trained models and checkpoints.

    ·         evaluation_strategy=”epoch”: Evaluates the model after each epoch.

    ·         save_strategy=”epoch”: Saves the model only at the end of each epoch to reduce storage use.

    ·         save_total_limit=2: Keeps only the two most recent model checkpoints.

    ·         per_device_train_batch_size=16 / per_device_eval_batch_size=16: Defines the batch size for training and evaluation.

    ·         dataloader_num_workers=2: Uses multiple CPU threads for faster data loading.

    ·         num_train_epochs=5: Trains for five epochs to ensure the model learns effectively.

    ·         weight_decay=0.01: Applies weight decay to prevent overfitting.

    ·         learning_rate=5e-5: Sets the learning rate for stable training.

    ·         logging_steps=100: Logs training updates every 100 steps.

    ·         load_best_model_at_end=True: Automatically loads the best-performing model after training.

    ·         metric_for_best_model=”eval_loss”: Uses validation loss to determine the best model.

    ·         greater_is_better=False: Since lower loss is better, this ensures the correct model is selected.

    ·         fp16=True: Enables mixed precision training, reducing memory usage and speeding up computations.

    ·         gradient_accumulation_steps=2: Accumulates gradients for two steps before updating weights, effectively simulating a larger batch size.

    ·         logging_dir=”./logs”: Stores logs for tracking training progress.

    ·         report_to=”none”: Disables external logging services to keep training local.

    These settings optimize training for efficiency and stability while ensuring the best-performing model is selected. The output directory stores trained models, which may be useful for future reuse. However, since the primary goal is evaluating F1-score, saving all checkpoints is unnecessary—hence, only the latest two are retained.

    Data Collator for Token Classification

    The DataCollatorForTokenClassification is used to efficiently batch-process the tokenized input data before passing it to the model.

    This collator:

    Handles padding dynamically: Ensures that all sequences in a batch have the same length by padding shorter ones, reducing unnecessary padding across all batches.

    Aligns labels with tokenized inputs: Since word tokenization can split words into multiple tokens, this collator ensures that labels align correctly with the tokenized format.

    Purpose

    Using a data collator simplifies batch processing, making training more memory-efficient and ensuring proper alignment between input tokens and their corresponding labels. This improves model stability and performance.

    Preparing Data for Training

    To train the model efficiently, the dataset needs to be batched and prepared for processing. A DataLoader is used to load the training data in small groups (batches) rather than processing the entire dataset at once. This makes training faster and helps utilize system memory efficiently.

    Additionally, the data is shuffled before training, which ensures that the model does not learn patterns based on the order of data points. The collate_fn parameter ensures that variable-length sentences are properly padded so they can be processed together.

    By setting up the DataLoader, the total number of training steps is also calculated. This helps in adjusting the learning process and scheduling optimizations during training.

    Optimizer for Model Training

    An optimizer is a crucial component in training machine learning models, as it adjusts the model’s parameters to minimize errors. The AdamW optimizer is used here, which is an improved version of the Adam optimizer that includes weight decay to reduce overfitting.

    It updates the model’s parameters (weighted_model.parameters()) to improve performance.

    The learning rate (lr=5e-5) controls how much the model changes at each step.

    The weight decay (0.01) prevents excessive reliance on specific features, helping generalization.

    This optimizer ensures stable and efficient learning during training.

    Learning Rate Scheduler

    A learning rate scheduler helps the model train more effectively by gradually reducing the learning rate over time. In this case, a linear scheduler is used, meaning the learning rate will steadily decrease as training progresses. The scheduler is applied to the optimizer to control how fast the model updates its weights.

    The learning rate controls how much the model’s parameters change during each training step. A higher learning rate makes the model learn faster but can lead to instability, while a lower learning rate ensures more stable learning but may take longer to achieve good results. By using a scheduler, the learning rate gradually decreases over time, allowing the model to make finer adjustments as training progresses. This helps improve accuracy and prevents the model from making drastic changes in later stages of training.

    At the start of training, the model uses the full learning rate. As the training continues, the scheduler gradually lowers it until it reaches zero. This prevents sudden large updates that could destabilize training and helps the model learn more smoothly.

    Early Stopping and Its Role in Training

    Early stopping is a technique that helps prevent unnecessary training when the model stops improving. It monitors the evaluation loss, and if there is no significant improvement for a set number of evaluations, training stops automatically. This prevents overfitting and saves time.

    • The patience value determines how many evaluations to wait before stopping.
    • A small improvement threshold ensures the model doesn’t stop too soon if there are only minor fluctuations in performance.

    By implementing early stopping, training is more efficient, and the best-performing model is selected without over-training.

    Setting Up the Trainer

    The Trainer is responsible for managing the entire training process, including training, evaluation, and optimization. It ensures the model learns effectively while tracking performance.

    • It utilizes the weighted model with adjusted class weights.
    • Training arguments define how training is executed, including batch sizes and learning rate.
    • The dataset is split into training and evaluation sets, ensuring the model learns from labeled data.
    • The tokenizer processes text, converting words into numerical input for the model.
    • The data collator ensures input batches are formatted correctly.
    • A function is included to compute evaluation metrics, aligning with the project’s goal of measuring F1 Score.
    • Optimizers and learning rate schedulers help fine-tune weight adjustments for better learning.
    • Early stopping is included as a callback to halt training if no improvement is observed.

    With this setup, the model is trained efficiently while tracking its performance at every stage.

    Training the Model

    Calling trainer.train() initiates the training process, where the model learns from the dataset through multiple iterations (epochs). During training:

    • The input text is tokenized and fed into the model.
    • The model makes predictions and compares them to actual labels.
    • The loss function calculates the difference between predicted and true values.
    • The optimizer adjusts model weights to minimize errors.
    • Performance metrics, including F1 Score, are calculated to track improvements.

    The process continues for the specified number of epochs, ensuring the model refines its ability to recognize named entities.

    The evaluate() function:

    • Runs the trained model on the validation dataset (which the model hasn’t seen during training).
    • Collects predictions for each token in the dataset.
    • Compares these predictions with the true labels.
    • Calculates performance metrics such as:
      • Loss (how far the predictions are from the actual labels),
      • Precision (how accurate the predictions are),
      • Recall (how many correct predictions were found out of all relevant ones),
      • F1 Score (a balance between precision and recall),
      • Accuracy (the percentage of correctly predicted tokens).

    The trainer.evaluate() method is a key step that evaluates how well the model performs on validation data using multiple quality metrics — with special focus on the F1 Score, which is the main evaluation metric in this project.

    This helps in understanding the strengths and weaknesses of the model across different entity types and in making informed decisions about model improvements.

    Output Results Analysis

    Once the training was completed using a Named Entity Recognition (NER) model based on a pre-trained transformer, the model was evaluated using a validation dataset. The evaluation was performed using trainer.evaluate(), which returns a set of metrics measuring the quality of predictions.

    The key metric used is the F1 Score, a widely accepted measurement that balances two other metrics: Precision (how many predicted entities were correct) and Recall (how many actual entities were found). The F1 Score is crucial when both false positives and false negatives matter, especially in real-world applications like SEO content processing.

    Overall Evaluation Metrics

    • Precision (82.86%): This means that out of all the words the model predicted as named entities, about 83% were actually correct. This is a strong sign that the model is careful and doesn’t make many false positives.
    • Recall (83.53%): This indicates that out of all the actual named entities in the dataset, the model was able to correctly find about 84% of them. This shows the model is able to detect a good portion of the relevant entities.
    • F1 Score (83.19%): The F1 score balances Precision and Recall. A high value here suggests that the model is both accurate and consistent. This is a key metric for evaluating performance in tasks like NER.
    • Accuracy (96.18%): This is the percentage of all tokens (including non-entity tokens) that the model labeled correctly. While this number is high, it is less useful than F1 Score for NER, since most tokens are non-entities.

    Entity-wise Performance Breakdown

    • ART (Artistic Works)

    F1 Score: 0.1592

    Performance on this entity type is low, mostly because there are very few examples in the dataset. Artistic titles often vary in format and context, making them difficult to detect accurately.

    • EVE (Events)

    F1 Score: 0.2302

    Events are also underrepresented in the training data. The model struggles with correctly identifying event names, which may be confused with general nouns or titles.

    • GEO (Geographical Locations)

    F1 Score: 0.8781

    Very strong performance. Locations are typically written in standard, recognizable formats (like city and country names), making them easier for the model to detect.

    • GPE (Geopolitical Entities)

    F1 Score: 0.9486

    Excellent performance, likely because of high support in the data. Countries, states, and political regions follow clear naming conventions that help the model generalize well.

    • NAT (Natural Phenomena/Things)

    F1 Score: 0.3288

    Poor performance, due to limited data and more ambiguous terms. Natural entities like rivers, mountains, or biological items can overlap with other entity types, causing confusion.

    • ORG (Organizations)

    F1 Score: 0.7376

    A strong result, but not as high as GPE or GEO. Organization names often contain generic words (e.g., “Company,” “Group”) which can also appear in non-entity contexts.

    • PER (Person Names)

    F1 Score: 0.7910

    Good performance here. Person names tend to follow consistent patterns and are well-covered in the training data, resulting in high recall and precision.

    • TIM (Time Expressions)

    F1 Score: 0.8488

    Another strong category, as dates and times often follow fixed patterns (like “January 2023” or “10 AM”) that are easy for the model to learn.

    How does this model help in SEO?

    This NER model is especially useful in the following SEO contexts:

    Internal Linking

    Tagging people, places, and brands enables automated linking between related content

    Content Gap Analysis

    Identifies missing or underused entities compared to competitor content

    Featured Snippet Optimization

    Helps structure content around identified entities, increasing chances of being shown in snippets

    Semantic SEO

    Helps search engines understand content context more deeply through labeled entities

    Entity-Based Clustering

    Assists in grouping content based on related names, locations, or organizations for topical authority

    Why are some categories like ART and NAT performing poorly?

    These categories had very few labeled samples in the dataset. Deep learning models rely heavily on large amounts of data to detect patterns. With limited examples of ART (like books, paintings) or NAT (nationalities), the model didn’t get enough exposure to learn their characteristics well.

    Final Thoughts

    The model has shown very strong performance in recognizing commonly found named entities like locations, organizations, and people — all of which are highly relevant in SEO. While a few entity types need improvement, the overall results are accurate, consistent, and practical for real-world SEO applications.

    This project demonstrates how NER models can be applied beyond academic use — specifically in optimizing and analyzing web content for better visibility and structure in search engines.


    Tuhin Banik

    Thatware | Founder & CEO

    Tuhin is recognized across the globe for his vision to revolutionize digital transformation industry with the help of cutting-edge technology. He won bronze for India at the Stevie Awards USA as well as winning the India Business Awards, India Technology Award, Top 100 influential tech leaders from Analytics Insights, Clutch Global Front runner in digital marketing, founder of the fastest growing company in Asia by The CEO Magazine and is a TEDx speaker and BrightonSEO speaker.


    Leave a Reply

    Your email address will not be published. Required fields are marked *