Open-Domain Question Answering (ODQA): Retrieves Answers

SUPERCHARGE YOUR ONLINE VISIBILITY! CONTACT US AND LET’S ACHIEVE EXCELLENCE TOGETHER!

This project delivers a fully functional open-domain question answering (ODQA) system tailored for SEO use cases. The system retrieves contextually relevant information from multiple URLs spanning various domains and generates SEO-specific answers to user-defined queries. By integrating a high-performance semantic retriever with a carefully designed generative answer engine, the solution addresses the need for accurate, non-generic, and context-grounded responses across diverse SEO content sources.

The ODQA framework supports batch processing of multiple client URLs, automatically identifying relevant content blocks, computing semantic similarity to the query, and generating clear, factual answers. All generated answers are traceable, with their source URLs and confidence scores provided for full transparency.

This solution is ideal for answering strategic or operational SEO questions based on real web content, whether the content is technical (e.g., canonical tags, HTTP headers), analytical (e.g., performance tracking), or strategic (e.g., tool-based success measurement). The system is designed to operate reliably on webpages with varying structure and domain focus, ensuring its value for digital marketers, SEO teams, and technical stakeholders alike.

Project Purpose

The primary goal of this project is to develop a question-answering system capable of retrieving and generating accurate, SEO-relevant answers from a wide array of client webpages, regardless of content domain or structure. Traditional QA systems often depend on static knowledge bases or narrowly scoped documents. In contrast, this solution is designed to operate in an open-domain environment, where source material can span technical documentation, SEO guides, metric dashboards, and tool-specific tutorials across multiple URLs.

In an SEO-driven context, the ability to answer broad, open-ended questions such as *”What are the key metrics for SEO success?”* or *”How to optimize for non-HTML content?”* requires not just keyword matching but true semantic understanding of web content. This system enables that by combining semantic passage retrieval with natural language generation, delivering clear and precise answers backed by source references.

The ODQA solution directly benefits clients by:

Consolidating insights from multiple pages into a single, actionable answer.
Enabling strategic and technical SEO decision-making based on real website content.
Improving efficiency by reducing manual effort to find, read, and interpret content.
Supporting content audits, competitive research, and automated SEO Q&A tools.

By aligning natural language understanding with practical SEO objectives, the system serves as a robust foundation for answering high-value client queries across varied domains.

Project’s Key Topics Explanation and Understanding

The project title — “Open-Domain Question Answering (ODQA): Retrieves answers from a broad array of domains with open-ended question handling” — encapsulates three fundamental capabilities of the system:

Open-Domain Question Answering (ODQA)

ODQA refers to a question answering paradigm that operates without a fixed or predefined set of documents. Instead, the system is built to work with large, unstructured, and diverse content sources — which can vary in domain, style, structure, and context.

In this project:

The system retrieves and answers questions using passages extracted from multiple client-provided webpages, each possibly representing a different SEO subdomain (technical SEO, content metrics, tool usage, etc.).
The open-domain setting allows the same question to be answered using information aggregated from various types of web pages without constraints on topic boundaries.

Retrieves Answers from a Broad Array of Domains

This highlights the system’s retrieval breadth and semantic flexibility. The “domains” here refer not only to technical domains (e.g., analytics, HTTP headers, content optimization) but also webpage origin diversity, meaning the system is capable of aggregating answers from multiple URLs, each with different semantic themes, layouts, and editorial styles.

Key capabilities that support this include:

A semantic retriever that identifies relevant passages across all provided pages, even if the relevant content is worded differently.
A unified answer generator that can consolidate information from different sources into a coherent, single output.
Transparent source attribution so that clients can trace back the answer to specific domains or URLs.

This approach ensures comprehensive coverage of real-world questions that cannot be answered from a single page or narrow topic range.

Open-Ended Question Handling

Open-ended questions require interpretation, summarization, and contextual understanding rather than retrieval of a single fact. These include:

Strategic queries: “What defines SEO success for multimedia content?”
Tactical queries: “How to configure canonical headers for PDFs or images?”
Insight-driven queries: “How to use tools for performance improvement?”

The system uses a generative language model to synthesize a complete answer using retrieved content, avoiding generic or template-based replies. The responses are optimized for clarity, brevity, and grounding in real client content.

keyboard_arrow_down

Q&A: Understanding the Project Value and Importance

What problem does this Open-Domain Question Answering (ODQA) system solve for SEO-focused business?

Most SEO teams manage a wide range of documentation, strategy pages, metric dashboards, technical implementation guides, and tool-based instructions. Finding clear answers to questions like “How to optimize PDFs for indexing?” or “Which metrics define SEO performance?” typically requires manually skimming through multiple pages across the site. This system solves that problem by allowing business to ask open-ended, high-value questions, and automatically retrieving and generating accurate, concise answers directly from their own content. It reduces time spent on search and interpretation and ensures the answers are grounded in actual website material — not generic advice.

What kind of questions can this system handle?

Unlike FAQ bots or basic search features that rely on keyword matching, this ODQA system is built to handle broad, complex, or strategy-level questions that require contextual understanding. Example questions it can answer:

“How to handle different types of URLs for SEO?”
“What metrics matter most for SEO success in 2024?”
“How can we use SEO tools to monitor ranking performance?” The system interprets the question, retrieves the most semantically relevant passages across pages, and generates a clear, human-like answer tailored to the client’s SEO context.

How does this project benefit website owners practically?

This project provides website owners with:

Faster insights: SEO teams can get direct answers without navigating page-by-page.
Centralized intelligence: It draws from across all content sources — including strategy guides, tool documentation, and performance tracking posts.
Improved decision-making: The output supports SEO planning, performance reviews, and internal training with reliable, traceable information.
Contextual accuracy: Answers are based only on what’s written in the client’s own pages — ensuring relevance and domain trust.
Reduced content redundancy: By identifying where key answers already exist, website owners avoid repeating similar content unnecessarily.

How is this different from a search function or keyword-based FAQ engine?

Traditional search retrieves snippets or URLs, but does not synthesize answers. This ODQA system:

Understands the semantic intent of the query.
Uses a retriever model to find contextually relevant passages, even if the wording is different.
Uses a generative language model to combine and summarize information into a complete answer.
Displays source URLs with relevance scores, ensuring transparency and credibility.

It bridges the gap between raw data and actual answers — making it suitable for strategic use cases, not just lookup tasks.

Libraries Used

requests

The requests library is a widely-used HTTP library for Python that allows sending HTTP/1.1 requests in a simple and human-readable format. It supports essential web interactions such as GET and POST, custom headers, timeouts, and session handling.

In this project, requests is used to programmatically fetch the raw HTML content of each webpage URL. It ensures robust handling of connection errors, response validation, and content-type verification—making it a foundational layer for extracting web data from diverse domains.

bs4 (BeautifulSoup, Comment)

BeautifulSoup is a Python library used for parsing HTML and XML documents into a structured format that is easy to navigate and manipulate. It supports various parsers like lxml, enabling fast and resilient document traversal. The Comment class helps identify and remove HTML comments that are typically not part of the visible content.

This project uses BeautifulSoup to extract only meaningful and visible HTML elements like paragraphs, headers, and list items. It filters out irrelevant elements such as <script>, <style>, and hidden content. This structured parsing is critical for isolating semantically rich text blocks from cluttered web pages.

hashlib

The hashlib module provides common hashing algorithms such as MD5 and SHA-256 for generating fixed-length digital fingerprints of data. It is part of the Python standard library and is commonly used for ensuring data uniqueness, integrity, or verification.

Here, hashlib is used to generate a hash digest of each extracted content block after normalization. This allows the system to identify and exclude duplicate or repeated blocks across the page, ensuring the quality and uniqueness of the text used in retrieval and generation.

re

The re module offers full support for Perl-style regular expressions in Python. It allows pattern-based searching, matching, and substitution within text data.

Within this project, regular expressions are extensively used during the preprocessing stage to clean unwanted elements like boilerplate phrases, URLs, HTML-encoded sequences, bullets, and numbering. This makes the text cleaner and more consistent before feeding it into embedding or generative models.

html

The html module provides tools for handling HTML entities and markup, such as escaping and unescaping character references like & or <.

In this project, html.unescape() is used to convert HTML-encoded characters back into their human-readable form. This ensures that the model processes text in a natural and semantic format, improving both the quality of retrieval and the coherence of generated answers.

unicodedata

The unicodedata module in Python allows for consistent handling of Unicode characters. It can be used for normalization, categorization, and comparison of text characters across various Unicode formats.

This project uses unicodedata.normalize(“NFKC”, text) to ensure all incoming blocks are in a standardized encoding. This step reduces inconsistencies and prevents character-level anomalies during tokenization or embedding.

torch

torch is the core package of PyTorch, a leading deep learning framework. It provides tensor computation, GPU acceleration, model training, and inference utilities for neural networks.

In this project, torch is used to manage device placement (CPU or GPU), handle tokenized input tensors, and run inference through the generative FLAN-T5 model. It ensures that the generation is optimized for performance and scalability in both local and hosted environments.

transformers.utils.logging

This module from the Hugging Face Transformers library provides tools to manage logging verbosity levels for model loading, configuration, and runtime behavior.

Here, it’s used to suppress unnecessary logs and progress bars that would otherwise clutter the Colab notebook interface. This ensures a clean and professional output presentation for clients reviewing the final QA results.

sentence_transformers.SentenceTransformer

SentenceTransformer is a wrapper around Transformer-based models optimized for generating sentence embeddings. It simplifies the process of encoding entire sentences into dense vectors suitable for semantic similarity comparison.

This project uses SentenceTransformer to convert both user queries and text blocks into vector embeddings. These embeddings form the basis of the semantic retrieval step, enabling the system to match relevant content even when wording differs significantly between the query and the source.

numpy

numpy is the foundational package for numerical computing in Python. It provides array-based operations, broadcasting, matrix math, and optimized performance for handling large-scale data.

Within the retrieval system, numpy is used to manage and store embedding matrices, perform efficient indexing, and prepare data structures compatible with the FAISS similarity engine.

faiss

faiss (Facebook AI Similarity Search) is a high-performance library for similarity search and clustering of dense vectors. It is particularly useful for large-scale retrieval tasks involving millions of embeddings.

In this project, faiss is used to create an in-memory index of passage embeddings, enabling fast top-K retrieval for any given query. This vector search system dramatically improves the efficiency of selecting the most relevant contexts for answer generation.

transformers.T5ForConditionalGeneration & T5Tokenizer

These components from Hugging Face Transformers load the FLAN-T5 model for conditional sequence generation and handle tokenization of prompts. The tokenizer breaks text into tokens understood by the model, while the model generates output sequences based on context and instructions.

The project uses these tools to implement the final generative QA system. Given a query and a set of retrieved text blocks, the tokenizer prepares the input prompt, and the model produces the final, SEO-relevant answer in natural language.

IPython.display (display, HTML)

IPython.display provides tools to control how outputs are rendered in notebooks. display() and HTML() are used to inject custom formatting, layout, and styling into the output cells.

This is used to present the final answer and supporting sources in a client-friendly, visually clear format—particularly useful for professional reports and deliverables created in notebook environments.

Function: extract_content_blocks

Purpose Summary:

This function is responsible for retrieving, parsing, and extracting meaningful textual content from a given webpage URL. It removes all non-relevant or decorative HTML elements (like scripts, footers, and ads) and returns a list of cleaned, non-duplicate content blocks. These blocks serve as the core content units for later semantic retrieval and answer generation steps.

Key Logic and Explanation

· response = requests.get(url, headers=headers, timeout=timeout) The function initiates an HTTP request to fetch the HTML content of the page using a standard desktop user-agent header to avoid bot blocking.

· if “text/html” not in content_type.lower(): Ensures that only HTML pages are processed. Non-HTML documents like PDFs or images are ignored early to save resources.

· soup = BeautifulSoup(page_content, “lxml”) Parses the fetched HTML using the lxml parser for fast and reliable DOM traversal.

· for tag in soup([…]): tag.decompose() Unwanted elements (scripts, navigation bars, styles, forms, etc.) are stripped from the document to isolate the core readable content.

· for el in soup.find_all(tag): Iterates through selected tags (paragraphs, list items, quotes, and headings) to extract human-readable sections of the page.

· if len(text.split()) < min_word_count: Filters out very short or trivial blocks, ensuring only meaningful content is retained for downstream processing.

· ascii_ratio = sum(ord(c) < 128 for c in text) / max(len(text), 1) Skips blocks with low ASCII content, which often indicate garbled text or language misencoding.

· digest = hash(norm_text) and if digest in seen_hashes: Deduplicates blocks based on hashed lowercase versions, ensuring that duplicate content fragments are not repeated in the result.

Returns: A list of clean, readable text blocks extracted from the HTML body of the page, suitable for semantic encoding and context retrieval.

Function: preprocess_blocks

Purpose Summary:

This function is designed to sanitize and normalize raw text blocks extracted from webpages. It eliminates boilerplate phrases, URLs, special formatting artifacts, and other noise elements. The goal is to produce clean, standardized text that is optimal for embedding in vector space and for input to the answer generation model.

Key Logic and Explanation

· boilerplate = re.compile(…) Defines a regex pattern to detect and remove standard web phrases like “click here”, “privacy policy”, or “subscribe”. These elements are not useful in SEO-focused semantic retrieval or generation.

· url_pattern = re.compile(r’https?://\S+|www\.\S+’) Matches and removes embedded links or URLs from the content, which can cause noise in embeddings and degrade generation quality.

· bullet_pattern, numbered_pattern, roman_pattern These patterns clean list formatting such as bullets, numeric steps, or Roman numeral headings that are common in web content outlines but are not meaningful for machine learning models.

· substitutions = { … } Maps special Unicode characters like curly quotes or long dashes to their plain ASCII equivalents. This normalization improves model consistency and reduces variability in text embeddings.

· def clean(text: str) -> str: A nested function that applies all the above transformations to a given text block: unescaping HTML, applying regex filters, and stripping extra whitespace. This encapsulation allows reusable, orderly text cleaning.

· if len(cleaned.split()) >= min_word_count: After cleaning, only retain blocks that have enough word count to provide meaningful semantic content. This prevents overly short or trivial sentences from influencing the retrieval process.

Returns: A flat list of cleaned text blocks, each a meaningful unit of content ready for use in sentence embedding and retrieval tasks. The result is optimized to feed directly into embedding models like SentenceTransformer or for inclusion in generator prompts.

Function: load_retriever_model

Purpose Summary:

This function loads a pre-trained sentence embedding model from the SentenceTransformer library. The loaded model is used to convert both the user query and content blocks into dense vector representations suitable for semantic similarity comparison. It acts as the backbone for the retrieval mechanism in the Open-Domain QA pipeline.

Key Logic and Explanation

return SentenceTransformer(model_name) Initializes and returns the model instance using the specified architecture. By default, it loads “all-mpnet-base-v2”, a highly effective general-purpose model trained for semantic search tasks. Internally, this loads the model weights, tokenizer, and configuration files from Hugging Face’s model hub. This model is used later to encode text into dense vectors for similarity-based retrieval using FAISS or direct dot-product ranking.

Returns: A SentenceTransformer model object capable of encoding any input text (queries or content blocks) into numerical embeddings. This model serves as the encoder component in the retrieval pipeline and is reused throughout the project for both index building and query encoding.

Model Overview: all-mpnet-base-v2 for Semantic Retrieval

Model Summary:

The all-mpnet-base-v2 model is a powerful, pre-trained transformer developed by SentenceTransformers. It belongs to the MPNet (Masked and Permuted Pre-training for Language Understanding) family, which is a successor to BERT and RoBERTa, designed to enhance both contextual understanding and semantic matching. This model is specifically fine-tuned for semantic textual similarity and information retrieval tasks, making it a highly suitable choice for Open-Domain Question Answering (ODQA) systems.

Technical Details and Suitability

Architecture: MPNet-based transformer with 110 million parameters. MPNet improves on BERT by integrating permutation-based training with masked language modeling, capturing both local and global dependencies better.
Embedding Output: It generates 768-dimensional sentence embeddings. These embeddings are dense vector representations of the semantic meaning of the input text.
Input Token Limit: Supports sequences up to 512 tokens, allowing relatively long passages or complex queries to be embedded accurately.
Performance: Compared to smaller models like miniLM or older distilBERT-based models, all-mpnet-base-v2 consistently outperforms in real-world semantic search tasks.
Inference Efficiency: Despite its larger size, it remains reasonably fast for inference, especially on GPU-backed environments like Colab, making it suitable for both development and production-scale batch processing.

Why This Model Was Selected

The core task in ODQA is to identify and rank the most relevant passages from a large set of unstructured content that might come from any domain or topic. This requires a model that can deeply understand both the user’s question and the content text, then map them to a shared embedding space where similar meaning corresponds to high cosine similarity.

The all-mpnet-base-v2 model is designed precisely for this:

It has been trained using Multiple Negatives Ranking Loss, which enables it to excel in capturing fine-grained semantic similarity between a question and its possible answers.
It achieves state-of-the-art performance on benchmarks like MS MARCO, STSbenchmark, and BEIR datasets, all of which evaluate sentence-level understanding and retrieval performance.

In the context of this project, where the goal is to accurately match user queries to meaningful blocks of SEO-related content across multiple domains and web pages, all-mpnet-base-v2 ensures a reliable and generalizable retrieval quality.

Role in This Project

In this Open-Domain QA system, all-mpnet-base-v2 is used to:

Encode the user query into a semantic vector.
Encode all cleaned and preprocessed content blocks from multiple URLs into vectors.
Compare these vectors using FAISS index (or optionally, direct cosine similarity) to identify and retrieve the top-K most semantically similar passages to serve as context for the final answer generation model.

By accurately narrowing down large content into highly relevant snippets, this model ensures that the downstream answer generator works with only the most contextually aligned inputs—directly enhancing the precision and usefulness of the final answers produced.

Function: encode_passages

Purpose Summary:

This function transforms a list of text passages into dense vector embeddings using a SentenceTransformer model. These embeddings capture the semantic meaning of the texts and are suitable for high-performance retrieval tasks using similarity search. Optionally, it normalizes the vectors for use with cosine similarity in FAISS.

Key Logic and Explanation

· embeddings = model.encode(passages, batch_size=32, show_progress_bar=False) Performs batch encoding of all input text blocks into numerical embeddings using the provided transformer model. The batch size of 32 ensures efficiency while maintaining memory stability.

· if normalize: faiss.normalize_L2(embeddings) If the normalize flag is enabled, it normalizes the embeddings to unit vectors using L2 norm. This is essential when performing cosine similarity search via inner product in FAISS, ensuring consistent scoring.

Returns: A NumPy array of normalized or raw dense vectors, each corresponding to the semantic representation of a single text passage. These embeddings are ready to be used for indexing or matching against user queries in the ODQA pipeline.

Function: build_faiss_index

Purpose Summary:

This function initializes and constructs a FAISS index using dense vector embeddings for high-performance similarity search. It enables fast and scalable retrieval of semantically similar content passages in open-domain question answering tasks.

Key Logic and Explanation

· dim = embeddings.shape[1] Extracts the dimensionality of the embedding vectors, which is required to initialize the FAISS index structure properly.

· index = faiss.IndexFlatIP(dim) Creates a flat (non-clustered) FAISS index that uses inner product similarity for nearest neighbor search. This structure is efficient for moderate-scale datasets and works well when paired with L2-normalized vectors.

· index.add(embeddings) Populates the index with the given list of embedding vectors, allowing future similarity-based retrievals using a user query embedding.

· return index Returns the fully constructed and populated FAISS index object, ready to be queried with vector representations of questions.

Function: encode_query

Purpose Summary:

This function transforms the user’s input question into a dense vector embedding using the same SentenceTransformer model used for content passages. The resulting embedding enables direct comparison with pre-indexed content via semantic similarity search.

Key Logic and Explanation

· embedding = model.encode([query]) Encodes the input query into a single dense vector using the SentenceTransformer model. The query must be wrapped in a list to maintain consistent batching format with the model’s interface.

· if normalize: faiss.normalize_L2(embedding) Optionally normalizes the embedding to unit length using L2 norm. This normalization is critical when using FAISS with inner product similarity, as it converts the similarity metric into cosine similarity, improving semantic alignment.

Function: retrieve_top_k

Purpose Summary:

This function performs a semantic similarity search to retrieve the top-k most relevant passages for a given query embedding from a FAISS index. It returns the top-ranked results along with their relevance scores.

Key Logic and Explanation

· D, I = index.search(query_vector, top_k) Performs the vector similarity search using FAISS. D contains the similarity scores, and I holds the indices of the most similar vectors (i.e., blocks) in the index.

· return […] Constructs and returns a list of dictionaries, of text, score and source urls with score.

Function: load_generation_model

Purpose Summary:

This function loads the pretrained FLAN-T5 model and its tokenizer for use in the answer generation phase. It automatically assigns the model to available hardware (GPU or CPU) using device_map=”auto” for seamless compatibility and performance.

Key Logic and Explanation

· tokenizer = T5Tokenizer.from_pretrained(model_name) Loads the tokenizer associated with the specified T5 model. This tokenizer is responsible for converting textual prompts into input token IDs that the model can process.

· model = T5ForConditionalGeneration.from_pretrained(model_name, device_map=”auto”) Loads the actual FLAN-T5 model architecture fine-tuned for conditional text generation. The device_map=”auto” parameter intelligently distributes model layers across available devices, ensuring optimal memory utilization and speed—especially useful in environments like Google Colab with GPU support.

Model Overview: FLAN-T5-Large for Controlled Answer Generation

Model Summary:

FLAN-T5-Large is an instruction-tuned generative language model developed by Google, built on top of the original T5 (Text-to-Text Transfer Transformer) architecture. The “FLAN” (Fine-tuned LAnguage Net) series is fine-tuned on a diverse set of instruction-following tasks, allowing it to better understand and execute natural-language prompts. This specific model has approximately 770 million parameters, providing a strong balance between performance and resource usage, particularly suitable for real-time, multi-domain applications such as Open-Domain Question Answering (ODQA).

Technical Details and Suitability

Architecture: Based on the T5 encoder-decoder transformer, trained to convert any NLP task into a text-to-text format. FLAN adds an additional layer of instruction tuning across diverse datasets.
Model Size: 770M parameters — significantly more expressive than base variants like flan-t5-base, while still lightweight enough for Colab GPUs.
Input Limit: Can handle up to 512 tokens of input, making it ideal for multi-passage QA with moderate-length context.
Output Control: Supports generation tuning via parameters such as max_new_tokens, min_new_tokens, and beam or sampling strategies. This allows fine-grained control over the response length, specificity, and diversity.
Instruction Tuning: Fine-tuned on a wide variety of prompt-based tasks (QA, summarization, reasoning, etc.), it demonstrates strong generalization to unseen prompts, especially when explicit constraints are provided.

Why This Model Was Selected

In this ODQA project, the final answer must be generated from passages that were extracted from unstructured SEO-related pages across various domains. It is essential that the generation model:

Follows prompt instructions closely.
Remains grounded in provided content.
Generates human-readable, practical answers tailored to client expectations.

FLAN-T5-Large was chosen because of its robust instruction-following capabilities and its ability to generate contextually accurate, concise, and factually grounded outputs. Unlike vanilla T5 models or smaller variants, FLAN-T5-Large reliably interprets nuanced prompt guidelines and avoids hallucinations when supplied with sufficient, well-structured input context.

Role in This Project

In this SEO-focused Open-Domain QA system, FLAN-T5-Large plays a critical role in transforming retrieved context into a professional-grade, client-ready answer. Specifically, it is responsible for:

Taking a natural language question and the top-K retrieved passages as input.
Producing a concise and actionable SEO answer aligned with both the query and supporting content.
Following structured prompt guidelines to avoid verbosity, generic SEO filler, or unsupported statements.

By leveraging instruction tuning and flexible generation parameters, this model enables scalable, high-quality answer synthesis across multiple SEO domains—ensuring each response is both semantically relevant and commercially valuable.

Function: format_prompt

Purpose Summary:

This function constructs a structured and instruction-tuned prompt to guide the answer generation model. It ensures the generated output is focused, context-aware, and tailored for SEO-specific QA tasks.

Key Logic and Explanation

· instructions = (…) Defines the instructional block at the beginning of the prompt. This outlines how the generator model should behave—such as avoiding generic SEO advice, staying grounded in the given context, and answering all aspects of the query in a well-developed form. These constraints are essential to keep the response accurate and aligned with the intended goal of factual content generation.

· if allow_fallback: … If allow_fallback is set to True, the instruction allows the model to draw from general SEO expertise only if the context is insufficient. This provides a controlled flexibility to the model in edge cases where relevant context is missing or incomplete.

· context_block = “\n”.join([…]) Converts the list of retrieved passages into bullet-style lines to be presented under the Context: section. This formatting helps the model distinguish separate evidence points clearly and supports better comprehension of the retrieved content.

· return f”{instructions}Question: {question.strip()}…\n\nAnswer:” Combines the instructional block, the user’s question, and the formatted context list into a complete prompt. This prompt is fed into the FLAN-T5 model to generate a high-quality SEO answer that’s directly supported by the retrieved text

Function: generate_answer

Purpose Summary:

This function uses a prompt and a pretrained language model to generate a complete SEO-focused answer from retrieved context blocks and a user query. It wraps together prompt formatting, tokenization, controlled generation, and decoding in a single callable interface.

Key Logic and Explanation

prompt = format_prompt(…) The prompt is created using a dedicated formatting function that takes the user’s question and relevant context passages. The allow_fallback flag optionally lets the generator rely on general SEO knowledge if the context is insufficient. This ensures flexibility while keeping answers grounded in real content.
inputs = tokenizer(…).to(model.device) The input prompt is tokenized and transferred directly to the model’s computation device (CPU or GPU). This ensures the model and inputs are on the same device for efficient inference. The use of .to(model.device) ensures compatibility in any runtime setting.
output = model.generate(…) This is the core text generation step. Key parameters used:
- max_new_tokens and min_new_tokens restrict the length of the output to stay within meaningful and balanced bounds.
- do_sample=True enables non-deterministic generation, allowing for varied and more natural outputs.
- temperature and top_p work together to balance randomness and focus in token selection.
- repetition_penalty discourages repetitive phrases, improving readability and coherence.
return tokenizer.decode(…) The generated token sequence is decoded back into a human-readable string. Special tokens are skipped to keep the output clean. The result is a finalized answer that adheres to both the input context and the prompt’s structural guidance.

Function: display_qa_result

This function presents the final output of the QA system—question, answer, and supporting sources—in a clean, stacked, and client-readable format. It wraps the answer in HTML with line wrapping for improved visibility in notebook environments, while the question and source URLs with relevance scores are printed using standard formatting. This structured display ensures clarity for end users or clients reviewing generated insights.

Result Analysis and Explanation

This section presents a focused evaluation of how the system responded to a real-world SEO query using live web content. The analysis assesses both the quality of the generated answer and the effectiveness of the underlying retrieval process that guided the generation.

Input Question Overview

The query submitted for evaluation was:

“How to handle different document URLs”

This is a typical open-ended SEO query that reflects a practical concern related to managing duplicate content, canonicalization, and search engine indexing for non-HTML resources (e.g., PDFs, images, videos). The goal of the system is to generate an informative, actionable, and precise answer based on relevant website content.

Retrieved Content Analysis

The system identified content blocks from the page:

Advanced Technical SEO: Handling Different Document URLs Using HTTP Headers

One of the top retrieved blocks (Relevance Score: 0.5006) stated:

“Ensure that the canonical URL is the most relevant and high-quality version of the file.”

This passage was ranked among the top due to its alignment with both the query topic (“different document URLs”) and its relevance to canonicalization, a key SEO technique. The FAISS-based retrieval successfully prioritized context blocks that contained detailed and actionable information about HTTP header usage—critical for SEO on non-HTML content.

Generated Answer Evaluation

The answer generated by the system was:

“If your server runs on Nginx, modifying the nginx.conf file will allow you to specify canonical headers for different file types. Unlike regular web pages, these file types do not have an HTML head> section where canonical tags are usually placed. Instead, HTTP headers allow webmasters to specify the preferred version of a file, preventing duplicate content issues and improving SEO rankings.”

This output mirrors the retrieved content accurately, demonstrating that:

The generation step faithfully used the provided context without introducing unsupported or generic SEO statements.
The content was presented in a clean, client-understandable structure, directly answering the original query.
Technical details like “nginx.conf” and “HTTP headers” were preserved in the answer, enhancing its credibility and practical utility.

Business and SEO Relevance

The answer reflects a high-value application for clients dealing with non-HTML content. Key takeaways for clients include:

Canonical handling via HTTP headers is a reliable SEO strategy for non-HTML resources.
The generated explanation gives practical server-side guidance (Nginx configuration), enabling implementation without requiring generic SEO templates.
The system has the potential to serve as a self-guided knowledge assistant for in-house SEO teams, offering precise, reference-backed answers that improve decision-making and reduce the need for manual document scanning.

Final Observations

The retriever-relevant score (~0.50) indicates moderate contextual match, which may improve with multi-document support or batch-level processing.
The generator’s performance is strong when the input block is semantically aligned and contains technical depth.
The system demonstrates reliability in preserving technical terminology and producing human-readable, context-grounded answers.

Result Analysis and Explanation

This section presents a comprehensive evaluation of the system’s ability to handle open-ended, multi-faceted SEO queries using content aggregated from multiple web pages. The system was tested on a complex question involving non-HTML SEO practices, tracking metrics, and the use of SEO tools—three distinct sub-domains commonly addressed in real-world optimization strategies.

Multi-Page Contextual Retrieval

The retrieval system is designed to identify and rank the most contextually relevant blocks of information from a broad set of webpages. During evaluation, it consistently retrieved passages that directly aligned with the subject of the input query, regardless of content distribution across the pages. This confirms the system’s ability to:

Navigate and compare multi-page content.
Prioritize semantically rich blocks over superficial keyword matches.
Maintain relevance even when multiple subtopics are blended within the query.

By capturing relevant content from different sources and ranking them effectively, the system ensures that the answer generation stage receives a focused, contextually diverse input—critical for high-quality output.

Answer Generation Behavior

The generated answers reflect a clear understanding of the query’s intent. Instead of producing vague or generic responses, the output typically includes:

Topic-specific phrases and actionable information.
Multi-faceted coverage when the question includes multiple themes.
Structurally coherent and factually grounded explanations.

The system also maintains boundaries around its responses by avoiding external hallucinations or unverifiable claims. This is particularly important for SEO applications where clients require precision and content fidelity to their original materials.

Understanding Relevance Scores and Source URLs

In order to ensure that generated answers are grounded in the most relevant and credible portions of content, the system includes a scoring mechanism that ranks extracted passages based on semantic alignment with the user’s question. Each answer is supported by a list of source URLs accompanied by a relevance score, which reflects how contextually close that passage was to the query.

The relevance score, ranging from 0.0 to 1.0, serves as a confidence signal. In most practical use cases, scores above 0.60 are considered strongly aligned, while scores between 0.40 and 0.60 may still offer useful context, especially for more nuanced or multi-faceted questions. Scores below 0.30 are generally not retrieved unless no better context is available. This scoring does not represent correctness or accuracy directly—it reflects semantic proximity within the content embedding space.

Each listed source URL points to the original webpage from which the supporting content was extracted. These URLs offer full transparency into the answer’s foundation and help users verify, explore, or reference the material further. However, it’s important to note that while relevance scores and URLs guide the system’s retrieval process, the final generated answer is what delivers the actual SEO insight. Therefore, the scoring system should be seen as an assistive layer—enhancing trust and traceability—rather than the core output.

Value to website owners and Practical Reliability

From a website owner perspective, this result illustrates the core promise of the system:

Domain-Spanning Synthesis: The model can combine insights from multiple types of SEO topics—technical setups, analytics strategy, and tool-based operations—into a single response.
Scalability and Speed: Instead of manually reading multiple documents, the system delivers a compact summary, helping SEO teams, consultants, or marketing managers make informed decisions in seconds.
Interpretability: The list of source URLs and associated relevance scores gives website owners confidence in where the information came from and how it was selected.

This performance confirms the project’s capability to assist in decision-making, content audits, and strategic discussions by surfacing accurate, relevant insights from scattered digital documentation.

Q&A Section for Project Result Understanding and Client Actions

How does this project help improve SEO outcomes for my website content?

This project delivers automated, content-aware answers to SEO-related queries by analyzing unstructured pages across different domains. It evaluates relevant content using advanced semantic retrieval and generation models, ensuring that the responses align directly with your site’s actual information. This enables your team to quickly surface critical SEO insights—such as handling canonical URLs or optimizing non-HTML assets—without manually inspecting pages. The model identifies authoritative context from your website and uses it to produce SEO-relevant answers, helping streamline audits and content improvements with precision.

How reliable are the answers generated from different webpages?

Each answer is generated based on top-matching content passages retrieved from your actual web pages, ranked using semantic similarity scores. These passages undergo careful preprocessing and embedding, ensuring that only the most contextually relevant parts are used. The generator model is instructed to rely solely on this content, and the sources used—along with their relevance scores—are presented transparently. This structure ensures answers are not fabricated but grounded in your existing page data, maintaining a high standard of trust and factuality.

What makes this project useful across multiple domains or content types?

The system is designed to handle open-domain questions, meaning it works effectively across various page topics—ranging from SEO tools and technical implementations to performance metrics and strategy insights. It does not rely on a predefined knowledge base but instead adapts dynamically to the content retrieved from any given page. Whether the page includes PDF headers, tracking guidelines, or tool-related content, the model adjusts its response according to the actual available data, offering scalable value across business verticals.

What actions should I take based on the output of this system?

Clients can use the generated answers to inform strategic content decisions—such as identifying areas lacking clarity, improving metadata usage, or refining SEO directives for non-HTML assets. The source references guide where the content came from, allowing you to validate or revise it directly. Over time, analyzing which queries lead to vague or incomplete answers also highlights gaps in your SEO content coverage, enabling proactive optimization.

How does the system ensure that the generated answers are trustworthy and based on credible sources?

Each generated answer is supported by content retrieved from relevant web pages, with source URLs displayed alongside a relevance score. This score—typically ranging from 0.0 to 1.0—indicates how semantically aligned a passage is with the question. While scores above 0.60 generally reflect strong contextual relevance, the true value lies in the final answer synthesized from the top-ranked passages. These scores are not shown to evaluate the quality of the answer directly, but rather to provide transparency into how content was selected. Including the original URLs as references allows clients to trace insights back to their source, ensuring both reliability and interpretability of the output.

Can this tool help during a technical SEO audit?

Yes, the system can rapidly analyze multiple URLs and answer audit-relevant questions such as whether canonical handling is correctly implemented, whether certain metrics are tracked, or if tool-based recommendations are applied. This removes the need for manually checking each page or asset, making audits faster, more consistent, and easier to scale across larger domains.

How does it adapt to different SEO query types—technical, strategic, or content-focused?

Because the system uses semantic retrieval rather than keyword matching, it understands the meaning behind a question and finds the most contextually aligned content from your pages. Whether the query is about tool usage, metric tracking, content structure, or indexing behavior, the system generates a tailored answer grounded in your website data, making it highly flexible for multiple SEO use cases.

What if the webpage doesn’t contain relevant content for a query?

In such cases, the system either generates a minimal, fact-based answer or returns nothing if no confident match is found. Since it uses semantic relevance thresholds and retrieves only high-quality content, it avoids guessing or providing incorrect guidance. This protects your workflow from misleading insights and ensures that every answer has traceable justification.

Can this approach improve the content strategy of my website over time?

Absolutely. By tracking which questions consistently yield low or repetitive answers, your content team can identify informational gaps in your pages. You can then enrich those pages with targeted content that not only improves user understanding but also boosts search relevance. Over time, this results in a stronger content portfolio that is both user- and search-engine-friendly.

Final Thoughts

This project delivers a robust and practical solution for open-domain question answering (ODQA), specifically tailored for SEO-focused content. By combining high-quality retrieval mechanisms with a domain-aligned generation model, the system effectively synthesizes informative answers grounded in real, verifiable web content. The integration of semantic relevance scoring and transparent source referencing further reinforces trust in the generated outputs.

The strength of the approach lies in its ability to process unstructured content from diverse domains and respond to open-ended questions with clarity, factuality, and contextual precision. This makes the system particularly valuable for SEO consultants, digital strategists, and content optimization teams seeking scalable, automated insights directly from their own or competitors’ web properties. The output is not only informative but also traceable, practical, and tailored for real-world decision-making.

Tuhin Banik

Thatware | Founder & CEO

Tuhin is recognized across the globe for his vision to revolutionize digital transformation industry with the help of cutting-edge technology. He won bronze for India at the Stevie Awards USA as well as winning the India Business Awards, India Technology Award, Top 100 influential tech leaders from Analytics Insights, Clutch Global Front runner in digital marketing, founder of the fastest growing company in Asia by The CEO Magazine and is a TEDx speaker and BrightonSEO speaker.

SUPERCHARGE YOUR ONLINE VISIBILITY! CONTACT US AND LET’S ACHIEVE EXCELLENCE TOGETHER!

Libraries Used

requests

bs4 (BeautifulSoup, Comment)

hashlib

re

html

unicodedata

torch

transformers.utils.logging

sentence_transformers.SentenceTransformer

numpy

faiss

transformers.T5ForConditionalGeneration & T5Tokenizer

IPython.display (display, HTML)

Function: extract_content_blocks

Purpose Summary:

Key Logic and Explanation

Function: preprocess_blocks

Purpose Summary:

Key Logic and Explanation

Function: encode_passages

Purpose Summary:

Key Logic and Explanation

Function: build_faiss_index

Purpose Summary:

Key Logic and Explanation

Function: encode_query

Purpose Summary:

Key Logic and Explanation

Function: retrieve_top_k

Purpose Summary:

Key Logic and Explanation

Function: load_generation_model

Purpose Summary:

Key Logic and Explanation

Function: format_prompt

Purpose Summary:

Key Logic and Explanation

Function: display_qa_result

Leave a Reply Cancel reply