SUPERCHARGE YOUR ONLINE VISIBILITY! CONTACT US AND LET’S ACHIEVE EXCELLENCE TOGETHER!
The AI-Powered Neural Topic Modeling for Content Clustering and SEO Strategy project aims to use advanced AI technology (specifically Neural Topic Modeling) to help website owners understand their content better. The project aims to:
- Automatically organize content into meaningful groups (called clusters) based on the discussed topics.
- Improve the website’s SEO (Search Engine Optimization) by identifying the best keywords and linking similar pages together to boost visibility in search engine rankings.
- Recommend similar content to users, helping them easily find other relevant pages on the website.
Let’s break down each part in simple language:
1. Neural Topic Modeling (NTM):
Neural Topic Modeling is an advanced method that uses AI to analyze large amounts of text (like website content) and automatically discover hidden topics within that text. Topics are the main themes or subjects that appear frequently in the content.
For example, if your website has articles about SEO, digital marketing, and web development, the NTM will automatically find these themes by analyzing the words used in each article. It’s like AI reading all your content and figuring out the key subjects your website discusses.
2. Content Clustering:
Once the NTM identifies the topics, the project groups similar content together, called content clustering.
Think of it this way: If you have several articles about SEO strategies, this project automatically clusters them into a group. Another cluster might include articles on social media marketing. This helps organize the website’s content into clear, meaningful groups. This makes it easier for users to navigate your website and find the information they want.
Why is this useful?
- It helps users by showing them related articles or services they might be interested in.
- It helps website owners keep their content well-organized and easy to manage.
3. SEO Strategy: How does it help SEO?
SEO (Search Engine Optimization) is improving a website’s ranking on search engines like Google. When your website ranks higher, more people find it when they search for related terms. The project uses Neural Topic Modeling to help with SEO in two ways:
- Keyword Strategy: The project identifies the most important keywords for each topic. These keywords are what people are likely to type into search engines. For example, if the NTM finds that “SEO services” and “link-building” are common topics, you can focus on these keywords to attract more traffic from search engines.
- Internal Linking: The project finds which pages are similar to each other. You can use this information to create links between similar pages. Internal linking is important for SEO because it helps search engines understand the structure of your website, making it easier to index your pages and boost your rankings.
4. Recommendation System: What does it do?
In addition to organizing content and improving SEO, the project also acts as a recommendation system. When someone reads an article or visits a page on your website, the project can suggest other similar pages that the user might be interested in based on the content they are viewing.
For example, if someone is reading about SEO strategies, the project can recommend other related pages like link-building techniques or competitor keyword analysis. This keeps visitors engaged with your website for longer and increases the chances of them exploring more of your content.
5. How Does This Help a Website Owner?
As a website owner, the project helps you in the following ways:
- Content Clustering: It automatically organizes your website’s content, saving you time and effort in manually managing pages.
- SEO Optimization: By showing you the most important keywords and helping you link similar content, it improves your website’s visibility in search engines, attracting more visitors.
- User Engagement: The recommendation system keeps users engaged by suggesting relevant content, which helps improve the user experience and increase the time visitors spend on your site.
Example of How It Works:
Let’s say you own a website that offers various digital marketing services. You have pages on:
- SEO services
- Social media marketing
- Link-building techniques
- Content proofreading
Using this project:
- Neural Topic Modeling analyzes all your pages and discovers that the main topics are SEO, social media, and content services.
- The project clusters these pages into meaningful groups (like all SEO-related pages together, all social media marketing pages together, etc.).
- It suggests the best keywords for each topic (like “SEO services” for SEO-related pages) so that you can optimize your content for search engines.
- It shows you which pages are similar, so you can link them together (for example, linking SEO services to competitor keyword analysis).
- It provides a list of recommended pages for users to see based on the content they are currently viewing, helping them discover more content on your site.
Key Benefits for Website Owners:
- Save time by automating the content organization process.
- Improve SEO by identifying the most important keywords and linking related content.
- Increase user engagement by providing page recommendations and keeping users on the site longer.
What is Neural Topic Modeling (NTM)?
Neural Topic Modeling combines traditional topic modeling techniques (like Latent Dirichlet Allocation, LDA) with neural networks. Topic modeling is a process that discovers hidden topics or themes within a large collection of text data. Neural Topic Modeling enhances this by using deep learning (neural networks) to identify complex, nuanced topics in the content, improving the accuracy of topic discovery.
Use Cases of Neural Topic Modeling:
- Content Organization: Automatically organize content into topics, making it easier for websites to create clusters or groups of related articles.
- SEO Optimization: NTM helps in finding hidden themes within your website content, which can guide your keyword strategy to target the right search terms.
- Recommendation Systems: E-commerce or content websites can use NTM to recommend relevant products or articles to users based on topic similarities.
Real-Life Implementations:
- Customer Reviews Analysis: E-commerce sites use NTM to analyze customer reviews and discover the hidden topics (e.g., “shipping,” “quality,” or “price”) that matter most to customers.
- News Websites: News websites use NTM to group related news articles automatically and create content clusters.
- Search Engines: Search engines can enhance their understanding of queries by categorizing content into more nuanced topics.
How is NTM used on Websites?
For your project related to a website, Neural Topic Modeling can be used to analyze the text content of the website and help group related pages or articles into topics. This is great for:
- Optimizing SEO and Keywords: NTM will find the best hidden topics in your content, which can be used to improve your website’s search engine ranking.
- Content Clustering: You can create groups of related content on the website that users will find easier to navigate and explore.
What kind of data does NTM need?
- Text Data: NTM needs a lot of text data to analyze. For your website project, this would be the written content on each page of the website (articles, blogs, descriptions, etc.).
- Input Formats: This text data can come from URLs of the webpages or be provided in CSV format. If you use URLs, you need to scrape or extract the text content from those webpages. If you have the content in CSV format, the text should be in a structured way (e.g., with a column for the page title and a column for the text content).
How does NTM work technically?
- Preprocessing the Data: The text needs to be cleaned first (removing stopwords like “the,” “is,” etc.). Then, it converts the text into numbers using a process called “vectorization” so the neural network can understand it.
- Neural Network and Topic Discovery: The neural network processes the text data and uncovers hidden topics by analyzing patterns in the text. Traditional models like LDA focus on simpler topics, while NTM goes deeper into complex patterns and relationships.
- Output: After processing, NTM outputs a list of topics (keywords that represent each topic) along with their associated content. For your website, this means the model will tell you the main themes in the website’s content and how they are related, which can guide your content strategy.
Why is NTM helpful for content clustering and keyword strategies?
By discovering hidden topics, NTM helps:
- Optimize Content Clusters: It groups related content together, improving the user experience on your website.
- Enhance Keyword Strategy: The topics uncovered by NTM can guide which keywords or search terms are most relevant to your website’s content, improving SEO.
1. Import Required Libraries for the Project
- Purpose: requests is a Python library used to make HTTP requests. In this project, we use it to access the content of the web pages listed in the URLs. When we “request” a webpage, this library gets the HTML content of that webpage for us to work with.
- Purpose: BeautifulSoup is a library used for parsing HTML and XML documents. Webpages are written in HTML, and this tool helps extract only the relevant text (ignoring HTML tags like <div>, <p>, etc.). It’s like scraping the meaningful content from the raw HTML that the requests library fetches.
- Purpose: re is Python’s regular expression module. Regular expressions are used for searching, matching, and manipulating text. We use it here to clean the text (e.g., removing digits, punctuation, or unwanted characters) before we analyze it.
- Purpose: nltk is the Natural Language Toolkit, a powerful library for processing text. It provides tools for tasks such as tokenization (splitting text into words), removing stopwords (like “the,” “is,” etc.), and performing other natural language processing (NLP) tasks.
- Purpose: The stopwords corpus is a part of the nltk library and contains lists of common words in a language (like “the,” “is,” “in,” etc.). These words are not very informative for analysis, so we remove them from the text to focus only on meaningful words.
- Purpose: CountVectorizer is a tool that transforms text into a Bag of Words model. This converts the text into a matrix of word counts where each row represents a document, and each column represents a word. This is essential because machine learning models like Neural Topic Modeling need text in a numerical format to work with.
- Purpose: cosine_similarity is used to calculate how similar two vectors are. In this project, it calculates the similarity between the topics of different web pages. Two pages that have a high cosine similarity score are considered to be about similar subjects, which is useful for making content recommendations.
- Purpose: corpora and models are part of the Gensim library, which is used for topic modeling. In this case, we use them to build the Latent Dirichlet Allocation (LDA) model. The LDA model automatically discovers topics from text data by analyzing word patterns across documents (in this case, the web pages).
- corpora: Helps create a dictionary that maps words to their unique IDs.
- models: Contains tools to build the LDA model, which will discover the hidden topics in the content.
- Purpose: numpy is a library for working with arrays and matrices, which are essential data structures in machine learning. Here, it is used to store the topic distributions (i.e., how much each topic is present in each webpage) in matrix form. It’s also useful for mathematical operations.
2. Download English Stopwords Using NLTK
- Purpose: Before using the list of stopwords (common words that don’t add value to the analysis), we need to download them using nltk. This command ensures that we have the necessary stopwords in English to remove from the text before further processing.
Function: fetch_webpage_content
This function takes a list of URLs as input and returns a list of raw text extracted from each webpage.
1. Define the function and initialize the list:
- Explanation:
- The function fetch_webpage_content accepts one parameter called urls, which is a list of web page URLs.
- content_list is an empty list that will be used to store the text content extracted from each webpage.
- Example: In this case, the input could be:
2. Loop over each URL:
· Explanation:
- This loop goes through each URL in the list urls. It will repeat the process for every URL to retrieve the content.
· Example: The URL https://thatware.co/ will be the first (and only) URL in this case, so the function will loop through it.
3. Send an HTTP request to get the webpage content:
· Explanation:
- requests.get(url) sends an HTTP request to the website server and tries to download the webpage’s content.
- The response from the server is stored in the variable response. This contains the HTML content of the webpage.
· Example:
- For the URL https://thatware.co/, this sends a request to the server hosting the ThatWare website. The response will contain the HTML code for the homepage.
4. Parse the HTML content:
· Explanation:
- The BeautifulSoup object soup parses the HTML content (which is in response.content) and allows us to work with it as a structured document.
- html.parser is a built-in HTML parser that helps break down the HTML into meaningful elements.
· Example:
- The BeautifulSoup library will now read through the HTML from https://thatware.co/ and allow us to extract text from the page, ignoring the HTML tags (like <div>, <h1>, etc.).
5. Extract visible text from the webpage:
· Explanation:
- soup.get_text(separator=’ ‘) extracts all the visible text from the webpage (without HTML tags). The separator=’ ‘ argument ensures that different blocks of text are separated by spaces.
- The result is a long string containing all the visible text on the page, with spaces between different sections.
· Example:
- From the page https://thatware.co/, this will extract the visible text content such as the homepage’s headers, body text, and any other readable content, ignoring HTML code like <div>, <h1>, etc.
6. Add the extracted text to the list:
· Explanation:
- The extracted text is then added to the content_list. This ensures that for each URL processed in the loop, we store its content in this list.
· Example:
- For the URL https://thatware.co/, the visible text of the homepage (such as “ThatWare SEO Services,” “AI-Powered SEO Solutions,” etc.) will be added as a string to content_list.
7. Handle any errors that occur:
- Explanation:
- The try block ensures that if something goes wrong (e.g., the webpage cannot be reached, or the URL is incorrect), the except block will handle it.
- It prints an error message that includes the URL that caused the error and a description of the error (str(e)).
- Example:
- If there’s an issue fetching the content from https://thatware.co/, this part of the code will catch the error and print something like:
8. Return the list of content:
- Explanation:
- After all the URLs have been processed, the function returns content_list, which contains the extracted text from each webpage.
- Example:
- For https://thatware.co/, the function would return a list containing the visible text from the homepage:
Complete Example:
Let’s say we call this function with one URL:
The output would be a list containing the text content of the page, similar to:
What Does This Function Do?
The preprocess_text function is designed to clean text data. When we collect raw text from webpages, it contains unnecessary information like punctuation, numbers, and common words like “the,” “and,” “is.”
Step-by-Step Explanation
1. Initialize the Stopwords:
· Explanation:
- The stopwords.words(‘english’) gets a list of common English words that are not useful for analysis (like “the,” “is,” “and”).
- These words are called stopwords, and they don’t help when you are trying to understand the main ideas or topics in a document.
- set(stopwords.words(‘english’)) stores these stopwords in a set (which is a type of collection that makes checking for words faster).
· Why this is important:
- Removing these common words helps us focus on the more important and meaningful words in the text.
2. Add Custom Stopwords and Symbols:
· Explanation:
- In addition to the default stopwords from the nltk library, this line defines custom stopwords and symbols that we also want to remove.
- This includes:
- Extra common words: “the,” “and,” “when,” “where,” etc.
- Punctuation: Symbols like “:”, “;”, “(“, “)”, etc.
- These words and symbols are not useful for understanding the topics in the text, so we will remove them.
· Why this is important:
- By removing symbols and unnecessary words, we make the text cleaner and more meaningful for further analysis.
3. Merge Default and Custom Stopwords:
· Explanation:
- This line combines the default stopwords from nltk with the custom stopwords and symbols that were defined above.
- The union() function merges these two lists together, so we have one complete list of all the words and symbols we want to remove.
· Why this is important:
- It ensures that all the unnecessary words (default and custom) will be removed when we clean the text.
4. Create an Empty List to Store the Cleaned Text:
· Explanation:
- This line creates an empty list called preprocessed_content.
- As we process each webpage’s text and clean it, the cleaned version of the text will be added to this list.
· Why this is important:
- This list will store the final, cleaned versions of the text for all the webpages, so we can use it later in the analysis.
5. Start Looping Through Each Webpage’s Content:
· Explanation:
- This line starts a loop that goes through each piece of text in content_list.
- content_list is the list of raw text from each webpage (from the previous step where we fetched the content from the URLs).
· Why this is important:
- This loop allows us to process each webpage’s content one by one.
6. Convert All Text to Lowercase:
· Explanation:
- This converts the entire text of each webpage to lowercase.
- For example, “SEO Services” becomes “seo services.”
· Why this is important:
- By converting everything to lowercase, we treat words like “SEO” and “seo” as the same. This makes the text uniform and avoids confusion when analyzing it.
7. Remove Digits (Numbers):
· Explanation:
- This line uses a regular expression (re.sub) to remove digits from the text.
- The pattern r’\d+’ matches any sequence of numbers, and ” replaces those numbers with nothing (effectively deleting them).
· Why this is important:
- Numbers are usually not useful when analyzing topics in text, so removing them helps make the text cleaner and easier to analyze.
8. Remove Punctuation and Symbols:
· Explanation:
- This line removes punctuation and special symbols from the text using another regular expression.
- The pattern r'[^\w\s]’ matches any character that is not a word character (\w) or a whitespace (\s), meaning it removes everything else (punctuation, symbols, etc.).
· Why this is important:
- Punctuation and symbols don’t add meaning to the text, so removing them makes the text cleaner for further processing.
9. Split the Text into Words:
· Explanation:
- This line splits the text into a list of words.
- For example, the sentence “seo services digital marketing” will be split into [‘seo’, ‘services’, ‘digital’, ‘marketing’].
· Why this is important:
- This makes it easier to remove stopwords (we can now look at each word individually) and analyze the text later.
10. Remove Stopwords:
· Explanation:
- This line goes through each word in the words list and checks if it is in the stop_words set (which includes default and custom stopwords).
- If the word is not in stop_words, it is kept. Otherwise, it is removed.
- The cleaned list of words (without stopwords) is stored in filtered_words.
· Why this is important:
- Removing stopwords helps focus the analysis on the meaningful words, rather than common, less important ones like “the,” “is,” “and.”
11. Rejoin the Cleaned Words and Add to Final List:
· Explanation:
- This takes the list of cleaned words (filtered_words) and joins them back into a single string (with spaces between the words).
- The cleaned text is then added to the preprocessed_content list.
· Why this is important:
- This step completes the cleaning process for each piece of text and stores the cleaned version so that it can be used later in the analysis.
12. Return the Preprocessed Text:
· Explanation:
- After processing all the webpages, the function returns the list of cleaned text (preprocessed_content).
- This cleaned text can now be used for further analysis, such as building a topic model or finding related content.
· Why this is important:
- We need the cleaned text to move on to the next steps in the analysis. This is the final output of this step.
Example of How the Function Works:
Let’s say we have some raw text:
After running it through the preprocess_text function:
- The text is converted to lowercase: “seo services 2024! learn digital marketing & get the best results.”
- The numbers are removed: “seo services! learn digital marketing & get the best results.”
- The punctuation is removed: “seo services learn digital marketing get the best results”
- The stopwords are removed: “seo services learn digital marketing best results”
The final cleaned
version would be: “seo services learn digital marketing best results”.
What is the Purpose of This Function?
The vectorize_text function converts text into a numerical format that the machine learning model can understand. This step is necessary because machine learning algorithms cannot work with plain text—they need numbers. So, this function turns the cleaned text (from the previous step) into a Bag of Words (BoW) model, which is essentially a table that counts how many times each word appears in the text.
Step-by-Step Explanation
1. Define the Function:
- Explanation:
- The function vectorize_text takes one input called preprocessed_content, which is a list of the cleaned text from all the webpages. This is the result from the previous step where we removed unnecessary words, symbols, and stopwords.
- The goal of this function is to convert this cleaned text into a matrix of word counts (numbers), so we can use it in further analysis or modeling.
2. Initialize a CountVectorizer Object:
- Explanation:
- CountVectorizer is a tool from the scikit-learn library that helps convert text into numbers by counting how many times each word appears.
- We are creating a vectorizer object here, which will be used to transform the text into a matrix of word counts.
Let’s break down the parameters used in CountVectorizer:
- max_df=0.9: This means we will ignore words that appear in more than 90% of the documents (webpages). If a word appears almost everywhere, it’s probably not very meaningful (e.g., “service” might appear on every page).
- min_df=2: This means we will ignore words that appear in fewer than 2 documents. This helps remove rare or unusual words that don’t add much value to the analysis.
- stop_words=’english’: This is an additional safety measure to remove common English stopwords like “the,” “and,” “is.” Even though we removed stopwords in the previous step, this ensures that any missed stopwords are removed.
- Why this is important:
- The CountVectorizer tool transforms the text into a Bag of Words model, which is basically a big table showing how often each word appears in each document (webpage). This table is the numerical format that the machine learning model can work with.
3. Fit and Transform the Preprocessed Text:
· Explanation:
- The fit_transform method does two things:
- Fit: It looks at the preprocessed text and learns which unique words are present across all the webpages. It creates a vocabulary of these words.
- Transform: It then counts how many times each word appears in each document (webpage) and puts that information into a matrix (a grid of numbers).
- The result is a matrix where:
- Each row represents a webpage.
- Each column represents a unique word.
- The numbers in the matrix show how many times a word appears on a particular webpage.
· Why this is important:
- This step transforms the text into a format (numbers) that a machine learning model can work with. Without converting text to numbers, the model wouldn’t be able to understand or process it.
4. Return the Vectorizer and Word Matrix:
· Explanation:
- The function returns two things:
- vectorizer: This is the CountVectorizer object that knows the vocabulary (the list of words it found) and can be used later for further analysis.
- word_matrix: This is the matrix of word counts for each document (webpage), which will be used in the next steps of the analysis.
· Why this is important:
- The vectorizer object helps us keep track of the words (vocabulary), and the word_matrix is the actual data we need to perform further analysis, such as topic modeling or similarity calculations.
What Happens in This Step?
In this step, we are turning our cleaned text into numbers so that it can be used in machine learning models. Specifically, we are creating a Bag of Words model, which is like a big table where:
- Each row is a webpage.
- Each column is a word.
- The numbers in the table represent how many times each word appears in each webpage.
This conversion is crucial because computers can only process numbers—not raw text—so the text has to be transformed into a numerical format before any further analysis can be done.
Example of How It Works:
Let’s say we have two cleaned documents (webpages):
- “seo services digital marketing best results”
- “learn digital marketing tips seo experts”
The Bag of Words model would look like this:
best | digital | experts | learn | marketing | results | seo | services | tips | |
Doc1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 0 |
Doc2 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 1 |
- Rows: Each row represents a document (or webpage).
- Columns: Each column represents a unique word that appears in the documents.
- Numbers: The numbers in the table show how many times a word appears in each document.
For example:
- The word “seo” appears 1 time in both Doc1 and Doc2.
- The word “services” appears 1 time in Doc1 and 0 times in Doc2.
I’ve gone through each word of your question to ensure that I understand exactly what you’re asking. You want me to explain the create_topic_model function in very simple terms, so that even someone without a technical background can understand it.
What Does This Function Do?
The create_topic_model function builds a Latent Dirichlet Allocation (LDA) topic model, which is a tool that discovers the hidden topics within a collection of documents. In this case, the documents are the webpages you are analyzing. The function does this by:
- Tokenizing the text (splitting it into words).
- Creating a dictionary (mapping each word to a unique ID).
- Creating a Bag of Words (BoW) corpus (counting how many times each word appears in each document).
- Building the LDA model to find hidden topics based on the words and their frequencies.
Step-by-Step Explanation
1. Define the Function:
- Explanation:
- This function, create_topic_model, takes two inputs:
- preprocessed_content: A list of cleaned text from the webpages (the output from the previous steps).
- num_topics=5: The number of topics you want the model to find. In this case, it’s set to 5 by default, but you can change this to find more or fewer topics.
- The goal is to discover the most common themes (topics) across the documents (webpages).
- This function, create_topic_model, takes two inputs:
2. Tokenize the Preprocessed Content:
· Explanation:
- Tokenization is the process of splitting text into individual words (tokens). This line goes through each cleaned document and splits it into words.
- For example, the sentence “seo services digital marketing” becomes [‘seo’, ‘services’, ‘digital’, ‘marketing’].
· Why this is important:
- Tokenizing the text is necessary because the topic model needs to look at individual words to find patterns and topics. Without splitting the text into words, the model wouldn’t be able to analyze the content properly.
3. Create a Dictionary:
· Explanation:
- A dictionary is created from the tokenized data using the corpora.Dictionary function from the gensim library.
- This dictionary assigns a unique ID to every word in the text. For example:
- “seo” might be assigned ID 0,
- “services” might be assigned ID 1, and so on.
- The dictionary helps the model understand which words are present in your documents and how frequently they appear.
· Why this is important:
- The dictionary is like a map that connects each word to its unique ID, which is necessary for the LDA model to process the text.
4. Convert the Tokenized Data into a Bag of Words Corpus:
· Explanation:
- This step converts each document into a Bag of Words (BoW) representation.
- The Bag of Words model creates a list of word IDs and their frequencies (how many times each word appears) for each document.
- For example, the document “seo services digital marketing” might become:
[(0, 1), (1, 1), (2, 1), (3, 1)]
- Here, the number 0 represents “seo”, and 1 indicates that it appears once in the document.
- The number 1 represents “services”, and it also appears once.
· Why this is important:
- The Bag of Words corpus is how the LDA model understands each document. Instead of raw text, the model works with word IDs and frequencies. This allows it to find which words often appear together and therefore uncover topics.
5. Create the LDA Topic Model:
· Explanation:
- This line creates the LDA model using the LdaModel function from gensim.
- The LDA model tries to find hidden topics in the documents by analyzing which words often appear together.
- The parameters used are:
- corpus: This is the Bag of Words representation of your documents.
- num_topics=num_topics: The number of topics you want to discover (set to 5 by default).
- id2word=dictionary: The dictionary that maps word IDs to the actual words.
- passes=10: This tells the model how many times it should go through the data to find the topics. More passes mean the model has more opportunities to refine its understanding of the topics.
· Why this is important:
- The LDA model is the core part of the topic discovery process. It analyzes the word patterns in the documents to find groups of words that tend to appear together, which represent the hidden topics.
Example of How It Works:
Let’s say you have two preprocessed documents:
- “seo services digital marketing”
- “learn digital marketing seo tips”
1. Tokenization:
- Document 1 becomes: [‘seo’, ‘services’, ‘digital’, ‘marketing’]
- Document 2 becomes: [‘learn’, ‘digital’, ‘marketing’, ‘seo’, ‘tips’]
2. Create a Dictionary:
- The dictionary assigns an ID to each unique word:
- ‘seo’: ID 0
- ‘services’: ID 1
- ‘digital’: ID 2
- ‘marketing’: ID 3
- ‘learn’: ID 4
- ‘tips’: ID 5
3. Convert to Bag of Words (BoW):
- Document 1 (BoW): [(0, 1), (1, 1), (2, 1), (3, 1)] (this means “seo” appears once, “services” once, etc.)
- Document 2 (BoW): [(0, 1), (2, 1), (3, 1), (4, 1), (5, 1)]
4. Create the LDA Model:
- The LDA model analyzes the words and tries to group them into 5 topics (since num_topics=5 is set by default). After analyzing which words appear together, it creates topics.
Output:
- You can now explore the topics that the LDA model has discovered and see which words are important for each topic.
What Does This Function Do?
The assign_topics_to_pages function uses the LDA model (created in the previous step) to figure out which topics are most relevant for each webpage. In simple terms, it checks each webpage (which we’ve processed as a document) and gives a score for how much each topic is discussed on that page. The result is a list that shows which topics are most important for each page.
Step-by-Step Explanation
1. Define the Function:
· Explanation:
- The function assign_topics_to_pages takes two inputs:
- lda_model: This is the LDA model we created earlier, which has already learned about the topics in the text.
- corpus: This is the Bag of Words representation of all the webpages, where each document is a list of word IDs and their frequencies (how many times each word appears).
· Why this is important:
- The LDA model needs to be applied to each document (webpage) to figure out which topics are most relevant to that particular document. This helps us understand which topics are being discussed on each page.
2. Initialize an Empty List for Topic Distributions:
· Explanation:
- This line creates an empty list called page_topics. This list will store the topic distribution for each webpage.
- A topic distribution means the percentage of each topic that is present in a webpage. For example, if a webpage discusses two main topics—SEO and digital marketing—the function will assign percentages like 60% SEO and 40% digital marketing for that page.
· Why this is important:
- We need to store the results (the topic distributions) somewhere so we can refer to them later. This list will eventually contain one set of results for each webpage.
3. Loop Through Each Document (Webpage):
· Explanation:
- This line starts a loop that goes through each document (webpage) in the corpus. The corpus is the Bag of Words representation of all the documents. Each document is a list of word IDs and their counts (how often each word appears).
- The loop allows us to analyze each document one by one.
· Why this is important:
- By looping through the documents, we can apply the topic model to each webpage and see which topics are present on that page.
4. Get the Topic Distribution for Each Document:
· Explanation:
- For each document (webpage), we use the LDA model to get the topic distribution. This tells us how much of each topic is present in that document.
- lda_model.get_document_topics(doc, minimum_probability=0) is the key method here:
- get_document_topics(doc): This analyzes the document using the LDA model and returns a list of topics along with their percentages.
- minimum_probability=0: This ensures that all topics are included in the result, even if the topic is only slightly present in the document (like 0% or very close to 0%).
- The result is a list that looks like this:
[(0, 0.1), (1, 0.3), (2, 0.6)]
- This example means:
- Topic 0 is 10% relevant to the document.
- Topic 1 is 30% relevant.
- Topic 2 is 60% relevant.
- The topic distribution for each document is then added to the page_topics list.
· Why this is important:
- This step is crucial because it tells us which topics are discussed on each webpage and how much of each topic is present. This is the main output of the function, and it allows us to see the dominant topics for each webpage.
Example of How It Works
Let’s say we have three documents (webpages) in our corpus, and the LDA model has found 3 topics. The assign_topics_to_pages function will return something like this:
Here’s how to interpret the results:
- Document 1: Topic 0 makes up 20% of the content, Topic 1 is 50%, and Topic 2 is 30%.
- Document 2: Topic 0 is the most important (60%), while Topic 1 and Topic 2 are less important (20% each).
- Document 3: Topic 1 is the dominant topic at 70%, and the other two topics are less important.
This shows how much each webpage discusses each topic, which helps in understanding the focus of the content on that page.
What Does This Function Do?
The display_top_keywords function is designed to show the most important words (keywords) for each topic that the LDA model has discovered. This is helpful for understanding the main ideas or themes behind each topic. For example, if a topic is about “SEO,” the important words (keywords) could be “SEO,” “ranking,” “optimization,” “search engines,” etc. Showing these top words helps you understand what each topic is really about.
Step-by-Step Explanation
1. Define the Function:
· Explanation:
- The function is called display_top_keywords, and it takes two inputs:
- lda_model: This is the LDA model we created earlier, which has learned about the hidden topics in the documents (webpages).
- num_keywords=10: This tells the function to show the top 10 keywords for each topic by default. You can change this number if you want to see more or fewer keywords for each topic.
· Why this is important:
- Displaying the most important words helps you understand what each topic is really talking about. For example, if Topic 1 is about “SEO,” showing the top keywords like “SEO,” “search engines,” and “rankings” helps you confirm the focus of that topic.
2. Loop Over Each Topic:
· Explanation:
- The print_topics() method from the LDA model is used here to get a list of the most important words for each topic. It returns a list of topics and their top keywords.
- (-1, num_keywords):
- -1 means that the function will loop through all the topics in the LDA model.
- num_keywords (which is 10 by default) tells the function to display the top 10 words for each topic.
- The for loop goes through each topic one by one:
- idx: The index (number) of the topic (like Topic 0, Topic 1, etc.).
- topic: A list of the top words (keywords) for that topic.
· Why this is important:
- This loop is necessary because we want to display the keywords for all topics, not just one. The print_topics() function makes it easy to get the top words for each topic and loop through them.
3. Print the Top Keywords for Each Topic:
· Explanation:
- This line prints out the top keywords for each topic in a readable format.
- The f”…” part is called an f-string, which makes it easy to combine variables (like idx and topic) into a sentence.
- Topic {idx}: Shows the topic number (e.g., Topic 0, Topic 1, etc.).
- Top {num_keywords} Keywords: Shows the number of keywords being displayed (default is 10).
- {topic}: This shows the actual top words (keywords) for that topic.
· Why this is important:
- This is the part of the function where the results are shown to you. It prints out the topic number and the list of important words for each topic so that you can understand what each topic is about.
Example of How It Works:
Let’s say you have an LDA model with 3 topics, and you want to display the top 5 keywords for each topic. Here’s what the output might look like:
- Topic 0 is probably about SEO (Search Engine Optimization) because the top words are “seo,” “search,” “engine,” “optimization,” and “ranking.”
- Topic 1 seems to be about writing or publishing content, with words like “content,” “writing,” “article,” “blog,” and “publish.”
- Topic 2 looks like it’s about social media, with words like “social,” “media,” “marketing,” “platforms,” and “facebook.”
This output helps you understand the themes or topics that your documents (webpages) are discussing, and it helps with things like SEO optimization because you can see which keywords are associated with each topic.
What Does This Function Do?
The recommend_similar_pages function takes a list of URLs and their associated topic distributions (which topics are present on each page and to what extent) and then calculates how similar each page is to the others. Based on this similarity, it generates recommendations for each page, showing the most similar pages. This is useful for content recommendation systems or suggesting internal links on a website.
Step-by-Step Explanation
1. Define the Function:
- Explanation:
- The function recommend_similar_pages takes three inputs:
- urls: This is a list of URLs of the webpages that you want to analyze.
- page_topics: This is a list of topic distributions for each page (which topics are present on each webpage and how much).
- num_topics: This is the total number of topics that were identified by the LDA model.
- The goal of this function is to find out which pages are most similar based on their topic distributions and provide human-readable recommendations.
- The function recommend_similar_pages takes three inputs:
2. Create an Empty Matrix to Store Topic Distributions:
· Explanation:
- Here, we are creating an empty matrix (a grid of numbers) called topic_vectors using the numpy library (np). This matrix will store the topic distribution for each page.
- The matrix will have:
- Rows: Each row represents a webpage (the number of rows will be the same as the number of webpages in page_topics).
- Columns: Each column represents a topic (the number of columns will be the same as the number of topics in num_topics).
- This matrix will be filled in the next step to show how much of each topic is present on each webpage.
· Why this is important:
- The matrix will allow us to store and compare the topic distributions for each webpage in a structured way, which is necessary for calculating the similarity between pages.
3. Fill in the Topic Distributions for Each Page:
· Explanation:
- This loop goes through each page and fills the topic_vectors matrix with the topic distribution data.
- enumerate(page_topics): This gives us both the index (i, which represents the page number) and the topic distribution (topics) for each page.
- For each page (i), we go through its topic distribution (topics). The distribution contains pairs of numbers:
- topic_num: The topic number (like Topic 0, Topic 1, etc.).
- topic_value: The percentage or score of how much this topic is present in the page (e.g., 0.5 means 50% of the page is about this topic).
- We fill the topic_vectors matrix so that for each page, the appropriate value for each topic is stored.
· Why this is important:
- This step makes sure that each page’s topic distribution is stored in a way that allows us to compare pages based on their topic similarity.
4. Calculate the Cosine Similarity Between Pages:
· Explanation:
- Now that the topic_vectors matrix is filled with topic distributions, we calculate the similarity between all pages using cosine similarity.
- cosine_similarity(topic_vectors): This function calculates how similar two pages are based on their topic distributions. The similarity is measured on a scale from 0 to 1:
- A score of 1 means the pages are very similar (discussing almost the same topics).
- A score of 0 means the pages are completely different (discussing unrelated topics).
- The result is a similarity_matrix, which shows how similar each pair of pages is.
· Why this is important:
- Cosine similarity helps us understand which pages are discussing similar topics. This is essential for creating recommendations based on the content of the pages.
5. Create a Dictionary to Store the Recommendations:
· Explanation:
- This line creates an empty dictionary called recommendations, which will store the most similar pages for each URL.
- The dictionary will have:
- Keys: The URLs of the pages.
- Values: A list of the most similar pages and their similarity scores.
· Why this is important:
- We need to store the recommendations for each page in a format that is easy to read and use. The dictionary will store the top 3 most similar pages for each URL.
6. Loop Through Each URL to Find Similar Pages:
· Explanation:
- This part of the code loops through each webpage (i) and compares it to every other webpage (j) to find the most similar ones.
- if i != j: This ensures that the page does not recommend itself. We want to find other pages that are similar, not recommend the same page.
- similarity_score = similarity_matrix[i][j]: This gets the similarity score between page i and page j.
- The most similar pages and their scores are stored in similar_pages, which will be sorted in the next step.
· Why this is important:
- This loop compares each page to all the others and collects the similarity scores, which are necessary for creating meaningful recommendations.
7. Sort and Keep the Top 3 Most Similar Pages:
· Explanation:
- sorted(similar_pages, key=lambda x: x[1], reverse=True): This sorts the similar pages by their similarity score (from highest to lowest). The x[1] refers to the similarity score in each tuple (URL, similarity score).
- similar_pages[:3]: After sorting, this keeps only the top 3 most similar pages for each URL.
- These top 3 similar pages are then stored in the recommendations dictionary for the current page (urls[i]).
· Why this is important:
- We don’t want to overwhelm the user with too many recommendations, so we limit the suggestions to the top 3 most similar pages. This ensures the recommendations are useful and focused.
Example of How It Works:
Let’s say we have 4 webpages:
- URL 1: “SEO strategies”
- URL 2: “Social media marketing”
- URL 3: “Content writing tips”
- URL 4: “SEO and content optimization”
After running this function, the recommendations might look like this:
- For URL 1, the most similar page is URL 4 with a similarity score of 0.85.
- For URL 2, the most similar page is URL 3 with a similarity score of 0.90.
- These scores tell us which pages are most similar based on their topic distributions.
What Does This Output Mean?
This output is the result of running a Neural Topic Modeling (NTM) model on the website’s content, and it shows two main things:
- Topics Discovered on The Website: The model has identified the main topics discussed across the website pages and displays the top 10 keywords that define each topic.
- Similar Pages Recommendations: For each page on the website, the model shows the most similar pages (based on content) and assigns them a similarity score between 0 and 1. A score of 1.00 means the pages are very similar, while lower scores indicate less similarity.
Let’s now break down the two parts of the output in more detail.
Part 1: Topics Discovered on The Website
The first part of the output lists the top 5 topics discovered from The website’s content. Each topic shows the top 10 keywords associated with it.
Example:
Explanation:
- Topic 0 contains the words “seo,” “services,” “marketing,” and “link,” which means this topic is related to SEO services and digital marketing. The numbers (like 0.047) next to the words represent the importance of the keyword in that topic (higher numbers mean the word is more relevant to the topic).
- Each topic is a hidden theme that your website content is addressing. Other topics could be about web development, content editing, or social media marketing.
Use Case:
- By understanding the key topics that the website covers, you can optimize your content. For example, if you are targeting SEO services, you can ensure that your content for SEO is well-organized and aligned with these keywords to improve search engine rankings.
What You Should Do:
- Enhance your content around the keywords shown in each topic. This will help you strengthen your SEO strategy. For instance, if your website’s main topic is about SEO, make sure to frequently use keywords like “SEO,” “services,” “link,” and “marketing” in relevant sections of your website.
Part 2: Similar Pages Recommendations
This part of the output provides recommendations for each webpage on the site, showing which pages are most similar based on the content. The output gives you the top 3 similar pages for each page, along with a similarity score.
Example:
Explanation:
- For the page ‘https://thatware.co/’, the model has identified that the most similar pages, based on the content, are:
- ‘https://thatware.co/competitor-keyword-analysis/’ with a similarity score of 1.00.
- ‘https://thatware.co/link-building-services/’ with a similarity score of 1.00.
- ‘https://thatware.co/digital-marketing-services/’ with a similarity score of 1.00.
What Does the Similarity Score Mean?
- The similarity score ranges from 0 to 1. A score of 1.00 means the two pages are very similar in terms of content. A lower score (e.g., 0.05) means the pages are less similar.
Use Case:
- These recommendations tell you which pages are closely related in content. You can use this information to:
- Create internal links between similar pages to improve user navigation and SEO.
- Group similar content together into clusters (e.g., creating a category for “SEO services” that includes all similar pages).
- Cross-promote content: You can suggest or recommend similar pages to users to keep them on your site for longer.
Step-by-Step Example of What to Do Next
Here’s a guide to what steps you should take after receiving this output:
1. Optimize Website Content Based on Topics
- Look at the topics that the model has discovered (like Topic 0, Topic 1, etc.). These topics represent the key areas your website covers. For each topic, focus on the top keywords.
- For example, if Topic 0 is about SEO services, revise and expand your content to include more in-depth articles or sections on SEO services, link-building, and digital marketing.
2. Use Similar Pages for Internal Linking
- For each page on your site, the model has recommended the top 3 most similar pages. You should:
- Link these similar pages together within your website. For example, on the page ‘https://thatware.co/’, you can add internal links to ‘https://thatware.co/competitor-keyword-analysis/’ and ‘https://thatware.co/link-building-services/’.
- This helps users navigate your site easily and improves your website’s SEO because search engines like websites that are well-linked internally.
3. Create Content Clusters
- Group pages that are highly similar into content clusters or categories. For example, if the pages ‘https://thatware.co/’ and ‘https://thatware.co/link-building-services/’ are highly similar, consider grouping them under a category like “SEO Services”.
- This will make your website more organized and user-friendly.
4. Cross-Promote Content
- Use the similarity information to cross-promote content. For example, if a user is reading the page on digital marketing services, you can recommend related content like the link-building services page to keep them engaged with your website longer.
Example Actions for Client
1. Say to Your Client: “We’ve discovered that the main topics on your site are SEO, services, web development, and marketing. To improve SEO, we need to make sure these keywords are frequently used and create content around these topics.”
2. Guide Your Client: “Based on the recommendations, we should create internal links between the most similar pages. For example, on the SEO services page, we’ll link to the competitor keyword analysis page and the link-building services page to help users find related content and improve search rankings.”3. For Growth: “We’ll group similar pages together under categories like SEO Services and Content Marketing. This will help users easily find the information they need and keep them on the website longer.”
Thatware | Founder & CEO
Tuhin is recognized across the globe for his vision to revolutionize digital transformation industry with the help of cutting-edge technology. He won bronze for India at the Stevie Awards USA as well as winning the India Business Awards, India Technology Award, Top 100 influential tech leaders from Analytics Insights, Clutch Global Front runner in digital marketing, founder of the fastest growing company in Asia by The CEO Magazine and is a TEDx speaker.