How To Find Low Content Pages Using Python

How To Find Low Content Pages Using Python

In the realm of web development and content management, identifying low-content pages on a website is crucial for maintaining quality and user experience. These pages typically offer minimal value to users and can contribute to a cluttered and unengaging online presence. Python, with its versatile libraries and tools, can be employed to automate the process of finding low-content pages on a website.

find low content page using python

Using this Python tool we can analyse the low-content pages of a website, after analysis, we can improve the content on those pages so that the authority and keyword rank will improve.

Step 1:

Using this Python tool we can analyse the low-content pages of a website, after analysis, we can improve the content on those pages so that the authority and keyword rank will improve.

Step 1:

import requests

from bs4 import BeautifulSoup

from urllib.parse import urlparse

def extract_urls(domain):

    # Send a GET request to the domain

    response = requests.get(domain)

    # Parse the HTML content using BeautifulSoup

    soup = BeautifulSoup(response.text, ‘html.parser’)

    # Find all anchor tags (<a>) in the HTML

    anchor_tags = soup.find_all(‘a’)

    urls = []

    # Extract the href attribute from each anchor tag

    for tag in anchor_tags:

        href = tag.get(‘href’)

        if href:

            # Check if the URL is relative or absolute

            parsed_url = urlparse(href)

            if parsed_url.netloc:

                # Absolute URL

                urls.append(href)

            else:

                # Relative URL, construct absolute URL using the domain

                absolute_url = domain + href

                urls.append(absolute_url)

    return urls

def analyze_urls(urls):

    word_counts = []

    for url in urls:

        response = requests.get(url)

        soup = BeautifulSoup(response.text, ‘html.parser’)

        text = soup.get_text()

        # Count the number of words

        word_count = len(text.split())

        word_counts.append((url, word_count))

    return word_counts

# Example usage

domain = ‘https://www.minto.co.nz/’

urls = extract_urls(domain)

url_word_counts = analyze_urls(urls)

for url, word_count in url_word_counts:

    print(f”URL: {url}”)

    print(f”Word Count: {word_count}”)

    print()

Edit the code and replace the domain as per the screenshot –

Put your desired domain here.

Now create a folder on desktop –

And save the code a as python on this folder –

Step 2:

Now open anaconda prompt –

And go to that folder using cd command –

Now install those PIPs –

pip install beautifulsoup4

one by one

pip install requests

Now run the python code –

python urls.py

We have extracted the word count of all pages.

Now copy the list on a excel file –

To excel –

Now manually analyse the list and delete the landing page which has above 1200 words on a page.

Also remove the irrelevant pages like contact us, login page, sign-up page etc.

And make a list of below 1200 word pages for further improvement.

Remember that web scraping should be done responsibly and ethically, adhering to a website’s terms of use and respecting robots.txt guidelines. Also, websites’ structures may change, so periodic updates to your scraping script might be necessary.

Leave a Reply

Your email address will not be published.