How To Find Low Content Pages Using Python

How To Find Low Content Pages Using Python

    In the realm of web development and content management, identifying low-content pages on a website is crucial for maintaining quality and user experience. These pages typically offer minimal value to users and can contribute to a cluttered and unengaging online presence. Python, with its versatile libraries and tools, can be employed to automate the process of finding low-content pages on a website.

    find low content page using python

    Using this Python tool we can analyse the low-content pages of a website, after analysis, we can improve the content on those pages so that the authority and keyword rank will improve.

    Step 1:

    Using this Python tool we can analyse the low-content pages of a website, after analysis, we can improve the content on those pages so that the authority and keyword rank will improve.

    Step 1:

    import requests

    from bs4 import BeautifulSoup

    from urllib.parse import urlparse

    def extract_urls(domain):

        # Send a GET request to the domain

        response = requests.get(domain)

        # Parse the HTML content using BeautifulSoup

        soup = BeautifulSoup(response.text, ‘html.parser’)

        # Find all anchor tags (<a>) in the HTML

        anchor_tags = soup.find_all(‘a’)

        urls = []

        # Extract the href attribute from each anchor tag

        for tag in anchor_tags:

            href = tag.get(‘href’)

            if href:

                # Check if the URL is relative or absolute

                parsed_url = urlparse(href)

                if parsed_url.netloc:

                    # Absolute URL

                    urls.append(href)

                else:

                    # Relative URL, construct absolute URL using the domain

                    absolute_url = domain + href

                    urls.append(absolute_url)

        return urls

    def analyze_urls(urls):

        word_counts = []

        for url in urls:

            response = requests.get(url)

            soup = BeautifulSoup(response.text, ‘html.parser’)

            text = soup.get_text()

            # Count the number of words

            word_count = len(text.split())

            word_counts.append((url, word_count))

        return word_counts

    # Example usage

    domain = ‘https://www.minto.co.nz/’

    urls = extract_urls(domain)

    url_word_counts = analyze_urls(urls)

    for url, word_count in url_word_counts:

        print(f”URL: {url}”)

        print(f”Word Count: {word_count}”)

        print()

    Edit the code and replace the domain as per the screenshot –

    Put your desired domain here.

    Now create a folder on desktop –

    And save the code a as python on this folder –

    Step 2:

    Now open anaconda prompt –

    And go to that folder using cd command –

    Now install those PIPs –

    pip install beautifulsoup4

    one by one

    pip install requests

    Now run the python code –

    python urls.py

    We have extracted the word count of all pages.

    Now copy the list on a excel file –

    To excel –

    Now manually analyse the list and delete the landing page which has above 1200 words on a page.

    Also remove the irrelevant pages like contact us, login page, sign-up page etc.

    And make a list of below 1200 word pages for further improvement.

    Remember that web scraping should be done responsibly and ethically, adhering to a website’s terms of use and respecting robots.txt guidelines. Also, websites’ structures may change, so periodic updates to your scraping script might be necessary.

    Leave a Reply

    Your email address will not be published. Required fields are marked *