In the realm of web development and content management, identifying low-content pages on a website is crucial for maintaining quality and user experience. These pages typically offer minimal value to users and can contribute to a cluttered and unengaging online presence. Python, with its versatile libraries and tools, can be employed to automate the process of finding low-content pages on a website.
Using this Python tool we can analyse the low-content pages of a website, after analysis, we can improve the content on those pages so that the authority and keyword rank will improve.
Step 1:
Using this Python tool we can analyse the low-content pages of a website, after analysis, we can improve the content on those pages so that the authority and keyword rank will improve.
Step 1:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
def extract_urls(domain):
# Send a GET request to the domain
response = requests.get(domain)
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, ‘html.parser’)
# Find all anchor tags (<a>) in the HTML
anchor_tags = soup.find_all(‘a’)
urls = []
# Extract the href attribute from each anchor tag
for tag in anchor_tags:
href = tag.get(‘href’)
if href:
# Check if the URL is relative or absolute
parsed_url = urlparse(href)
if parsed_url.netloc:
# Absolute URL
urls.append(href)
else:
# Relative URL, construct absolute URL using the domain
absolute_url = domain + href
urls.append(absolute_url)
return urls
def analyze_urls(urls):
word_counts = []
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.text, ‘html.parser’)
text = soup.get_text()
# Count the number of words
word_count = len(text.split())
word_counts.append((url, word_count))
return word_counts
# Example usage
domain = ‘https://www.minto.co.nz/’
urls = extract_urls(domain)
url_word_counts = analyze_urls(urls)
for url, word_count in url_word_counts:
print(f”URL: {url}”)
print(f”Word Count: {word_count}”)
print()
Edit the code and replace the domain as per the screenshot –
Put your desired domain here.
Now create a folder on desktop –
And save the code a as python on this folder –
Step 2:
Now open anaconda prompt –
And go to that folder using cd command –
Now install those PIPs –
pip install beautifulsoup4
one by one
pip install requests
Now run the python code –
python urls.py
We have extracted the word count of all pages.
Now copy the list on a excel file –
To excel –
Now manually analyse the list and delete the landing page which has above 1200 words on a page.
Also remove the irrelevant pages like contact us, login page, sign-up page etc.
And make a list of below 1200 word pages for further improvement.
Remember that web scraping should be done responsibly and ethically, adhering to a website’s terms of use and respecting robots.txt guidelines. Also, websites’ structures may change, so periodic updates to your scraping script might be necessary.