⭐What are XML Sitemaps?
An XML Sitemap is a file that list all the webpages and resources of a website
in a structured list. This file is primarily used by Search Engine bots for crawling.
XML or “Extensible Markup Language” is a standard format for organizing and encoding
data.
Apart from simply listing the URLs of the sitemap, it also contains information
on :
1. The last time each page was updated.
2. The frequency of changes to each page
3. The priority of crawling each page relative to the others.
This ultimately helps search engines understand your website structure and improve the
speed and accuracy of crawling.
XML sitemaps are particularly more useful for larger website for search engines to navigate.
discover and index the most valuable pages efficiently and keep them updated.
⭐Why to Automate XML Sitemaps?
Since we know understand the importance of XML Sitemaps it is imperative
to keep them updated frequently to ensure your website stays updated in Google’s
index as you make changes in it.
Although for small websites updating sitemaps once a month may suffice, for particularly
larger websites with complex dynamic content, it can be important to update XML sitemaps weekly or more frequently.
Having said that here are the reasons why an SEO would want to automate updating the XML Sitemap file.
- While Using a Non-User Friendly CMS
Although platforms like WordPress and Shopify offer solutions that keeps the Sitemap updated, using a custom built CMS can provide challenges to this task. The technique shared in this blog gives full control over sitemaps.
- More Control and Customization over the Sitemap
Even though many platforms may offer solutions to generate dynamic sitemaps, they might not give much control over whats being updated. Sometimes we may need to exclude some links, adjust frequency or even remove some subfolders. In this blog, the technique shared will give you full control over your sitemaps.
- Server Files Not Accessible
Lastly, in many cases the webmaster may not have access to the server to update the Sitemap files on their own. Here we will use the technique of reverse proxies to overcome this issue. However if one has proper access to the server folders, then this is not necessary.
- Error Prevention
With more control on your automation, one can prevent errors and inconsistencies in the details of the Sitemaps links. This may cause additional indexing issues and affect the crawl budget of the website.
So lets get started. So what are the things one would need to automate the Sitemap.
- The Paid version of the Screaming Frog SEO Spider
- An IT/tech team that can implement reverse proxies
- A dedicated machine (this isn’t necessary but it’ll make your life a lot easier)
👉What are the Actual Steps for Automating XML Sitemaps?
- Start by Scheduling an Automated Crawl.
- Establishing a central location for storing the output files of Screaming Frog
- Implementing a Reverse Proxy
- Final Testing
👉Things You Would Need
⭐Screaming Frog SEO Tool Full Version
Screaming Frog is one of the most reliable SEO Tools to set up a crawl and get a host of data and insights into the On Page SEO and Technical details of a website. But the main reason you would need it is for the scheduled crawl setup feature which is important.
⭐An IT Guy/Team for Implementing Reverse Proxies
If you are working by yourself, then you might have access to the server files. However in most organizations, ITs and Dev groups are not really comfortable in giving access to SEOs to server files. Hence you might be in need of a Dev guy to implement your reverse proxies.
⭐A Dedicated Machine
In order to run scheduled crawls it is always necessary to have the system on. If one setup the scheduled crawl of the weekend and shuts the computer on Friday, then thats essentially saying goodbye to automation.
👉How to Automate your XML Sitemaps
⭐Setting Up your Automated Crawl
Scheduling a crawl in Screaming Frog is one of the best features for this SEO Tool. You can run daily to weekly scheduled crawl. You should also consider whether you require a custom-built crawl settings file.
This is primarily determined by whether or not you wish to customise the contents of your sitemap. The majority of my clients require this. In some circumstances, this is because we have a sitemap index and thus a separate settings file for each segmented XML. In other circumstances, I wanted to bake in certain customizations.
⭐Setting Up a Scheduled Crawl:
Go to File > Scheduling, set a task and give it a name. Set the frequency and timing and you can also use a description in place to note frequency.
Running in headless mode is required for exports, so make sure that box is checked. You should additionally overwrite files in output so that your filename does not change. A consistent file path is required for the reverse proxy to function. Finally, save the crawl and export the XML sitemap.
Two more quick points to consider:
- If we are setting up a sitemap index with nested sitemap we can set up individual crawl with includes and exclude to segment them the way we want.
- Go to export sitemap configuration and choose your settings there before saving the crawl settings file. This will ensure the export format is customized otherwise it will put default values for change frequency and priority.
👉SetUp a Central Hub for Storing Screaming Frog Export Files
To enable the reverse proxy, ensure that your scheduled crawl dumps the files in a certain directory and, as previously indicated, enable the ‘overwrite files’ option rather than date-stamping your files. This server location will also need to be web-accessible.
So if your file path on the server is Z:\\client-name\sitemaps\sitemap.xml, it should also render at example.com/client-name/sitemaps/sitemap.xml.
👉Creating a Reverse Proxy
The reverse proxy is basically the bridge that connects the dynamically generated Screaming Frog sitemap file and your website.
Although setting up a reverse proxy and the details of it I won’t be covering in this section as that you can find in any of the developer docs from qualified resources, but to cover in brief essentially what we are doing rerouting a request for /sitemap.xml to a different location so the URL stays the same, but the rendered content is not from the server’s root folder, it comes from the alternative file you’re dropping with the crawl.
Here’s what a reverse proxy looks like in the web.config file.
However if one is looking to create a reverse proxy in the htaaccess file then it would be different.
Lastly, it is beneficial to drop a robots.txt file in the same folder as the screaming frog files are stored and set up a reverse proxy for that too. This automates the changing of the robots.txt file as well as soon as the sitemap gets updated. Hence further diminishing the need of developer help in the automation process.
👉Testing is Key!
You’ll want to test here because you’ll be influencing how the production site works.
We set up the reverse proxy in a staging environment initially, but if you don’t have access to that, I’d propose coordinating with the developers so you can test right away and get them removed as soon as problems arise. I always open the SEO Spider produced file, make a little adjustment, and then reload the site’s XML sitemap.
⭐Final Thoughts
Overall, this project should not take more than a few hours to set up. Final testing and debugging may take a bit longer but that’s time well spent.