How To Create The Perfect XML Sitemap

xml lap

Reading Time: 5 minutes

Share This

While Google and other search engines are getting better at finding pages on their own, sitemaps still help by effectively giving them more data about your web pages. There are many XML sitemap generators available for purchase, or even for free. They do what they’re supposed to – they crawl your site and spit out a properly formatted, static XML sitemap.

The only problem with these XML sitemap generators is they don’t know what URLs should (or should not) be in the XML sitemap format. Sure, you could tell some of them to obey directives and tags, like robots.txt and canonical tags, but unless your site is perfectly optimized, you’ll need to do some work by hand. It’s extremely rare to see a larger, database driven site that’s perfectly optimized, thus producing a flawless XML sitemap from these tools. Parameters tend to create duplication or page bloat. Language directories sometimes improperly get included. Runaway folder structures tend to reveal process files and junk pages you didn’t know existed. The bigger and more dynamic the website, the higher the likelihood for unnecessary page URLs.

At the end of the day, the XML sitemap should only be exposing the URLs you actually want Google to see. Nothing more, nothing less. The XML sitemap file is to help search engines get a data dump of all your important pages, to supplement what they haven’t found on their own. In return, this allows these “unfound” pages to get found, crawled, and ideally (hopefully) rank within the search results.

So what should be in the ultimate XML sitemap?

  • Only pages that 200 (page found).  No 404’s, redirects, 500 errors, etc.
  • Only pages that are not blocked by robots.txt.
  • Only pages that are the canonical page.
  • Only pages that relate to the second-level domain (meaning, no subdomains – they should get their own XML sitemap)
  • In most cases, only pages that are of the same language (even if all your language pages are on the same .TLD, that language usually gets its own sitemap)

So, in the end, the perfect XML sitemap file should 100% mirror what – in a perfect world – Google crawls and indexes. Ideally, your website has a process for building these perfect sitemaps routinely without your intervention. As new products or pages come in and go out, your XML sitemap should simply overwrite itself. However, the rest of this post explains how to create a one-off XML sitemap for those occasions where a prototype sitemap is needed, or as a quick fix for a broken sitemap generator.

What Tools Do You Need to Create an XML Sitemap?

logo

Screaming Frog is an incredibly powerful site crawler ideal for all kinds of SEO tasks. One of its several features is the ability to export perfectly written XML sitemaps. If your export is large, it properly breaks the sitemaps up and includes a sitemapindex.xml file. While you’re at it, you can even export an image sitemap. Screaming Frog is free for small crawls, but if you have a website larger than 500 URLs, pony up for the paid version. This is one tool you’ll be happy you paid for if you do SEO work. It’s a mere £99 per year (or $130).

Once you install it on your desktop, you’re just about ready to go. If you are working on extremely large sites, you’re probably going to need to expand its memory usage. Out of the box, Screaming Frog allocates 512mb of RAM for its use. As you can imagine, the more you crawl, the more memory you’ll need. To do this, follow the steps.

Download > Screaming Frog

Now that Screaming Frog is installed and super-charged, you’re ready to go.

Setting Up For The Perfect Crawl

Screaming Frog looks like a lot but is very easy to use. In the Configuration > Spider setting, you have several checkboxes you can use to tell Screaming Frog how to behave. We’re trying to get Screaming Frog to emulate Google, so we want to make some checks here.

Check the following boxes before you run your crawl:

  • Respect Noindex
  • Respect Robots.txt
  • Do not crawl nofollow
  • Do not crawl external links
  • Respect canonical

At this point, I recommend crawling the site.  Consider this the first wave.

How to Examine Crawl Data for Sitemap Creation

Export the full site data from Screaming Frog. We’re going to evaluate all the pages in Excel or Google Sheets. While we know we took some steps to show us only things search engines have access to on their own, we want to make sure there aren’t pages they are seeing which we didn’t know about. You know, those ?color= parameters on eCommerce sites, or /search/ URLs that maybe you didn’t want to be indexed. I like to sort the URL column A-Z so I can quickly scan down and see duplicate URLs.

This data is super valuable not only to creating a strong XML sitemap, but also going back and blocking some pages on your website that need some tightening up. Unless your site is 100% optimized, this is a valuable, hard look at potentially runaway URLs. I recommend doing this crawl and looking at your data at least once per quarter.

Scrubbing Out Bad URLs

A “bad” URL, in this case, is simply one we don’t want Google to see.  Ultimately we’re going to need to get these further exclusions available to Screaming Frog.  At this point, you have two options.

  1. We can either upload your clean Excel list back into Screaming Frog,
  2. or run a new Screaming Frog crawl with the exclusions built in.

Option 1: Using your spreadsheet, delete the rows containing URLs you don’t like.  Speed up the process by using Excel’s filters (ie, contains, does not contain, etc.).  The only column of data we care about is the one with your URLs. Also, use Excel’s filters to show only the 200 (page found) URLs.  The time it takes to audit this spreadsheet depends on how many URLs you have, different types of URL conventions, and how comfortable you are with Excel.

Next, copy the entire column of “good” URLs, and return back to Screaming Frog.  Start a new crawl using the Mode > List option. Paste your URLs and start your crawl.  Once all the appropriate URLs are back into Screaming Frog, move on to the next section.

Option 2: Now that you know the URLs you want to block, you can do it with Screaming Frog’s exclude feature.  Configure > Exclude pulls up a small window to enter in regular expressions (regex).  Not familiar with regex?  No problem, it’s really very easy, in which Screaming Frog gives you great examples you just need to bend to your will.  https://www.screamingfrog.co.uk/seo-spider/user-guide/configuration/#exclude.

(Alternatively, you can use the include function if there are certain types of URLs or sections you specifically want to crawl.  Take the directions above, and simply reverse them.) Once you have a perfect crawl in Screaming Frog, move on to the next section below.

Export The XML Sitemap

At this stage, you’ve either chosen Option 1 or Option 2 above.  You have all the URLs you want to be indexed loaded in Screaming Frog.  You just need to do the simplest step of all – export!

export

You have some extra checkboxes to consider.  A very smart set of selections if you ask me.
options

This helps you really refine what goes into the XML sitemap in case you missed something in the steps above. Simply select what makes sense to you, and execute the export. Screaming Frog will generate the sitemaps to your desired location, to which you’re ready to upload to your website. Don’t forget to get these new sitemaps into your Google Search Console and Bing Toolbox sitemap uploader.

(If you need some clarity on what these definitions are, visit https://www.sitemaps.org/protocol.html)

You’re all set. Remember, this is just a snapshot of your ever-changing site. I still fully recommend a dynamic XML sitemap that updates as your site changes. Hope this was helpful.

 

Share This
Bill Sebald

Authored by:

Managing Partner

I've been doing SEO since 1996. Blogger, speaker, and occasionally teaching at Drexel and Philadelphia University. I started Greenlane in 2005 to help clients leverage search marketing to hit business goals. I love this stuff. Visit my profile page.

Follow Me on Twitter

Leave a Reply

Your email address will not be published. Required fields are marked *

More Related Articles

Why you should optimize for Visual Search
SEO

Why You Should Optimize for Visual Search

Every year, it seems like there’s a new “next big thing” in SEO – some big release that causes a bunch of chatter – it’s the future of search! Some become an integral part of the ranking algorithm, like BERT…
Continue Reading

SEO Search Engine Optimization - Flat Style Design
Analytics Digital Marketing SEO

How to Use Google Analytics to Build a Useful Website Performance Report

Part 2 of a series on SEO reporting best practices (ICYMI: Part 1) Traffic data, found in Google Analytics, helps you quantify your site’s engagement level. Using this data, you can understand more than just how many users visit  your…
Continue Reading

Learn the basics of SEO reporting
Analytics SEO

How to Use Google Search Console to Build an Impactful SEO Report

Part One of a series on reporting best practices Marketers rely on metrics and KPIs, whether it’s to manage expectations, track progress, or just establish a common internal vocabulary to discuss projects and goals. Understanding the sources and limitations of…
Continue Reading