|

How to Find and Fix “Index Bloat” SEO Issues

What Is Index Bloat?

Index bloat is when a website has unnecessary pages in the search engine index.  This causes Google and other retrieval systems to spend more time processing than needed, in order to filter irrelevant or duplicate results within the indexes. It may also lead to a poor experience for users.  Index bloat could include paginated pages, processing files, and duplicate pages, commonly found with news and eCommerce sites.

It’s ideal to have a very clean indexation in search engines – no bloat, no noise, and only the things you want Google to see (and serve). Just like a database, bloat can slow processing and keep data from being indexed correctly.

If a website is a mess of URLs and duplicate content, Google will throw their hands up in the air out of frustration. This is a bad spot to be in. You’ll find your traffic and rankings drop while your indexation becomes bloated. Your crawl rate (which we’ve found correlates with traffic levels) will be curbed. It could seem all very sudden, or it could be gradual. Every case is different – but it’s always a nightmare. At the end of the day, the World Wide Web is full of junk, so it’s important to avoid adding to the problem Google Search already has to face.

Keeping track of website changes is critical to SEO. The other day I peeked into our own Google Webmaster Tools indexation report, and saw something pretty alarming in the “index status” report:

Crawl Index

Yikes! Our site is relatively small. When a spike like this occurs, it’s much easier to identify. We’re lucky in that regard. In this case, it was easy to tie the 12/14/2014 spike to a new website theme we implemented. Something, somehow, got passed me while I was developing the site. Time for some detective work…

For us, our core website is relatively small – the majority of the site is made up of blog posts. But that doesn’t mean I trust Google to never struggle. I’ve always liked the crawl budget and frequency we’ve received, so I certainly don’t want noise to slip in and eat away my crawl budget. Remember, Google comes to a site with a set amount of fuel. If you let them waste that fuel on duplicate and junk pages, you risk adding noise to their processing.  Noise = bad.

Difference Between Google.com and Google Search Console

While Google Search Console is known to have more bugs than you’d expect Google would allow, the truth is, it’s typically more accurate. Remember – Google doesn’t index everything it crawls. Nor does it show everything it indexes. Case in point, doing the site: in Google shows “About 536” pages in the index, while the Search Console shows 512:

SERPs 1

But, if you go to the last page of results that number changes. Why? No clue. It’s always been flaky like that. More reason to rely on Google Webmaster Tools data over Google.com data.

SERPS2

The Spike Was Caught, Now What?

In this case, a simple site: command in Google showed me something I didn’t expect.  After digging through the results, I found a ton of results like:

https://www.greenlanemarketing.com/blog/2014/07/how-entities-and-knowledge-cards-can-help-with-query-intent/?replytocom=39447#respond

Yikes. A rouge parameter. Duplicate content. There’s my index bloat. The #respond isn’t the issue so much (since Google doesn’t index anything past a # in the URL, though that’s an argument in mobile); the real issue was the ?replytocom= parameter. This new theme creates this link whenever anyone leaves a comment (so you can reply directly to that comment):

Reply Link

But wait a minute… before launch I added wildcards to my robots.txt file to make sure parameters didn’t get crawled:  https://www.greenlanemarketing.com/robots.txt 

My robots.txt file

This is my default WordPress robots.txt directives (when not using a cart or any custom coding). I tweak as necessary for each install. And, as expected, this did block spiders from going into any /?replytocom=39447#respond URLs. Yet, there they are, indexed – but not in all their glory:

Indexation

See the description Google put in?  “A description for this result is not available because of the site’s robots.txt – learn more.” If you’re an SEO, you’ve surely seen that before. The truth is, a robots.txt keeps a page from being crawled but does not keep a page from being indexed. An often-times confusing concept.

[su_newsletter_email]

If a web document links to a page that is blocked by robots.txt, the spider still discovers the page because of the link. PageRank is still flushed through the link. Except, Google doesn’t see what’s actually on that page – so it usually doesn’t rank well at all.

The problem is I have seen these blocked pages rank, often in place of a better domain result. Google isn’t perfect. An SEO needs to mitigate those situations. An SEO also needs to maximize the flow of PageRank and keep it flowing to pages that matter. For older SEOs, I’m not necessarily talking about PageRank Sculpting here – but more about keeping the index and crawl budget optimized.

So here I stand, with a load of URLs in the index that I really don’t want there. I have the pages blocked with robots.txt so the URL removal tool in Google Webmaster Tools will do the trick. But until I plug up that link, it will keep getting indexed.  With a little PHP coding, I added a nofollow into the post template for that link:

[infobox style=”alert-info”] /** * Add a rel=”nofollow” to the comment reply links
*/
function add_nofollow_to_reply_link( $link ) {
return str_replace( ‘”)\’>’, ‘”)\’ rel=\’nofollow\’>’, $link );
}

add_filter( ‘comment_reply_link’, ‘add_nofollow_to_reply_link’ ); [/infobox]

Next, I ran a crawl using Screaming Frog and set it to follow “nofollow” URLs and ignore robots.txt. I want to see everything, but notably these “replytocom” URLs. After the crawl, I found more than 700 of these URLs.  That’s about 450 more than Google was showing using the site: operator. That’s not surprising though, Google doesn’t show everything they know about. This will, however, make clean up less than perfect.

I decided only to remove the URLs Google was showing, which after doing the site operator, was around 60. Though Matt Cutts warns us against overdoing URL removal, 60 seemed reasonable to me. Back to Google Webmaster Tools, and go to Google Index > Remove URLs. Enter your URLs one by one – to my knowledge, there’s no way to bulk upload. You could also access the tool through this link: https://www.google.com/webmasters/tools/removals. Protip: Use Scrapebox to scrape the results out of the SERPs and export for an easier time collecting URLs.

URL Removal

Within a few hours, these specific URLs should be out of the index.

However, that doesn’t solve the entire issue – just a small set of URLs. There are still pages indexed that Google didn’t show to Scrapebox, now being blocked by robots.txt, that Google will still ping once in a while. Google has a big memory so these will stay in the index and my “index status” will still be high.

The ultimate answer: The <meta robots=”noindex,nofollow”> tag. These duplicate URLs gotta go, and I don’t want to wait for canonical to do it’s “slow” thing. I need to get the aforementioned meta tags on only https://www.greenlanemarketing.com/resources/articles/how-entities-and-knowledge-cards-can-help-with-query-intent/?replytocom=39447#respond and not https://www.greenlanemarketing.com/resources/articles/how-entities-and-knowledge-cards-can-help-with-query-intent/. Typically this might require some conditional coding, but Yoast’s SEO WP Plugin allows for this. 

So now it’s in place, but for Google to see this tag, I have one crucial step – I need to remove the robots.txt directive that stops these pages from being crawled. That’s right – the robots.txt I was so proud to have is actually keeping the noindex tag from being seen, and causing Google to index the unwanted pages.

The Waiting Game

Now we wait. All the changes have been made, so it’s a matter of waiting for spiders to come, discover, and send back the “noindex” information; and for the processing, center to start kicking these updates through the data centers.

Fetching as Google within Google Webmaster Tools can offer a “submit to index” button that tends to get things recrawled and indexed quickly, but this is on a URL-by-URL basis; far from quick unless you have a small website.

I plan on waiting a month or two. We’re a fairly small site – 60 days seems reasonable. There are hundreds of duplicate pages here that I suspect Google is not in a hurry to go back to.  A larger site usually gets a higher crawl budget but has a higher volume of pages – in my experience, a month is a good rule of thumb for those sites too.

2 Months Later

GWT - Indexation Being ControlledWell, Google threw me a curveball. Since January 4th, while expecting the “index status” report to start trending down, Google found even more URLs that I missed.  I won’t go into what they were (since it’s not really important to this piece), but we peaked at 609 URLs indexed (instead of the ~200 we were supposed to have).

But as of today (3/9/2015), we are down to 333 indexed pages. Only a little longer and we’ll be back to where we belong.

Summary

Indexation can be controlled with analysis and some smart tweaking, though really does test your patience as an SEO. While this is a small example, I’ve seen indexation run crazy on sites with hundreds of thousands of URLs where a little management gave Google a whole new appreciation of the website and content. We have a case study on how this translated into a huge return on traffic. I’ve outlined our steps in hopes you can use this information to solve your indexation problems. Good luck – it’s completely worth the hard work.

Similar Posts