Site crawling, an area of search engine optimisation that is so often overlooked. The ability of a robot to crawl and understand your website is essential. Nothing is more important than making your website robot friendly.
Here’s an example of a campaign where the client’s website was not robot friendly. It took months to redesign every aspect of the pagination, site structure and removing legacy content. However, once all the dust settled, the rankings began to soar.
Now they are enjoying more than double the traffic they did this time last year. You can get these same results by fixing your crawl stats.
Results from fixing crawl issues
What is an SEO Spider?
No doubt if you watched the video and you’re new to SEO, the first term that probably confused you is a spider.
Simply put, this is the name that developers have called crawling bots that explore the internet. They go from page-to-page through the linking of the internet. It’s the world wide web, so it makes sense to call them spiders.
There are directives for spiders, and classifications for them too. Some spiders are considered good bots, whilst others are bad bots. Some listen to your robots.txt directives, and others choose to ignore them.
But the important thing is simply to know that these crawling bots are not necessarily harmful. However, if they’re crawling too much of your website too quickly, they may cause your server to slow down. This brings us to the next step.
What is the Robots.txt Directive?
According to Wikipedia, the robots exclusion standard began in 1994 when Martijn Koster created a robot to crawl the internet. This is backed up by robotstxt.org, with their original web standard document page here.
Inadvertently, a site crawler that did a distributed denial of service across the website being worked on, which lead towards the need for robots directives.
Whilst a robots.txt is not recognised as an internet standard by any standards body, it is recognised by the webmaster community. More importantly, it’s something that Google supports and recommends.
The basic format of a robots.txt for most of you guys will look something like this:
If this is familiar to you, then you’ve probably seen a robots.txt before or looked it up whilst reading this article. But in the next section, I’m going to break down each of the directives and how they work.
A List of Robots Directives.
Here’s a list of directives, and what they do, as well as a handful of interesting and less commonly known facts:
User-Agent: This tells the robots whether the rules apply to them, and by default, most people will use an asterisk to represent all robots.
Disallow: This tells the robots whether this section of the website should be crawled. They can still ping to see if the page is there, but if they follow directives then they won’t visit.
Allow: This tells the robots that this page or section can be reached. So if you want to block off a section of your website, but allow a single file there, then you can do so.
Crawl-Delay: This tells the robots that they should wait a certain amount of time before crawling another page. Some search engines perceive this as a time delay before the next page is crawled, others interpret this as length of time before revisiting the website.
Sitemap: This is often used for websites to specify where your XML sitemap is located. This helps web crawlers to find your sitemap and crawl the website.
Disallow: /wp-admin/: This is a default for Yoast SEO on WordPress. It’s blocking off the wp-admin dashboard so that robots do not crawl your back end.
Allow: /wp-admin/admin-ajax.php: This is the default for Yoast, because an important WordPress file called admin-ajax.php is located inside the /wp-admin/ folder. If you’ve ever password protected your /wp-admin/ section then you’ll notice a 401 error due to the browser not reaching the admin-ajax.php file.
: This is not something that is included in your robots.txt file, but is still part of the robots exclusion protocol. This tells the robot not to follow the link, but can be ignored and is not useful for protecting secret files or documents.
How often does Google crawl?
The truth is that Google will crawl each website differently on a daily basis. However, you can use Google Search Console to help you here. This will tell you how often Google is crawling your website, which is most important.
To do this, open up Google Search Console and select your property. Then select Crawl > Crawl Stats
When you do this, you’ll be presented with the below graphs. The one you’re interested in are the Pages crawled per day.
You’re looking for the average amount of pages crawled per day, not the high or low values. There’s normally a great amount of variance between these values, but with a small website such as mine, you can expect these numbers:
Checking Crawl Stats in Google Search Console
How to get Google to crawl my website?
If you’re looking to get Google to crawl your website, then the most important thing is to make sure that it’s well connected and providing value.
First and foremost, if the website is full of thin and duplicate content, this is going to stop Google from crawling your website. There’s no substitute for a good website design.
However, there do arise situations from time-to-time that require a little more encouragement. Perhaps you’ve created a new page you want to rank. Maybe there’s old pages indexed that you want removed from the index, but attributed towards the new page.
To do this, you’ve got a few options:
You can create a list of your pages you want Google to crawl, then create a fresh XML sitemap file. This can then be submitted using the Submit Sitemaps. This page can be found in Crawl > Sitemaps.
Submitting sitemap in Google Search Console
The other option on how to get google to crawl your website, is to fetch and request indexing. This can be done by visiting Crawl > Fetch as Google.
This page is fairly obvious how it works. Simply submit the URL and select either Fetch, or Fetch and Render. Then a button appears saying Request Indexing. When you press this, another pop-up will ask how you want to do this.
The first option will simply request that you crawl this page. So if you’re looking to submit a new article and get it quickly indexed, this is a great option. However, if you’re looking to recrawl an entire section of a website, select Crawl this URL and its direct links.
If your pages are full of good quality content, and you’ve got lots of links, then this isn’t going to be a problem. The people that are struggling to index pages need to address navigational and content problems.
Fetch & Render with GSC
When did Google last crawl my website?
To get a quick idea of when Google last crawled your website, you should check the crawl stats in Google Search Console. This can be found by visiting Crawl > Crawl Stats. This data can show some of Google’s behaviour, which is useful to see.
However, to get the best data, you need to check your server logs. This will provide you data on what user-agent crawled your website, as well as which pages. This is a lot more detailed than the crawl stats and errors from Google Search Console.
Checking Crawl Stats in Google Search Console
What are crawl errors in Google Search Console?
A crawl error in Google is any page that provided Google with a not found error, server error, or blocked when it shouldn’t be. These types of errors are common, and they don’t seem to devalue your website. However, you should make efforts to reduce these errors. These can be found in Crawl > Crawl Errors.
If the page is returning a 404 not found, you can have two options.
The first, you can return a 410 Permanently Removed error. This will indicate to Google that you have purposefully removed the content, and that it’s not an error. The result is that the page will quickly be removed from the index.
Status Code 410 Response
The second, you can add a 301 Redirect towards another relevant page. This will show to Google that the page is permanently moved, and should consider this the new home for that content. This is my personal favourite strategy, as it gives the best user experience.
A simple way to do this is use the Redirection Plugin on WordPress. It’s a fantastic tool that allows for easy importing and exporting of redirects. However, if you’re using Magento or another platform, then you may wish to add the Redirects to your disavow file.
Simply add the code as follows:
Redirect 301 /old-url-path/ https://www.example.com/new-url-path/
Identifying Crawl Errors in GSC
How to perform a Site Crawl?
Performing a quick site crawl can provide great insight into your website. My two favourite tools for site audits are Ahrefs and Screaming Frog. Both of these tools provide awesome data to help you analyse your website.
However, for me Screaming Frog is ever-so-slightly better for site crawling. If you want to find out how to use this tool, then check out my Screaming Frog guide.
If you’re still too new to SEO to analyse a website yourself, then a great tool like Ahrefs Site Audit can provide some great insight. It provides you great data on titles, headings, and even page speed performance.
AHREFS site audit performance screen
Tools such as Screaming Frog do not parse all the HTML on your website. This prevents you from seeing all the potential issues on your website. If you want to fix all your on-site issues then you will need to get a variety of tools.
I demonstrate in the below video how to use Sitebulb to crawl a website and audit the broken links. It’s an interesting case where 18,000 not found errors were hidden to Screaming Frog, but could be picked up through Sitebulb’s HTML parsing.
When considering site indexing and crawl rates, you should always take HTML parsing into consideration.
What is Depth First Crawling?
If an SEO spider uses depth-first, then it starts from the homepage and then collects all the URIs from that page. This means that the page you start on will be Level 0, and all links from that page are Level 1. The bot then follows one of these links and starts collecting all the Level 2 links from there and so fourth. Once the bot has exhausted all the links within that journey it will return to the route and follow the next path.
depth first crawling diagram
What is Breadth First Crawling?
This works by crawling your page horizontally and gradually working through all the links by level. This means that the bot struggles to find your deepest pages quickly, but will retrieve the link levels in the correct order. This has some advantages over depth-first crawling in that it reveals site structure more clearly. However, for crawlers such as Googlebot, finding your deepest page can only occur by regularly crawling your entire website.
This means that crawling your website regularly becomes costly. By crawling your website based on depth-first, Google can find your deepest pages each time it enters your site from a different location. Since the internet is a series of connections between different sites, Google doesn’t always enter through the homepage.
breadth first crawling diagram