Why did we create this Google robots.txt tester alternative?
In a significant update, Google has recently enhanced its Search Console with a new robots.txt report feature. This development comes alongside the news that Google has decided to sunset its longstanding Robots.txt Tester tool. Renowned industry expert Barry Schwartz reported this change on November 15, 2023, marking a pivotal shift in how webmasters interact with Google's crawling process.
You can continue accessing Google's legacy robots.txt tester until December 12, 2023, by clicking on this link.
What's a Robots.txt file?
A robots.txt file is a simple text file located in the root directory of any website. It guides web crawlers about which pages or files they can or cannot request from your site, and how often they should crawl the site. This is primarily used to prevent your site from being overloaded with requests from various web crawlers and spiders; it is not a mechanism for keeping a web page out of Google. Key use cases include:
- Directing search engines or other web crawlers on which parts of your site should or should not be crawled.
- Preventing certain pages from being crawled to avoid overloading your site’s resources.
- Helping search engines prioritize which pages to crawl first.
- Discouraging crawlers from accessing sensitive areas of your site (though not a security measure).
- Assisting in SEO by preventing search engines from indexing duplicate pages or internal search results pages.
Robots.txt Syntax Example
Understanding the syntax of a robots.txt file is key to effectively managing how search engines or other web spiders interact with your site. Here's a basic example to illustrate common directives and their usage:
In this example:
- The User-agent: * directive applies the following rules to all crawlers.
- Disallow: /private/ tells crawlers to avoid the /private/ directory.
- Allow: /public/ explicitly permits crawling of the /public/ directory.
- The Sitemap: directive points crawlers to the site's sitemap for efficient navigation.
- Crawl-delay: 10 asks crawlers to wait 10 seconds between requests, reducing server load.
This simple example provides a foundational understanding of how to structure a robots.txt file and the impact of each directive on crawler behavior. Below, we will provide a list of directives that can be included in a robots.txt file.
- User-agent: Targets specific search engines for crawling instructions.
- Disallow: Prevents search engines from indexing private or sensitive folders.
- Allow: Ensures important pages or directories are crawled within restricted areas.
- Sitemap: Helps search engines find and index all eligible pages on your site.
- Crawl-delay: Reduces server load by limiting the rate of crawling.
Less Common Directives
- Noindex: Indicates that a page should not be indexed (more effective as a meta tag).
- Host: Specifies the preferred domain for indexing (primarily recognized by Yandex).
- Noarchive: Prevents search engines from storing a cached copy of the page (more commonly used in meta tags).
- Nofollow: Instructs search engines not to follow links on the specified page (more effective in meta tags).
Common Web Crawlers and Their Purposes
Understanding the variety of web crawlers is crucial for effectively managing your robots.txt file. Each crawler has its specific purpose and origin. Here are some common crawlers and their descriptions:
- Googlebot: Google's main web crawler for indexing sites for Google Search.
- Bingbot: Microsoft's crawler for indexing content for Bing search results.
- Yandex Bot: A crawler for the Russian search engine Yandex.
- Apple Bot: Indexes webpages for Apple’s Siri and Spotlight Suggestions.
- DuckDuckBot: The web crawler for DuckDuckGo.
- Baiduspider: The primary crawler for Baidu, the leading Chinese search engine.
- Sogou Spider: A crawler for the Chinese search engine Sogou.
- Facebook External Hit: Indexes content shared on Facebook.
- Exabot: The web crawler for Exalead.
- Swiftbot: Swiftype's web crawler for custom search engines.
- Slurp Bot: Yahoo's crawler for indexing pages for Yahoo Search.
- CCBot: By Common Crawl, providing internet data for research.
- GoogleOther: A newer Google crawler for internal use.
- Google-InspectionTool: Used by Google's Search testing tools.
- MJ12bot: Majestic's web crawler.
- Pinterestbot: Pinterest's crawler for images and metadata.
- SemrushBot: For collecting SEO data by Semrush.
- Dotbot: Moz's crawler for SEO and technical issues.
- AhrefsBot: Ahrefs' bot for SEO audits and link building.
- Archive.org_bot: By the Internet Archive to save web page snapshots.
- Soso Spider: Crawler for the Soso search engine by Tencent.
- SortSite: A crawler for testing, monitoring, and auditing websites.
- Apache Nutch: An extensible and scalable open-source web crawler.
- Open Search Server: A Java web crawler for creating search engines or indexing content.
- Headless Chrome: A browser operated from the command line or server environment.
- Chrome-Lighthouse: A browser addon for auditing and performance metrics.
- Adbeat: A crawler for site and marketing audits.
- Comscore / Proximic: Used for online advertising.
- Bytespider: A crawler associated with search engines.
- PetalBot: A crawler for Petal Search.
Understanding Google's New Robots.txt Report
The new robots.txt report in Google Search Console provides webmasters with critical insights into their website's accessibility to Google's crawlers. Key features of this report include:
- Identification of robots.txt files for the top 20 hosts on your site.
- Information on size, fetch status, date when Google last tried to crawl the robots.txt file.
- Details of any warnings or errors encountered during the crawl.
- View fetch history for robots.txt files over the past 30 days.
- Ability to request an emergency recrawl of a robots.txt file.
Accessing this report is straightforward. Webmasters can find it under the "Settings" section in Google Search Console. The interface is designed to be user-friendly, offering a clear and concise view of the robots.txt report, ensuring that you can quickly assess and address any issues related to your site's indexing and crawling.
Why This Matters
The retirement of the original Robots.txt Tester tool by Google signifies a shift in how webmasters will manage and troubleshoot their site's interaction with Google's crawlers, and it also leaves a gap for some specific use cases:
- You can no longer quickly check if a new rule you are adding might inadvertently block Googlebot from crawling your important pages.
- There is no way to check if a new page or section of the site is not inadvertently blocked by the current rules and directives in your robots.txt file.
Google's new robots.txt report aims to provide more comprehensive insights, making it a crucial tool for anyone facing indexing and crawling issues. It's advisable for site managers to regularly review this report in Google Search Console to ensure optimal site accessibility and performance. For other quick checks, they can use alternative robots.txt testers/validators like ours.
Robots.txt Validator, tester and interpreter: Filling the Gap
Recognizing the gap left by the sunsetting of Google's Robots.txt Tester, we've developed an alternative tool that helps you test a new rule or check if a page or section of your site is crawlable. Our tool does more than just syntax checking; it interprets the rules and directives in your robots.txt file and provides you with a simple summary and a brief syntax reports on any errors or warnings found. It aims not only to mimic the functionalities of the retired Google robots.txt tester but also to offer an enhanced robots.txt validator and tester.
Stay ahead in managing your site's accessibility and ensure seamless crawling and indexing by Google. Explore our new tool today and take control of your website's performance in the ever-evolving digital landscape.