The robots.txt
file is a simple text file placed in the publicly accessible root directory of a website (e.g., https://www.example.com/robots.txt
). While its syntax is straightforward, its importance in technical SEO is critical. A misconfigured file can lead to serious consequences — from important parts of the website being ignored by bots to the entire website being excluded from search engine results.
The goal of this article is not only to provide an overview of the basic rules but also to offer context, advanced examples, and specific recommendations for different platforms — especially WordPress.
What Is robots.txt and Why Use It?
The robots.txt
file is used to control access for search engine crawlers (also known as user-agents) such as Googlebot, Bingbot, Yandexbot, etc., to various parts of your website. It’s part of the so-called Robots Exclusion Protocol (REP), which was designed for efficiently managing site crawling.
The primary goals of using robots.txt
are to:
- Prevent indexing of duplicate or irrelevant content
- Restrict access to areas not intended for the public
- Optimize crawl budget – the time and resources a bot spends on your site
- Block crawling of technical structures that carry no informational value
- Prevent indexing of endpoints such as REST API, search queries, AJAX scripts, etc.
Structure of the robots.txt File
The file is composed of blocks, each beginning with the User-agent
directive, followed by Disallow
, Allow
, and optionally Sitemap
.
Example:
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://www.example.com/sitemap_index.xml
- User-agent: Specifies which bot the rules apply to.
*
means all bots. - Disallow: Denies access to a specific path or URL.
- Allow: Explicitly permits access – often used for exceptions to
Disallow
. - Sitemap: Recommends the location of the website’s XML sitemap(s).
Each block is read separately for each user-agent. You cannot mix rules for multiple user-agents in one block. Each section is evaluated independently.
Wildcards and Special Characters
To fine-tune access control, you can use wildcard characters.
*
– matches any number of characters
Example:Disallow: /private/*.pdf
→ Blocks all PDF files in the/private/
directory.$
– matches the end of a URL
Example:Disallow: /*.pdf$
→ Blocks all URLs ending in.pdf
, but not something like/download.php?file=document.pdf
.
The combination of *
and $
allows for precise control over URL structures.
What Should the robots.txt File Contain?
1. Sitemap Definition
Search engines can read the Sitemap:
directive and crawl the file.
Sitemap: https://www.example.com/sitemap_index.xml
If you’re using plugins like Yoast SEO or RankMath, the sitemap is generated automatically.
Use full URLs including https://
.
For large sites, consider splitting into multiple sitemaps (e.g., /sitemap-products.xml
, /sitemap-categories.xml
).
2. Blocking Non-Essential Site Sections
This includes technical directories, pagination, search results, and AJAX functions — covered in more detail below.
What Should Not Be Indexed and Why
Admin Interface
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
WordPress stores all admin functions under /wp-admin/
. These pages should not be crawled or indexed.admin-ajax.php
is an exception as it’s required by some plugins and themes (e.g., WooCommerce, AJAX filtering).
Internal Search
Disallow: /?s=
Disallow: /search
Internal search queries can generate hundreds or thousands of URL variations with different parameters. These pages usually:
- Contain no unique content
- Cause content duplication
- Waste crawl budget
- Pollute the search index
Search pages should be blocked via robots.txt
or marked as noindex
.
Author, Archive, and Tag Pages
Disallow: /author/
Disallow: /tag/
Disallow: /category/uncategorized/
/author/
– redundant if there’s only one author; usually no unique content/tag/
– creates noise unless tags are well-structured/category/uncategorized/
– the default WordPress category, should be renamed or blocked
Such pages can be blocked via robots.txt
or marked with noindex, follow
.
Parameterized and Paginated URLs
Disallow: /*?*
Disallow: */page/
WooCommerce and other e-commerce platforms often use parameters like ?orderby=
, ?filter_price=
, ?color=
, ?size=
, etc., resulting in hundreds of duplicate pages.
Pagination (/page/2/
) should be handled carefully – either via canonical URLs or disallowing the pages.
REST API and AJAX Endpoints
Disallow: /wp-json/
Disallow: /graphql/
Modern WordPress themes often use REST APIs. While these endpoints don’t serve HTML, some bots still crawl them. Unless they are public APIs, they should be blocked.
Technical Directories
Disallow: /cgi-bin/
Disallow: /temp/
Disallow: /backup/
Disallow: /*.sql$
Disallow: /*.zip$
These directories/files may contain sensitive or unnecessary data and should not be indexed.
What Should Not Be Blocked
CSS and JavaScript
Some site owners mistakenly block:
Disallow: /wp-includes/
Disallow: /wp-content/
This is a serious mistake. Googlebot must access all CSS and JS necessary for rendering your page. Blocking them can result in a ranking penalty due to poor page rendering.
Correct approach: Do not block /wp-content/
or /wp-includes/
unless necessary. Assets in /wp-content/uploads/
(e.g., images) should be indexable.
WordPress-Specific Considerations
WordPress is a very popular CMS but suffers from common SEO issues. robots.txt
can help mitigate some of them:
Problematic areas:
/wp-admin/
,/wp-includes/
– technical directories/feed/
,/trackback/
,/comments/feed/
– often unnecessary?replytocom=
– comment display via URL param → duplicates/?s=
– internal search
Recommended Base Configuration:
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /?s=
Disallow: /search
Disallow: /author/
Disallow: /category/uncategorized/
Disallow: /tag/
Disallow: /feed/
Disallow: /trackback/
Disallow: /comments/feed/
Sitemap: https://www.example.com/sitemap_index.xml
Testing and Validation
After editing your robots.txt
, make sure that:
- The file is accessible at
https://www.example.com/robots.txt
- It returns HTTP status 200 OK
- The
Sitemap:
directive is correct - It doesn’t conflict with other rules (e.g.,
noindex
in meta tags)
You can validate using:
- Google Search Console – robots.txt Tester
- Online validators like https://technicalseo.com/tools/robots-txt/
Advanced Use Cases
Separate Sections for Different Bots
User-agent: Googlebot
Disallow: /private-google/
User-agent: Bingbot
Disallow: /private-bing/
User-agent: *
Disallow: /common/
Google and Bing can be given separate crawling rules.
Blocking Parameterized URLs with Exceptions
Disallow: /*?*
Allow: /*?orderby=
This blocks all URLs with parameters, except those with orderby=
, useful for product sorting.
Final Checklist Summary
✅ File is placed in the correct location
✅ Contains a correct Sitemap:
directive
✅ Does not block access to essential CSS/JS
✅ Blocks access to search and parameterized URLs
✅ Blocks access to technical folders and files
✅ Does not accidentally include Disallow: /
✅ Rules are clear and conflict-free
✅ Validation in Search Console shows no errors