How to Properly Use the robots.txt File - Affinite.io

The robots.txt file is a simple text file placed in the publicly accessible root directory of a website (e.g., https://www.example.com/robots.txt). While its syntax is straightforward, its importance in technical SEO is critical. A misconfigured file can lead to serious consequences — from important parts of the website being ignored by bots to the entire website being excluded from search engine results.

The goal of this article is not only to provide an overview of the basic rules but also to offer context, advanced examples, and specific recommendations for different platforms — especially WordPress.


What Is robots.txt and Why Use It?

The robots.txt file is used to control access for search engine crawlers (also known as user-agents) such as Googlebot, Bingbot, Yandexbot, etc., to various parts of your website. It’s part of the so-called Robots Exclusion Protocol (REP), which was designed for efficiently managing site crawling.

The primary goals of using robots.txt are to:

  • Prevent indexing of duplicate or irrelevant content
  • Restrict access to areas not intended for the public
  • Optimize crawl budget – the time and resources a bot spends on your site
  • Block crawling of technical structures that carry no informational value
  • Prevent indexing of endpoints such as REST API, search queries, AJAX scripts, etc.

Structure of the robots.txt File

The file is composed of blocks, each beginning with the User-agent directive, followed by Disallow, Allow, and optionally Sitemap.

Example:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://www.example.com/sitemap_index.xml
  • User-agent: Specifies which bot the rules apply to. * means all bots.
  • Disallow: Denies access to a specific path or URL.
  • Allow: Explicitly permits access – often used for exceptions to Disallow.
  • Sitemap: Recommends the location of the website’s XML sitemap(s).

Each block is read separately for each user-agent. You cannot mix rules for multiple user-agents in one block. Each section is evaluated independently.


Wildcards and Special Characters

To fine-tune access control, you can use wildcard characters.

  • * – matches any number of characters
    Example:
    Disallow: /private/*.pdf
    → Blocks all PDF files in the /private/ directory.
  • $ – matches the end of a URL
    Example:
    Disallow: /*.pdf$
    → Blocks all URLs ending in .pdf, but not something like /download.php?file=document.pdf.

The combination of * and $ allows for precise control over URL structures.


What Should the robots.txt File Contain?

1. Sitemap Definition

Search engines can read the Sitemap: directive and crawl the file.

Sitemap: https://www.example.com/sitemap_index.xml

If you’re using plugins like Yoast SEO or RankMath, the sitemap is generated automatically.
Use full URLs including https://.
For large sites, consider splitting into multiple sitemaps (e.g., /sitemap-products.xml, /sitemap-categories.xml).

2. Blocking Non-Essential Site Sections

This includes technical directories, pagination, search results, and AJAX functions — covered in more detail below.


What Should Not Be Indexed and Why

Admin Interface

Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

WordPress stores all admin functions under /wp-admin/. These pages should not be crawled or indexed.
admin-ajax.php is an exception as it’s required by some plugins and themes (e.g., WooCommerce, AJAX filtering).

Internal Search

Disallow: /?s=
Disallow: /search

Internal search queries can generate hundreds or thousands of URL variations with different parameters. These pages usually:

  • Contain no unique content
  • Cause content duplication
  • Waste crawl budget
  • Pollute the search index

Search pages should be blocked via robots.txt or marked as noindex.

Author, Archive, and Tag Pages

Disallow: /author/
Disallow: /tag/
Disallow: /category/uncategorized/
  • /author/ – redundant if there’s only one author; usually no unique content
  • /tag/ – creates noise unless tags are well-structured
  • /category/uncategorized/ – the default WordPress category, should be renamed or blocked

Such pages can be blocked via robots.txt or marked with noindex, follow.

Parameterized and Paginated URLs

Disallow: /*?*
Disallow: */page/

WooCommerce and other e-commerce platforms often use parameters like ?orderby=, ?filter_price=, ?color=, ?size=, etc., resulting in hundreds of duplicate pages.

Pagination (/page/2/) should be handled carefully – either via canonical URLs or disallowing the pages.

REST API and AJAX Endpoints

Disallow: /wp-json/
Disallow: /graphql/

Modern WordPress themes often use REST APIs. While these endpoints don’t serve HTML, some bots still crawl them. Unless they are public APIs, they should be blocked.

Technical Directories

Disallow: /cgi-bin/
Disallow: /temp/
Disallow: /backup/
Disallow: /*.sql$
Disallow: /*.zip$

These directories/files may contain sensitive or unnecessary data and should not be indexed.


What Should Not Be Blocked

CSS and JavaScript

Some site owners mistakenly block:

Disallow: /wp-includes/
Disallow: /wp-content/

This is a serious mistake. Googlebot must access all CSS and JS necessary for rendering your page. Blocking them can result in a ranking penalty due to poor page rendering.

Correct approach: Do not block /wp-content/ or /wp-includes/ unless necessary. Assets in /wp-content/uploads/ (e.g., images) should be indexable.


WordPress-Specific Considerations

WordPress is a very popular CMS but suffers from common SEO issues. robots.txt can help mitigate some of them:

Problematic areas:

  • /wp-admin/, /wp-includes/ – technical directories
  • /feed/, /trackback/, /comments/feed/ – often unnecessary
  • ?replytocom= – comment display via URL param → duplicates
  • /?s= – internal search

Recommended Base Configuration:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /?s=
Disallow: /search
Disallow: /author/
Disallow: /category/uncategorized/
Disallow: /tag/
Disallow: /feed/
Disallow: /trackback/
Disallow: /comments/feed/
Sitemap: https://www.example.com/sitemap_index.xml

Testing and Validation

After editing your robots.txt, make sure that:

  • The file is accessible at https://www.example.com/robots.txt
  • It returns HTTP status 200 OK
  • The Sitemap: directive is correct
  • It doesn’t conflict with other rules (e.g., noindex in meta tags)

You can validate using:


Advanced Use Cases

Separate Sections for Different Bots

User-agent: Googlebot
Disallow: /private-google/

User-agent: Bingbot
Disallow: /private-bing/

User-agent: *
Disallow: /common/

Google and Bing can be given separate crawling rules.

Blocking Parameterized URLs with Exceptions

Disallow: /*?*
Allow: /*?orderby=

This blocks all URLs with parameters, except those with orderby=, useful for product sorting.


Final Checklist Summary

✅ File is placed in the correct location
✅ Contains a correct Sitemap: directive
✅ Does not block access to essential CSS/JS
✅ Blocks access to search and parameterized URLs
✅ Blocks access to technical folders and files
✅ Does not accidentally include Disallow: /
✅ Rules are clear and conflict-free
✅ Validation in Search Console shows no errors

WordPress Comment Security: Complete Guide (2025)
WordPress Comment Security: Complete Guide (2025)
03 Sep, 2025

Looking for something?