How to Use Robots.txt to Control Search Engine Crawling (2025 Guide)

The robots.txt file is a small but powerful configuration tool that tells search engines which parts of your website they can or cannot access.
When used correctly, it helps optimize crawl efficiency, protect sensitive areas, and improve SEO performance. When used incorrectly, it can deindex your entire site overnight.

What Is Robots.txt

The robots.txt file is a plain text document located at the root of your domain — for example:

https://example.com/robots.txt

It communicates directly with web crawlers (like Googlebot, Bingbot, or GPTBot) to define which folders or pages should be crawled and which should be excluded.

Every time a bot visits your site, it first checks this file to understand your crawl preferences.

How Robots.txt Works

The file follows a simple logic:

  • Allow: (optional) lets bots access certain files within restricted folders
  • User-agent: specifies the bot (e.g., Googlebot, Bingbot)
  • Disallow: specifies folders or pages the bot should not crawl
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

Sitemap: https://example.com/sitemap.xml

This setup is ideal for WordPress: it blocks access to administrative areas but allows crawling of the AJAX file required for proper plugin functionality.

How to Disallow Everything

To block all bots from crawling your site, use:

User-agent: *
Disallow: /

Warning: Never use this directive on a live site. It will remove your website from Google and cause a complete loss of organic traffic until corrected.

How to Allow Everything

To allow all crawlers full access, you can leave your robots.txt file empty or use:

User-agent: *
Disallow:

This means “disallow nothing,” so all content is accessible to search engines.

How to Block Specific Files or Folders

You can prevent bots from accessing specific paths by listing them individually:

User-agent: *
Disallow: /private/
Disallow: /drafts/
Disallow: /hidden/file.html

Everything not listed remains crawlable.

How to Block Specific Bots

User-agent: Bingbot
Disallow: /

User-agent: *
Disallow:

This example blocks only Bingbot, while Googlebot and other crawlers can still access your website.

The following configuration is SEO-friendly and compatible with Google Search Console:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

Sitemap: https://yourdomain.com/sitemap.xml

Best Practices

  • Always include the URL of your XML sitemap.
  • Test your configuration with the Google Search Console Robots.txt Tester.
  • Use lowercase paths to avoid discrepancies across environments.

Robots.txt vs Noindex

If your goal is to prevent pages from appearing in search results, use the noindex tag rather than blocking them via robots.txt.

<meta name="robots" content="noindex,follow">

Why this matters: a URL blocked by robots.txt may still appear in search results, but without a meta description.
Using noindex allows crawlers to access the page while ensuring it is not displayed in search results.

Plugins such as Yoast SEO, Rank Math, or The SEO Framework make it simple to apply the noindex directive to specific pages or categories.

When to Block Your Entire Site

If you want to restrict access entirely (for both users and bots), use password protection rather than robots.txt.
A free WordPress plugin such as Password Protected is ideal for staging or development environments.

Important Facts About Robots.txt

  • Not all bots respect robots.txt directives — malicious crawlers often ignore them.
  • The robots.txt file is public. Do not list confidential directories in it.
  • A single typo (for example, Disallow: / instead of Disallow: /admin/) can deindex an entire site.
  • Always back up and validate your file before deployment.

AI Overview: Robots.txt and AI Crawlers in 2025

In 2025, major AI companies such as OpenAI, Anthropic, and Google DeepMind operate their own crawlers (for example, GPTBot, ClaudeBot, Google-Extended).
If you want to prevent your content from being used in AI training datasets, include the following directives:

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: ClaudeBot
Disallow: /

This ensures your website’s content is excluded from AI model indexing and training.

Key Takeaways

  • Place robots.txt in the root directory of your website (e.g., /public_html/robots.txt).
  • Use Allow and Disallow carefully; a misplaced slash can have critical SEO consequences.
  • Always include your XML sitemap.
  • Validate your configuration in Google Search Console before publishing.
  • Avoid Disallow: / on production sites.
  • Consider blocking AI crawlers if content protection is a priority.

SEO Optimization Summary

SEO ElementOptimized Content
Primary Keywordrobots.txt file
Secondary Keywordsdisallow all, allow all, block bots, WordPress robots.txt, AI crawlers
Meta TitleHow to Use Robots.txt to Control Search Engine Crawling (2025 SEO Guide)
Meta DescriptionLearn how to configure your robots.txt file correctly in 2025 to control crawlers, protect content, and maintain SEO performance — including how to block AI crawlers.
Slug/robots-txt-guide-2025
Schema TypeArticle
Target Length~1,300 words

How to create a printable HTML page with custom A4 layout

Creating printable HTML pages can be tricky, even for experienced frontend developers. I remember tackling this challenge a few years ago, and I had to dig deep into research and community solutions like StackOverflow. To save you some time and effort, I’ve put together this guide that walks through the process step-by-step.


Step 1: Set up the basic structure

Start by setting up a div container with the class print-container. Inside it, include another div with the class page-A4. This structure will hold your content. Below, I’ve added an article element as an example.

<div class="print-container page-A4">
    <section class="page-content">
        <article>Ciao! I'm an A4 document.</article>
    </section>
    <section class="page-content">
        <article>Ciao! I'm an A4 document.</article>
    </section>
</div>

Step 2: Basic styling

Apply base styles to maintain layout integrity:

body {
    margin: 0;
}
.print-container {
    margin: 0;
}
.page-content {
    margin: 0;
    overflow: hidden;
    position: relative;
    box-sizing: border-box;
    page-break-after: always;
}

Step 3: Screen display enhancements

To make the page look more like a printed sheet when viewed on a screen, add these styles:

@media screen {
    body {
        background: #f4f4f4;
    }
    .page-content {
        background: white;
        box-shadow: 0 2px 8px rgba(0, 0, 0, 0.2);
        margin: 10mm auto;
    }
}

Step 4: Set page size to A4

Define the dimensions of the A4 page using millimeters:

.page-a4 .page-content {
    width: 210mm;
    height: 297mm;
}
.page-padding {
    padding: 10mm;
}

Add the page-padding class for easy margin adjustments:

<div class="print-container page-a4">
    <section class="page-content page-padding">
        <article>Ciao! I'm an A4 document.</article>
    </section>
    <section class="page-content page-padding">
        <article>Ciao! I'm an A4 document.</article>
    </section>
</div>

Step 5: Configure print setting

Lastly, ensure your CSS sets the correct page size for printing:

@page {
    size: A4;
    margin: 0;
}
@media print {
    .print-container.page-a4 {
        width: 210mm;
    }
}

By following these steps, you’ll achieve a properly formatted A4 printable layout for your web content.

Let me know in the comments if you try it or have any questions!