---
title: "A robots.txt that actually lets agents in"
url: "https://www.mondello.dev/posts/a-robots-txt-that-actually-lets-agents-in"
slug: "a-robots-txt-that-actually-lets-agents-in"
published: "2026-04-11T01:17:53.335Z"
updated: "2026-04-11T05:25:13.976Z"
excerpt: "I deployed a blog and checked /robots.txt. My hosting provider's default was blocking every AI crawler on the internet. That's the opposite of what I want. So I wrote a plugin."
author: "Romy Mondello"
tags: ["agents", "open-source", "robots-txt", "seo"]
audio:
  url: "https://www.mondello.dev/audio/01KNX1G07T426GSBK8EW2DWH4Z-narration-1776097055970.mp3"
  duration: "7:20"
  size_bytes: 7044525
  format: "audio/mpeg"
license: "CC BY 4.0 unless otherwise noted"
source: "https://github.com/integrate-your-mind/mondello-dev"
---
# A robots.txt that actually lets agents in

I deployed a blog this week and checked `/robots.txt` as a sanity pass. The default was blocking every AI crawler on the internet. ClaudeBot, GPTBot, Google-Extended, PerplexityBot, CCBot, Applebot-Extended, meta-externalagent, Bytespider — all `Disallow: /`. Plus a `Content-Signal: search=yes,ai-train=no` at the top.

This is not what I want. If you're publishing in 2026 and your goal is for agents to find your work, starting from "block every agent" is starting at -1. You're paying the bandwidth bill for human crawlers and zero for the audience that is actually growing.

So I wrote a plugin.

## The problem sits at two layers

The first layer is the template. The EmDash blog template I was using has a reasonable default `robots.txt` that lets everything in except the admin surface. Fine.

The second layer is the hosting provider. Cloudflare ships a zone-level feature called "AI Scrapers and Crawlers" that **prepends its own managed robots.txt content** above whatever your Worker returns. The prepended block is the one doing the blocking. It's on by default for a lot of accounts, and most people don't notice because they never diff the served `/robots.txt` against what their Worker actually emits.

`robots.txt` processing uses first-match-wins semantics. When a bot sees its user-agent appear in two groups, the first one is the one it obeys. CF's block list is above my allow list. I lose.

The fix has two steps: toggle the CF setting off in the dashboard under Security → Bots, and make sure your Worker is emitting a `robots.txt` you actually want served. This post is about the second step.

## The bot catalog is data, not code

I published the generator as an open source plugin called `emdash-plugin-agent-seo`. The first thing in the package is a file called `bots.ts`. It is a flat list of every well-known AI crawler I could find documentation for as of Q2 2026:

```typescript
export interface AgentBot {
  readonly id: string;
  readonly userAgent: string;
  readonly operator: string;
  readonly purpose: readonly BotPurpose[];
  readonly docsUrl: string;
}

export const AGENT_BOTS: readonly AgentBot[] = [
  { id: "gptbot", userAgent: "GPTBot", operator: "OpenAI", purpose: ["training"], docsUrl: "https://platform.openai.com/docs/gptbot" },
  { id: "claudebot", userAgent: "ClaudeBot", operator: "Anthropic", purpose: ["training"], docsUrl: "https://docs.anthropic.com/claude/docs/crawler" },
  { id: "google-extended", userAgent: "Google-Extended", operator: "Google", purpose: ["training", "grounding"], docsUrl: "https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers" },
  { id: "perplexitybot", userAgent: "PerplexityBot", operator: "Perplexity", purpose: ["search", "grounding"], docsUrl: "https://docs.perplexity.ai/guides/bots" },
  // ... 12 more
];
```

The important part: this is **data**. It's versioned separately from the generator, auditable in code review, trivial to fork and override. If Anthropic ships a new bot tomorrow called `Claude-SearchBot`, you add one line to this file and everything downstream picks it up. You don't touch the generator, you don't touch the route, you don't touch the site that consumes it.

Each bot carries metadata: who operates it, what it's for (training / grounding / search / assistant), and a link to the vendor's published docs so anyone reviewing the list can verify the entry.

## The generator is 80 lines of pure TypeScript

```typescript
export const buildRobotsTxt = (opts: RobotsTxtOptions): string => {
  const {
    siteUrl,
    defaultPolicy = { mode: "allow" },
    botPolicies = {},
    bots = AGENT_BOTS,
    globalDisallow = ["/_emdash/"],
  } = opts;

  const sections: string[] = [];
  for (const bot of bots) {
    const policy = botPolicies[bot.id] ?? { mode: "allow" };
    sections.push(...renderGroup(bot.userAgent, policy));
    sections.push("");
  }
  sections.push(...renderGroup("*", defaultPolicy));
  for (const path of globalDisallow) sections.push(`Disallow: ${path}`);
  const base = siteUrl.replace(/\/$/, "");
  sections.push(`Sitemap: ${base}/sitemap.xml`);
  return sections.join("\n") + "\n";
};
```

Same input always produces the same output. No I/O, no environment access, no framework coupling. Trivial to unit-test. The function takes policy as an argument — the default is "allow every bot in the catalog, disallow /\_emdash/, advertise the sitemap" — but any caller can override per-bot or change the default without forking the package.

## Policy as data

This is the part that mattered most to me. Most robots.txt generators hardcode a policy and give you knobs for individual bots. I wanted the opposite: give the caller a small DSL they can compose.

```typescript
type BotPolicy =
  | { mode: "allow" }
  | { mode: "disallow" }
  | { mode: "paths"; allow?: string[]; disallow?: string[] };
```

Three modes. Every bot, plus the default fallback, takes one of these. That means you can compose any reasonable policy in a few lines.

**Maximum agent discoverability** — the default. Every bot allowed, admin hidden:

```typescript
buildRobotsTxt({ siteUrl, defaultPolicy: { mode: "allow" } });
```

**Block training bots, allow search and grounding** — if you want agents to surface your posts in answers but not train on them:

```typescript
import { AGENT_BOTS, filterBotsByPurpose } from "emdash-plugin-agent-seo/bots";

const trainingBots = filterBotsByPurpose(AGENT_BOTS, ["training"]);
const botPolicies = Object.fromEntries(
  trainingBots.map((b) => [b.id, { mode: "disallow" } as const]),
);
buildRobotsTxt({ siteUrl, botPolicies, defaultPolicy: { mode: "allow" } });
```

**Paid-only crawl** — every known bot disallowed, humans through, x402 paywall on the pages. Pairs well with pay-per-fetch revenue:

```typescript
const botPolicies = Object.fromEntries(
  AGENT_BOTS.map((b) => [b.id, { mode: "disallow" } as const]),
);
buildRobotsTxt({ siteUrl, botPolicies, defaultPolicy: { mode: "allow" } });
```

The policy is yours. The plugin just renders it.

## What to pair it with

`robots.txt` is the signal that tells bots what they're allowed to take. `llms.txt` is the signal that tells them what's worth taking. You want both.

`llms.txt` is a discovery manifest per [llmstxt.org](https://llmstxt.org) — a root-level file that lists your canonical posts and pages as link lines, plus a separate `llms-full.txt` that inlines the full text so an agent can ingest the whole site in one fetch. I published a separate plugin for that called `emdash-plugin-llms-txt`. Same design: pure functional generator, no framework coupling, caller controls the data shape.

The combination of an allow-everything `robots.txt` + a clean `llms.txt` + a fresh `sitemap.xml` is the current best-practice for agent-era SEO. It's also almost nothing to ship once you have the plugins.

## Wire it up

Drop this in `src/pages/robots.txt.ts` in any EmDash project:

```typescript
import type { APIRoute } from "astro";
import { buildRobotsTxt } from "emdash-plugin-agent-seo";

export const GET: APIRoute = async ({ site, url }) => {
  const siteUrl = site?.toString() ?? url.origin;
  const body = buildRobotsTxt({
    siteUrl,
    defaultPolicy: { mode: "allow" },
    globalDisallow: ["/_emdash/"],
  });
  return new Response(body, {
    headers: {
      "content-type": "text/plain; charset=utf-8",
      "cache-control": "public, max-age=3600",
    },
  });
};
```

That's the whole integration. Nine lines plus the response envelope.

## One last thing about Cloudflare

I spent longer than I should have on this, so flagging it explicitly: if your site is behind Cloudflare and you've turned on "AI Scrapers and Crawlers" in the zone settings, **your Worker cannot fully override the served robots.txt.** Cloudflare prepends its managed content above your Worker response. The dashboard path is Security → Bots → AI Scrapers and Crawlers. Toggle it off if you want the Worker's response to win.

This is not in Cloudflare's published gotcha list anywhere I could find. It's a setting a lot of accounts have enabled by default. I didn't notice for an embarrassing length of time because I kept `curl`-ing `/robots.txt` and seeing my agent-seo allow list at the bottom of the response, so I assumed it was working. First-match-wins means the top always wins.

## Source

`emdash-plugin-agent-seo` is MIT-licensed. The code lives in the `plugins/` directory of the blog repo until I extract it to a standalone public repo next week. Pull requests welcome, especially for the bot catalog — if I missed a crawler or have stale docs URLs, open an issue or send a patch.

Same story for `emdash-plugin-llms-txt` — they're sister packages and most people will want both.

This is the kind of thing that should exist as a boring, small, well-tested utility that any EmDash site can install in one command. The v0.1.0 packages are the first pass. The goal is for robots.txt on any EmDash site, anywhere, to be a solved problem — the same way you don't think about how your site gets SSL certs anymore.
