WEBVTT - A robots.txt that actually lets agents in

1
00:00:00.000 --> 00:00:03.901
I deployed a blog this week and checked /robots.

2
00:00:03.901 --> 00:00:06.069
txt as a sanity pass.

3
00:00:06.069 --> 00:00:10.404
The default was blocking every AI crawler on the internet.

4
00:00:10.404 --> 00:00:15.606
ClaudeBot, GPTBot, Google-Extended, PerplexityBot, CCBot, Applebot-Extended, meta-externalagent, Bytespider — all Disallow: /.

5
00:00:15.606 --> 00:00:18.640
Plus a Content-Signal: search=yes,ai-train=no at the top.

6
00:00:18.640 --> 00:00:21.241
This is not what I want.

7
00:00:21.241 --> 00:00:31.645
If you're publishing in 2026 and your goal is for agents to find your work, starting from "block every agent" is starting at -1.

8
00:00:31.645 --> 00:00:39.015
You're paying the bandwidth bill for human crawlers and zero for the audience that is actually growing.

9
00:00:39.015 --> 00:00:41.182
So I wrote a plugin.

10
00:00:41.182 --> 00:00:43.783
The problem sits at two layers

11
00:00:43.783 --> 00:00:55.488
The first layer is the template. The EmDash blog template I was using has a reasonable default robots.txt that lets everything in except the admin surface. Fine.

12
00:00:55.488 --> 00:00:58.522
The second layer is the hosting provider.

13
00:00:58.522 --> 00:01:05.458
Cloudflare ships a zone-level feature called "AI Scrapers and Crawlers" that prepends its own managed robots.

14
00:01:05.458 --> 00:01:08.493
txt content above whatever your Worker returns.

15
00:01:08.493 --> 00:01:12.394
The prepended block is the one doing the blocking.

16
00:01:12.394 --> 00:01:21.498
It's on by default for a lot of accounts, and most people don't notice because they never diff the served /robots.

17
00:01:21.498 --> 00:01:24.532
txt against what their Worker actually emits.

18
00:01:24.532 --> 00:01:38.837
robots.txt processing uses first-match-wins semantics. When a bot sees its user-agent appear in two groups, the first one is the one it obeys. CF's block list is above my allow list. I lose.

19
00:01:38.837 --> 00:01:50.108
The fix has two steps: toggle the CF setting off in the dashboard under Security → Bots, and make sure your Worker is emitting a robots.

20
00:01:50.108 --> 00:01:52.276
txt you actually want served.

21
00:01:52.276 --> 00:01:55.310
This post is about the second step.

22
00:01:55.310 --> 00:01:58.345
The bot catalog is data, not code

23
00:01:58.345 --> 00:02:03.113
I published the generator as an open source plugin called emdash-plugin-agent-seo.

24
00:02:03.113 --> 00:02:07.882
The first thing in the package is a file called bots.

25
00:02:07.882 --> 00:02:08.315
ts.

26
00:02:08.315 --> 00:02:16.552
It is a flat list of every well-known AI crawler I could find documentation for as of Q2 2026:

27
00:02:16.552 --> 00:02:19.153
The important part: this is data.

28
00:02:19.153 --> 00:02:25.655
It's versioned separately from the generator, auditable in code review, trivial to fork and override.

29
00:02:25.655 --> 00:02:35.192
If Anthropic ships a new bot tomorrow called Claude-SearchBot, you add one line to this file and everything downstream picks it up.

30
00:02:35.192 --> 00:02:42.995
You don't touch the generator, you don't touch the route, you don't touch the site that consumes it.

31
00:02:42.995 --> 00:02:57.734
Each bot carries metadata: who operates it, what it's for (training / grounding / search / assistant), and a link to the vendor's published docs so anyone reviewing the list can verify the entry.

32
00:02:57.734 --> 00:03:01.202
The generator is 80 lines of pure TypeScript

33
00:03:01.202 --> 00:03:04.236
Same input always produces the same output.

34
00:03:04.236 --> 00:03:07.704
No I/O, no environment access, no framework coupling.

35
00:03:07.704 --> 00:03:09.005
Trivial to unit-test.

36
00:03:09.005 --> 00:03:25.044
The function takes policy as an argument — the default is "allow every bot in the catalog, disallow /_emdash/, advertise the sitemap" — but any caller can override per-bot or change the default without forking the package.

37
00:03:25.044 --> 00:03:26.345
Policy as data

38
00:03:26.345 --> 00:03:41.517
This is the part that mattered most to me. Most robots.txt generators hardcode a policy and give you knobs for individual bots. I wanted the opposite: give the caller a small DSL they can compose.

39
00:03:41.517 --> 00:03:51.921
Three modes. Every bot, plus the default fallback, takes one of these. That means you can compose any reasonable policy in a few lines.

40
00:03:51.921 --> 00:03:56.690
Maximum agent discoverability — the default. Every bot allowed, admin hidden:

41
00:03:56.690 --> 00:04:06.660
Block training bots, allow search and grounding — if you want agents to surface your posts in answers but not train on them:

42
00:04:06.660 --> 00:04:14.897
Paid-only crawl — every known bot disallowed, humans through, x402 paywall on the pages. Pairs well with pay-per-fetch revenue:

43
00:04:14.897 --> 00:04:18.798
The policy is yours. The plugin just renders it.

44
00:04:18.798 --> 00:04:20.966
What to pair it with

45
00:04:20.966 --> 00:04:31.803
robots.txt is the signal that tells bots what they're allowed to take. llms.txt is the signal that tells them what's worth taking. You want both.

46
00:04:31.803 --> 00:04:32.236
llms.

47
00:04:32.236 --> 00:04:35.271
txt is a discovery manifest per llmstxt.

48
00:04:35.271 --> 00:04:43.507
org — a root-level file that lists your canonical posts and pages as link lines, plus a separate llms-full.

49
00:04:43.507 --> 00:04:50.877
txt that inlines the full text so an agent can ingest the whole site in one fetch.

50
00:04:50.877 --> 00:04:54.778
I published a separate plugin for that called emdash-plugin-llms-txt.

51
00:04:54.778 --> 00:05:00.414
Same design: pure functional generator, no framework coupling, caller controls the data shape.

52
00:05:00.414 --> 00:05:14.286
The combination of an allow-everything robots.txt + a clean llms.txt + a fresh sitemap.xml is the current best-practice for agent-era SEO. It's also almost nothing to ship once you have the plugins.

53
00:05:14.286 --> 00:05:15.586
Wire it up

54
00:05:15.586 --> 00:05:19.054
Drop this in src/pages/robots.txt.ts in any EmDash project:

55
00:05:19.054 --> 00:05:23.389
That's the whole integration. Nine lines plus the response envelope.

56
00:05:23.389 --> 00:05:25.557
One last thing about Cloudflare

57
00:05:25.557 --> 00:05:42.463
I spent longer than I should have on this, so flagging it explicitly: if your site is behind Cloudflare and you've turned on "AI Scrapers and Crawlers" in the zone settings, your Worker cannot fully override the served robots.

58
00:05:42.463 --> 00:05:42.897
txt.

59
00:05:42.897 --> 00:05:46.798
Cloudflare prepends its managed content above your Worker response.

60
00:05:46.798 --> 00:05:52.000
The dashboard path is Security → Bots → AI Scrapers and Crawlers.

61
00:05:52.000 --> 00:05:56.768
Toggle it off if you want the Worker's response to win.

62
00:05:56.768 --> 00:06:01.970
This is not in Cloudflare's published gotcha list anywhere I could find.

63
00:06:01.970 --> 00:06:06.739
It's a setting a lot of accounts have enabled by default.

64
00:06:06.739 --> 00:06:12.808
I didn't notice for an embarrassing length of time because I kept curl-ing /robots.

65
00:06:12.808 --> 00:06:21.044
txt and seeing my agent-seo allow list at the bottom of the response, so I assumed it was working.

66
00:06:21.044 --> 00:06:23.645
First-match-wins means the top always wins.

67
00:06:23.645 --> 00:06:24.079
Source

68
00:06:24.079 --> 00:06:25.379
emdash-plugin-agent-seo is MIT-licensed.

69
00:06:25.379 --> 00:06:34.916
The code lives in the plugins/ directory of the blog repo until I extract it to a standalone public repo next week.

70
00:06:34.916 --> 00:06:46.187
Pull requests welcome, especially for the bot catalog — if I missed a crawler or have stale docs URLs, open an issue or send a patch.

71
00:06:46.187 --> 00:06:52.256
Same story for emdash-plugin-llms-txt — they're sister packages and most people will want both.

72
00:06:52.256 --> 00:07:02.660
This is the kind of thing that should exist as a boring, small, well-tested utility that any EmDash site can install in one command.

73
00:07:02.660 --> 00:07:03.527
The v0.

74
00:07:03.527 --> 00:07:03.961
1.

75
00:07:03.961 --> 00:07:06.562
0 packages are the first pass.

76
00:07:06.562 --> 00:07:08.729
The goal is for robots.

77
00:07:08.729 --> 00:07:20.000
txt on any EmDash site, anywhere, to be a solved problem — the same way you don't think about how your site gets SSL certs anymore.