I've been treating this as two separate decisions, because people mix them up:
- Training/data collection crawlers - bots that scrape content to build datasets.
- Answer-generation / "AI search" fetchers - systems that fetch a live page to quote or cite in a response.
Blocking the first group makes perfect sense if you don't want your work feeding someone else's model. But blocking the second can shoot you in the foot if you do want your content showing up in AI surfaces with proper citations - they often need to grab the source to quote it accurately.
Here's the practical workflow I've settled on:
- Decide your policy per bot class. Don't just throw a blanket "Disallow: /" at everything labelled "AI".
- Keep your robots.txt explicit and document the reasoning internally - future you will thank past you when a new crawler shows up.
- Add logging to see which bots actually hit your site. Screaming Frog crawl, GSC reports, or even GTM custom events can track UA strings.
- If you let fetchers through, make sure the pages they hit are clean, canonical, and not paywalled-by-JS - nothing worse than an AI quoting a loading spinner.
And whatever you choose, monitor it. Most teams set robots.txt once and never touch it again, but the bot landscape changes every quarter. I've seen AhrefsBot get rebranded twice in the last year alone.