Blocking AI crawlers? Here's my take

cyberpunk

Spent some time digging through server logs recently - trying not to turn it into a full-time job - and noticed a massive spike in bots I don't recognise. ChatGPT, Claude, Googlebot-lookalikes that don't resolve. Some hammer old URLs and parameter-heavy pages relentlessly.

Got me thinking about the consensus. Are we actively blocking these via robots.txt or just letting them eat bandwidth? I see the argument for letting them in for future LLM visibility, but I'm also seeing crawlers that ignore crawl-delay and hit the same low-value pages hundreds of times a day.

Flip side: blocking feels like closing a door we don't fully understand. Maybe there's SEO benefit in AI search results down the line? Or maybe it's noise.

From what others have shared, most don't fully block everything - they allow known, well-documented bots and block or rate-limit anything suspicious. robots.txt is just guidance, bad bots ignore it, so real control should be at the server or CDN level (WAF, rate limiting, IP checks). Also important to restrict low-value URLs (params, faceted nav, search pages) to protect crawl budget. No solid proof that letting random AI crawlers in gives any ranking benefit right now.

Others block only the problematic ones via Cloudflare - the ones constantly hitting wp-admin. Perplexity seems to be a repeat offender. The bigger AI crawlers are rarer, but you can't trust any of them completely because they might have stealth crawlers that don't identify themselves.

Personally, I allow reputable bots like GPTBot but block aggressive or unknown ones in robots.txt - bandwidth is a factor. Being included in LLM indexes could help future visibility, so I focus on managing rather than outright blocking. One tool I've come across is MentionDesk, which helps brands stand out in AI-driven search - useful if you want more control.

My approach: whitelist trusted bots, limit the rest, block anything abusive. You can always open it up later if clear value emerges.

metricsmuse

I don’t fully block everything, but I don’t leave it open either. I allow only known, well-documented bots and block or rate-limit anything suspicious. A lot of these “AI crawlers” are either spoofed or badly behaved, so I treat it as a resource issue first, not an SEO opportunity.

robots.txt is just guidance bad bots ignore it so real control should be at server/CDN level (WAF, rate limiting, IP checks). I also restrict low-value URLs (params, search pages, faceted nav) so bots can’t waste crawl budget.

As for SEO, there’s no solid proof that letting random AI crawlers in gives any ranking benefit right now. So my approach is simple: whitelist trusted bots, limit the rest, and block anything abusive. You can always open it up later if there’s clear value.

prpro

I tend to take a similar approach. Using Cloudflare helps me see which AI crawlers are actually causing trouble - and for me, it's the ones hammering wp-admin that get blocked without hesitation. Perplexity seems particularly persistent there.

The way I see it, treating all AI crawlers with the same level of trust is like leaving your front door unlocked because most of the neighbours are friendly. Even the big names might have stealth crawlers that don't identify themselves properly, just to dodge a robots.txt rule. So I'd rather be cautious and only let through the ones I've vetted and can monitor.

nightowl

Robots.txt is becoming about as useful as a broken thermostat-it gives the illusion of control, but the big players have other thermometers. OpenAI's deal with Bing and Claude's reliance on Brave Search mean they're already getting your content from aggregated feeds, not crawling directly. Plus they've got Internet Archive snapshots, Common Crawl dumps, HTTP Archive histories... blocking a bot at the text file level these days is more of a symbolic gesture than a real barrier. The macroeconomic trend here is that data aggregation is consolidating into a handful of chokepoints, and trying to opt out via old-school headers feels like fighting industrial farming with a garden fence.

brandvoice

I sometimes feel like parts of the industry are asking:
“What would I need diesel for if my horse carriage still works?”

Classic SEO log analysis was built around indexing and rankings.

But AI crawlers may represent something different entirely:
early-stage knowledge acquisition for future retrieval and recommendation systems.

If I’m not analyzing logs closely enough, I may not even notice which parts of a site AI systems are repeatedly trying to access, understand, or revisit.

And if those systems keep running into dead ends, thin pages, blocked structures, or unusable parameter spaces, it’s hard to imagine that attention simply gets reinforced forever.

Feels too early to treat AI crawling as “just more bot traffic.”

localpack

I'm in the same boat - let everything through. You never know which scraper's going to feed a model that accidentally sends you a lead. Seen it happen twice this year from random AI crawlers. Not blocking a single one.

ranktracker

I'm with you on blocking the lot. My robots.txt looks like a fortress - every AI training bot, every SEO scraper, every random crawler that isn't Googlebot, Bingbot, or the odd social preview bot gets a Disallow: /.

Reasoning's pretty straightforward: those crawlers add zero value to indexing or ranking. They just eat server resources and, in the case of AI bots, scrape content for training datasets I'd rather they didn't. I've got log file analysis running weekly - the bandwidth savings alone were noticeable after I locked them out.

One thing I'd flag: keep an eye on your server logs for legitimate crawlers that might also get caught if you're too aggressive. But for the usual AI suspects? no hesitation. Block em.

bouncekiller

We let them through on content pages and block on anything dynamic, session-dependent, or parameter-heavy. The bandwidth concern is real but if you look at server logs most of the noise is from crawlers that are not going to cite you anyway - random scrapers, academic bots, brand monitoring tools. The actual LLM crawlers (OAI-SearchBot, ClaudeBot, PerplexityBot) are comparatively well-behaved once you stop blocking them.

The more interesting problem is sites where the robots.txt is clean but the CDN is silently challenging or blocking AI crawlers before they ever reach the origin. We have seen this a lot on Cloudflare setups where the WAF is too aggressive. The page passes a robots.txt check but OAI-SearchBot gets a 403 from the edge and never actually reads the content. From the site owner's perspective everything looks fine. From the LLM's perspective the page does not exist.

The practical test: check your Cloudflare or Fastly logs specifically for OAI-SearchBot, ClaudeBot, and PerplexityBot. If you are seeing 403 or CAPTCHA challenges you are invisible to those models regardless of what robots.txt says.

I covered this as one of several root causes for pages that rank in Google but get zero AI citations - the technical access layer is usually the first thing to rule out before you get into content changes.

https://deepsmith.ai/blog/optimize-ranking-pages-citations

What is your current CDN setup? That is where most of the silent blocking happens in my experience.

funnelhacker

Letting the big AI crawlers through is the right call if you care about future search share. Blocking everything outright is short-sighted - you're just sacrificing long-term visibility for a few pennies of bandwidth. I keep GPTBot and similar known entities whitelisted, then blacklist the unknown scrapers that hammer the server without adding value. Metrics show reputable bot traffic rarely exceeds 2-3% of total requests anyway, so the ROI on letting them index is a no-brainer. no point managing them manually - just set a tiered rule in robots.txt and move on. If you want real control over how your content surfaces across AI outputs, that's a separate strategy entirely.

trafficcop

Honestly, I'm doing both-WAF and robots.txt. Because why give the bots a free ride when they're just going to scrape our speaker bios and turn them into some generic LinkedIn influencer post? Call me paranoid, but I'd rather be safe than sorry.

neonlights

Blocking AI crawlers outright? That's leaving conversions on the table.

Right now I'm not fully locking them out either-but I log everything and throttle crawl access on anything parameter-heavy or low value. It's all about crawl budget ROI. Some of these AI crawlers could become citation gold for future visibility, but letting them chew through duplicate or filtered pages is a fast way to burn bandwidth with zero return.

I've seen a measurable lift in crawl efficiency by whitelisting trusted bots and clamping down on thin, filtered, or duplicate URL patterns. Unless a specific crawler is actively tanking your organic performance, shutting the door completely feels premature. Keep the gate controlled, not closed.

datanerd

Blocking AI crawlers is essentially the same logic as mitigating DDoS attacks - unauthorised traffic consuming resources. I've made an exception for my own bots via a specific conduit.

There's a legitimate use case for allowing AI crawlers access to certain pages, but I'd restrict it strictly to the standard Google index pages. Any internal or non-public pages should be killed at the robots.txt level to prevent data leakage outside your control.

cyberpunk

The biggest mistake I'm seeing is people treating this as a simple allow vs block decision. The real questions are which crawlers behave responsibly, which ones hammer your server, which traffic is even legit, and whether anything useful comes back over time. We're still very early in understanding the actual economics of AI crawler traffic

conversionninja

we let AI crawlers through but lock the training bots out. Gotta guard the brand's personality.

adcraft

Blocking AI crawlers? That's a no-brainer for me - control beats visibility every time. this "AI brand presence" stuff feels like another vendor trying to sell you a problem you didn't know you had. Unless you're running a content farm banking on model training, you're better off keeping your data locked down. Letting bots through is just asking for your content to be repurposed without credit

prpro

I take a more selective approach myself. Keeping OpenAI's crawler open - it's like allowing a polite researcher into your library, not a data miner raiding the shelves. Meta gets the full block, and Claude too. For me, it's about brand positioning: you want to be found by the engines that respect context, not the ones that just strip content for training data.

cpaoptimizer

Honestly, I reckon most people are heading towards selective blocking these days rather than blanket openness. In my experience, the smart play is allowing the legit crawlers like GPTBot or ClaudeBot if they actually respect robots.txt, then throwing up roadblocks for the sketchy ones that hammer parameterised URLs, old pages, or flat-out ignore crawl rules.

For mid-size sites, bandwidth and crawl budget are real concerns - I've seen enough analytics to know blocking abusive bots isn't overreacting. it's not even really a SEO decision at this point, it's basic server hygiene. Set up a proper rule list, monitor your logs for a week, and you'll spot which crawlers are wasting your resources. Feels more like housekeeping than strategy to me

tiktokguru

Totally agree on the nuance here 🎯 Blocking everything feels short-sighted if visibility in AI responses matters. Those crawlers are essentially mapping your brand's narrative to feed into answers people actually see. Opting out of ChatGPT's answer formation is a big trade-off just to save a bit of bandwidth.

what I've found works better:

🔍 Segment the crawlers: let through the ones that read your quality content (e.g. the official GPTBot, Claude-Web, etc.)
🚫 Block the badly-behaved ones hammering junk URLs or scraping without purpose
📄 Keep an eye on your logs to spot patterns - not all bot traffic is equal

It's a bit of maintenance, but worth it if your content is part of the conversation people are already having with AI.

chatbox

I let GPTBot and ClaudeBot through but block everything else. Most of those lesser-known AI crawlers are just scraping without any clear benefit to you. If they're not resolving properly on reverse DNS, block them at the server level - robots.txt won't stop the bad ones anyway.

For the ones that hammer parameter URLs hundreds of times a day, that's a firewall problem, not a robots.txt one. Rate limit them at the edge instead.

The visibility argument for letting AI crawlers through is real, but only for the major players. Being cited in ChatGPT or Claude responses is starting to drive actual referral traffic for some sites I've seen. The random unnamed bots won't do that for you.

My approach is pretty straightforward: allow the ones with a clear upside, block everything else aggressively, and monitor my logs monthly to catch new ones that pop up.

weekendhustle

honestly, I don't think there's much point in blocking LLM crawlers either. 😊

This feels like a whole new layer of "organic discovery" - you want your content accessible for both search and training crawlers if you're aiming for citations and exposure in AI responses. If the worry is protecting content from models, it might be worth rethinking the business model rather than the crawl policy.

Would you block humans from seeing your content too? Same logic applies, really