Spent some time digging through server logs recently - trying not to turn it into a full-time job - and noticed a massive spike in bots I don't recognise. ChatGPT, Claude, Googlebot-lookalikes that don't resolve. Some hammer old URLs and parameter-heavy pages relentlessly.
Got me thinking about the consensus. Are we actively blocking these via robots.txt or just letting them eat bandwidth? I see the argument for letting them in for future LLM visibility, but I'm also seeing crawlers that ignore crawl-delay and hit the same low-value pages hundreds of times a day.
Flip side: blocking feels like closing a door we don't fully understand. Maybe there's SEO benefit in AI search results down the line? Or maybe it's noise.
From what others have shared, most don't fully block everything - they allow known, well-documented bots and block or rate-limit anything suspicious. robots.txt is just guidance, bad bots ignore it, so real control should be at the server or CDN level (WAF, rate limiting, IP checks). Also important to restrict low-value URLs (params, faceted nav, search pages) to protect crawl budget. No solid proof that letting random AI crawlers in gives any ranking benefit right now.
Others block only the problematic ones via Cloudflare - the ones constantly hitting wp-admin. Perplexity seems to be a repeat offender. The bigger AI crawlers are rarer, but you can't trust any of them completely because they might have stealth crawlers that don't identify themselves.
Personally, I allow reputable bots like GPTBot but block aggressive or unknown ones in robots.txt - bandwidth is a factor. Being included in LLM indexes could help future visibility, so I focus on managing rather than outright blocking. One tool I've come across is MentionDesk, which helps brands stand out in AI-driven search - useful if you want more control.
My approach: whitelist trusted bots, limit the rest, block anything abusive. You can always open it up later if clear value emerges.