When users ask questions in ChatGPT, ChatGPT's ChatGPT-User bot actually runs web searches and downloads search result pages in real time to source up-to-date information for the user. Perplexity and Meta's AI search features have the same behavior.
This is a separate process from search indexing (the traditional way Googlebot or Bingbot crawl websites on an ongoing basis) or the periodic collection of training data that AI companies use to build models. (For most AI search platforms you can opt-out of having the AI train on your content while still being surfaced in search results.)
As a website operator, just like you want to be findable in Google, you want to be citable by AI assistants — AI search is one of the fastest growing consumer apps of all time, and users who eventually click through to your site from AI answers are better qualified and more likely to convert. And unlike traditional web search, when you update a page that is getting retrieved and cited by ChatGPT or Perplexity, the updated content is reflected in the AI's answers in real time.
tl;dr – What user agents should I allowlist to be cited and get traffic from AI platforms?
Platform | Robots.txt Identifier | User-Agent header value |
ChatGPT |
|
|
Claude |
|
|
Meta AI |
|
|
Perplexity |
|
|
Google AI Overviews |
|
|
Google Gemini |
|
|
Your site must also be indexed by normal web search engines, so allow Googlebot and Bingbot if you have not already done so.
What user agents is it safe to block? How can I stop AI models from being trained on my data?
In general, we recommend organizations for whom monetizing content is not a primary revenue stream consider allowing at least some training data collection from your web properties. The more models are trained on general information about your brand, products, and services, the more likely you are to be correctly represented by model answers across the widest range of inquiries.
That being said, OpenAI's training data crawler can be blocked without affecting your ability to receive citations in ChatGPT. You can also safely block training data collection from Anthropic (Claude) and Common Crawl without affecting your ability to be cited by chatbots. So you can block:
GPTBotClaudeBotCCBot
without any immediately negative effects.
Note that blocking ClaudeBot will prevent training data collection, but you should still allow Claude-SearchBot and Claude-User if you want to be cited in Claude's search and retrieval features.
Unfortunately, it's unclear if it's possible to disable training data collection from Meta or Google without affecting your presence and ability to be cited in Gemini and Meta AI. We'll update as we gain more clarity into their user agent / bot policies.
How to allow or block AI user agents
There are two mechanisms for bot control that websites commonly deploy:
robots.txt
These are rules published on a per-domain basis that reputable bot operators (including OpenAI / ChatGPT) follow when deciding whether to access content. You can block user agents entirely or restrict them to specific content.
However, robots.txt is an opt-in mechanism that web crawlers and other agents that are retrieving your content must specifically decide to obey. (It is considered a major breach of internet etiquette to ignore robots.txt.)
You should also be aware that there are some cases where robots.txt rules may be considered not to apply. For example, it's common for internet platforms to retrieve URLs that users submit to generate content previews or check for malicious pages. These aren't typically considered to be violations of robots.txt rules because they are the direct result of a user action — not an automated crawler.
Anti-bot middleware, web application firewalls, etc.
Some website providers employ more stringent network-level anti-bot protections. If you run your website in-house, they're also often deployed by in-house IT teams. Examples of these products include Cloudflare Bot Management, Imperva, Akamai Bot Manager, and the Bot Control feature of AWS Web Application Firewall.
These are a technical enforcement mechanism that either completely block suspected bot traffic (based on their user agent, but also signals like source IP address and browsing behavior) or require the completion of a captcha or other anti-bot challenge before allowing visitors to access the actual page content.
Recommendations
We've seen that many websites across the internet have turned on anti-bot protections from vendors like Cloudflare. In particular, many media properties do not want their content to be used to train LLMs that may compete with them in the future.
However, this anti-bot blocking is also preventing ChatGPT's
