AI & search crawlers
Every crawler NeuralCrawl tracks in robots.txt — vendor, purpose, and how many monitored sites block it by name.
AI crawlers 27 bots
| Bot | Vendor | Purpose | Sites blocking |
|---|---|---|---|
CCBot |
Common Crawl | Open web corpus (used for LLM training) | 93 (8.7%) |
GPTBot |
OpenAI | LLM training crawler | 89 (8.3%) |
ClaudeBot |
Anthropic | LLM training crawler | 87 (8.1%) |
Bytespider |
ByteDance | LLM training crawler | 84 (7.8%) |
Google-Extended |
Gemini training opt-out token | 75 (7.0%) | |
meta-externalagent |
Meta | Meta AI training crawler | 72 (6.7%) |
Applebot-Extended |
Apple | Apple Intelligence training opt-out | 69 (6.4%) |
Amazonbot |
Amazon | Alexa / LLM crawler | 63 (5.9%) |
anthropic-ai |
Anthropic | Legacy crawler token | 60 (5.6%) |
Diffbot |
Diffbot | Structured data extraction | 60 (5.6%) |
omgili |
Webz.io | Data feeds resold for AI training | 56 (5.2%) |
PerplexityBot |
Perplexity | Answer-engine indexing | 55 (5.1%) |
cohere-ai |
Cohere | LLM training crawler | 51 (4.7%) |
FacebookBot |
Meta | Meta AI crawler (legacy) | 50 (4.7%) |
Claude-Web |
Anthropic | Legacy crawler token | 45 (4.2%) |
ChatGPT-User |
OpenAI | User-triggered browsing | 42 (3.9%) |
Timpibot |
Timpi | Decentralised index crawler | 37 (3.4%) |
Claude-User |
Anthropic | User-triggered browsing | 32 (3.0%) |
Claude-SearchBot |
Anthropic | Search indexing | 30 (2.8%) |
Perplexity-User |
Perplexity | User-triggered browsing | 29 (2.7%) |
YouBot |
You.com | Answer-engine indexing | 29 (2.7%) |
DuckAssistBot |
DuckDuckGo | DuckAssist answers | 28 (2.6%) |
meta-externalfetcher |
Meta | User-triggered fetcher | 28 (2.6%) |
AI2Bot |
Allen Institute for AI | Research crawler | 27 (2.5%) |
OAI-SearchBot |
OpenAI | Search indexing | 26 (2.4%) |
MistralAI-User |
Mistral AI | User-triggered browsing | 22 (2.0%) |
PanguBot |
Huawei | LLM training crawler | 22 (2.0%) |
Search-engine crawlers 12 bots
| Bot | Vendor | Purpose | Sites blocking |
|---|---|---|---|
PetalBot |
Huawei | Petal Search index | 46 (4.3%) |
Baiduspider |
Baidu | Baidu Search index | 25 (2.3%) |
SeznamBot |
Seznam | Seznam Search index (Czechia) | 7 (0.7%) |
Applebot |
Apple | Siri & Spotlight Search index | 5 (0.5%) |
YandexBot |
Yandex | Yandex Search index | 5 (0.5%) |
Slurp |
Yahoo | Yahoo Search index | 4 (0.4%) |
Sogou |
Sogou | Sogou Search index (China) | 3 (0.3%) |
Googlebot-News |
Google News index | 1 (0.1%) | |
Bingbot |
Microsoft | Bing Search index | 0 (0.0%) |
DuckDuckBot |
DuckDuckGo | DuckDuckGo Search index | 0 (0.0%) |
Googlebot |
Google Search index | 0 (0.0%) | |
Googlebot-Image |
Google Images index | 0 (0.0%) |
See also The Wall — a site-by-site matrix of blocking status across the most newsworthy crawlers.