NeuralCrawl

πŸ‡ΊπŸ‡Έ Cornell University

cornell.edu · Universities · rank #13 · University · live robots.txt ↗

AI crawler access (latest snapshot, 3h ago)

blocked restricted allowed faded = inherited from the * wildcard group

GPTBot
ChatGPT-User
OAI-SearchBot
ClaudeBot
Claude-User
Claude-SearchBot
anthropic-ai
Claude-Web
CCBot
Google-Extended
Applebot-Extended
PerplexityBot
Perplexity-User
Bytespider
Amazonbot
FacebookBot
meta-externalagent
meta-externalfetcher
cohere-ai
AI2Bot
Diffbot
omgili
YouBot
DuckAssistBot
MistralAI-User
PanguBot
Timpibot

Current robots.txt 1018 bytes · sha256 cf32953b9920 · raw

User-agent: *
Crawl-Delay: 6
Disallow: /_dynamic_files/
Disallow: /_tasks/
Disallow: /test/
Disallow: /tools/
Disallow: /template/
Disallow: /search/
Disallow: /visit/plan/
Disallow: /video/kaltura/
Disallow: /video/tasks/
Disallow: /server-health-check/


# SiteImprove should ignore these page particularly because they aren't actually used, but are still linked for historical reasons
User-agent: Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0) SiteCheck-sitecrawl by Siteimprove.com
Disallow: /cuinfo/specialconditions/
Disallow: /_includes/header.cfm

User-agent: Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0) LinkCheck by Siteimprove.com
Disallow: /cuinfo/specialconditions/
Disallow: /_includes/header.cfm

User-agent: HTML validator: Siteimprove_W3C_Validator/1.3
Disallow: /cuinfo/specialconditions/
Disallow: /_includes/header.cfm

User-agent: CSS Validator: Jigsaw/2.3.0 W3C_CSS_Validator_JFouffa/2.0
Disallow: /cuinfo/specialconditions/
Disallow: /_includes/header.cfm 

Change history

  1. initial snapshot
    • First snapshot of robots.txt archived