NeuralCrawl

๐Ÿ‡ฎ๐Ÿ‡ฑ Bar-Ilan University

biu.ac.il · Universities · rank #42 · University · live robots.txt ↗

AI crawler access (latest snapshot, 2h ago)

blocked restricted allowed faded = inherited from the * wildcard group

GPTBot
ChatGPT-User
OAI-SearchBot
ClaudeBot
Claude-User
Claude-SearchBot
anthropic-ai
Claude-Web
CCBot
Google-Extended
Applebot-Extended
PerplexityBot
Perplexity-User
Bytespider
Amazonbot
FacebookBot
meta-externalagent
meta-externalfetcher
cohere-ai
AI2Bot
Diffbot
omgili
YouBot
DuckAssistBot
MistralAI-User
PanguBot
Timpibot

Current robots.txt 3363 bytes · sha256 1b3eba58b064 · raw

#
# robots.txt
#
# This file is to prevent the crawling and indexing of certain parts
# of your site by web crawlers and spiders run by sites like Yahoo!
# and Google. By telling these "robots" where not to go on your site,
# you save bandwidth and server resources.
#
# This file will be ignored unless it is at the root of your host:
# Used:    http://example.com/robots.txt
# Ignored: http://example.com/site/robots.txt
#
# For more information about the robots.txt standard, see:
# http://www.robotstxt.org/robotstxt.html

# Allow AI crawlers full access

User-agent: AI2Bot
Allow: /

User-agent: Ai2Bot-Dolma
Allow: /

User-agent: Amazonbot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: Applebot
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: Brightbot 1.0
Allow: /

User-agent: Bytespider
Allow: /

User-agent: CCBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: cohere-ai
Allow: /

User-agent: cohere-training-data-crawler
Allow: /

User-agent: Crawlspace
Allow: /

User-agent: Diffbot
Allow: /

User-agent: DuckAssistBot
Allow: /

User-agent: FacebookBot
Allow: /

User-agent: FriendlyCrawler
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: GoogleOther
Allow: /

User-agent: GoogleOther-Image
Allow: /

User-agent: GoogleOther-Video
Allow: /

User-agent: GPTBot
Allow: /

User-agent: iaskspider/2.0
Allow: /

User-agent: ICC-Crawler
Allow: /

User-agent: ImagesiftBot
Allow: /

User-agent: img2dataset
Allow: /

User-agent: ISSCyberRiskCrawler
Allow: /

User-agent: Kangaroo Bot
Allow: /

User-agent: Meta-ExternalAgent
Allow: /

User-agent: Meta-ExternalFetcher
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: omgili
Allow: /

User-agent: omgilibot
Allow: /

User-agent: PanguBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: PetalBot
Allow: /

User-agent: Scrapy
Allow: /

User-agent: SemrushBot-OCOB
Allow: /

User-agent: SemrushBot-SWA
Allow: /

User-agent: Sidetrade indexer bot
Allow: /

User-agent: Timpibot
Allow: /

User-agent: VelenPublicWebCrawler
Allow: /

User-agent: Webzio-Extended
Allow: /

User-agent: YouBot
Allow: /


# General rules for all other bots
User-agent: *
# Disallow: /
# CSS, JS, Images
Allow: /core/*.css$
Allow: /core/*.css?
Allow: /core/*.js$
Allow: /core/*.js?
Allow: /core/*.gif
Allow: /core/*.jpg
Allow: /core/*.jpeg
Allow: /core/*.png
Allow: /core/*.svg
Allow: /profiles/*.css$
Allow: /profiles/*.css?
Allow: /profiles/*.js$
Allow: /profiles/*.js?
Allow: /profiles/*.gif
Allow: /profiles/*.jpg
Allow: /profiles/*.jpeg
Allow: /profiles/*.png
Allow: /profiles/*.svg
# Directories
Disallow: /core/
Disallow: /profiles/
# Files
Disallow: /README.txt
Disallow: /web.config
# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /filter/tips
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
Disallow: /user/logout/
# Paths (no clean URLs)
Disallow: /index.php/admin/
Disallow: /index.php/comment/reply/
Disallow: /index.php/filter/tips
Disallow: /index.php/node/add/
Disallow: /index.php/search/
Disallow: /index.php/user/password/
Disallow: /index.php/user/register/
Disallow: /index.php/user/login/
Disallow: /index.php/user/logout/
Disallow: /taxonomy/*

Sitemap: https://www.biu.ac.il/sitemap.xml

Change history

  1. initial snapshot
    • First snapshot of robots.txt archived