NeuralCrawl

The Guardian / robots.txt snapshot

← back to theguardian.com · fetched 2026-06-20T01:10:30Z (18h ago) · HTTP 200 · 2844 bytes · sha256 86ff4f30943f492a · raw

final URL: https://www.theguardian.com/robots.txt

1# This is the robots.txt file for theguardian.com
2
3# Guardian content is made available under our terms and conditions of use.
4# Any other uses are not permitted, incl. but not limited to: for large language
5# models (LLMs), machine learning and/or artificial intelligence-related
6# purposes; with any of the aforementioned technologies; and/or for any
7# commercial purposes. Contact [email protected] for assistance
8
9User-agent: *
10Disallow: /sendarticle/
11Disallow: /Users/
12Disallow: /users/
13Disallow: /*/print$
14Disallow: /email/
15Disallow: /contactus/
16Disallow: /share/
17Disallow: /websearch
18Disallow: /*?commentpage=
19Disallow: /whsmiths/
20Disallow: /external/overture/
21Disallow: /discussion/report-abuse/*
22Disallow: /discussion/report-abuse-ajax/*
23Disallow: /discussion/comment-permalink/*
24Disallow: /discussion/report-abuse/*
25Disallow: /discussion/user-report-abuse/*
26Disallow: /discussion/handlers/*
27Disallow: /discussion/your-profile
28Disallow: /discussion/your-comments
29Disallow: /discussion/edit-profile
30Disallow: /discussion/search/comments
31Disallow: /discussion/*
32Disallow: /search
33Disallow: /music/artist/*
34Disallow: /music/album/*
35Disallow: /books/data/*
36Disallow: /settings/
37Disallow: /embed/
38Disallow: /*styles/js-on.css$
39Disallow: /sport/olympics/2008/events/*
40Disallow: /sport/olympics/2008/medals/*
41Disallow: /f/healthcheck
42Disallow: /sections
43Disallow: /top-stories
44Disallow: /most-read/sport
45Disallow: /articles
46Disallow: /global$
47Disallow: /*/feedarticle/*
48Disallow: /travel/2013/aug/22/been-there-readers-competition?*
49Disallow: /preference/*
50Disallow: /59666047/
51Disallow: /print/
52Disallow: /info/tech-feedback
53Disallow: /production-monitoring/
54Disallow: *.emailjson
55Disallow: *.emailtxt
56Disallow: /headline.txt
57Disallow: *?*dcr=apps*
58
59User-agent: Mediapartners-Google
60Disallow:
61
62Sitemap: http://www.theguardian.com/sitemaps/news.xml
63Sitemap: http://www.theguardian.com/sitemaps/video.xml
64
65User-agent: NewsNow
66User-agent: CCBot
67User-agent: TurnitinBot
68User-agent: PetalBot
69User-agent: MoodleBot
70User-agent: FacebookBot
71User-agent: Bytespider
72User-agent: Mojeek
73User-agent: JenkersBot
74User-agent: Seekr
75User-agent: YouBot
76User-agent: Arquivo-web-crawler
77User-agent: coccocbot-web
78User-agent: SeznamBot
79User-Agent: PerplexityBot
80User-Agent: yacy
81User-agent: anthropic-ai
82User-agent: ClaudeBot
83User-agent: Claude-SearchBot
84User-agent: Claude-User
85User-agent: AwarioRssBot
86User-agent: AwarioSmartBot
87User-agent: SentiOne
88User-agent: ImageSift
89User-agent: Applebot-Extended
90User-agent: YandexAdditional
91User-agent: YandexAdditionalBot
92User-agent: scalepostAI
93User-agent: Buck
94User-agent: meta-externalagent
95User-agent: Amazonbot
96User-agent: DuckAssistBot
97User-agent: Google-CloudVertexBot
98User-agent: Amzn-SearchBot
99User-agent: AhrefsBot
100User-agent: AhrefsSiteAudit
101Disallow: /
102
103License: https://theguardian.com/license.xml