イギリスの新聞のrobots.txtを調査した。下記のrobots.txtを調査するシリーズの第三弾。
Wikipediaのイギリスの新聞一覧によると、イギリスの高級紙は下記の5紙。
- デイリー・テレグラフ (The Daily Telegraph/ The Sunday Telegraph)
- タイムズ (The Times /The Sunday Times)
- インデペンデント (The Independent)
- ガーディアン (The Guardian) / オブザーバー (The Observer)
- フィナンシャル・タイムズ (The Financial Times)
結果を下記の表に示す。基本的に、個人がプログラムで自動アクセスしても大丈夫だった。ただし、Nutchなどの特定のクローラが名指しで禁止されている場合があるので注意が必要。
※ 記事の内容に保証はありません。実際に各サイトからデータを取得する場合には、ご自身で調査をお願いします。
新聞 | User Agentによる制限 | 記事リスト | 記事本文 |
---|---|---|---|
The Daily Telegraph | Endecaのみ別扱い。その他は共通。 | OK | OK |
The Times | 特定のクローラが禁止されている | OK | OK |
The Independent | Nutchは不可 | 見当たらない | OK |
The Guardian | - | OK | OK |
The Financial Times | - | OK | OK |
The Daily Telegraph
The Daily Telegraph
robots.txt
# Robots.txt file
# All robots will spider the domain
User-agent: *
Disallow: */application/*
Disallow: */ixale/
Disallow: /core/Content/
Disallow: /*?source=rss
Disallow: /*?mobile=true
Disallow: /*?mobile=basic
Disallow: /*?ModPagespeed=noscript
Disallow: /promotions/emails/
Disallow: /r/
Disallow: /search/*
Disallow: /searchbeta/*
Disallow: /sponsored/travel/msc-cruises/
Disallow: /travel/8711559/The-Telegraph-Travel-Awards-2011.html
Disallow: /travel/hotel/e/*
Disallow: /sponsored/staging/
Disallow: /sponsored/email/
Disallow: /sponsored/business/northern-ireland-business/
Disallow: /sponsored/business/lloyds-tsb-enterprise-awards/
Disallow: /sponsored/earth/statoil/
Disallow: /sponsored/motoring/alfa-romeo-cars/
Disallow: /sponsored/motoring/vw-up/
Disallow: /sponsored/property/all-saints-eastbourne/
Disallow: /sponsored/supplement-portfolio/
Disallow: /sponsored/travel/cunard-cruises/
Disallow: /sponsored/travel/cruise-holidays/
Disallow: /sponsored/travel/macau/macaumap/
Disallow: /sponsored/travel/telegraph-cottages/
Disallow: /sponsored/finance/spread-betting/
Disallow: /sponsored/finance/retirement-annuity/
Disallow: /sponsored/travel/hidden-britain/
Disallow: /sponsored/business/sme-business-essentials/
Disallow: /sponsored/in-the-know/london-cultural-attractions
Disallow: /sponsored/in-the-know/london-dining
Disallow: /sponsored/in-the-know/london-entertainment
Disallow: /sponsored/in-the-know/london-lifestyle
Disallow: /sponsored/in-the-know/london-nightlife
Disallow: /sponsored/in-the-know/london-shopping
Disallow: /sponsored/in-the-know/london-sport-activities
Disallow: /sponsored/in-the-know/london-transport-accommodation
Disallow: /sponsored/in-the-know/london-video-guides
Disallow: /sponsored/motoring/suzuki-motorbikes/
Disallow: /sponsored/technology/cool-list/
Disallow: /sponsored/why-not/11633071/Voltarol-Tool.html
Disallow: /sponsored/why-not/11632940/voltarol-Poll.html
Disallow: /sponsored/why-not/11633725/Voltarol-tool-Mobile.html
Disallow: /sponsored/11874202/Header-for-article-bottom-containers-Dove-Rugby.html
Disallow: /news/0/general-election-2017-latest-results-exit-polls-reaction/
Disallow: /events/jeffrey-archer/order-confirmation/
Disallow: /events/jodi-and-graham/order-confirmation/
Disallow: /events/champions-tennis-vip-experience/order-confirmation/
Disallow: /events/an-evening-with-sir-ranulph-fiennes/order-confirmation/
Disallow: /events/the-telegraph-dining-experience/order-confirmation/
Disallow: /travel/*/articles/page-*
Disallow: /*_jcr_content*
Sitemap: http://www.telegraph.co.uk/cars/sitemap.xml
Sitemap: http://www.telegraph.co.uk/sitemaps/news/news_sitemap_index.xml
Sitemap: http://www.telegraph.co.uk/sitemaps/section/section_sitemap_index.xml
Sitemap: http://www.telegraph.co.uk/sitemaps/video/video_sitemap_index.xml
Sitemap: http://www.telegraph.co.uk/sitemaps/web/web_sitemap_index.xml
Sitemap: http://www.telegraph.co.uk/film/sitemap.xml
Sitemap: http://www.telegraph.co.uk/fashion/sitemap.xml
Sitemap: http://www.telegraph.co.uk/beauty/sitemap.xml
Sitemap: http://www.telegraph.co.uk/books/sitemap.xml
Sitemap: http://www.telegraph.co.uk/music/sitemap.xml
Sitemap: http://www.telegraph.co.uk/theatre/sitemap.xml
Sitemap: http://www.telegraph.co.uk/dance/sitemap.xml
Sitemap: http://www.telegraph.co.uk/opera/sitemap.xml
Sitemap: http://www.telegraph.co.uk/photography/sitemap.xml
Sitemap: http://www.telegraph.co.uk/art/sitemap.xml
Sitemap: http://www.telegraph.co.uk/radio/sitemap.xml
Sitemap: http://www.telegraph.co.uk/comedy/sitemap.xml
Sitemap: http://www.telegraph.co.uk/food-and-drink/sitemap.xml
Sitemap: http://www.telegraph.co.uk/gardening/sitemap.xml
Sitemap: http://www.telegraph.co.uk/interiors/sitemap.xml
Sitemap: http://www.telegraph.co.uk/pets/sitemap.xml
Sitemap: http://www.telegraph.co.uk/wellbeing/sitemap.xml
Sitemap: http://www.telegraph.co.uk/gaming/sitemap.xml
Sitemap: http://www.telegraph.co.uk/sitemap-news.xml
Sitemap: http://www.telegraph.co.uk/sitemap.xml
Sitemap: http://www.telegraph.co.uk/news/sitemap.xml
Sitemap: http://www.telegraph.co.uk/travel/sitemap.xml
Sitemap: http://www.telegraph.co.uk/financial-services/sitemap.xml
User-Agent: endeca
Disallow: /archive/
Disallow: /search/*
記事リスト
Disallowにないのでアクセス可。
PREMIUM http://www.telegraph.co.uk/premium/
NEWS http://www.telegraph.co.uk/news/
SPORT http://www.telegraph.co.uk/sport/
BUSINESS http://www.telegraph.co.uk/business/
MONEY http://www.telegraph.co.uk/money/
OPINION http://www.telegraph.co.uk/opinion/
TECH & SCIENCE http://www.telegraph.co.uk/science-technology/
CLUTURE http://www.telegraph.co.uk/culture/
TRAVEL http://www.telegraph.co.uk/travel/
記事
TRAVELに関して注意が必要。
Disallow: /travel/*/articles/page-*
上記に当てはまらないTRAVELの記事もある。
それ以外はアクセス可。
The Times
User-agent:*
Sitemap: https://www.thetimes.co.uk/sitemaps/sitemap.xml
Disallow: /tto/papers/
Disallow: /tto/papers.do
Disallow: /tto/feeds/
Disallow: /tto/adtest/
Disallow: /tto/public/sitesearch.do
Disallow: /tto/public/needtoknow/
Disallow: /article/back-in-time-l9ksbst58
Disallow: /article/editors-top-10-c2xhkbmml
Disallow: /article/3-of-the-best-3qs2jnx8f
Disallow: /article/review-william-hunter-to-damien-hirst-the-dead-teach-the-living-tdfxxsxsn
Disallow: /article/whats-on-music-rpmq3kt6n
Disallow: /article/whats-on-theatre-8jjz928wx
Disallow: /article/days-out-c9gh3pmpw
Disallow: /article/whats-on-comedy-g59s9tb3c
Disallow: /article/visual-art-nhfgfd8cf
Disallow: /article/books-lwp8l0lvf
Disallow: /article/grubs-up-ztlgc03j7
Disallow: /article/baby-blues-j2vmg5qjp
Disallow: /article/editors-top-10-d07x6rgdp
Disallow: /article/3-of-the-best-music-festivals-3rrqnqvpw
Disallow: /article/box-of-tricks-wsrdw98c8
Disallow: /article/music-v8gg0rc7d
Disallow: /article/whats-on-theatre-77hbgd32s
Disallow: /article/whats-on-days-out-92mr9hk3c
Disallow: /article/whats-on-comedy-8z3ks9h9c
Disallow: /article/whats-on-visual-art-mswsvtzfc
Disallow: /article/whats-on-books-pc808tkc5
Disallow: /article/theatre-jackie-the-musical-0t6dknhqg
Disallow: /article/3-of-the-best-foodie-events-76fbzj9wt
Disallow: /article/editors-top-10-h77tfl322
Disallow: /article/best-of-the-fest-2w078t90f
Disallow: /article/review-simon-starling-at-twilight-b9jsk66lt
Disallow: /article/whats-on-music-k9mgl9sx9
Disallow: /article/whats-on-comedy-8dns3nm3d
Disallow: /article/whats-on-theatre-5drb0vdxd
Disallow: /article/whats-on-visual-art-nphswswgd
Disallow: /article/whats-on-days-out-v9v6vjfrn
Disallow: /article/bucket-list-nb76vzqw9
Disallow: /article/whats-on-books-hs260lxxp
Disallow: /article/a-timely-revival-wmgstzjhb
Disallow: /article/editors-top-10-7x30l5tgg
Disallow: /article/great-expectations-vcpjz5mtt
Disallow: /article/review-richard-obriens-rocky-horror-show-2xlqxvssm
Disallow: /article/3-of-the-best-events-for-book-lovers-pmfh7cg6j
Disallow: /article/whats-on-music-gkn6vs0qr
Disallow: /article/whats-on-comedy-wghc98t9g
Disallow: /article/whats-on-theatre-k08kcl95h
Disallow: /article/whats-on-visual-art-pgsvn8j6r
Disallow: /article/whats-on-days-out-prj7ftkwm
Disallow: /article/whats-on-books-0xcq7jc9f
Disallow: /article/virgin-money-fireworks-concert-zjsbfnx3z
Disallow: /article/california-dreamin-rng6hjksw
Disallow: /article/editors-top-10-flqmnlx6n
Disallow: /article/3-of-the-best-fringe-kids-show-mpklk6bb7
Disallow: /article/review-botanic-gardens-i-still-believe-in-miracles-8q0fbkbb0
Disallow: /article/whats-on-music-r95jjbj89
Disallow: /article/what-s-on-theatre-vqwdc6n3n
Disallow: /article/what-s-on-days-out-hj7d32hkj
Disallow: /article/whats-on-comedy-zxjg78pnb
Disallow: /article/whats-on-visual-art-pgf9hpn0w
Disallow: /article/whats-on-books-686d5mnxl
Disallow: /article/story-time-67gr6rwj2
# Agent Specific Disallowed Sections
User-agent: NewsNow
Disallow: /
User-agent: Omgili
Disallow: /
User-agent: WebVac
Disallow: /
User-agent: WebZip
Disallow: /
User-agent: psbot
Disallow: /
User-agent: ia_archiver
Disallow: /
User-agent: Meltwater
記事リスト
分野ごとの一覧ページはない様子。
sitemap.xmlからたどるのだろうか。
記事
/article
以下にある特定の記事以外はアクセス可。
ただ、登録しないと一部しか読めない。
Uber loses appeal against employment rights ruling
https://www.thetimes.co.uk/edition/news/uber-loses-appeal-against-employment-rights-ruling-qhbkzvbl9
The Independent
#
# robots.txt
#
# This file is to prevent the crawling and indexing of certain parts
# of your site by web crawlers and spiders run by sites like Yahoo!
# and Google. By telling these "robots" where not to go on your site,
# you save bandwidth and server resources.
#
# This file will be ignored unless it is at the root of your host:
# Used: http://example.com/robots.txt
# Ignored: http://example.com/site/robots.txt
#
# For more information about the robots.txt standard, see:
# http://www.robotstxt.org/robotstxt.html
User-agent: *
# Files
Disallow: /CHANGELOG.txt
Disallow: /cron.php
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /INSTALL.sqlite.txt
Disallow: /install.php
Disallow: /INSTALL.txt
Disallow: /LICENSE.txt
Disallow: /MAINTAINERS.txt
Disallow: /update.php
Disallow: /UPGRADE.txt
Disallow: /xmlrpc.php
# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /filter/tips/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
Disallow: /user/logout/
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=filter/tips/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
Disallow: /?q=user/logout/
# Ignore refresh URLs
Disallow: /*ILC-refresh
User-agent: Nutch
Disallow: /
Sitemap: http://www.independent.co.uk/googlenewssitemap
Sitemap: http://www.independent.co.uk/sitemap.xml
Disallow: /pugpig/
Sitemap: http://www.independent.co.uk/sitemap.xml
記事リスト
Disallowにないので、問題ない
http://www.independent.co.uk/news/uk/politics
http://www.independent.co.uk/topic/brexit
http://www.independent.co.uk/news/uk
http://www.independent.co.uk/infact
http://www.independent.co.uk/voices
http://www.independent.co.uk/life-style
http://www.independent.co.uk/sport
http://www.independent.co.uk/news/business
http://www.independent.co.uk/arts-entertainment
記事
こちらも大丈夫そう
The Guardian
# this is the robots.txt file for theguardian.com
User-agent: *
Disallow: /sendarticle/
Disallow: /Users/
Disallow: /users/
Disallow: /*/print$
Disallow: /email/
Disallow: /contactus/
Disallow: /share/
Disallow: /websearch
Disallow: /*?commentpage=
Disallow: /whsmiths/
Disallow: /external/overture/
Disallow: /discussion/report-abuse/*
Disallow: /discussion/report-abuse-ajax/*
Disallow: /discussion/comment-permalink/*
Disallow: /discussion/report-abuse/*
Disallow: /discussion/user-report-abuse/*
Disallow: /discussion/handlers/*
Disallow: /discussion/your-profile
Disallow: /discussion/your-comments
Disallow: /discussion/edit-profile
Disallow: /discussion/search/comments
Disallow: /discussion/*
Disallow: /search
Disallow: /music/artist/*
Disallow: /music/album/*
Disallow: /books/data/*
Disallow: /settings/
Disallow: /embed/
Disallow: /*styles/js-on.css$
Disallow: /sport/olympics/2008/events/*
Disallow: /sport/olympics/2008/medals/*
Disallow: /f/healthcheck
Disallow: /sections
Disallow: /top-stories
Disallow: /most-read/sport
Disallow: /articles
Disallow: /podcasts
Disallow: /global$
Disallow: /*/feedarticle/*
Disallow: /travel/2013/aug/22/been-there-readers-competition?*
Disallow: /preference/*
Disallow: /59666047/
Disallow: /print/
Disallow: /info/tech-feedback
Disallow: /production-monitoring/
User-agent: Mediapartners-Google
Disallow:
Sitemap: http://www.theguardian.com/sitemaps/news.xml
Sitemap: http://www.theguardian.com/sitemaps/video.xml
User-agent: bingbot
Crawl-delay: 1
記事リスト
Disallowにリストアップされていないのでアクサス可。
UK https://www.theguardian.com/uk-news
world https://www.theguardian.com/world
sport https://www.theguardian.com/uk/sport
football https://www.theguardian.com/football
opinion https://www.theguardian.com/uk/commentisfree
culture https://www.theguardian.com/uk/culture
business https://www.theguardian.com/uk/business
lifestyle https://www.theguardian.com/uk/lifeandstyle
fashion https://www.theguardian.com/fashion
environment https://www.theguardian.com/uk/environment
tech https://www.theguardian.com/uk/technology
travel https://www.theguardian.com/uk/travel
記事
基本的にリストのパスの下にあるようで、やはりアクセス可。
iPhone X: most expensive Apple phone is also easiest to break
The Financial Times
The Financial Times
robots.txt
# all use of FT content is subject to the Terms & Conditions and Copyright Policy set out on FT.com
Sitemap: https://www.ft.com/sitemaps/index.xml
User-agent: *
Disallow: /__
Disallow: /search
Disallow: /advanced-search
Disallow: /offline/
Disallow: /myft/
Allow: /myft/list/
Allow: /__origami/
Allow: /__assets/
記事リスト
Disallowにリストアップされていないのでアクセス可。
Global economy https://www.ft.com/global-economy
Politics https://www.ft.com/world/us/politics
記事
アクセス可だが、コンテンツを見るには契約が必要。
https://www.theguardian.com/world
https://www.ft.com/content/824929b4-c471-11e7-a1d2-6786f39ef675