robots.txt

robots.txtの調査 イギリスの新聞・高級紙編

イギリスの新聞のrobots.txtを調査した。下記のrobots.txtを調査するシリーズの第三弾。

Wikipediaのイギリスの新聞一覧によると、イギリスの高級紙は下記の5紙。

  • デイリー・テレグラフ (The Daily Telegraph/ The Sunday Telegraph)
  • タイムズ (The Times /The Sunday Times)
  • インデペンデント (The Independent)
  • ガーディアン (The Guardian) / オブザーバー (The Observer)
  • フィナンシャル・タイムズ (The Financial Times)

結果を下記の表に示す。基本的に、個人がプログラムで自動アクセスしても大丈夫だった。ただし、Nutchなどの特定のクローラが名指しで禁止されている場合があるので注意が必要。

※ 記事の内容に保証はありません。実際に各サイトからデータを取得する場合には、ご自身で調査をお願いします。

新聞 User Agentによる制限 記事リスト 記事本文
The Daily Telegraph Endecaのみ別扱い。その他は共通。 OK OK
The Times 特定のクローラが禁止されている OK OK
The Independent Nutchは不可 見当たらない OK
The Guardian - OK OK
The Financial Times - OK OK

The Daily Telegraph

The Daily Telegraph
robots.txt

# Robots.txt file
# All robots will spider the domain

User-agent: *

Disallow: */application/*
Disallow: */ixale/
Disallow: /core/Content/
Disallow: /*?source=rss
Disallow: /*?mobile=true
Disallow: /*?mobile=basic
Disallow: /*?ModPagespeed=noscript
Disallow: /promotions/emails/
Disallow: /r/
Disallow: /search/*
Disallow: /searchbeta/*
Disallow: /sponsored/travel/msc-cruises/
Disallow: /travel/8711559/The-Telegraph-Travel-Awards-2011.html
Disallow: /travel/hotel/e/*
Disallow: /sponsored/staging/
Disallow: /sponsored/email/
Disallow: /sponsored/business/northern-ireland-business/
Disallow: /sponsored/business/lloyds-tsb-enterprise-awards/
Disallow: /sponsored/earth/statoil/
Disallow: /sponsored/motoring/alfa-romeo-cars/
Disallow: /sponsored/motoring/vw-up/
Disallow: /sponsored/property/all-saints-eastbourne/
Disallow: /sponsored/supplement-portfolio/
Disallow: /sponsored/travel/cunard-cruises/
Disallow: /sponsored/travel/cruise-holidays/
Disallow: /sponsored/travel/macau/macaumap/
Disallow: /sponsored/travel/telegraph-cottages/
Disallow: /sponsored/finance/spread-betting/
Disallow: /sponsored/finance/retirement-annuity/
Disallow: /sponsored/travel/hidden-britain/
Disallow: /sponsored/business/sme-business-essentials/
Disallow: /sponsored/in-the-know/london-cultural-attractions
Disallow: /sponsored/in-the-know/london-dining
Disallow: /sponsored/in-the-know/london-entertainment
Disallow: /sponsored/in-the-know/london-lifestyle
Disallow: /sponsored/in-the-know/london-nightlife
Disallow: /sponsored/in-the-know/london-shopping
Disallow: /sponsored/in-the-know/london-sport-activities
Disallow: /sponsored/in-the-know/london-transport-accommodation
Disallow: /sponsored/in-the-know/london-video-guides
Disallow: /sponsored/motoring/suzuki-motorbikes/
Disallow: /sponsored/technology/cool-list/
Disallow: /sponsored/why-not/11633071/Voltarol-Tool.html
Disallow: /sponsored/why-not/11632940/voltarol-Poll.html
Disallow: /sponsored/why-not/11633725/Voltarol-tool-Mobile.html
Disallow: /sponsored/11874202/Header-for-article-bottom-containers-Dove-Rugby.html
Disallow: /news/0/general-election-2017-latest-results-exit-polls-reaction/
Disallow: /events/jeffrey-archer/order-confirmation/
Disallow: /events/jodi-and-graham/order-confirmation/
Disallow: /events/champions-tennis-vip-experience/order-confirmation/
Disallow: /events/an-evening-with-sir-ranulph-fiennes/order-confirmation/
Disallow: /events/the-telegraph-dining-experience/order-confirmation/
Disallow: /travel/*/articles/page-*
Disallow: /*_jcr_content*

Sitemap: http://www.telegraph.co.uk/cars/sitemap.xml
Sitemap: http://www.telegraph.co.uk/sitemaps/news/news_sitemap_index.xml
Sitemap: http://www.telegraph.co.uk/sitemaps/section/section_sitemap_index.xml
Sitemap: http://www.telegraph.co.uk/sitemaps/video/video_sitemap_index.xml
Sitemap: http://www.telegraph.co.uk/sitemaps/web/web_sitemap_index.xml
Sitemap: http://www.telegraph.co.uk/film/sitemap.xml 
Sitemap: http://www.telegraph.co.uk/fashion/sitemap.xml 
Sitemap: http://www.telegraph.co.uk/beauty/sitemap.xml 
Sitemap: http://www.telegraph.co.uk/books/sitemap.xml
Sitemap: http://www.telegraph.co.uk/music/sitemap.xml
Sitemap: http://www.telegraph.co.uk/theatre/sitemap.xml
Sitemap: http://www.telegraph.co.uk/dance/sitemap.xml
Sitemap: http://www.telegraph.co.uk/opera/sitemap.xml
Sitemap: http://www.telegraph.co.uk/photography/sitemap.xml
Sitemap: http://www.telegraph.co.uk/art/sitemap.xml
Sitemap: http://www.telegraph.co.uk/radio/sitemap.xml
Sitemap: http://www.telegraph.co.uk/comedy/sitemap.xml
Sitemap: http://www.telegraph.co.uk/food-and-drink/sitemap.xml
Sitemap: http://www.telegraph.co.uk/gardening/sitemap.xml
Sitemap: http://www.telegraph.co.uk/interiors/sitemap.xml
Sitemap: http://www.telegraph.co.uk/pets/sitemap.xml
Sitemap: http://www.telegraph.co.uk/wellbeing/sitemap.xml
Sitemap: http://www.telegraph.co.uk/gaming/sitemap.xml
Sitemap: http://www.telegraph.co.uk/sitemap-news.xml
Sitemap: http://www.telegraph.co.uk/sitemap.xml
Sitemap: http://www.telegraph.co.uk/news/sitemap.xml
Sitemap: http://www.telegraph.co.uk/travel/sitemap.xml
Sitemap: http://www.telegraph.co.uk/financial-services/sitemap.xml

User-Agent: endeca
Disallow: /archive/

Disallow: /search/*

記事リスト

Disallowにないのでアクセス可。

PREMIUM http://www.telegraph.co.uk/premium/
NEWS http://www.telegraph.co.uk/news/
SPORT http://www.telegraph.co.uk/sport/
BUSINESS http://www.telegraph.co.uk/business/
MONEY http://www.telegraph.co.uk/money/
OPINION http://www.telegraph.co.uk/opinion/
TECH & SCIENCE http://www.telegraph.co.uk/science-technology/
CLUTURE http://www.telegraph.co.uk/culture/
TRAVEL http://www.telegraph.co.uk/travel/

記事

TRAVELに関して注意が必要。

Disallow: /travel/*/articles/page-*

上記に当てはまらないTRAVELの記事もある。

http://www.telegraph.co.uk/travel/activity-and-adventure/base-jumper-miles-daisher-on-the-world-s-most-dangerous-sport/

それ以外はアクセス可。

The Times

The Times
robots.txt

User-agent:*
Sitemap: https://www.thetimes.co.uk/sitemaps/sitemap.xml
Disallow: /tto/papers/
Disallow: /tto/papers.do
Disallow: /tto/feeds/
Disallow: /tto/adtest/
Disallow: /tto/public/sitesearch.do
Disallow: /tto/public/needtoknow/
Disallow: /article/back-in-time-l9ksbst58
Disallow: /article/editors-top-10-c2xhkbmml
Disallow: /article/3-of-the-best-3qs2jnx8f
Disallow: /article/review-william-hunter-to-damien-hirst-the-dead-teach-the-living-tdfxxsxsn
Disallow: /article/whats-on-music-rpmq3kt6n
Disallow: /article/whats-on-theatre-8jjz928wx
Disallow: /article/days-out-c9gh3pmpw
Disallow: /article/whats-on-comedy-g59s9tb3c
Disallow: /article/visual-art-nhfgfd8cf
Disallow: /article/books-lwp8l0lvf
Disallow: /article/grubs-up-ztlgc03j7
Disallow: /article/baby-blues-j2vmg5qjp
Disallow: /article/editors-top-10-d07x6rgdp
Disallow: /article/3-of-the-best-music-festivals-3rrqnqvpw
Disallow: /article/box-of-tricks-wsrdw98c8
Disallow: /article/music-v8gg0rc7d
Disallow: /article/whats-on-theatre-77hbgd32s
Disallow: /article/whats-on-days-out-92mr9hk3c
Disallow: /article/whats-on-comedy-8z3ks9h9c
Disallow: /article/whats-on-visual-art-mswsvtzfc
Disallow: /article/whats-on-books-pc808tkc5
Disallow: /article/theatre-jackie-the-musical-0t6dknhqg
Disallow: /article/3-of-the-best-foodie-events-76fbzj9wt
Disallow: /article/editors-top-10-h77tfl322
Disallow: /article/best-of-the-fest-2w078t90f
Disallow: /article/review-simon-starling-at-twilight-b9jsk66lt
Disallow: /article/whats-on-music-k9mgl9sx9
Disallow: /article/whats-on-comedy-8dns3nm3d
Disallow: /article/whats-on-theatre-5drb0vdxd
Disallow: /article/whats-on-visual-art-nphswswgd
Disallow: /article/whats-on-days-out-v9v6vjfrn
Disallow: /article/bucket-list-nb76vzqw9
Disallow: /article/whats-on-books-hs260lxxp
Disallow: /article/a-timely-revival-wmgstzjhb
Disallow: /article/editors-top-10-7x30l5tgg
Disallow: /article/great-expectations-vcpjz5mtt
Disallow: /article/review-richard-obriens-rocky-horror-show-2xlqxvssm
Disallow: /article/3-of-the-best-events-for-book-lovers-pmfh7cg6j
Disallow: /article/whats-on-music-gkn6vs0qr
Disallow: /article/whats-on-comedy-wghc98t9g
Disallow: /article/whats-on-theatre-k08kcl95h
Disallow: /article/whats-on-visual-art-pgsvn8j6r
Disallow: /article/whats-on-days-out-prj7ftkwm
Disallow: /article/whats-on-books-0xcq7jc9f
Disallow: /article/virgin-money-fireworks-concert-zjsbfnx3z
Disallow: /article/california-dreamin-rng6hjksw
Disallow: /article/editors-top-10-flqmnlx6n
Disallow: /article/3-of-the-best-fringe-kids-show-mpklk6bb7
Disallow: /article/review-botanic-gardens-i-still-believe-in-miracles-8q0fbkbb0
Disallow: /article/whats-on-music-r95jjbj89
Disallow: /article/what-s-on-theatre-vqwdc6n3n
Disallow: /article/what-s-on-days-out-hj7d32hkj
Disallow: /article/whats-on-comedy-zxjg78pnb
Disallow: /article/whats-on-visual-art-pgf9hpn0w
Disallow: /article/whats-on-books-686d5mnxl
Disallow: /article/story-time-67gr6rwj2

#Agent Specific Disallowed Sections

User-agent: NewsNow
Disallow: /
User-agent: Omgili
Disallow: /
User-agent: WebVac
Disallow: /
User-agent: WebZip
Disallow: /
User-agent: psbot
Disallow: /
User-agent: ia_archiver
Disallow: /
User-agent: Meltwater

記事リスト

分野ごとの一覧ページはない様子。
sitemap.xmlからたどるのだろうか。

記事

/article 以下にある特定の記事以外はアクセス可。
ただ、登録しないと一部しか読めない。

Uber loses appeal against employment rights ruling
https://www.thetimes.co.uk/edition/news/uber-loses-appeal-against-employment-rights-ruling-qhbkzvbl9

The Independent

The Independent
robots.txt

#
# robots.txt
#
# This file is to prevent the crawling and indexing of certain parts
# of your site by web crawlers and spiders run by sites like Yahoo!
# and Google. By telling these "robots" where not to go on your site,
# you save bandwidth and server resources.
#
# This file will be ignored unless it is at the root of your host:
# Used:    http://example.com/robots.txt
# Ignored: http://example.com/site/robots.txt
#
# For more information about the robots.txt standard, see:
# http://www.robotstxt.org/robotstxt.html

User-agent: *
# Files
Disallow: /CHANGELOG.txt
Disallow: /cron.php
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /INSTALL.sqlite.txt
Disallow: /install.php
Disallow: /INSTALL.txt
Disallow: /LICENSE.txt
Disallow: /MAINTAINERS.txt
Disallow: /update.php
Disallow: /UPGRADE.txt
Disallow: /xmlrpc.php
# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /filter/tips/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
Disallow: /user/logout/
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=filter/tips/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
Disallow: /?q=user/logout/

# Ignore refresh URLs
Disallow: /*ILC-refresh

User-agent: Nutch
Disallow: /

Sitemap: http://www.independent.co.uk/googlenewssitemap
Sitemap: http://www.independent.co.uk/sitemap.xml
Disallow: /pugpig/
Sitemap: http://www.independent.co.uk/sitemap.xml

記事リスト

Disallowにないので、問題ない

http://www.independent.co.uk/news/uk/politics
http://www.independent.co.uk/topic/brexit
http://www.independent.co.uk/news/uk
http://www.independent.co.uk/infact
http://www.independent.co.uk/voices
http://www.independent.co.uk/life-style
http://www.independent.co.uk/sport
http://www.independent.co.uk/news/business
http://www.independent.co.uk/arts-entertainment

記事

こちらも大丈夫そう

http://www.independent.co.uk/travel/news-and-advice/smartphone-apps-tripadvisor-changing-future-travel-digital-holiday-bookings-etickets-social-media-a8043246.html

The Guardian

The Guardian
robots.txt

# this is the robots.txt file for theguardian.com

User-agent: *
Disallow: /sendarticle/
Disallow: /Users/
Disallow: /users/
Disallow: /*/print$
Disallow: /email/
Disallow: /contactus/
Disallow: /share/
Disallow: /websearch
Disallow: /*?commentpage=
Disallow: /whsmiths/
Disallow: /external/overture/
Disallow: /discussion/report-abuse/*
Disallow: /discussion/report-abuse-ajax/*
Disallow: /discussion/comment-permalink/*
Disallow: /discussion/report-abuse/*
Disallow: /discussion/user-report-abuse/*
Disallow: /discussion/handlers/*
Disallow: /discussion/your-profile
Disallow: /discussion/your-comments
Disallow: /discussion/edit-profile
Disallow: /discussion/search/comments
Disallow: /discussion/*
Disallow: /search
Disallow: /music/artist/*
Disallow: /music/album/*
Disallow: /books/data/*
Disallow: /settings/
Disallow: /embed/
Disallow: /*styles/js-on.css$
Disallow: /sport/olympics/2008/events/*
Disallow: /sport/olympics/2008/medals/*
Disallow: /f/healthcheck
Disallow: /sections
Disallow: /top-stories
Disallow: /most-read/sport
Disallow: /articles
Disallow: /podcasts
Disallow: /global$
Disallow: /*/feedarticle/*
Disallow: /travel/2013/aug/22/been-there-readers-competition?*
Disallow: /preference/*
Disallow: /59666047/
Disallow: /print/
Disallow: /info/tech-feedback
Disallow: /production-monitoring/

User-agent: Mediapartners-Google
Disallow:

Sitemap: http://www.theguardian.com/sitemaps/news.xml
Sitemap: http://www.theguardian.com/sitemaps/video.xml

User-agent: bingbot
Crawl-delay: 1

記事リスト

Disallowにリストアップされていないのでアクサス可。

UK https://www.theguardian.com/uk-news
world https://www.theguardian.com/world
sport https://www.theguardian.com/uk/sport
football https://www.theguardian.com/football
opinion https://www.theguardian.com/uk/commentisfree
culture https://www.theguardian.com/uk/culture
business https://www.theguardian.com/uk/business
lifestyle https://www.theguardian.com/uk/lifeandstyle
fashion https://www.theguardian.com/fashion
environment https://www.theguardian.com/uk/environment
tech https://www.theguardian.com/uk/technology
travel https://www.theguardian.com/uk/travel

記事

基本的にリストのパスの下にあるようで、やはりアクセス可。

iPhone X: most expensive Apple phone is also easiest to break

https://www.theguardian.com/technology/2017/nov/07/iphone-x-most-expensive-apple-phone-fragile-drop-screen-display

The Financial Times

The Financial Times
robots.txt

# all use of FT content is subject to the Terms & Conditions and Copyright Policy set out on FT.com
Sitemap: https://www.ft.com/sitemaps/index.xml

User-agent: *
Disallow: /__
Disallow: /search
Disallow: /advanced-search
Disallow: /offline/
Disallow: /myft/
Allow: /myft/list/
Allow: /__origami/
Allow: /__assets/

記事リスト

Disallowにリストアップされていないのでアクセス可。

Global economy https://www.ft.com/global-economy
Politics https://www.ft.com/world/us/politics

記事

アクセス可だが、コンテンツを見るには契約が必要。

https://www.theguardian.com/world
https://www.ft.com/content/824929b4-c471-11e7-a1d2-6786f39ef675