LoginSignup
0
0

More than 5 years have passed since last update.

robots.txtの調査 イギリスの新聞・高級紙編

Posted at

イギリスの新聞のrobots.txtを調査した。下記のrobots.txtを調査するシリーズの第三弾。

Wikipediaのイギリスの新聞一覧によると、イギリスの高級紙は下記の5紙。

  • デイリー・テレグラフ (The Daily Telegraph/ The Sunday Telegraph)
  • タイムズ (The Times /The Sunday Times)
  • インデペンデント (The Independent)
  • ガーディアン (The Guardian) / オブザーバー (The Observer)
  • フィナンシャル・タイムズ (The Financial Times)

結果を下記の表に示す。基本的に、個人がプログラムで自動アクセスしても大丈夫だった。ただし、Nutchなどの特定のクローラが名指しで禁止されている場合があるので注意が必要。

※ 記事の内容に保証はありません。実際に各サイトからデータを取得する場合には、ご自身で調査をお願いします。

新聞 User Agentによる制限 記事リスト 記事本文
The Daily Telegraph Endecaのみ別扱い。その他は共通。 OK OK
The Times 特定のクローラが禁止されている OK OK
The Independent Nutchは不可 見当たらない OK
The Guardian - OK OK
The Financial Times - OK OK

The Daily Telegraph

The Daily Telegraph
robots.txt

# Robots.txt file
# All robots will spider the domain

User-agent: *

Disallow: */application/*
Disallow: */ixale/
Disallow: /core/Content/
Disallow: /*?source=rss
Disallow: /*?mobile=true
Disallow: /*?mobile=basic
Disallow: /*?ModPagespeed=noscript
Disallow: /promotions/emails/
Disallow: /r/
Disallow: /search/*
Disallow: /searchbeta/*
Disallow: /sponsored/travel/msc-cruises/
Disallow: /travel/8711559/The-Telegraph-Travel-Awards-2011.html
Disallow: /travel/hotel/e/*
Disallow: /sponsored/staging/
Disallow: /sponsored/email/
Disallow: /sponsored/business/northern-ireland-business/
Disallow: /sponsored/business/lloyds-tsb-enterprise-awards/
Disallow: /sponsored/earth/statoil/
Disallow: /sponsored/motoring/alfa-romeo-cars/
Disallow: /sponsored/motoring/vw-up/
Disallow: /sponsored/property/all-saints-eastbourne/
Disallow: /sponsored/supplement-portfolio/
Disallow: /sponsored/travel/cunard-cruises/
Disallow: /sponsored/travel/cruise-holidays/
Disallow: /sponsored/travel/macau/macaumap/
Disallow: /sponsored/travel/telegraph-cottages/
Disallow: /sponsored/finance/spread-betting/
Disallow: /sponsored/finance/retirement-annuity/
Disallow: /sponsored/travel/hidden-britain/
Disallow: /sponsored/business/sme-business-essentials/
Disallow: /sponsored/in-the-know/london-cultural-attractions
Disallow: /sponsored/in-the-know/london-dining
Disallow: /sponsored/in-the-know/london-entertainment
Disallow: /sponsored/in-the-know/london-lifestyle
Disallow: /sponsored/in-the-know/london-nightlife
Disallow: /sponsored/in-the-know/london-shopping
Disallow: /sponsored/in-the-know/london-sport-activities
Disallow: /sponsored/in-the-know/london-transport-accommodation
Disallow: /sponsored/in-the-know/london-video-guides
Disallow: /sponsored/motoring/suzuki-motorbikes/
Disallow: /sponsored/technology/cool-list/
Disallow: /sponsored/why-not/11633071/Voltarol-Tool.html
Disallow: /sponsored/why-not/11632940/voltarol-Poll.html
Disallow: /sponsored/why-not/11633725/Voltarol-tool-Mobile.html
Disallow: /sponsored/11874202/Header-for-article-bottom-containers-Dove-Rugby.html
Disallow: /news/0/general-election-2017-latest-results-exit-polls-reaction/
Disallow: /events/jeffrey-archer/order-confirmation/
Disallow: /events/jodi-and-graham/order-confirmation/
Disallow: /events/champions-tennis-vip-experience/order-confirmation/
Disallow: /events/an-evening-with-sir-ranulph-fiennes/order-confirmation/
Disallow: /events/the-telegraph-dining-experience/order-confirmation/
Disallow: /travel/*/articles/page-*
Disallow: /*_jcr_content*

Sitemap: http://www.telegraph.co.uk/cars/sitemap.xml
Sitemap: http://www.telegraph.co.uk/sitemaps/news/news_sitemap_index.xml
Sitemap: http://www.telegraph.co.uk/sitemaps/section/section_sitemap_index.xml
Sitemap: http://www.telegraph.co.uk/sitemaps/video/video_sitemap_index.xml
Sitemap: http://www.telegraph.co.uk/sitemaps/web/web_sitemap_index.xml
Sitemap: http://www.telegraph.co.uk/film/sitemap.xml 
Sitemap: http://www.telegraph.co.uk/fashion/sitemap.xml 
Sitemap: http://www.telegraph.co.uk/beauty/sitemap.xml 
Sitemap: http://www.telegraph.co.uk/books/sitemap.xml
Sitemap: http://www.telegraph.co.uk/music/sitemap.xml
Sitemap: http://www.telegraph.co.uk/theatre/sitemap.xml
Sitemap: http://www.telegraph.co.uk/dance/sitemap.xml
Sitemap: http://www.telegraph.co.uk/opera/sitemap.xml
Sitemap: http://www.telegraph.co.uk/photography/sitemap.xml
Sitemap: http://www.telegraph.co.uk/art/sitemap.xml
Sitemap: http://www.telegraph.co.uk/radio/sitemap.xml
Sitemap: http://www.telegraph.co.uk/comedy/sitemap.xml
Sitemap: http://www.telegraph.co.uk/food-and-drink/sitemap.xml
Sitemap: http://www.telegraph.co.uk/gardening/sitemap.xml
Sitemap: http://www.telegraph.co.uk/interiors/sitemap.xml
Sitemap: http://www.telegraph.co.uk/pets/sitemap.xml
Sitemap: http://www.telegraph.co.uk/wellbeing/sitemap.xml
Sitemap: http://www.telegraph.co.uk/gaming/sitemap.xml
Sitemap: http://www.telegraph.co.uk/sitemap-news.xml
Sitemap: http://www.telegraph.co.uk/sitemap.xml
Sitemap: http://www.telegraph.co.uk/news/sitemap.xml
Sitemap: http://www.telegraph.co.uk/travel/sitemap.xml
Sitemap: http://www.telegraph.co.uk/financial-services/sitemap.xml

User-Agent: endeca
Disallow: /archive/

Disallow: /search/*

記事リスト

Disallowにないのでアクセス可。

PREMIUM http://www.telegraph.co.uk/premium/
NEWS http://www.telegraph.co.uk/news/
SPORT http://www.telegraph.co.uk/sport/
BUSINESS http://www.telegraph.co.uk/business/
MONEY http://www.telegraph.co.uk/money/
OPINION http://www.telegraph.co.uk/opinion/
TECH & SCIENCE http://www.telegraph.co.uk/science-technology/
CLUTURE http://www.telegraph.co.uk/culture/
TRAVEL http://www.telegraph.co.uk/travel/

記事

TRAVELに関して注意が必要。

Disallow: /travel/*/articles/page-*

上記に当てはまらないTRAVELの記事もある。

それ以外はアクセス可。

The Times

The Times
robots.txt

User-agent:*
Sitemap: https://www.thetimes.co.uk/sitemaps/sitemap.xml
Disallow: /tto/papers/
Disallow: /tto/papers.do
Disallow: /tto/feeds/
Disallow: /tto/adtest/
Disallow: /tto/public/sitesearch.do
Disallow: /tto/public/needtoknow/
Disallow: /article/back-in-time-l9ksbst58
Disallow: /article/editors-top-10-c2xhkbmml
Disallow: /article/3-of-the-best-3qs2jnx8f
Disallow: /article/review-william-hunter-to-damien-hirst-the-dead-teach-the-living-tdfxxsxsn
Disallow: /article/whats-on-music-rpmq3kt6n
Disallow: /article/whats-on-theatre-8jjz928wx
Disallow: /article/days-out-c9gh3pmpw
Disallow: /article/whats-on-comedy-g59s9tb3c
Disallow: /article/visual-art-nhfgfd8cf
Disallow: /article/books-lwp8l0lvf
Disallow: /article/grubs-up-ztlgc03j7
Disallow: /article/baby-blues-j2vmg5qjp
Disallow: /article/editors-top-10-d07x6rgdp
Disallow: /article/3-of-the-best-music-festivals-3rrqnqvpw
Disallow: /article/box-of-tricks-wsrdw98c8
Disallow: /article/music-v8gg0rc7d
Disallow: /article/whats-on-theatre-77hbgd32s
Disallow: /article/whats-on-days-out-92mr9hk3c
Disallow: /article/whats-on-comedy-8z3ks9h9c
Disallow: /article/whats-on-visual-art-mswsvtzfc
Disallow: /article/whats-on-books-pc808tkc5
Disallow: /article/theatre-jackie-the-musical-0t6dknhqg
Disallow: /article/3-of-the-best-foodie-events-76fbzj9wt
Disallow: /article/editors-top-10-h77tfl322
Disallow: /article/best-of-the-fest-2w078t90f
Disallow: /article/review-simon-starling-at-twilight-b9jsk66lt
Disallow: /article/whats-on-music-k9mgl9sx9
Disallow: /article/whats-on-comedy-8dns3nm3d
Disallow: /article/whats-on-theatre-5drb0vdxd
Disallow: /article/whats-on-visual-art-nphswswgd
Disallow: /article/whats-on-days-out-v9v6vjfrn
Disallow: /article/bucket-list-nb76vzqw9
Disallow: /article/whats-on-books-hs260lxxp
Disallow: /article/a-timely-revival-wmgstzjhb
Disallow: /article/editors-top-10-7x30l5tgg
Disallow: /article/great-expectations-vcpjz5mtt
Disallow: /article/review-richard-obriens-rocky-horror-show-2xlqxvssm
Disallow: /article/3-of-the-best-events-for-book-lovers-pmfh7cg6j
Disallow: /article/whats-on-music-gkn6vs0qr
Disallow: /article/whats-on-comedy-wghc98t9g
Disallow: /article/whats-on-theatre-k08kcl95h
Disallow: /article/whats-on-visual-art-pgsvn8j6r
Disallow: /article/whats-on-days-out-prj7ftkwm
Disallow: /article/whats-on-books-0xcq7jc9f
Disallow: /article/virgin-money-fireworks-concert-zjsbfnx3z
Disallow: /article/california-dreamin-rng6hjksw
Disallow: /article/editors-top-10-flqmnlx6n
Disallow: /article/3-of-the-best-fringe-kids-show-mpklk6bb7
Disallow: /article/review-botanic-gardens-i-still-believe-in-miracles-8q0fbkbb0
Disallow: /article/whats-on-music-r95jjbj89
Disallow: /article/what-s-on-theatre-vqwdc6n3n
Disallow: /article/what-s-on-days-out-hj7d32hkj
Disallow: /article/whats-on-comedy-zxjg78pnb
Disallow: /article/whats-on-visual-art-pgf9hpn0w
Disallow: /article/whats-on-books-686d5mnxl
Disallow: /article/story-time-67gr6rwj2

#Agent Specific Disallowed Sections

User-agent: NewsNow
Disallow: /
User-agent: Omgili
Disallow: /
User-agent: WebVac
Disallow: /
User-agent: WebZip
Disallow: /
User-agent: psbot
Disallow: /
User-agent: ia_archiver
Disallow: /
User-agent: Meltwater

記事リスト

分野ごとの一覧ページはない様子。
sitemap.xmlからたどるのだろうか。

記事

/article 以下にある特定の記事以外はアクセス可。
ただ、登録しないと一部しか読めない。

Uber loses appeal against employment rights ruling
https://www.thetimes.co.uk/edition/news/uber-loses-appeal-against-employment-rights-ruling-qhbkzvbl9

The Independent

The Independent
robots.txt

#
# robots.txt
#
# This file is to prevent the crawling and indexing of certain parts
# of your site by web crawlers and spiders run by sites like Yahoo!
# and Google. By telling these "robots" where not to go on your site,
# you save bandwidth and server resources.
#
# This file will be ignored unless it is at the root of your host:
# Used:    http://example.com/robots.txt
# Ignored: http://example.com/site/robots.txt
#
# For more information about the robots.txt standard, see:
# http://www.robotstxt.org/robotstxt.html

User-agent: *
# Files
Disallow: /CHANGELOG.txt
Disallow: /cron.php
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /INSTALL.sqlite.txt
Disallow: /install.php
Disallow: /INSTALL.txt
Disallow: /LICENSE.txt
Disallow: /MAINTAINERS.txt
Disallow: /update.php
Disallow: /UPGRADE.txt
Disallow: /xmlrpc.php
# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /filter/tips/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
Disallow: /user/logout/
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=filter/tips/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
Disallow: /?q=user/logout/

# Ignore refresh URLs
Disallow: /*ILC-refresh

User-agent: Nutch
Disallow: /

Sitemap: http://www.independent.co.uk/googlenewssitemap
Sitemap: http://www.independent.co.uk/sitemap.xml
Disallow: /pugpig/
Sitemap: http://www.independent.co.uk/sitemap.xml

記事リスト

Disallowにないので、問題ない

http://www.independent.co.uk/news/uk/politics
http://www.independent.co.uk/topic/brexit
http://www.independent.co.uk/news/uk
http://www.independent.co.uk/infact
http://www.independent.co.uk/voices
http://www.independent.co.uk/life-style
http://www.independent.co.uk/sport
http://www.independent.co.uk/news/business
http://www.independent.co.uk/arts-entertainment

記事

こちらも大丈夫そう

The Guardian

The Guardian
robots.txt

# this is the robots.txt file for theguardian.com

User-agent: *
Disallow: /sendarticle/
Disallow: /Users/
Disallow: /users/
Disallow: /*/print$
Disallow: /email/
Disallow: /contactus/
Disallow: /share/
Disallow: /websearch
Disallow: /*?commentpage=
Disallow: /whsmiths/
Disallow: /external/overture/
Disallow: /discussion/report-abuse/*
Disallow: /discussion/report-abuse-ajax/*
Disallow: /discussion/comment-permalink/*
Disallow: /discussion/report-abuse/*
Disallow: /discussion/user-report-abuse/*
Disallow: /discussion/handlers/*
Disallow: /discussion/your-profile
Disallow: /discussion/your-comments
Disallow: /discussion/edit-profile
Disallow: /discussion/search/comments
Disallow: /discussion/*
Disallow: /search
Disallow: /music/artist/*
Disallow: /music/album/*
Disallow: /books/data/*
Disallow: /settings/
Disallow: /embed/
Disallow: /*styles/js-on.css$
Disallow: /sport/olympics/2008/events/*
Disallow: /sport/olympics/2008/medals/*
Disallow: /f/healthcheck
Disallow: /sections
Disallow: /top-stories
Disallow: /most-read/sport
Disallow: /articles
Disallow: /podcasts
Disallow: /global$
Disallow: /*/feedarticle/*
Disallow: /travel/2013/aug/22/been-there-readers-competition?*
Disallow: /preference/*
Disallow: /59666047/
Disallow: /print/
Disallow: /info/tech-feedback
Disallow: /production-monitoring/

User-agent: Mediapartners-Google
Disallow:

Sitemap: http://www.theguardian.com/sitemaps/news.xml
Sitemap: http://www.theguardian.com/sitemaps/video.xml

User-agent: bingbot
Crawl-delay: 1

記事リスト

Disallowにリストアップされていないのでアクサス可。

UK https://www.theguardian.com/uk-news
world https://www.theguardian.com/world
sport https://www.theguardian.com/uk/sport
football https://www.theguardian.com/football
opinion https://www.theguardian.com/uk/commentisfree
culture https://www.theguardian.com/uk/culture
business https://www.theguardian.com/uk/business
lifestyle https://www.theguardian.com/uk/lifeandstyle
fashion https://www.theguardian.com/fashion
environment https://www.theguardian.com/uk/environment
tech https://www.theguardian.com/uk/technology
travel https://www.theguardian.com/uk/travel

記事

基本的にリストのパスの下にあるようで、やはりアクセス可。

iPhone X: most expensive Apple phone is also easiest to break

The Financial Times

The Financial Times
robots.txt

# all use of FT content is subject to the Terms & Conditions and Copyright Policy set out on FT.com
Sitemap: https://www.ft.com/sitemaps/index.xml

User-agent: *
Disallow: /__
Disallow: /search
Disallow: /advanced-search
Disallow: /offline/
Disallow: /myft/
Allow: /myft/list/
Allow: /__origami/
Allow: /__assets/

記事リスト

Disallowにリストアップされていないのでアクセス可。

Global economy https://www.ft.com/global-economy
Politics https://www.ft.com/world/us/politics

記事

アクセス可だが、コンテンツを見るには契約が必要。

https://www.theguardian.com/world
https://www.ft.com/content/824929b4-c471-11e7-a1d2-6786f39ef675

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0