robots.txt

robots.txtの調査 アメリカの新聞・一般紙編

More than 1 year has passed since last update.

robots.txtの調査 国内の新聞・一般紙編に引き続き、アメリカの新聞社のサイトのrobots.txtを調査した。対象は、USA Today, New York Times, The Washington Post, Los Angeles Timesの4紙。一般紙で、発行数が多いものを選んだ。

結果を下記の表に示す。基本的に、個人がプログラムで自動アクセスしても大丈夫だった。Googlebotなど、商用サイトのクローラの挙動を制御したいようだった。

※ 記事の内容に保証はありません。実際に各サイトからデータを取得する場合には、ご自身で調査をお願いします。

新聞 User Agentによる制限 記事リスト 記事本文
USA Today Googlebot-Newsのみ別扱い。その他は共通。 OK OK
New York Times なし OK OK
The Washington Post Googlebot, Twitterbotを別扱い。その他は共通。 OK OK
Los Angeles Times Googlebot等商用Webサービス用と、Nutchは別扱い。その他は共通。 OK OK

USA Today

USA Today
robots.txt

# robots.txt for https://www.usatoday.com/

User-agent: Googlebot-News
Disallow: /story/sponsor-story/
Disallow: /picture-gallery/sponsor-story/
Disallow: /videos/sponsor-story/
Disallow: /longform/sponsor-story/
Disallow: /pages/interactives/sponsor-story/
Disallow: /videos/embed/

User-Agent: *
Disallow: /errors
Disallow: /interactive/
Disallow: /userauth/
Disallow: /ugc/
Disallow: /feeds/
Disallow: /services/
Disallow: /facebook/
Disallow: /version-info/
Disallow: /longform/draft/
Disallow: /story/draft/
Disallow: /topic/*/smart/
Disallow: /search
Disallow: /module-showcase/
Disallow: /newsletter/
Disallow: /blended-newsletter/
Disallow: /story/nletter/

Disallow: /optimus

Disallow: /exp-cruise

Disallow: /exp-las-vegas2

Disallow: /exp-faw

Disallow: /exp-caribbean

Disallow: /exp-beach

Disallow: /exp-cruise2

Disallow: /story/advisory/


Disallow: /yourtake

Disallow: /story/sports/ncaab/2014/03/20/ge-cfo-challenge-daniel-kelly-amfam/6661213/

Disallow: /story/2014/03/20/ge-cfo-challenge-david-bartlett-amway/6653003/

Disallow: /story/sports/ncaab/2014/03/20/ge-cfo-challenge-art-mccarthy-neulion/6655521/

Disallow: /story/sports/ncaab/2014/03/20/ge-cfo-challenge-david-gross-major-league-lacrosse/6646987/

Disallow: /money/lookup/stocks/



Sitemap: https://www.usatoday.com/news-sitemap.xml
Sitemap: https://www.usatoday.com/web-sitemap-index.xml
Sitemap: https://www.usatoday.com/video-sitemap-index.xml

記事リスト

Disallowにリストアップされていないのでアクセス可。

記事

Disallowにリストアップされていないのでアクセス可。
ただし、 /story/draft/ はNG.下書きだろうか。

Russia exploited race divisions on Facebook. More black staffers, diversity could have have helped.

https://www.usatoday.com/story/tech/news/2017/11/03/facebooks-lack-diversity-blamed-failing-catch-fake-russian-accounts-spreading-racially-divisive-mess/829905001/

New York Times

New York Times
robots.txt

User-agent: *
Allow: /ads/public/
Allow: /svc/news/v3/all/pshb.rss
Disallow: /ads/
Disallow: /adx/bin/
Disallow: /archives/
Disallow: /auth/
Disallow: /cnet/
Disallow: /college/
Disallow: /external/
Disallow: /financialtimes/
Disallow: /idg/
Disallow: /indexes/
Disallow: /library/
Disallow: /nytimes-partners/
Disallow: /packages/flash/multimedia/TEMPLATES/
Disallow: /pages/college/
Disallow: /paidcontent/
Disallow: /partners/
Disallow: /register
Disallow: /thestreet/
Disallow: /svc
Disallow: /video/embedded/*
Disallow: /web-services/
Disallow: /gst/travel/travsearch*

Sitemap: http://spiderbites.nytimes.com/sitemaps/www.nytimes.com/sitemap.xml.gz
Sitemap: http://www.nytimes.com/sitemaps/sitemap_news/sitemap.xml.gz
Sitemap: http://spiderbites.nytimes.com/sitemaps/sitemap_video/sitemap.xml.gz
Sitemap: http://spiderbites.nytimes.com/sitemaps/www.nytimes.com_realestate/sitemap.xml.gz
Sitemap: http://spiderbites.nytimes.com/sitemaps/www.nytimes.com/2016_election_sitemap.xml.gz

記事リスト

/section から始まっている。Disallowにないのでアクセス可。

記事

年月日から始まっている。Disallowにないので、アクセス可。

The iPhone X Is Cool. That Doesn’t Mean You Are Ready for It.

https://www.nytimes.com/2017/11/01/technology/personaltech/apple-iphone-x-review.html

The Washington Post

The Washington Post
robots.txt

User-agent: *
Disallow: /*_print.html
Disallow: /*_email.html
Disallow: /*_singlePage.html
Disallow: /*_allComments.html
Disallow: /*_jsn.json
Disallow: /*_jsonpStatic.js
Disallow: /*_nitf.xml
Disallow: /*_newsml.html
Disallow: /*_qa.html
Disallow: /*_meta.xml
Disallow: /*_jsnp.js
Disallow: /*_json.json
Disallow: /*_search.html
Disallow: /*_jsonp.js
Disallow: /*_jsnpStatic.js
Disallow: /*_rss.xml
Disallow: /*_mobile.mobile
Disallow: /*_mobile.xml
Disallow: /*_allCommentsClassicBlog.html
Disallow: /*_seo.html
Disallow: /*_nimbusJson.json
Disallow: /*_nimbusJsonp.js
Disallow: /*_nimbusJsonpStatic.js
Disallow: /*_modal.html
Disallow: /todays_paper/
Disallow: /rw/WashingtonPost/Content/Epaper/
Disallow: /ac2/
Disallow: /blogs/slow-ride/
Disallow: /local/blogsandcolumns/slow-ride-story-tanked
Disallow: /local/blogsandcolumns/slow-ride-story-achenblog
Disallow: /local/blogsandcolumns/slow-ride-stream-tanked
Disallow: /local/blogsandcolumns/slow-ride-front
Disallow: /utils
Disallow: /jobs/JS_JobSearchResult
Disallow: /jobs/UpdateJobEmployerCounterServlet
Disallow: /jobs/JS_Login
Disallow: /jobs/EU_UpdateJobEmployerCounter
Disallow: /blogs/nationals-journal-beta/
Disallow: /blogs/test/
Disallow: /posttv-beta/
Disallow: /posttv/sponsored-video/
Disallow: /posttv/c/trendex/
Disallow: /posttv/c/video_search/
Disallow: /posttv/posttv/trendex
Disallow: /rweb/
Disallow: /wp-stat/vrroom/
Disallow: /classic-apps/
Disallow: /news/test/
Disallow: /news/tablet/
Disallow: /sf/test/
Disallow: /news/test-liveblog/
Disallow: /pb/
Disallow: /homepage-video-test
Disallow: /testpage-forhomepage
Disallow: /knowmore
Disallow: /test
Disallow: /brand-studio/
Disallow: /sslsingle
Disallow: /sf/brand-connect/$

User-agent: Twitterbot
Allow: /posttv-beta/

User-agent: Googlebot-News
Disallow: /sf/brand-connect/wp/
Disallow: /posttv/sponsored-video/
Disallow: /posttv/c/trendex/
Disallow: /posttv/posttv/trendex
Disallow: /conversations/the-washington-post/2017/05/18/242f4306-3be4-11e7-a058-ddbb23c75d82_story.html
Disallow: /blogs/test/
Disallow: /news/test/
Disallow: /news/tablet/
Disallow: /classic-apps/

User-agent: Googlebot
Disallow: /conversations/the-washington-post/2017/05/18/242f4306-3be4-11e7-a058-ddbb23c75d82_story.html
Disallow: /blogs/test/
Disallow: /news/test/
Disallow: /news/tablet/
Disallow: /classic-apps/

Sitemap: https://www.washingtonpost.com/web-sitemap-index.xml
Sitemap: https://www.washingtonpost.com/news-sitemap-index.xml
Sitemap: https://www.washingtonpost.com/video-sitemap.xml
Sitemap: https://www.washingtonpost.com/real-estate/sitemap.xml
Sitemap: https://jobs.washingtonpost.com/sitemapindex.xml
Sitemap: https://www.washingtonpost.com/wp-stat/sitemaps/index.xml

記事リスト

カテゴリ名そのまま。Disallowにないのでアクセス可。

記事

カテゴリ名が先頭に来る様子。例えば、下記の記事は /news/the-switch/ 以下にある。Disallowに無いのでアクセス可。test用、tablet用のページでなければアクセスして良いらしい。

With the iPhone X, Apple is asking you to break up with the home button

https://www.washingtonpost.com/news/the-switch/wp/2017/10/31/iphone-x-review-apple-is-asking-you-to-break-up-with-the-home-button/

Los Angeles Times

Los Angeles Times
robots.txt

User-agent: *
Disallow: /search/
Disallow: /changebrowser
Disallow: /thirdpartyservice
Disallow: /*/thirdpartyservice
Disallow: /config/
Disallow: /*/config
Disallow: /dzcfg
Disallow: /*/dzcfg
Disallow: /*/test-template
Disallow: /*.json
Disallow: /test/
Disallow: /topic/*?
Disallow: /deeplinkid/
Disallow: /searchsuggest
Disallow: /*/searchsuggest
Disallow: /svgimageproc
Disallow: /hive/

User-agent: Googlebot-News
Disallow: /*-ugc-*.html
Disallow: /about/
Disallow: /cbp-*.html
Disallow: /*/cbp-*.html
Disallow: /bp/*
Disallow: /*-adstory.html
Disallow: /*sns-ap
Disallow: /*photogallery
Disallow: /brandpublishing/
Disallow: /*/brandpublishing/
Disallow: /shopping/
Disallow: /*/shopping/
Disallow: /paid-posts/
Disallow: /*/paid-posts/

User-agent: Twitterbot
Allow: /

User-agent: Mediapartners-Google
Allow: /

User-agent: discobot
Disallow: /

User-agent: CCBot
Crawl-Delay: 2

User-agent: Nutch
Crawl-Delay: 2

Sitemap: http://www.latimes.com/sitemap.xml

記事リスト

Disallowにないのでアクセス可。

記事

Disallowにないのでアクセス可。

With iPhone 8 and iPhone X, Apple creates another tier of luxury

http://www.latimes.com/business/technology/la-fi-tn-apple-iphone-20170912-htmlstory.html

まとめ

調べた4紙は、個人のアクセスを禁じていなかった。むしろGoogle botなどの商用サイトのクローラの挙動を細かく制御しようとする傾向がある。

参考文献

過去の調査結果
robots.txtの調査 国内の新聞・一般紙編

参考にしたサイト
Robots.txt の仕様
Webスクレイピングの注意事項一覧
robots.txt - Qiita