1
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

robots.txtの調査 アメリカの新聞・一般紙編

Last updated at Posted at 2017-11-04

robots.txtの調査 国内の新聞・一般紙編に引き続き、アメリカの新聞社のサイトのrobots.txtを調査した。対象は、USA Today, New York Times, The Washington Post, Los Angeles Timesの4紙。一般紙で、発行数が多いものを選んだ。

結果を下記の表に示す。基本的に、個人がプログラムで自動アクセスしても大丈夫だった。Googlebotなど、商用サイトのクローラの挙動を制御したいようだった。

※ 記事の内容に保証はありません。実際に各サイトからデータを取得する場合には、ご自身で調査をお願いします。

新聞 User Agentによる制限 記事リスト 記事本文
USA Today Googlebot-Newsのみ別扱い。その他は共通。 OK OK
New York Times なし OK OK
The Washington Post Googlebot, Twitterbotを別扱い。その他は共通。 OK OK
Los Angeles Times Googlebot等商用Webサービス用と、Nutchは別扱い。その他は共通。 OK OK

USA Today

USA Today
robots.txt

# robots.txt for https://www.usatoday.com/

User-agent: Googlebot-News
Disallow: /story/sponsor-story/
Disallow: /picture-gallery/sponsor-story/
Disallow: /videos/sponsor-story/
Disallow: /longform/sponsor-story/
Disallow: /pages/interactives/sponsor-story/
Disallow: /videos/embed/

User-Agent: *
Disallow: /errors
Disallow: /interactive/
Disallow: /userauth/
Disallow: /ugc/
Disallow: /feeds/
Disallow: /services/
Disallow: /facebook/
Disallow: /version-info/
Disallow: /longform/draft/
Disallow: /story/draft/
Disallow: /topic/*/smart/
Disallow: /search
Disallow: /module-showcase/
Disallow: /newsletter/
Disallow: /blended-newsletter/
Disallow: /story/nletter/

Disallow: /optimus

Disallow: /exp-cruise

Disallow: /exp-las-vegas2

Disallow: /exp-faw

Disallow: /exp-caribbean

Disallow: /exp-beach

Disallow: /exp-cruise2

Disallow: /story/advisory/


Disallow: /yourtake

Disallow: /story/sports/ncaab/2014/03/20/ge-cfo-challenge-daniel-kelly-amfam/6661213/

Disallow: /story/2014/03/20/ge-cfo-challenge-david-bartlett-amway/6653003/

Disallow: /story/sports/ncaab/2014/03/20/ge-cfo-challenge-art-mccarthy-neulion/6655521/

Disallow: /story/sports/ncaab/2014/03/20/ge-cfo-challenge-david-gross-major-league-lacrosse/6646987/

Disallow: /money/lookup/stocks/



Sitemap: https://www.usatoday.com/news-sitemap.xml
Sitemap: https://www.usatoday.com/web-sitemap-index.xml
Sitemap: https://www.usatoday.com/video-sitemap-index.xml

記事リスト

Disallowにリストアップされていないのでアクセス可。

記事

Disallowにリストアップされていないのでアクセス可。
ただし、 /story/draft/ はNG.下書きだろうか。

Russia exploited race divisions on Facebook. More black staffers, diversity could have have helped.

New York Times

New York Times
robots.txt

User-agent: *
Allow: /ads/public/
Allow: /svc/news/v3/all/pshb.rss
Disallow: /ads/
Disallow: /adx/bin/
Disallow: /archives/
Disallow: /auth/
Disallow: /cnet/
Disallow: /college/
Disallow: /external/
Disallow: /financialtimes/
Disallow: /idg/
Disallow: /indexes/
Disallow: /library/
Disallow: /nytimes-partners/
Disallow: /packages/flash/multimedia/TEMPLATES/
Disallow: /pages/college/
Disallow: /paidcontent/
Disallow: /partners/
Disallow: /register
Disallow: /thestreet/
Disallow: /svc
Disallow: /video/embedded/*
Disallow: /web-services/
Disallow: /gst/travel/travsearch*

Sitemap: http://spiderbites.nytimes.com/sitemaps/www.nytimes.com/sitemap.xml.gz
Sitemap: http://www.nytimes.com/sitemaps/sitemap_news/sitemap.xml.gz
Sitemap: http://spiderbites.nytimes.com/sitemaps/sitemap_video/sitemap.xml.gz
Sitemap: http://spiderbites.nytimes.com/sitemaps/www.nytimes.com_realestate/sitemap.xml.gz
Sitemap: http://spiderbites.nytimes.com/sitemaps/www.nytimes.com/2016_election_sitemap.xml.gz

記事リスト

/section から始まっている。Disallowにないのでアクセス可。

記事

年月日から始まっている。Disallowにないので、アクセス可。

The iPhone X Is Cool. That Doesn’t Mean You Are Ready for It.

The Washington Post

The Washington Post
robots.txt

User-agent: *
Disallow: /*_print.html
Disallow: /*_email.html
Disallow: /*_singlePage.html
Disallow: /*_allComments.html
Disallow: /*_jsn.json
Disallow: /*_jsonpStatic.js
Disallow: /*_nitf.xml
Disallow: /*_newsml.html
Disallow: /*_qa.html
Disallow: /*_meta.xml
Disallow: /*_jsnp.js
Disallow: /*_json.json
Disallow: /*_search.html
Disallow: /*_jsonp.js
Disallow: /*_jsnpStatic.js
Disallow: /*_rss.xml
Disallow: /*_mobile.mobile
Disallow: /*_mobile.xml
Disallow: /*_allCommentsClassicBlog.html
Disallow: /*_seo.html
Disallow: /*_nimbusJson.json
Disallow: /*_nimbusJsonp.js
Disallow: /*_nimbusJsonpStatic.js
Disallow: /*_modal.html
Disallow: /todays_paper/
Disallow: /rw/WashingtonPost/Content/Epaper/
Disallow: /ac2/
Disallow: /blogs/slow-ride/
Disallow: /local/blogsandcolumns/slow-ride-story-tanked
Disallow: /local/blogsandcolumns/slow-ride-story-achenblog
Disallow: /local/blogsandcolumns/slow-ride-stream-tanked
Disallow: /local/blogsandcolumns/slow-ride-front
Disallow: /utils
Disallow: /jobs/JS_JobSearchResult
Disallow: /jobs/UpdateJobEmployerCounterServlet
Disallow: /jobs/JS_Login
Disallow: /jobs/EU_UpdateJobEmployerCounter
Disallow: /blogs/nationals-journal-beta/
Disallow: /blogs/test/
Disallow: /posttv-beta/
Disallow: /posttv/sponsored-video/
Disallow: /posttv/c/trendex/
Disallow: /posttv/c/video_search/
Disallow: /posttv/posttv/trendex
Disallow: /rweb/
Disallow: /wp-stat/vrroom/
Disallow: /classic-apps/
Disallow: /news/test/
Disallow: /news/tablet/
Disallow: /sf/test/
Disallow: /news/test-liveblog/
Disallow: /pb/
Disallow: /homepage-video-test
Disallow: /testpage-forhomepage
Disallow: /knowmore
Disallow: /test
Disallow: /brand-studio/
Disallow: /sslsingle
Disallow: /sf/brand-connect/$

User-agent: Twitterbot
Allow: /posttv-beta/

User-agent: Googlebot-News
Disallow: /sf/brand-connect/wp/
Disallow: /posttv/sponsored-video/
Disallow: /posttv/c/trendex/
Disallow: /posttv/posttv/trendex
Disallow: /conversations/the-washington-post/2017/05/18/242f4306-3be4-11e7-a058-ddbb23c75d82_story.html
Disallow: /blogs/test/
Disallow: /news/test/
Disallow: /news/tablet/
Disallow: /classic-apps/

User-agent: Googlebot
Disallow: /conversations/the-washington-post/2017/05/18/242f4306-3be4-11e7-a058-ddbb23c75d82_story.html
Disallow: /blogs/test/
Disallow: /news/test/
Disallow: /news/tablet/
Disallow: /classic-apps/

Sitemap: https://www.washingtonpost.com/web-sitemap-index.xml
Sitemap: https://www.washingtonpost.com/news-sitemap-index.xml
Sitemap: https://www.washingtonpost.com/video-sitemap.xml
Sitemap: https://www.washingtonpost.com/real-estate/sitemap.xml
Sitemap: https://jobs.washingtonpost.com/sitemapindex.xml
Sitemap: https://www.washingtonpost.com/wp-stat/sitemaps/index.xml

記事リスト

カテゴリ名そのまま。Disallowにないのでアクセス可。

記事

カテゴリ名が先頭に来る様子。例えば、下記の記事は /news/the-switch/ 以下にある。Disallowに無いのでアクセス可。test用、tablet用のページでなければアクセスして良いらしい。

With the iPhone X, Apple is asking you to break up with the home button

Los Angeles Times

Los Angeles Times
robots.txt

User-agent: *
Disallow: /search/
Disallow: /changebrowser
Disallow: /thirdpartyservice
Disallow: /*/thirdpartyservice
Disallow: /config/
Disallow: /*/config
Disallow: /dzcfg
Disallow: /*/dzcfg
Disallow: /*/test-template
Disallow: /*.json
Disallow: /test/
Disallow: /topic/*?
Disallow: /deeplinkid/
Disallow: /searchsuggest
Disallow: /*/searchsuggest
Disallow: /svgimageproc
Disallow: /hive/

User-agent: Googlebot-News
Disallow: /*-ugc-*.html
Disallow: /about/
Disallow: /cbp-*.html
Disallow: /*/cbp-*.html
Disallow: /bp/*
Disallow: /*-adstory.html
Disallow: /*sns-ap
Disallow: /*photogallery
Disallow: /brandpublishing/
Disallow: /*/brandpublishing/
Disallow: /shopping/
Disallow: /*/shopping/
Disallow: /paid-posts/
Disallow: /*/paid-posts/

User-agent: Twitterbot
Allow: /

User-agent: Mediapartners-Google
Allow: /

User-agent: discobot
Disallow: /

User-agent: CCBot
Crawl-Delay: 2

User-agent: Nutch
Crawl-Delay: 2

Sitemap: http://www.latimes.com/sitemap.xml

記事リスト

Disallowにないのでアクセス可。

記事

Disallowにないのでアクセス可。

With iPhone 8 and iPhone X, Apple creates another tier of luxury

まとめ

調べた4紙は、個人のアクセスを禁じていなかった。むしろGoogle botなどの商用サイトのクローラの挙動を細かく制御しようとする傾向がある。

参考文献

過去の調査結果
robots.txtの調査 国内の新聞・一般紙編

参考にしたサイト
Robots.txt の仕様
Webスクレイピングの注意事項一覧
robots.txt - Qiita

1
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?