robots.txtって何
webサイト側で「ここにはアクセスしちゃ駄目だよ〜」等の指示をクローラーに向けて書いたものです
なので、サイトをクローリングするときは最小限守ったほうがいいと思います
どうやって解析するの
まずはurllibを使う方法です(今回はgoogleを例とします)
urllib.robotparser.RobotFileParserクラスを使います
from urllib.robotparser import RobotFileParser
url = "https://www.google.com/robots.txt"
rp = RobotFileParser() # RobotFileParserインスタンスを生成
rp.set_url(url)
rp.read() # robots.txtの読み取り
rp.can_fetch("アクセスしたいURL") # 指定されたURLがアクセス可能か確認、戻り値はbool
基本的な使い方はこんな感じです
これの面倒くさいところはrp.set_urlにrobots.txtの直リンを与えないといけないところです
robots.txtの直リンはurllib.parse.urljoin(url, "/robots.txt")で得られるのですがいちいちこんなのやるの面倒くさいじゃないですか
更にrobots.txtの指示を各User-agentごとに見たい!なんて思ってる方もいるかもしれません
そんなときにはrobotsparsetoolsをお使いください
robotsparsetools
インストールはpipで出来ます
$ pip install robotsparsetools
使い方は簡単、robotsparsetools.Parseにurlを渡してあげるだけです
from robotsparsetools import Parse
url = "https://www.google.com/" # robots.txtへの直リンを渡してもいい
p = Parse(url)
print(p)
これを実行すると、User-agentをkeyとしてそれに対する各指示をvalueとしたdict(正確には違うけど)が返ってきます
{'*': {'Disallow': ['/search', '/sdch', '/groups', '/index.html?', '/?', '/?hl=*&', '/?hl=*&*&gws_rd=ssl', '/imgres', '/u/', '/preferences', '/setprefs', '/default', '/m?', '/m/', '/wml?', '/wml/?', '/wml/search?', '/xhtml?', '/xhtml/?', '/xhtml/search?', '/xml?', '/imode?', '/imode/?', '/imode/search?', '/jsky?', '/jsky/?', '/jsky/search?', '/pda?', '/pda/?', '/pda/search?', '/sprint_xhtml', '/sprint_wml', '/pqa', '/palm', '/gwt/', '/purchases', '/local?', '/local_url', '/shihui?', '/shihui/', '/products?', '/product_', '/products_', '/products;', '/print', '/books/', '/bkshp?*q=*', '/books?*q=*', '/books?*output=*', '/books?*pg=*', '/books?*jtp=*', '/books?*jscmd=*', '/books?*buy=*', '/books?*zoom=*', '/ebooks/', '/ebooks?*q=*', '/ebooks?*output=*', '/ebooks?*pg=*', '/ebooks?*jscmd=*', '/ebooks?*buy=*', '/ebooks?*zoom=*', '/patents?', '/patents/download/', '/patents/pdf/', '/patents/related/', '/scholar', '/citations?', '/citations?*cstart=', '/s?', '/maps?', '/mapstt?', '/mapslt?', '/maps/stk/', '/maps/br?', '/mapabcpoi?', '/maphp?', '/mapprint?', '/maps/api/js/', '/maps/api/place/js/', '/maps/api/staticmap', '/maps/api/streetview', '/maps/_/sw/manifest.json', '/mld?', '/staticmap?', '/maps/preview', '/maps/place', '/maps/timeline/', '/help/maps/streetview/partners/welcome/', '/help/maps/indoormaps/partners/', '/lochp?', '/center', '/ie?', '/blogsearch/', '/blogsearch_feeds', '/advanced_blog_search', '/uds/', '/chart?', '/transit?', '/calendar/', '/cl2/feeds/', '/cl2/ical/', '/coop/directory', '/coop/manage', '/trends?', '/trends/music?', '/trends/hottrends?', '/trends/viz?', '/trends/embed.js?', '/trends/fetchComponent?', '/trends/beta', '/trends/topics', '/musica', '/musicad', '/musicas', '/musicl', '/musics', '/musicsearch', '/musicsp', '/musiclp', '/urchin_test/', '/movies?', '/wapsearch?', '/reviews/search?', '/orkut/albums', '/cbk', '/recharge/dashboard/car', '/recharge/dashboard/static/', '/profiles/me', '/s2/profiles/me', '/s2', '/transconsole/portal/', '/gcc/', '/aclk', '/cse?', '/cse/home', '/cse/panel', '/cse/manage', '/tbproxy/', '/imesync/', '/shenghuo/search?', '/support/forum/search?', '/reviews/polls/', '/hosted/images/', '/ppob/?', '/ppob?', '/accounts/ClientLogin', '/accounts/ClientAuth', '/accounts/o8', '/topicsearch?q=', '/xfx7/', '/squared/api', '/squared/search', '/squared/table', '/qnasearch?', '/app/updates', '/sidewiki/entry/', '/quality_form?', '/labs/popgadget/search', '/buzz/post', '/compressiontest/', '/analytics/feeds/', '/analytics/partners/comments/', '/analytics/portal/', '/analytics/uploads/', '/alerts/', '/ads/search?', '/ads/plan/action_plan?', '/ads/plan/api/', '/ads/hotels/partners', '/phone/compare/?', '/travel/clk', '/hotelfinder/rpc', '/hotels/rpc', '/commercesearch/services/', '/evaluation/', '/chrome/browser/mobile/tour', '/compare/*/apply*', '/forms/perks/', '/shopping/suppliers/search', '/ct/', '/edu/cs4hs/', '/trustedstores/s/', '/trustedstores/tm2', '/trustedstores/verify', '/adwords/proposal', '/shopping?*', '/shopping/product/', '/shopping/seller', '/shopping/ratings/account/metrics', '/shopping/ratings/merchant/immersivedetails', '/shopping/reviewer', '/about/careers/applications/', '/landing/signout.html', '/webmasters/sitemaps/ping?', '/ping?', '/gallery/', '/landing/now/ontap/', '/maps/reserve/api/', '/maps/reserve/search', '/maps/reserve/bookings', '/maps/reserve/settings', '/maps/reserve/manage', '/maps/reserve/payment', '/maps/reserve/receipt', '/maps/reserve/sellersignup', '/maps/reserve/payments', '/maps/reserve/feedback', '/maps/reserve/terms', '/maps/reserve/m/', '/maps/reserve/b/', '/maps/reserve/partner-dashboard', '/about/views/', '/intl/*/about/views/', '/local/cars', '/local/cars/', '/local/dealership/', '/local/dining/', '/local/place/products/', '/local/place/reviews/', '/local/place/rap/', '/local/tab/', '/localservices/*', '/nonprofits/account/'], 'Allow': ['/search/about', '/search/static', '/search/howsearchworks', '/?hl=', '/?hl=*&gws_rd=ssl$', '/?gws_rd=ssl$', '/?pt1=true$', ' /m/finance', '/books?*q=related:*', '/books?*q=editions:*', '/books?*q=subject:*', '/books/about', '/booksrightsholders', '/books?*zoom=1*', '/books?*zoom=5*', '/books/content?*zoom=1*', '/books/content?*zoom=5*', '/ebooks?*q=related:*', '/ebooks?*q=editions:*', '/ebooks?*q=subject:*', '/ebooks?*zoom=1*', '/ebooks?*zoom=5*', '/citations?user=', '/citations?view_op=new_profile', '/citations?view_op=top_venues', '/scholar_share', '/maps?*output=classic*', '/maps?*file=', '/maps/d/', '/maps/api/js', ' /calendar$', ' /calendar/about/', '/safebrowsing/diagnostic', '/safebrowsing/report_badware/', '/safebrowsing/report_error/', '/safebrowsing/report_phish/', '/profiles', '/s2/profiles', '/s2/oz', '/s2/photos', '/s2/search/social', '/s2/static', '/accounts/o8/id', '/alerts/manage', '/alerts/remove', '/alerts/$', '/searchhistory/', '/maps/reserve', '/maps/reserve/partners', '/finance', '/js/']}, 'AdsBot-Google': {'Disallow': ['/maps/api/js/', '/maps/api/place/js/', '/maps/api/staticmap', '/maps/api/streetview'], 'Allow': ['/maps/api/js']}, 'Twitterbot': {'Allow': ['/imgres']}, 'facebookexternalhit': {'Allow': ['/imgres']}, 'Sitemap': ['https://www.google.com/sitemap.xml']}
getメソッドなども使用できます
p.get("*").get("Disallow")
p["*"]["Disallow"]
['/search', '/sdch', '/groups', '/index.html?', '/?', '/?hl=*&', '/?hl=*&*&gws_rd=ssl', '/imgres', '/u/', '/preferences', '/setprefs', '/default', '/m?', '/m/', '/wml?', '/wml/?', '/wml/search?', '/xhtml?', '/xhtml/?', '/xhtml/search?', '/xml?', '/imode?', '/imode/?', '/imode/search?', '/jsky?', '/jsky/?', '/jsky/search?', '/pda?', '/pda/?', '/pda/search?', '/sprint_xhtml', '/sprint_wml', '/pqa', '/palm', '/gwt/', '/purchases', '/local?', '/local_url', '/shihui?', '/shihui/', '/products?', '/product_', '/products_', '/products;', '/print', '/books/', '/bkshp?*q=*', '/books?*q=*', '/books?*output=*', '/books?*pg=*', '/books?*jtp=*', '/books?*jscmd=*', '/books?*buy=*', '/books?*zoom=*', '/ebooks/', '/ebooks?*q=*', '/ebooks?*output=*', '/ebooks?*pg=*', '/ebooks?*jscmd=*', '/ebooks?*buy=*', '/ebooks?*zoom=*', '/patents?', '/patents/download/', '/patents/pdf/', '/patents/related/', '/scholar', '/citations?', '/citations?*cstart=', '/s?', '/maps?', '/mapstt?', '/mapslt?', '/maps/stk/', '/maps/br?', '/mapabcpoi?', '/maphp?', '/mapprint?', '/maps/api/js/', '/maps/api/place/js/', '/maps/api/staticmap', '/maps/api/streetview', '/maps/_/sw/manifest.json', '/mld?', '/staticmap?', '/maps/preview', '/maps/place', '/maps/timeline/', '/help/maps/streetview/partners/welcome/', '/help/maps/indoormaps/partners/', '/lochp?', '/center', '/ie?', '/blogsearch/', '/blogsearch_feeds', '/advanced_blog_search', '/uds/', '/chart?', '/transit?', '/calendar/', '/cl2/feeds/', '/cl2/ical/', '/coop/directory', '/coop/manage', '/trends?', '/trends/music?', '/trends/hottrends?', '/trends/viz?', '/trends/embed.js?', '/trends/fetchComponent?', '/trends/beta', '/trends/topics', '/musica', '/musicad', '/musicas', '/musicl', '/musics', '/musicsearch', '/musicsp', '/musiclp', '/urchin_test/', '/movies?', '/wapsearch?', '/reviews/search?', '/orkut/albums', '/cbk', '/recharge/dashboard/car', '/recharge/dashboard/static/', '/profiles/me', '/s2/profiles/me', '/s2', '/transconsole/portal/', '/gcc/', '/aclk', '/cse?', '/cse/home', '/cse/panel', '/cse/manage', '/tbproxy/', '/imesync/', '/shenghuo/search?', '/support/forum/search?', '/reviews/polls/', '/hosted/images/', '/ppob/?', '/ppob?', '/accounts/ClientLogin', '/accounts/ClientAuth', '/accounts/o8', '/topicsearch?q=', '/xfx7/', '/squared/api', '/squared/search', '/squared/table', '/qnasearch?', '/app/updates', '/sidewiki/entry/', '/quality_form?', '/labs/popgadget/search', '/buzz/post', '/compressiontest/', '/analytics/feeds/', '/analytics/partners/comments/', '/analytics/portal/', '/analytics/uploads/', '/alerts/', '/ads/search?', '/ads/plan/action_plan?', '/ads/plan/api/', '/ads/hotels/partners', '/phone/compare/?', '/travel/clk', '/hotelfinder/rpc', '/hotels/rpc', '/commercesearch/services/', '/evaluation/', '/chrome/browser/mobile/tour', '/compare/*/apply*', '/forms/perks/', '/shopping/suppliers/search', '/ct/', '/edu/cs4hs/', '/trustedstores/s/', '/trustedstores/tm2', '/trustedstores/verify', '/adwords/proposal', '/shopping?*', '/shopping/product/', '/shopping/seller', '/shopping/ratings/account/metrics', '/shopping/ratings/merchant/immersivedetails', '/shopping/reviewer', '/about/careers/applications/', '/landing/signout.html', '/webmasters/sitemaps/ping?', '/ping?', '/gallery/', '/landing/now/ontap/', '/maps/reserve/api/', '/maps/reserve/search', '/maps/reserve/bookings', '/maps/reserve/settings', '/maps/reserve/manage', '/maps/reserve/payment', '/maps/reserve/receipt', '/maps/reserve/sellersignup', '/maps/reserve/payments', '/maps/reserve/feedback', '/maps/reserve/terms', '/maps/reserve/m/', '/maps/reserve/b/', '/maps/reserve/partner-dashboard', '/about/views/', '/intl/*/about/views/', '/local/cars', '/local/cars/', '/local/dealership/', '/local/dining/', '/local/place/products/', '/local/place/reviews/', '/local/place/rap/', '/local/tab/', '/localservices/*', '/nonprofits/account/']
URLへのアクセスが禁止されていないか調べるのは、p. can_crawlで調べることができます(これはrp.can_fetchにあたる関数です)
p.can_crawl("調べたいURL", "User-agent名") # 戻り値はbool
第二引数のUser-agent名を省略した場合は、*(全てのクローラー)が指定されたものとして処理します
Crawl-delay(クロール間隔)はp.delay(User-agent名)で取得できます
p.delay("User-agent")
こちらもUser-agent名を省略した場合は、*が指定されたものとして処理します
Sitemapは、p.get("Sitemap")かp["Sitemap"]で取得できます
p.get("Sitemap")
p["Sitemap"]
# どっちでも取得できる
また、あまり使わないかもしれませんが、robots.txtへのアクセスに、urlopenではなくrequestsを使用することもできます
requests.getに渡すオプションも指定できます
p = Parse("https://www.google.com/", requests=True, timeout=(1.0, 10.0))
まとめ
限定的かもしれませんが、是非robots.txtの解析にrobotsparsetoolsを使ってみてください!