More than 5 years have passed since last update.

Webクローラ「Heritrix」のロボット排除プロトコル（REP）について

Last updated at 2015-01-07Posted at 2015-01-06

はじめに

クローラを使うとき、「ロボット排除プロトコル」を必ず意識しなければなりません。

ということで、
Heritrix のロボット排除プロトコルについて、ドキュメント内の記述を探してみることに。

ロボット排除プロトコルとは？

参考まで。
[ロボット排除プロトコル（REP）とは？――メタタグやrobots.txtの基礎]
(http://web-tan.forum.impressrd.jp/e/2008/02/27/2710)

Heritrix のロボット排除プロトコル

おもいっきりトップページに記述があった。大事なことだから、そりゃそうか。

Heritrix - Heritrix - IA Webteam Confluence
https://webarchive.jira.com/wiki/display/Heritrix/Heritrix#Heritrix-Webmasters!

以下、引用と翻訳。

Webmasters!
Heritrix is designed to respect the robots.txt exclusion directives and META robots tags,
and collect material at a measured, adaptive pace unlikely to disrupt normal website activity.

うっちー翻訳

ウェブマスターのみなさん！
Heritrix は robots.txt が指示する除外ページやメタタグ（REPタグ）を尊重するよう設計されています。
そして、一定でかつ、一般的なウェブサイトの活動を邪魔することがないように適応したペースで資料を集めます。

ロボット排除プロトコルに、ちゃんと対応していることが確認できました。

ちゃんちゃん。

おわり。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up