robots.txtを取得しクロール拒否されていないかチェック①

Last updated at 2024-07-28Posted at 2024-07-28

概要

robots.txtを取得しクロール拒否されていないかをチェックするプログラムをPHPで作成したいと思います。

今回は、robots.txtを取得する処理を作成します。
次回は、クロール拒否されていないかチェックする処理を作成します。

前提

robots.txtとは

検索エンジンのクローラーなどに、アクセスしていいURLを伝えるファイルです。

クローラーを作成する場合は、robots.txtに準ずる必要があります。

robots.txtの場所

基本的にURLドメインの/直下に置くことになっているので、Qiitaのrobots.txtは、https://qiita.com/robots.txtにあります。

中身の説明は次回とさせていただきます。

コーディング

処理内容

プログラムは汎用的に作ろうと思うので、robots.txtを直接指定しなくても、どんなURLでもrobots.txtのURLに変換します。

このようなURLが指定された場合に、robots.txtのファイルのURLを生成します。
https://example.com/
https://example.com/robots.txt
https://example.com/hoge/
↓
https://example.com/robots.txt

プログラム（PHP）

robots.txtファイルのURLを生成する関数

/**
 * 指定されたウェブサイトのrobots.txtファイルのURLを生成
 *
 * @param string $url ウェブサイトの元のURL。
 * @return string ウェブサイトのrobots.txtファイルを指すURL。
 */
function makeRobotsUrl($url) {
    return parse_url($url)['scheme'] ."://". parse_url($url)['host'] ."/robots.txt";
}

robots.txtを取得する処理

// GETパラメータの取得
$requestUrl = isset($_GET["url"]) && filter_var($_GET["url"], FILTER_VALIDATE_URL) ? $_GET["url"] : false;
if ($requestUrl === false) {
   exit("urlが指定されていません");
}

// robots.txtの取得
$robotsUrl = makeRobotsUrl($requestUrl);
$robotsTxt = @file_get_contents($robotsUrl);

// robots.txtの存在チェック
if ($robotsTxt === false) {
    exit("robots.txtがありません");
}

このコードでrobots.txtが取得できるようになりました。

おわりに

簡単なプログラムですが、robots.txtを取得する処理を作成しました。次回は、robots.txtの内容の説明を含めクロール拒否されていないかチェックする処理を紹介したいと思います。

少しでも参考になれば幸いです。

ここまで読んでいただきありがとうございました！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up