More than 3 years have passed since last update.

PHPでのスクレイピング

Last updated at 2022-06-06Posted at 2022-06-05

背景

とあるWebアプリを開発している際に、依頼者が別に管理しているサイトのデータを持ってきて使ってほしいとの要望。
そのデータはDBで管理していなくHTMLファイルに直書きしていたので、それらをPHPを使ったスクレイピングで取得した際の内容を記載。

使ったのは~~Googleが提供している~~phpQuery（※）なるもの。
そやつを使うと想像以上に簡単にスクレイピングができました。
（※）GoogleでなくTobiasz Cudnikという方が作られたもののようです。

前提

今回データを取得したいサイトのHTML構造（一部）は以下の通り。

データを取得したいサイト.html

・・・
 <div class="info">
   <h3>hoge</h3>
   <div class="txt">
     <p>hoge_text</p>
   </div>
 </div>
 <div class="info">
   <h3>huge</h3>
   <div class="txt">
     <p>huge_text</p>
   </div>
 </div>
 <div class="info">
   <h3>hoga</h3>
   <div class="txt">
     <p>hoga_text</p>
   </div>
 </div>
・・・

やった内容

phpQueryのインストールと設定

Google Codeからphpファイルをインストール。
（※今回はphpQuery-0.9.5.386-onefile.zipを使用）

インストールしたZipファイルを解凍すると以下ファイルが入っているはず。

phpQuery-onefile.php

phpQueryの実行用ファイルを作成

次にインストールしたphpQueryを実行する用のファイル（scraping.php）を作成。
その中でrequire_onceを使ってphpQuery-onefile.phpを読み込みます。

scraping.php

<?php 

require_once("./phpQuery-onefile.php");

$target_site = file_get_contents("データを取得したいサイトURL");
$file = fopen("./result.php", "a"); // 取得したデータを書き込む用ファイル

$html = phpQuery::newDocument($target_site); // $target_siteのDOM取得
$info = $htmlDOM->find("h3")->text(); //　今回h3タグの中身を取得したい

fputs($file, $info); // 取得したデータを$fileに書き込み

fclose($file);

これでscraping.phpを実行すると、以下のようにresult.phpに記載されるはず。

result.php

hoge
huge
hoga

他要素のデータも併せて取得したい場合

ただ今回はh3タグのデータだけでなく、h3タグの兄弟要素（pタグ）のデータも合わせて取得したいので上記コードに手を加える。
幸いなことに、データを取得したいサイトは前提にあるような構成になっていたのでループ処理にてデータを取得する。
ついでに取得したデータを使いやすいよう配列型になるように変更。

scraping.php

<?php 

require_once("./phpQuery-onefile.php");

$target_site = file_get_contents("データを取得したいサイトURL");
$file = fopen("./result.php", "a");

// ↓ここから変更
$info_count = count(phpQuery::newDocument($target_site)->find("div.info > h3"));

for($i = 0; $i < $info_count; $i++) {
  $htmlDOM = phpQuery::newDocument($target_site);

  $info_title[] = $htmlDOM->find("div.info > h3:eq($i)")->text();
  $info_text[] = $htmlDOM->find("div.txt > p:eq($i)")->text();
  
  fputs($file, $i . ' => [' . "\n" . '"title" => "' . $info_title[$i] . '",' . "\n" . '"text" => "'  . $info_text[$i] . '",' . "\n" . '],' . "\n");
}

fclose($file);

これでscraping.phpを実行すると、以下のようにresult.phpに記載されるはず。

result.php

0 => [
  "title" => "hoge",
  "text" => "hoge_text",
],
1 => [
  "title" => "huge",
  "text" => "huge_text",
],
2 => [
  "title" => "hoga",
  "text" => "hoga_text",
],

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up