More than 1 year has passed since last update.

php-html-parser でスクレイピング

Posted at 2022-09-20

スクレイピングしたい
php-html-parser がいいらしい。
embedというライブラリを使ったがヘッダを抜き出すのは良いが本文を抜き出すのは弱そう。

参考
https://www.utakata.work/entry/php/webscraping-with-php-html-parser

インスコ♡

composer require paquettg/php-html-parser

おや？依存性エラー？
(関連も一気にアップデートする -w コマンドを使う)

composer require paquettg/php-html-parser -w

OK

それでは、ヤフーニュースからタイトルとリンクを抜き出してみよう

hoge.php

$options = new Options();
//        $options->setEnforceEncoding('utf8');

$url = 'https://news.yahoo.co.jp/search?ei=UTF-8&p=%E5%A9%9A%E6%B4%BB';

$dom = new Dom();
$dom->loadFromUrl($url, $options);

$elements = $dom->find('.newsFeed_list a');

$res = [];
foreach( $elements as $v )
{
    $title = $v->find('.newsFeed_item_title');
    $res[$v->href] = $title->text;
}

print_r($res);

aタグの中に .newsFeed_item_title という div が入っているので、$v->text; では文字列を取得できない。

結果

一部伏せ字

Array
(
    [https://news.yahoo.co.jp/articles/749fd17c64cxxxf77a6119f6afe8fee7cb8] => xxxと不倫…43歳娘が遺品整理で見つけてショックを受けた、不倫日記の「ヤバすぎる中身」
    [https://news.yahoo.co.jp/articles/16e3ce9badxxx274adb474d1af59a4d06d] => 年下の彼xxx、ある日衝撃の告白をされア然
    [https://news.yahoo.co.jp/articles/0d9ec0cxxxd25abac9472b0f9ae3c19877268] => 結婚相xxxマったアラサー女の末路
    [https://news.yahoo.co.jp/articles/bd3b6ebf79cxxe90f35b1b758937db94a14fa] => 「キンプリとか、ジャニーズ級のイケxxxの地獄。

以上

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

php-html-parser で スクレイピング

インスコ♡

それでは、ヤフーニュースから タイトル と リンク を抜き出してみよう

結果

php-html-parser でスクレイピング

それでは、ヤフーニュースからタイトルとリンクを抜き出してみよう