More than 5 years have passed since last update.

DOMCrawlerの使い方

Posted at 2019-09-29

はじめに

Laravelを使用している時にスクレイピングを行ないました．
その際に使ったDOMCrawlerの使い方をまとめました．

インポート

use Symfony\Component\DomCrawler\Crawler;

$html = <<<'HTML'
<!DOCTYPE html>
<html>
    <body>
        <p class="message">1番目のP</p>
        <p>2番目のP</p>
        <p>3番目のP</p>
    </body>
</html>
HTML;

//HTMLの文字列をCrawlerクラスに渡す．
$crawler = new Crawler($html);

フィルタリング

XPathを用いる方法

$crawler->filterXPath('html/body/p')

CSSセレクタを用いる方法

$crawler->filter('body > p')

Nodeを選択する

0番目のp要素の取得する

$crawler->filter('body > p')->eq(0);

最初のp要素の取得する

$crawler->filter('body > p')->first();

最後のp要素の取得する

$crawler->filter('body > p')->last();

最初のp要素の兄弟にあたるものを取得する

$crawler->filter('body > p')->first()->siblings();

例えば下記のようなHTMLがあった場合．
取得するのは<p>2番目のP</p>と<p>3番目のP</p>を取得する

<body>
    <p class="message">1番目のP</p>
    <p>2番目のP</p>
    <p>3番目のP</p>
</body>

選択したノードの子供の要素を取得する．

$crawler->filter('body')->children();
$crawler->filter('body')->children('p.message');

選択したノードの親の要素を取得する．

$crawler->filter('body > p')->parents();

ノードの値を取得する

ノードの名前を取得する

$tag = $crawler->filterXPath('//body/*')->nodeName();

例えば下記のようなHTMLがあった場合．
取得するのは1番最初に該当するノードの名前を取得する．なので今回の場合はpが返却される.

<body>
    <p class="message">1番目のP</p>
    <p>2番目のP</p>
    <p>3番目のP</p>
</body>

ノードの値を返す

$message = $crawler->filterXPath('//body/p')->text();

クラスの名前を取得する

$class = $crawler->filterXPath('//body/p')->attr('class');

複数のノードの値を一度に取得する方法

$nodeValues = $crawler->filter('p')->each(function (Crawler $node, $i) {
    return $node->text();
});

画像の取得

// selectImageの中にはalt属性の値を指定する
$image = $crawler->selectImage('イラスト1')->image();

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up