More than 5 years have passed since last update.

PHPでスクレイピング

Last updated at 2019-04-17Posted at 2017-04-14

$html = '<html>
	<head>
            <meta content="htmlスクレイピング - Qiita" property="og:title">
            <meta content="https://cdn.qiita.com/assets/qiita-fb-2887e7b4aad86fd8c25cea84846f2236.png" property="og:image">
            <meta content="ogのdescription" property="og:description">
        </head>
	<body>
		<p id="first">上</p>
		<p id="second">中</p>
		<p id="third" class="test">下</p>
		<div>sampleA</div>
		<div>
			<p class="test">sampleB</p>
		</div>
		<div id="area">
			<p>
				<span>sampleC</span>
			</p>
			<div>sampleD</div>
            sampleE
		</div>
	</body>
</html>';

$dom_document = new DOMDocument();
@$dom_document->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
$xml_object = simplexml_import_dom($dom_document);

DOMDocumentを使う

foreach ($dom_document->getElementsByTagName('p') as $item)
{
    foreach ($item->childNodes as $node)
    {
        var_dump($node->nodeValue, $node->textContent);
    }
}

SimpleXMLElementを使う

基本

(string)$xml_object->body->p[0];
// string(3) "上"

(string)$xml_object->body->p[0]->attributes()->id;
// string(5) "first"

(string)$xml_object->body->p[0]['id'];
// string(5) "first"

xpath

要素の選択

要素の内容で選択

$xml_object->xpath('/html/body/p[.="中"]'); // ルートから指定
$xml_object->xpath('//body/p[.="中"]'); // 途中から指定
// array(1) { [0]=> object(SimpleXMLElement)#3 (2) { ["@attributes"]=> array(1) { ["id"]=> string(6) "second" } [0]=> string(3) "中" } }

要素の属性で選択

$xml_object->xpath('//p[@id="first"]');
// array(1) { [0]=> object(SimpleXMLElement)#3 (2) { ["@attributes"]=> array(1) { ["id"]=> string(5) "first" } [0]=> string(3) "上" } }

$xml_object->xpath('//p[@class="test"]');
// array(1) { [0]=> object(SimpleXMLElement)#3 (2) { ["@attributes"]=> array(2) { ["id"]=> string(5) "third" ["class"]=> string(4) "test" } [0]=> string(3) "下" } }

要素の順番で選択(最初の要素は1)

$xml_object->xpath('//p[2]');
// array(1) { [0]=> object(SimpleXMLElement)#3 (2) { ["@attributes"]=> array(1) { ["id"]=> string(6) "second" } [0]=> string(3) "中" } }

特定の子要素を持つ要素を選択

$xml_object->xpath('//div[p]'); // 直近の子要素にpを持つdiv
// array(2) { [0]=> object(SimpleXMLElement)#3 (1) { ["p"]=> string(7) "sampleB" } [1]=> object(SimpleXMLElement)#4 (2) { ["p"]=> object(SimpleXMLElement)#5 (1) { ["span"]=> string(7) "sampleC" } ["div"]=> string(7) "sampleD" } }

階層を使用して選択

$xml_object->xpath('//div/p[1]/../../p[1]');
// array(1) { [0]=> object(SimpleXMLElement)#3 (2) { ["@attributes"]=> array(1) { ["id"]=> string(5) "first" } [0]=> string(3) "上" } }

値の取得

$xml_object->xpath($xpath_string)は取得できない場合は空配列、エラーの場合はfalseを返します。
xpath

if ( ! empty($xml_object->xpath($xpath_string)))で判定し、オブジェクトを持っているものを前提とすること。

要素の内容を取得

(string)$xml_object->xpath('//p[2]')[0];
// string(3) "中"

要素の属性を取得

(string)$xml_object->xpath('//p[2]/@id')[0];
// string(6) "second"

(string)$xml_object->xpath('//p[3]/@class')[0];
// string(4) "test"

(string)$xml_object->xpath('//meta[@property="og:title"]/@content')[0];
// string(33) "htmlスクレイピング - Qiita"

(string)$xml_object->xpath('//meta[@property="og:image"]/@content')[0];
// string(74) "https://cdn.qiita.com/assets/qiita-fb-2887e7b4aad86fd8c25cea84846f2236.png"

(string)$xml_object->xpath('//meta[@property="og:description"]/@content')[0];
// string(16) "ogのdescription"

xpathを分けて記述することもできます

$xml_object->xpath('//div')[1]->xpath('p');
// array(1) { [0]=> object(SimpleXMLElement)#6 (2) { ["@attributes"]=> array(1) { ["class"]=> string(4) "test" } [0]=> string(7) "sampleB" } }

指定した要素内のテキストを全て取得したい場合

SimpleXMLElement内にSimpleXMLElementを持っている場合は(string)では内包されたSimpleXMLElement内のテキストは取得できません。
(string)では直下のタグの付いていないテキストのみしか取得できません。

$xml_object->xpath('//div[@id="area"]');

array(1) {
  [0]=>
  object(SimpleXMLElement)#10948 (3) {
    ["@attributes"]=>
    array(1) {
      ["id"]=>
      string(4) "area"
    }
    ["p"]=>
    object(SimpleXMLElement)#10950 (1) {
      ["span"]=>
      string(7) "sampleC"
    }
    ["div"]=>
    string(7) "sampleD"
  }
}

(string)$xml_object->xpath('//div[@id="area"]')[0];

string(87) "


                    sampleE
                "

配下のテキストを全て取得したい場合は以下のようにタグを削除して取得します。

strip_tags($xml_object->xpath('//div[@id="area"]')[0]->asXml());

string(147) "

                        sampleC

                    sampleD
                    sampleE
                "

/@の指定があるxpathは以下のようになってしまいますので注意（/@idだとid="XXXXXX"になってしまいます）

(string)$xml_object->xpath('//@id')[0];

string(5) "first"

strip_tags($xml_object->xpath('//@id')[0]->asXml());

string(11) " id="first""

実際にスクレイピングで使ってみたらHTMLエンティティ化された文字列が来るものもあったので下記のように使ったりしました。

mb_ereg_replace('\s', '', html_entity_decode($string, ENT_QUOTES));// 全角スペースを除く空白文字は削除

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up