More than 1 year has passed since last update.

XPathでWebから欲しい場所だけ抜き取る

Last updated at 2023-02-20Posted at 2014-08-24

最近、カレー専門の機械学習エンジンを運用している関係で、Webからコーパスを作成している。具体的にはグルメサイトなどから有益な情報をコピペして貯めこんでいるのだが、手作業で広告などを除去するのが面倒で、そういえばXPathというのがあったっけ、と思いだしてみた。

配列を入れ子にして、ひとつのドキュメントから複数の箇所を抜き出せるようにしてあるのが工夫ポイントかな。

<?php
$s = "";
$url = "http://www.goo.ne.jp/";
$patterns = [
    'www.goo.ne.jp' => [
        '//title',
        'id("news-chu-new")'
    ],
];

foreach( $patterns as $domain => $xpaths ){
    if( strstr($url,$domain) ){
        @$content = file_get_contents($url);
        @$page = new DOMDocument();
        @$page->loadHTML($content);
        $xpath = new DOMXPath($page);
        foreach( $xpaths as $path){
            $textContent = $xpath->query($path)->item(0)->textContent;
            $s = $textContent." ".$s;
        }
    }
}
$s = strip_tags($s);
$s = str_replace(["\r\n","\r","\n","\t"], ' ', $s);
echo $s;
?>

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up