1
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

XPathでWebから欲しい場所だけ抜き取る

Last updated at Posted at 2014-08-24

最近、カレー専門の機械学習エンジンを運用している関係で、Webからコーパスを作成している。具体的にはグルメサイトなどから有益な情報をコピペして貯めこんでいるのだが、手作業で広告などを除去するのが面倒で、そういえばXPathというのがあったっけ、と思いだしてみた。

配列を入れ子にして、ひとつのドキュメントから複数の箇所を抜き出せるようにしてあるのが工夫ポイントかな。

<?php
$s = "";
$url = "http://www.goo.ne.jp/";
$patterns = [
    'www.goo.ne.jp' => [
        '//title',
        'id("news-chu-new")'
    ],
];

foreach( $patterns as $domain => $xpaths ){
    if( strstr($url,$domain) ){
        @$content = file_get_contents($url);
        @$page = new DOMDocument();
        @$page->loadHTML($content);
        $xpath = new DOMXPath($page);
        foreach( $xpaths as $path){
            $textContent = $xpath->query($path)->item(0)->textContent;
            $s = $textContent." ".$s;
        }
    }
}
$s = strip_tags($s);
$s = str_replace(["\r\n","\r","\n","\t"], ' ', $s);
echo $s;
?>
1
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?