Edited at

php-phantomjsでWebスクレイピング

More than 1 year has passed since last update.


環境・ツール


  • macOS High Sierra 10.13.2

  • PHP 7.1.7 -> PHP 5.6.35

  • Composer version 1.6.3


環境設定


  • PHPバージョン確認

$ php -v

PHP 7.1.7 (cli) (built: Jul 15 2017 18:08:09) ( NTS )
Copyright (c) 1997-2017 The PHP Group
Zend Engine v3.1.0, Copyright (c) 1998-2017 Zend Technologies


  • PHP5.6インストール

$ brew install php56



  • PHP7.1からPHP5.6に切り替え

$ vim ~/.bash_profile 


~/.bash_profile

# php5.6 #

export PATH=/usr/local/Cellar/php\@5.6/5.6.35/bin:$PATH

$ source ~/.bash_profile 

$ php -v

PHP 5.6.35 (cli) (built: Mar 31 2018 20:21:31)
Copyright (c) 1997-2016 The PHP Group
Zend Engine v2.6.0, Copyright (c) 1998-2016 Zend Technologies
with Zend OPcache v7.0.6-dev, Copyright (c) 1999-2016, by Zend Technologies


  • Composerインストール

$ brew install homebrew/php/composer

$ composer -v

______
/ ____/___ ____ ___ ____ ____ ________ _____
/ / / __ \/ __ `__ \/ __ \/ __ \/ ___/ _ \/ ___/
/ /___/ /_/ / / / / / / /_/ / /_/ (__ ) __/ /
\____/\____/_/ /_/ /_/ .___/\____/____/\___/_/
/_/
Composer version 1.6.3 2018-01-31 16:28:17

$ composer init

↑init必要なかったか?


  • php-phantomjsインストール


composer.json

{

"config": {
"bin-dir": "bin"
},
"scripts": {
"post-install-cmd": [
"PhantomInstaller\\Installer::installPhantomJS"
],
"post-update-cmd": [
"PhantomInstaller\\Installer::installPhantomJS"
]
}
}

$ composer require "jonnyw/php-phantomjs:4.*"


コーディング


scrape-php-phantomjs.php

<?php

require 'vendor/autoload.php';

use JonnyW\PhantomJs\Client;
use JonnyW\PhantomJs\DependencyInjection\ServiceContainer;

$client = Client::getInstance();
$request = $client->getMessageFactory()->createRequest();
$response = $client->getMessageFactory()->createResponse();

$url = 'URL';
$request->setUrl($url);
$client->send($request,$response);

$htmlstr = $response->getContent();
$dom = new DOMDocument;
@$dom->loadHTML($htmlstr);
$xpath = new DOMXPath($dom);
$entries = [];
$q_product = '//li[@class="CLASS NAME"]';
foreach ($xpath->query($q_product) as $node) {
$entries[] = [
'title' => $xpath->evaluate('string(.//h2[@class="CLASS NAME"]/a)',$node),
'price' => $xpath->evaluate('string(.//span[@class="CLASS NAME"][1])',$node)
];
}
var_dump($entries);
?>



  •  スクリプト実行時Warning発生

Declaration of JonnyW\..\ServiceContainer::load() should be compatible with Symfony\..\Container::load($file) 

Update ServiceContainer.php #217でServiceContainer.phpを最新化して解決


おまけ(php-phantomjsでWebスクレイピングしてExcelで入出力)


  • PhpSpreadsheetインストール

$ composer require phpoffice/phpspreadsheet


  • PhpSPreadsheetでのファイル読み込み、ファイル書き込み


scrape-php-phantomjs-spreadsheet.php

<?php

require 'vendor/autoload.php';

use JonnyW\PhantomJs\Client;
use JonnyW\PhantomJs\DependencyInjection\ServiceContainer;

use PhpOffice\PhpSpreadsheet\Writer\Xlsx as Writer;
use PhpOffice\PhpSpreadsheet\Reader\Xlsx as Reader;

$reader = new Reader();
$spreadsheet = $reader->load('example.xlsx');

for($i=2;$i<=3;$i++){
$sheet0 = $spreadsheet->getSheet(0);
$cell = 'A'.$i;
$code = $sheet0->getCell($cell)->getValue();
$url = 'http://www.example.com?code='.$code;
$client = Client::getInstance();
$request = $client->getMessageFactory()->createRequest();
$response = $client->getMessageFactory()->createResponse();
$request->setUrl($url);
$client->send($request,$response);
$htmlstr = $response->getContent();
$dom = new DOMDocument;
@$dom->loadHTML($htmlstr);
$xpath = new DOMXPath($dom);
$entries = [];
$q_1 = '//div[@id="ID NAME"]';
foreach ($xpath->query($q_1) as $node) {
$entries = [
'1' => $xpath->evaluate('string(.//div[@class="CLASS NAME"]/table/tbody/tr[XXX]/td[XXX])',$node),
'2' => $xpath->evaluate('string(.//div[@class="CLASS NAME"]/table/tbody/tr[XXX]/td[XXX])',$node),
'3' => $xpath->evaluate('string(.//table[XXX]/tbody/tr[XXX]/td[XXX])',$node),
];
}
$sheet0->setCellValue('B'.$i,$entries[1]);
$sheet0->setCellValue('C'.$i,$entries[2]);
$sheet1 = $spreadsheet->getSheet(1);
$sheet1->setCellValue('D'.$i,$entries[3]);
}
$writer = new Writer($spreadsheet);
$writer->save('example.xlsx');
?>



参考にしたサイト

PHPネイティブのDOMによるスクレイピング入門

Macにhomebrewでcomposerをインストール

PHP開発でComposerを使わないなんてありえない!基礎編

PHP PhantomJS を使ってPHPでヘッドレスブラウジング

GitHub - jonnnnyw/php-phantomjs: Execute PhantomJS commands through PHP

PHP PhantomJS

Amazonの検索結果をスクレイピング

PHPExcelが非推奨になったので後継のPhpSpreadsheetを使ってみる

GitHub - PHPOffice/PhpSpreadsheet: A pure PHP library for reading and writing spreadsheet files

PhpSpreadsheet Documentation