Qiita Teams that are logged in
You are not logged in to any team

Log in to Qiita Team
Community
OrganizationAdvent CalendarQiitadon (β)
Service
Qiita JobsQiita ZineQiita Blog
5
Help us understand the problem. What is going on with this article?
@ttrun5706

php-phantomjsでWebスクレイピング

More than 3 years have passed since last update.

環境・ツール

  • macOS High Sierra 10.13.2
  • PHP 7.1.7 -> PHP 5.6.35
  • Composer version 1.6.3

環境設定

  • PHPバージョン確認
$ php -v
PHP 7.1.7 (cli) (built: Jul 15 2017 18:08:09) ( NTS )
Copyright (c) 1997-2017 The PHP Group
Zend Engine v3.1.0, Copyright (c) 1998-2017 Zend Technologies
  • PHP5.6インストール
$ brew install php56
  • PHP7.1からPHP5.6に切り替え
$ vim ~/.bash_profile 
~/.bash_profile
# php5.6 #
export PATH=/usr/local/Cellar/php\@5.6/5.6.35/bin:$PATH
$ source ~/.bash_profile 
$ php -v
PHP 5.6.35 (cli) (built: Mar 31 2018 20:21:31) 
Copyright (c) 1997-2016 The PHP Group
Zend Engine v2.6.0, Copyright (c) 1998-2016 Zend Technologies
    with Zend OPcache v7.0.6-dev, Copyright (c) 1999-2016, by Zend Technologies
  • Composerインストール
$ brew install homebrew/php/composer
$ composer -v
   ______
  / ____/___  ____ ___  ____  ____  ________  _____
 / /   / __ \/ __ `__ \/ __ \/ __ \/ ___/ _ \/ ___/
/ /___/ /_/ / / / / / / /_/ / /_/ (__  )  __/ /
\____/\____/_/ /_/ /_/ .___/\____/____/\___/_/
                    /_/
Composer version 1.6.3 2018-01-31 16:28:17
$ composer init

↑init必要なかったか?

  • php-phantomjsインストール
composer.json
{
    "config": {
        "bin-dir": "bin"
    },
    "scripts": {
        "post-install-cmd": [
            "PhantomInstaller\\Installer::installPhantomJS"
        ],
        "post-update-cmd": [
            "PhantomInstaller\\Installer::installPhantomJS"
        ]
    }
}
$ composer require "jonnyw/php-phantomjs:4.*"

コーディング

scrape-php-phantomjs.php
<?php

require 'vendor/autoload.php';

use JonnyW\PhantomJs\Client;
use JonnyW\PhantomJs\DependencyInjection\ServiceContainer;

$client = Client::getInstance();
$request = $client->getMessageFactory()->createRequest();
$response = $client->getMessageFactory()->createResponse();

$url = 'URL';
$request->setUrl($url);
$client->send($request,$response);

$htmlstr = $response->getContent();
$dom = new DOMDocument;
@$dom->loadHTML($htmlstr);
$xpath = new DOMXPath($dom);
$entries = [];
$q_product = '//li[@class="CLASS NAME"]';
foreach ($xpath->query($q_product) as $node) {
        $entries[] = [
                'title' => $xpath->evaluate('string(.//h2[@class="CLASS NAME"]/a)',$node),
                'price' => $xpath->evaluate('string(.//span[@class="CLASS NAME"][1])',$node)
        ];
}
var_dump($entries);
?>
  •  スクリプト実行時Warning発生
Declaration of JonnyW\..\ServiceContainer::load() should be compatible with Symfony\..\Container::load($file) 

Update ServiceContainer.php #217でServiceContainer.phpを最新化して解決

おまけ(php-phantomjsでWebスクレイピングしてExcelで入出力)

  • PhpSpreadsheetインストール
$ composer require phpoffice/phpspreadsheet
  • PhpSPreadsheetでのファイル読み込み、ファイル書き込み
scrape-php-phantomjs-spreadsheet.php
<?php

require 'vendor/autoload.php';

use JonnyW\PhantomJs\Client;
use JonnyW\PhantomJs\DependencyInjection\ServiceContainer;

use PhpOffice\PhpSpreadsheet\Writer\Xlsx as Writer;
use PhpOffice\PhpSpreadsheet\Reader\Xlsx as Reader;

$reader = new Reader();
$spreadsheet = $reader->load('example.xlsx');

for($i=2;$i<=3;$i++){
        $sheet0 = $spreadsheet->getSheet(0);
        $cell = 'A'.$i;
        $code = $sheet0->getCell($cell)->getValue();
        $url = 'http://www.example.com?code='.$code;
        $client = Client::getInstance();
        $request = $client->getMessageFactory()->createRequest();
        $response = $client->getMessageFactory()->createResponse();
        $request->setUrl($url);
        $client->send($request,$response);
        $htmlstr = $response->getContent();
        $dom = new DOMDocument;
        @$dom->loadHTML($htmlstr);
        $xpath = new DOMXPath($dom);
        $entries = [];
        $q_1 = '//div[@id="ID NAME"]';
        foreach ($xpath->query($q_1) as $node) {
                $entries = [
                        '1' => $xpath->evaluate('string(.//div[@class="CLASS NAME"]/table/tbody/tr[XXX]/td[XXX])',$node),
                        '2' => $xpath->evaluate('string(.//div[@class="CLASS NAME"]/table/tbody/tr[XXX]/td[XXX])',$node),
                        '3' => $xpath->evaluate('string(.//table[XXX]/tbody/tr[XXX]/td[XXX])',$node),
              ];
        }
        $sheet0->setCellValue('B'.$i,$entries[1]);
        $sheet0->setCellValue('C'.$i,$entries[2]);
        $sheet1 = $spreadsheet->getSheet(1);
        $sheet1->setCellValue('D'.$i,$entries[3]);
}
$writer = new Writer($spreadsheet);
$writer->save('example.xlsx');
?>

参考にしたサイト

PHPネイティブのDOMによるスクレイピング入門
Macにhomebrewでcomposerをインストール
PHP開発でComposerを使わないなんてありえない!基礎編
PHP PhantomJS を使ってPHPでヘッドレスブラウジング
GitHub - jonnnnyw/php-phantomjs: Execute PhantomJS commands through PHP
PHP PhantomJS
Amazonの検索結果をスクレイピング
PHPExcelが非推奨になったので後継のPhpSpreadsheetを使ってみる
GitHub - PHPOffice/PhpSpreadsheet: A pure PHP library for reading and writing spreadsheet files
PhpSpreadsheet Documentation

5
Help us understand the problem. What is going on with this article?
Why not register and get more from Qiita?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away

Comments

No comments
Sign up for free and join this conversation.
Sign Up
If you already have a Qiita account Login
5
Help us understand the problem. What is going on with this article?