10
7

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

Puppeteerで非同期スクレイピング

Last updated at Posted at 2018-10-06

スクリーンショットをとるときは怪しい挙動をしていたので注意が必要そう。
fullPage: trueにしないと応答がないものがあってずっと処理が終わらなかった。

コード

test.js
const puppeteer = require('puppeteer');

// Unhandled promise rejection
process.on('unhandledRejection', (error) => {
  console.error(error);
  process.exit(1);
});

const articleUrlList = [
  'https://qiita.com/horikeso/items/0bf9a78454b8124a6dfa',
  'https://qiita.com/horikeso/items/f87d3e703828aa13e2ff',
  'https://qiita.com/horikeso/items/ec34a8e3d6731a94f5f9',
  'https://qiita.com/horikeso/items/bb255eede8a051dfa785',
];

(async () => {
  try {
    const browser = await puppeteer.launch({
      args: [
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--lang=ja,en-US;q=0.9,en;q=0.8',
        '--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
      ]
    });
    let index = 0;
    promiseList = [];
    articleUrlList.forEach(targetUrl => {
      promiseList.push((async (index) => {
        const page = await browser.newPage();
        page.setDefaultNavigationTimeout(30000);// default 3000 milliseconds, pass 0 to disable timeout
        const response = await page.goto(targetUrl);
        await page.waitFor(1000);// 1秒待つ

        if (response.status() !== 200) {
          return [];
        }

        console.log(index);
        const fileName = index + '.png';
        await page.screenshot({path: fileName, fullPage: true});

        const result = await page.evaluate(() => {
          return [document.querySelector('meta[property="og:title"]').getAttribute('content')];
        });

        await page.close();

        return result;
      })(index));
      index++;
    });

    let articleTitleList = [];
    await Promise.all(promiseList).then(valueList => {
      valueList.forEach(value => {
        articleTitleList = articleTitleList.concat(value);
      });
    }).catch(reject => {
      throw reject;
    });

    console.log(articleTitleList);

    await browser.close();
    
  } catch (error) {
    throw error;
  }
})().catch((error) => {
  console.log(error);
  process.exit(1);
});

注意事項

MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 Symbol(Connection.Events.Disconnected) listeners added. Use emitter.setMaxListeners() to increase limit

のようなエラーが出る場合は
以下でMaxListenersの上限を上げるか(サーバースペックに問題がない場合 それなりにメモリを使ったりしますので)
LodashUnderscoreのchunkで分割実行したりしてPromiseの同時実行数を減らす必要がありそうです。

require('events').EventEmitter.defaultMaxListeners = 15;// default 10, Promise等 同時実行制限

gotoが返ってこないページがある

いろんなページを取得する場合は総当たりでぶつける方法でも良いかもしれません。

うまく取れないものがある場合は順番にやっていく。
timeoutは要調整

const optionList = [
    {waitUntil: 'load'},
    {waitUntil: 'domcontentloaded'},
    {waitUntil: 'networkidle0'},
    {waitUntil: 'networkidle2'}
];

let response = null;
for (let optionIndex = 0; optionIndex < optionList.length; optionIndex++) {
    if (response) {
        break;
    }
    response = await page.goto(targetUrl, optionList[optionIndex]).catch(error => {
        if (optionIndex === optionList.length - 1) {
            throw error;
        }
    });
}

if (response.status() !== 200) {
    continue;
}

実行結果

node test.js
3
0
2
1
[ 'CentOS7でPuppeteerを使う - Qiita',
  'Puppeteerのevaluateに引数 - Qiita',
  'Puppeteerでステータスコード - Qiita',
  '非同期処理を含むループを同期処理 - Qiita' ]

indexが順番ではないので非同期で出来ているはず。

作成されたSS

長い・・・

0.png

0.png

1.png

1.png

2.png

2.png

3.png

3.png

10
7
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
10
7

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?