More than 5 years have passed since last update.

Node.jsでスクレイピングするならこれが本命（たぶん）

Posted at 2018-09-29

はじめに

青空文庫APIサーバーのご紹介という記事でご紹介しているように、青空文庫の情報を取り出すためのAPIサーバーを作っていたりします。一応、Aozorahackというグループには所属しているのですが、青空文庫のDBに直接アクセスする権利は持っておらず、サーブする情報はCSVファイルおよび青空文庫のサイトからスクレイピングして取得しております。

Webの情報をスクレイピングするなら、Scrapyなんかが定番で、Python使いとしてはこれ一択な気がするのですが、APIサーバをNode+Koa.jsで作った勢いで、スクレイパーもNodeで書いていたりします。その過程で紆余曲折しながら何度か書き換えて「Nodeでスクレイピングするならこれが良いんじゃないの？」という個人的な結論に至ったのでそれについてちょっと書きます。

Node.jsでのスクレイピング

scraperjs

まず初めに使ってみたのが、Scraperjsというライブラリ。情報取得したいURLを指定してscraperを生成し、scrapeメソッドにjQueryオブジェクトを引数とする処理関数を渡すだけ。then, catchと繋げているところからわかる様に各々の処理の結果はpromiseで返されるので複数のサイトのスクレイピングを書くと並列動作する（ただし独自実装のpromise）。

const scraperjs = require('scraperjs');

scraperjs.StaticScraper.create('http://www.amazon.com')
  .scrape(($) => {
    return $('title').text();
  }).then((title) => {
    console.log(title);
  }).catch((error) => {
    console.error('Error:', error);
  });

これでほぼやりたいことができていたので暫くはこれで稼働していたのですが、npmを叩くと最近ちょっと変なエラーがでるようになりました。

found 3 vulnerabilities (2 low, 1 high)
  run `npm audit fix` to fix them, or `npm audit` for details

とりあえず、エラーメッセージの言われるがままにしてみます。

$ npm audit fix

up to date in 0.555s
fixed 0 of 3 vulnerabilities in 110 scanned packages
  3 vulnerabilities required manual review and could not be updated

$ npm audit

                       === npm audit security report ===

┌──────────────────────────────────────────────────────────────────────────────┐
│                                Manual Review                                 │
│            Some vulnerabilities require your attention to resolve            │
│                                                                              │
│         Visit https://go.npm.me/audit-guide for additional guidance          │
└──────────────────────────────────────────────────────────────────────────────┘
┌───────────────┬──────────────────────────────────────────────────────────────┐
│ High          │ Cross-Site Scripting (XSS)                                   │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ Package       │ jquery                                                       │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ Patched in    │ >=3.0.0                                                      │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ Dependency of │ scraperjs                                                    │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ Path          │ scraperjs > jquery                                           │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ More info     │ https://nodesecurity.io/advisories/328                       │
└───────────────┴──────────────────────────────────────────────────────────────┘
┌───────────────┬──────────────────────────────────────────────────────────────┐
│ Low           │ Prototype Pollution                                          │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ Package       │ lodash                                                       │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ Patched in    │ >=4.17.5                                                     │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ Dependency of │ scraperjs                                                    │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ Path          │ scraperjs > cheerio > lodash                                 │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ More info     │ https://nodesecurity.io/advisories/577                       │
└───────────────┴──────────────────────────────────────────────────────────────┘
┌───────────────┬──────────────────────────────────────────────────────────────┐
│ Low           │ Insecure Entropy Source - Math.random()                      │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ Package       │ node-uuid                                                    │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ Patched in    │ >=1.4.4                                                      │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ Dependency of │ scraperjs                                                    │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ Path          │ scraperjs > phantom > shoe > sockjs > node-uuid              │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ More info     │ https://nodesecurity.io/advisories/93                        │
└───────────────┴──────────────────────────────────────────────────────────────┘
found 3 vulnerabilities (2 low, 1 high) in 110 scanned packages
  3 vulnerabilities require manual review. See the full report for details.

どうやら、セキュリティの脆弱性の問題を指摘しているっぽい。少し調べて見ると、npm　v６から脆弱性のあるライブラリを使っているとこういった警告が出る様になったみたいです。Scraperjsが依存しているライブラリで脆弱性エラーが出ていて、簡単に置き換えるわけにはいかないし、リポジトリの最終更新も2年前でアップデートされる望みも薄い。まあ、このまま使い続けても問題ないのかもしれないけど、脆弱性HIGHなのもあり、気持ち悪いので別のソリューションを探して見ることにしました。

node-crawler

それで見つけたのが、node-crawlerというライブラリ。これは処理を行うコールバック関数をCrawlerオブジェクト生成時に指定し、スクレイピングしたいURLをキューに突っ込むというスタイル。URL先がダウンロードされたらその指定したコールバックが呼ばれるという仕掛けです。

const Crawler = require('crawler');

const crawler = new Crawler({
  callback : (error, res, done) => {
    if(error){
      console.log(error);
    } else {
      const $ = res.$;
      console.log($('title').text());
    }
    done();
  }
});

crawler.queue('http://www.amazon.com');
crawler.queue('http://www.google.com/');

コールバックなのがちょっと古めかしいなというのと、queueでURLを突っ込むところのコンテキストとレスポンスの処理の部分のコンテキストが分かれているのでちょっと書きにくい。それでも少し時間をかけてこれで書きなおして一件落着と思ったら、これも脆弱性警告がでることに気がつきました（涙）

$ npm audit

                       === npm audit security report ===

┌──────────────────────────────────────────────────────────────────────────────┐
│                                Manual Review                                 │
│            Some vulnerabilities require your attention to resolve            │
│                                                                              │
│         Visit https://go.npm.me/audit-guide for additional guidance          │
└──────────────────────────────────────────────────────────────────────────────┘
┌───────────────┬──────────────────────────────────────────────────────────────┐
│ Low           │ Prototype Pollution                                          │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ Package       │ lodash                                                       │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ Patched in    │ >=4.17.5                                                     │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ Dependency of │ crawler                                                      │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ Path          │ crawler > seenreq > ioredis > lodash                         │
├───────────────┼──────────────────────────────────────────────────────────────┤
│ More info     │ https://nodesecurity.io/advisories/577                       │
└───────────────┴──────────────────────────────────────────────────────────────┘
found 1 low severity vulnerability in 366 scanned packages
  1 vulnerability requires manual review. See the full report for details.

先に気がつけよ！という話ですが、まあ仕方ない。あまり気に入ってなかったのもあり、次の放浪の旅に出かけます。

Request-Promise + cheerio　（本命）

でたどり着いたのは　Request-Promiseとcheerioの組み合わせ。とてもとても単純なところに戻って来ましたが、結局これが一番使い勝手が良い気がします。

使い方はほぼScraperjsと同じ。Getする時のオプションで、取得したHTMLをjQueryオブジェクトに変換する処理を追加するだけ。


const rp = require('request-promise');
const cheerio = require('cheerio');

const options = {
  transform: (body) => {
    return cheerio.load(body);
  }
};

rp.get('http://www.amazon.com', options)
  .then(($) => {
    return $('title').text();
  }).then((title) => {
    console.log(title);
  }).catch((error) => {
    console.error('Error:', error);
  });

そして、この処理で得られるpromiseは本物(?)のES6 PromiseなのでPromise.allで複数の処理の完了を待ったりとか、async/awaitを組み合わせて使ったりとかできます。例えばこんな感じ。


const rp = require('request-promise');
const cheerio = require('cheerio');

const options = {
  transform: (body) => {
    return cheerio.load(body);
  }
};

const urls = [
  'https://www.amazon.com',
  'https://www.google.com/',
];

const promises = urls.map((url)=> {
  return (async () => {
    try {
      const $ = await rp.get(url, options);
      return $('title').text();
    } catch(error) {
      console.error('Error:', error);
    }
  })();
});
Promise.all(promises).then((result) => {
  console.log(result);
});

ということで、今のところ、これが一番良いかなと思いました。

まとめ

Node.jsのメリットの一つがnpmjs.comで配布される豊富なライブラリ群だと思いますが、その全てがメンテナンスされ続けるわけではないでしょう。むしろ放置されていくモノが多いと思いますが、自分が使っているライブラリ（とそれが依存しているライブラリ）にセキュリティ的なリスクがあるかどうかはこれまであまり気にしてこなかった。それを npmがデフォルトで警告を発する様になったというのはとても良い動きなのかなと思います。

と、本題からは全くずれてしまったまとめですが、スクレイピングに関しては request-promiseとcheerioの組み合わせが面倒が一番少なくて良さそう、という結論に至りました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up