初めに
Dockerコンテナ上でPuppeteerを使ってスクレイピングをしていたらPossible EventEmitter memory leak detected.
なるエラーが発生。
emitter.setMaxListeners()
を使ってもエラーが解消されず困っていたのですが、この問題に関するissueを参照したら解消したので、備忘録として残しておきます。
今回のサンプルをGithubにあげましたので、適度確認下さい。
環境情報
環境としてWindows 10 Pro
のDocker(Docker Desktop)
上のNode.js
でTypeScript + Puppeteer
のプログラムを実行しています。
$ docker.exe --version
Docker version 19.03.2, build 6a30dfc
$ docker-compose.exe --version
docker-compose version 1.24.1, build 4667896b
$ docker-compose.exe run scraping cat //etc/issue
Debian GNU/Linux 9 \n \l
$ docker-compose.exe run scraping node --version
v10.15.3
{
"dependencies": {
"@types/node": "^12.7.8",
"@types/puppeteer": "^1.19.1",
"puppeteer": "^1.20.0",
"ts-node": "^8.4.1",
"tsc": "^1.20150623.0",
"typescript": "^3.6.3"
}
}
問題があったコード
import puppeteer from 'puppeteer';
(async () => {
const urlList = [
"https://qiita.com/nobodytolove123/items/5fbb35d3a036989acc04",
"https://qiita.com/nobodytolove123/items/895463907df00aba912f",
"https://qiita.com/nobodytolove123/items/112562699f8ac8d36937"
];
let browser: puppeteer.Browser;
try {
browser = await puppeteer.launch({
args: [
'--no-sandbox',
'--disable-setuid-sandbox'
]
}).catch(e => { throw(e) });
const page: puppeteer.Page = await browser.newPage();
urlList.forEach(async (url) => {
await page.goto(url, { waitUntil: "domcontentloaded" }).catch(e => {
throw (e.message);
});
console.log(await page.title());
});
} catch(e) {
throw (e.message);
} finally {
if (browser) {
browser.close();
}
}
})();
エラー
$ docker-compose run scraping
(node:1) UnhandledPromiseRejectionWarning: Navigation failed because browser has disconnected!
(node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:1) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
(node:1) UnhandledPromiseRejectionWarning: Navigation failed because browser has disconnected!
(node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2)
(node:1) UnhandledPromiseRejectionWarning: Navigation failed because browser has disconnected!
(node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 3)
対応策
上記のエラーを解消するため、issueを参照した所、複数回に渡って連続でpage.goto
する場合、毎回browser.close()
をするとよいとの回答がありました。
なので、work
というgoto
毎にbrowser.close()
を行うジョブ関数を定義し、work
関数が再起という形で自身を呼び出す実装を行います。
import puppeteer from 'puppeteer';
(async () => {
const urlList = [
"https://qiita.com/nobodytolove123/items/5fbb35d3a036989acc04",
"https://qiita.com/nobodytolove123/items/895463907df00aba912f",
"https://qiita.com/nobodytolove123/items/112562699f8ac8d36937",
];
const work = async (url) => {
let browser: puppeteer.Browser;
try {
browser = await puppeteer.launch({
args: [
'--no-sandbox',
'--disable-setuid-sandbox'
]
}).catch(e => { throw (e) });
const page: puppeteer.Page = await browser.newPage();
await page.goto(url, { waitUntil: "domcontentloaded" });
console.log(await page.title());
} catch (e) {
throw (e);
} finally {
if (browser) {
browser.close();
}
if (urlList.length) {
work(urlList.shift());
}
}
}
try {
work(urlList.shift());
} catch(e) {
console.log(e);
process.exit(-1);
}
})();
$ docker-compose.exe run scraping
最近傍点の抽出 - Qiita
Babelでnamespace、moduleをトランスパイルする - Qiita
Docker + Laravel 学習メモ - Qiita
クラス分割
以降はおまけです、ソースをスッキリするためにクラスに分割してみます。
import CrewlWorker from './crewl_worker';
(new CrewlWorker()).run();
import Puppeteer from './puppeteer'
export default class CrewlWorker {
urls: string[];
public async run() {
this.urls = await this.getWorkUrls();
this.crawlPage(this.urls.shift());
}
public async getWorkUrls(): Promise<string[]> {
return new Promise((resolve) => {
resolve([
"https://qiita.com/nobodytolove123/items/5fbb35d3a036989acc04",
"https://qiita.com/nobodytolove123/items/895463907df00aba912f",
"https://qiita.com/nobodytolove123/items/112562699f8ac8d36937",
]);
});
}
async crawlPage(url: string) {
let pup: Puppeteer;
try {
pup = await new Puppeteer().initialize();
await pup.page.goto(url, { waitUntil: "domcontentloaded" });
console.log(await pup.page.title());
} catch (e) {
throw (e);
} finally {
if (pup) {
pup.browser.close();
}
if (this.urls.length) {
this.crawlPage(this.urls.shift());
} else {
process.exit(0);
}
}
}
}
import puppeteer from 'puppeteer';
export default class Puppeteer {
public browser: puppeteer.Browser;
public page: puppeteer.Page;
private launchArg: any = {
args: [
'--no-sandbox',
'--disable-setuid-sandbox'
]
};
public async initialize() {
return new Promise<this>(async (resolve, reject) => {
try {
this.browser = await puppeteer.launch(this.launchArg);
this.page = await this.browser.newPage();
} catch (e) {
reject(e);
}
resolve(this);
});
}
}
$ docker-compose.exe run scraping ./node_modules/.bin/ts-node main.ts
最近傍点の抽出 - Qiita
Babelでnamespace、moduleをトランスパイルする - Qiita
Docker + Laravel 学習メモ - Qiita
最後に
今回はエラーを回避しつつ、Puppeteer
を使ったスクレイピングを行いました。
しかし私はPuppeteer
やTypeScript
は初心者ですので、何か突っ込みがあればコメントにてお待ちしております。