More than 3 years have passed since last update.

Puppeteer + Google Cloud Function + TypescriptでスクレイピングAPIをStep by Stepで作る！

Posted at 2020-08-08

はじめに

タイトルの通りの構成で何かしらのパラメタを受け取り、レスポンスとして任意の他のWebサイトページ情報を取得するWebAPIをStep by Stepで作成します

STEP0:なぜPuppeteer？

昨今SPA構成のサイトが増え、最初に読み込んだHTMLでは情報が取得しやすい形でないことも多くなりました。もしくはまったく存在しないパターンもあります
Qiitaのトップページなどでもこの内容を取得できたからといって、ここから情報を抜き出すのは難しそうです

→ view-source:https://qiita.com/ (Chrome, FirefoxなどでURLに貼り付けてください)

そんな中Puppeteerを使用すればJSによって最終的にレンダリングされたDOMが得られるため、お馴染みのブラウザJSのAPIを知っていれば情報を収集したり、Webサイト上で任意の操作をすることも簡単になります
これがPuppeteerを使う理由です

またGoogle Cloud Function(GCF)では特に意識せずともPuppeteerが使えるらしいです。Puppeteerさまさまですね

Puppeteerのバージョンに気をつけろ！

いきなり身も蓋もない話ですが、インストールするPuppeteerのバージョンに気をつけて下さい

私は最初5.0.0を使用しましたが、node_modules/puppeteer/.local_chromium内部が正しく展開されておらず、直近のフォルダにあるZIPファイルを所定のフォルダに自分で解凍することで対処しました
ローカルだけの問題かな（汗）と思っていたらGCFにプッシュした後でも同様の問題が発生し時間を溶かしました

2.1.1のバージョンを使用したところ、ローカルでもGCFでも問題なく動作しています
特に抵抗がなければこのバージョンを使用することをおすすめします

STEP1:ローカルからスクレイピングするFunctionを作成する

ネタとして、Qiitaの任意のタグの最近の投稿を取得するAPIをつくりましょう

以下のコマンドを実行


mkdir scrape-qiita-tags && cd scrape-qiita-tags && mkdir src
npm init -y
npm i puppeteer@2.1.1 typescript
npm i --save-dev @types/express @types/puppeteer @google-cloud/functions-framework ts-node

準備

以下のファイルを作成

tsconfig.json

{
  "compilerOptions": {
    "sourceMap": true,
    "target": "es2017",
    "module": "commonjs",
    "lib": ["dom", "es2017"],
    "outDir": "./dist",
    "rootDir": "./src"
  }
}

↑srcフォルダ以下にTSファイルが、コンパイルされたファイルがdistフォルダに格納されるイメージで進めます

'.gitigonre

dist
node_modules

'.gcloudignore

# This file specifies files that are *not* uploaded to Google Cloud Platform
# using gcloud. It follows the same syntax as .gitignore, with the addition of
# "#!include" directives (which insert the entries of the given .gitignore-style
# file at that point).
#
# For more information, run:
#   $ gcloud topic gcloudignore
#
.gcloudignore
# If you would like to upload your .git directory, .gitignore file or files
# from your .gitignore file, remove the corresponding line
# below:
.git
.gitignore

node_modules
!include:.gitignore

LICENSE
README.md

!dist
src

.gcloudignoreファイルの最後の２行が大事だったりします。理由は後述します

ではコーディング

src/getPagesByTag.ts

import { launch } from 'puppeteer';

type QittaResponse = {
  pages: QiitaPage[];
  hasNextpage: boolean;
};

type QiitaPage = {
  title: string;
  url: string;
  lgtm: number;
  postedAt: string;
};

const getPagesByTag = async (tagname: string): Promise<QittaResponse> => {
  const browser = await launch({
    headless: true,
    defaultViewport: {
      width: 1280,
      height: 882
    },
    // 高速化を期待しているオプションです。どれほど効果を見込めるかは把握していません
    args: [
      '--no-sandbox',
      '--disable-canvas-aa',
      '--disable-2d-canvas-clip-aa',
      '--disable-gl-drawing-for-tests',
      '--use-gl=swiftshader',
      '--enable-webgl',
      '--hide-scrollbars',
      '--mute-audio',
      '--no-first-run',
      '--disable-infobars',
      '--disable-breakpad',
      '--window-size=1280,882',
      '--disable-setuid-sandbox'
    ]
  });
  const page = await browser.newPage();
  await page.setUserAgent('bot');
  // console.logデバッグしたいときに
  // page.on('console', msg => console.log(msg.text()));

  const res = await page.goto(buildPageURL(tagname));
  if (!res.ok()) {
    return { pages: [], hasNextpage: false };
  }

  await page.waitFor(1000);
  const pages = await page.evaluate(getPages);
  const { hasNextpage } = await page.evaluate(getPageInfo);
  const ret: QittaResponse = { pages, hasNextpage };

  await browser.close();

  return ret;
};

const buildPageURL = (tagname: string): string => {
  return `https://qiita.com/tags/${tagname}`;
};

const getPages = async (): Promise<QiitaPage[]> => {
  // page.evaluateで渡されたメソッドはdocumentやwindowといったおなじみのブラウザオブジェクトにアクセスできる
  // これらのオブジェクトからいい感じに情報を取得する

  // 最近の投稿下にある記事セレクタ
  const targetSelector = '[class^=TagNewestItemList__TagNewestItemListContainer] [class^=ItemListArticleWithAvatar__Item-sc]';

  return Array.from(document.querySelectorAll(targetSelector)).map((e: Element) => {
    const titleElement = e.querySelector('[class^=ItemListArticleWithAvatar__ItemBodyTitle-sc]') as HTMLAnchorElement;
    const title = titleElement.text;
    const url = titleElement.href;
    const lgtmElement = e.querySelector('[class^=ItemListArticleWithAvatar__LgtmCount-sc]') as HTMLElement;
    const lgtm = Number(lgtmElement.textContent);
    const postedAt = (e.querySelector('[class^=ItemListArticleWithAvatar__Timestamp-sc]') as HTMLElement).textContent

    return { title, url, lgtm, postedAt }
  });
};

const getPageInfo = async (): Promise<{ hasNextpage: boolean }> => {
  const pagerElement = document.querySelector('ul.st-Pager > .st-Pager_next');
  const hasNextpage = !!pagerElement
  return { hasNextpage };
};

export default getPagesByTag;

動作テスト用のスクリプトを書いて期待する結果が得られるか試しましょう

src/test.ts

import getPagesByTag from './getPagesByTag';

(async () => {
  // npx ts-node src/test.ts
  console.log(await getPagesByTag('puppeteer'));
})();

STEP2:GCFとのスクレイピングFunctionとのつなぎを作成する

GCFのNode.jsランタイムはHTTPフレームワークとしてExpressを使用しています
HTTP フレームワーク
なので、Expressのリクエストとレスポンスの型を使用できます

今回の場合では、src/index.tsでexportしているメソッド名がそのままGCFのFunction名になるようになります。わかりやすい名前を付けてあげましょう

src/index.ts

import { Request, Response } from 'express';
import getPagesByTag from './getPagesByTag';

export const getQiitaPagesByTagPage = (req: Request, res: Response) => {
  const { tagname } = req.query;

  if (!tagname) {
    return res.status(400).send('tagname is not present');
  }

  try {
    getPagesByTag(tagname.toString()).then((result) => {
      res.status(200)
         .type('application/json')
         .send(result);
    }).catch((error) => {
      throw error;
    });
  } catch (err) {
    res.status(500);
    res.send(err);
  }
};

`functions-framework`を使ってローカルで動作確認をする

functions-frameworkはローカルでの動作確認用のライブラリです

さてTypescriptで開発してきましたが、GCFのランタイムで直接TSが実行できるわけではないのでJSにコンパイルします。distフォルダ配下に生成されますが、↓の疑問が出ます

そもそもどのファイルがFunctionとしてデプロイされるのだろう

結論としてはpackage.jsonのmainに指定しているPathのファイルです。この情報がなかなか見つからなくて時間がまた溶けました

package.json


{
  "name": "scrape-qiita-tags",
  "version": "1.0.0",
  "description": "",
-  "main": "index.js",
+  "main": "dist/index.js",
...
}

次のコマンドで動作確認をしましょう

tsc && functions-framework --target=getQiitaPagesByTagPage

STEP3:デプロイ

今回はgcloudコマンドを使用してデプロイします
gcloudをインストールします

以下のコマンドでデプロイします

gcloud functions deploy getQiitaPagesByTagPage --runtime nodejs10 --trigger-http --memory=2048MB --timeout=120s

今回の場合ですとローカルのファイルをPushしてFunctionを作るので、無駄なファイルはPushしたくないです。.gcloudignoreに指定したファイルはスキップしてくれる便利機能がありますが注意点があります

`.gcloudignore`と`.gitignore`

.gcloudignoreで!include:.gitignoreと記述すると.gitignoreを継承してくれます
それ自体は問題有りませんが、大抵の場合コンパイルしたJSを格納しておくフォルダ(dist)は.gitignoreに指定しているので、その設定を否定してあげなければいけません
!distと記述することで、逆にGCFにプッシュの対象になります
またsrcフォルダはプッシュする必要はないので、.gcloudignoreに追加します

この設定をミスるとローカルではうまくいくが本番では失敗する嫌なパターンに入ります
時間が溶

またGCFを使用するためにはBillingの設定が必要になります

おわりに

予想以上に落とし穴がありましたが、このアーキテクチャであれば大抵のサイトはスクレイピング・操作できる気がします
定期的に巡回しているサイトがあれば、この際にAPI化してしまうのはどうでしょうか？
またスクレイピングはサイトが禁止していないか確認し、迷惑をかけない範囲で実行しましょう

余談
QiitaのAPIから取れそうですね…コレ

References

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up