More than 1 year has passed since last update.

AL2 ベースの Lambda 上で Selenium4 を使う CDK Stack

Last updated at 2024-01-14Posted at 2024-01-14

概要

Amazon Linux2 ベースイメージの AWS Lambda で、Seleniumを利用したWebクローリング処理をCDKでデプロイするサンプルコードを書いたので紹介します。本記事のタグの技術要素を普段から利用されている方は、特に本記事の解説を読んで頂く必要はありません。コードを覗いてみてください。

参考にさせて頂いたコードはこちらです。ありがとうございます。

背景

APIの開発がすぐにはできない社内のWebアプリでページを表示し、内容に応じて複数ある内の一つのボタンをポチっとしたいケースがありましたので、Seleniumを利用する事にしました。

特に大規模な処理に使うわけではないのですが、将来的に Stepfunctions や APIGateway 経由で叩けるように拡張できるようにしたかったので、Lambdaで動かせたら良いよね、さらに言えば CDK でデプロイできたらいいよね。という発想で書いてみました。

CDK Stack

Lambda関数1個をデプロイするシンプルなスタックです。

ただし、Lambdaはコンテナイメージでデプロイする形になります。Seleniumから利用するChromeブラウザとドライバをAmazonlinux2にインストールする処理をDockerfileに記述し管理したかったからです。

test_stack_py

from aws_cdk import Duration, Stack
from aws_cdk import aws_lambda as _lambda
from constructs import Construct

LAMBDA_TIMEOUT = 60


class SeleniumTestStack(Stack):
    def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
        """Stack定義"""
        super().__init__(scope, construct_id, **kwargs)

        # Lambda Handler Definitions
        local_debug_lambda_handler = _lambda.DockerImageCode.from_image_asset(
            directory=".",
            cmd=["local_debug.handler"],
            file="local_debug.Dockerfile",
        )
        # Lambda Function Definitions
        _lambda.DockerImageFunction(
            scope=self,
            id="SeleniumTest",
            code=local_debug_lambda_handler,
            timeout=Duration.seconds(120),
            memory_size=1024,
            retry_attempts=0,
            description="sample lambda function for selenium test",
            environment={"APP_NAME": "web-crawler"},
        )

Dockerfile

前述の通りDockerfile内にChromeブラウザとドライバをインストールするオペレーションを記述しています。

コンテナを利用することで、OSやブラウザを含めた依存関係をコードで管理できますし、構築オペレーションにも冪等性が出せますので、使い捨てのコードではない場合はオススメです。

Pythonパッケージの依存関係はPoetryを利用しています。

local_debug.Dockerfile

# Chrome Web Brouser for Selenium Automate Development
FROM public.ecr.aws/lambda/python:3.10 as build
RUN yum install -y unzip && \
    curl -Lo "/tmp/chromedriver-linux64.zip" "https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/120.0.6099.109/linux64/chromedriver-linux64.zip" && \
    curl -Lo "/tmp/chrome-linux64.zip" "https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/120.0.6099.109/linux64/chrome-linux64.zip" && \
    unzip /tmp/chromedriver-linux64.zip -d /opt/ && \
    unzip /tmp/chrome-linux64.zip -d /opt/

FROM public.ecr.aws/lambda/python:3.10
RUN yum install -y atk cups-libs gtk3 libXcomposite alsa-lib \
    libXcursor libXdamage libXext libXi libXrandr libXScrnSaver \
    libXtst pango at-spi2-atk libXt xorg-x11-server-Xvfb \
    xorg-x11-xauth dbus-glib dbus-glib-devel nss mesa-libgbm
COPY --from=build /opt/chrome-linux64 /opt/chrome
COPY --from=build /opt/chromedriver-linux64 /opt/

# Install poetry
RUN curl -sSL https://install.python-poetry.org | POETRY_HOME=/opt/poetry python \
    && cd /usr/local/bin \
    && ln -s /opt/poetry/bin/poetry \
    && poetry config virtualenvs.create false

COPY src/ ${LAMBDA_TASK_ROOT}/
COPY pyproject.toml poetry.lock ${LAMBDA_TASK_ROOT}/
# Install python packages
RUN poetry install --no-root

pyproject.toml

[tool.poetry]
name = "sample-crawler"
version = "0.1.0"
description = "web crawler"
authors = ["sample <124345558+xxxxxxx@users.noreply.github.com>"]
license = "MIT"
readme = "README.md"

[tool.poetry.dependencies]
python = "^3.10"
aws-cdk-lib = "^2.110.0"
boto3 = "^1.29.3"
selenium = "^4.16.0"
chromedriver-binary = "120.0.6099.109"

[tool.poetry.group.dev.dependencies]
pytest = "^7.4.3"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

Codeの要点解説

Lambda実行時のエントリポイントは src/local_debug.py:handler() です。Google に test というキーワードを検索し返された結果ページのタイトルをCloudwatch Logsに出力する処理を行います。

つまりWebクローリング結果をログ出力するサンプルコードです。

src/libs/selenium.py

class GoogleSearchSampleDriver(BaseWebDriver):
    """
    Google検索サンプルWebDriver
    """

    def test(self):
        # googleにアクセスしキーワードを入れて検索する
        self.driver.get("https://www.google.com")
        quer = self.wait.until(EC.visibility_of_element_located((By.NAME, "q")))
        quer.send_keys("test")
        quer.send_keys(Keys.ENTER)
        # 検索結果画面への遷移を待機し、Google検索結果のウインドウタイトルを取得する
        self.wait.until(EC.presence_of_all_elements_located)
        # 検索結果のページタイトルを返す
        return self.driver.title

Webクローリング処理はSeleniumに行わせるので、準備として Selenium Web Driver インスタンスを BaseWebDriver クラスのコンストラクタでセットアップしています。

セットアップ処理で重要なのは、次のようにChromeブラウザとドライバのPathを正しく指定している点です。

src/libs/selenium.py

CHROME_PATH = os.getenv("CHROME_PATH", "/opt/chrome/chrome")
CHROME_DRIVER_PATH = os.getenv("CHROME_DRIVER_PATH", "/opt/chromedriver")

ブラウザとドライバは Dockerfile 内でそれぞれ /opt/chrome/chrome と /opt/chromedriver へインストールする記述にしていましたので、これを指定している関係です。

local_debug.Dockerfile

# Chrome Web Brouser for Selenium Automate Development
FROM public.ecr.aws/lambda/python:3.10 as build
RUN yum install -y unzip && \
    curl -Lo "/tmp/chromedriver-linux64.zip" "https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/120.0.6099.109/linux64/chromedriver-linux64.zip" && \
    curl -Lo "/tmp/chrome-linux64.zip" "https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/120.0.6099.109/linux64/chrome-linux64.zip" && \
    unzip /tmp/chromedriver-linux64.zip -d /opt/ && \
    unzip /tmp/chrome-linux64.zip -d /opt/
...
COPY --from=build /opt/chrome-linux64 /opt/chrome
COPY --from=build /opt/chromedriver-linux64 /opt/
...

Seleniumは特にそうなのですが、ブラウザ、ドライバ、Seleniumの3要素のセットアップが必要な仕組みなので、ブラウザとドライバの取得元や、構築オペレーション、プログラムコードの管理が伴います。これら全てをコードで管理できている事で、将来のメンテナへの説明や複数名での開発がしやすくなると思います。

当サンプルの実行法補足

当サンプルは AWS CDK, Docker, Lambda, Python の利用経験がある方向けの記述です。

CDKについてはゼロベースだよと言う方は、AWS CDK Workshopの必要条件を実施し、Pythonワークショップを流し見した上で、Docker サービスをインストールし、当サンプルコードの Quickstart を実施してください。

Cloudformation経由でLambda関数が1つデプロイされますので、Lambdaをテスト実行してみてください。当LambdaはEventを利用しないので、テスト時のInputは何でもOKです。

Chromeブラウザ(115 and newer)とChromeDriverの互換性

こちらのサイトに記載された方法で互換性があるChromeDriverを選択できます。

具体的には、当サンプルではChrome 120.0.6099.109 をDockerfile中で利用しましたので、リビジョンまでのバージョン文字・・・つまり 120.0.6099 を使って CfT JSON endpoints -> known-good-versions-with-downloads.json ファイル内で調べたURLが、互換するドライバの取得元URL https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/120.0.6099.109/linux64/chromedriver-linux64.zip になるという確認の流れです。

今回は行いませんでしたが、別の手段としてChromeのバージョン文字を含むAPIエンドポイントが提供されていたりもするので、curlを動的に組んで互換性があるDriverを取得する記述もできそうです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up