More than 5 years have passed since last update.

[Python] スクレイピング in AWS Lambda

Last updated at 2017-03-19Posted at 2017-03-19

Javascriptありのサイトに対応するため、下記のスタックでスクレイピングする環境を整える

Phantomjs
Selenium

必要ライブラリ
ローカル実行

pip install python-lambda-local

デプロイ

pip install lambda-uploader

サクッと試したい方は下記のリポジトリで
https://github.com/akichim21/python_scraping_in_lambda

実行スクリプト

selenium(driver: phantomjs)でjs実行済みのhtmlを生成して、タイトルを抜き出すということをやってます。最後にclose(), quit()してphantomjsをkillするだけのスクリプト。

lambda_function.py

# !/usr/bin/env python

import time # for sleep
import os   # for path
import signal
import selenium
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

def lambda_handler(event, context):
  # set user agent
  user_agent = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36")

  dcap = dict(DesiredCapabilities.PHANTOMJS)
  dcap["phantomjs.page.settings.userAgent"] = user_agent
  dcap["phantomjs.page.settings.javascriptEnabled"] = True

  browser = webdriver.PhantomJS(
              service_log_path=os.path.devnull,
              executable_path="./phantomjs",
              service_args=['--ignore-ssl-errors=true', '--load-images=no', '--ssl-protocol=any'],
              desired_capabilities=dcap
            )

  browser.get('http://google.com')
  title = browser.title
  browser.close()
  browser.quit()
  return title

引数に渡すjson

今回は使わないが、event["key1"]みたいに引数を渡す時には使用する

event.json

{
  "key3": "value3",
  "key2": "value2",
  "key1": "value1"
}

Lambdaの設定json

roleを作成&置換してください。
今回は50M以上ではないので、s3を使いませんが他のライブラリを使うとすぐ50MB以上になるので、s3必須です。s3_bucketを設定するとs3経由でファイルがアップロードされます。

"s3_bucket": "xxx-lambda"
"s3_key": "deploy/lambda_function.zip"

nameは実行時によく使うので本番では適切な名前を。
memory, timeoutなどもスクリプトに応じて適切に設定してください。

lambda.json

{
  "name": "python_scraping_test",
  "description": "python_scraping_test",
  "region": "ap-northeast-1",
  "runtime": "python2.7",
  "handler": "lambda_function.lambda_handler",
  "role": "arn:aws:iam::00000000:role/lambda_basic_execution",
  "timeout": 60,
  "memory": 128,
  "variables": {
    "production": "True"
  },
  "ignore": [
    "\\.git.*",
    "/.*\\.pyc$",
    "/.*\\.zip$"
  ]
}

依存関係

seleniumだけpipで入れる。phantomjsはバイナリで保存しておく。

requirements.txt

selenium

ローカル実行コマンド

python-lambda-localを使って実行fはファンクション名、tはタイムアウト(s)を指定して実行

python-lambda-local -f lambda_handler -l ./ -t 60 lambda_function.py event.json

結果

[root - INFO - 2017-03-19 08:16:05,271] Event: {u'test': u'test'}
[root - INFO - 2017-03-19 08:16:05,271] START RequestId: 4e881a1b-3f7a-4de8-9afb-aee6f6b5dac6
[root - INFO - 2017-03-19 08:16:06,766] END RequestId: 4e881a1b-3f7a-4de8-9afb-aee6f6b5dac6
[root - INFO - 2017-03-19 08:16:06,766] RESULT:
Google
[root - INFO - 2017-03-19 08:16:06,766] REPORT RequestId: 4e881a1b-3f7a-4de8-9afb-aee6f6b5dac6  Duration: 1494.06 ms

deployコマンド

~/.aws/credentialsを使うので、設定してない場合は設定を

aws-cliない場合はインストール

pip install awscli
aws configure

condaなどvirtualenvが特殊な場所にある時は場所を把握しておく。Location

pip show virtualenv

lambda-uploader

## virtualenvが特殊な場所の場合はLocationのディレクトリを引数に(vagrant、anaconda3、envがpy2だとこんな感じ
lambda-uploader --virtualenv=/home/vagrant/.pyenv/versions/anaconda3-4.1.0/envs/py2/lib/python2.7/site-packages

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up