LoginSignup
8
3

More than 3 years have passed since last update.

Lambda+EFSで自然言語処理ライブラリ(GiNZA)使ってみる

Posted at

背景

アドベントカレンダー用記事を書いていて、サイズが大きい自然言語処理ライブラリをLambdaで使う部分で技術的障壁が出てきている。そんな中、EFSにセットアップしたPythonライブラリをLambdaにimportする方法という記事を見つける。こちらの技術で要件が満たせそうなので試してみる。

関係する拙記事

背景で述べた技術的障壁を乗り越えるべく各種技術を検証した時の記事。
LambdaLayer用zipをCodeBuildでお手軽に作ってみる。
LambdaでDockerコンテナイメージ使えるってマジですか?(Python3でやってみる)

GiNZA とは

形態素解析を始めとして各種自然言語処理が出来るpythonライブラリ。spaCyの機能をラップしてる(はず)なのでその機能は使える。形態素解析エンジンにSudachiを使用したりもしている。

前提

リソース群は基本CloudFormationで作成。AWSコンソールからCloudFormationで、「スタックの作成」でCloudFormationのTemplateを読み込む形。すいませんが、CloudFormationの適用方法などは把握している方前提になります。

KeyPairの準備(無い場合)

後ほどのCloudFormationのパラメーター指定で必要になるので、AWSコンソールから作成しておく。もちろん、.sshフォルダへの配置など、sshログインの為の準備はしておく。(SSMでやれという話もあるが・・・)

VPCとかSubnetの準備(無い場合)

公式ページ AWS CloudFormation VPC テンプレート に記載のCloudFormationテンプレートを修正し、AWSコンソールから適用。
修正内容は以下の通り

  • 料金節約の為にPrivateSubnetとかNATを削除
  • 別のCloudFormationで使う値をExport
  • 実際にはリソース名など変更しています

修正後のVPC+SubnetのCloudFormation
# It's based on the following sample.
# https://docs.aws.amazon.com/ja_jp/codebuild/latest/userguide/cloudformation-vpc-template.html
Description:  This template deploys a VPC, with a pair of public and private subnets spread
  across two Availability Zones. It deploys an internet gateway, with a default
  route on the public subnets. It deploys a pair of NAT gateways (one in each AZ),
  and default routes for them in the private subnets.

Parameters:
  EnvironmentName:
    Description: An environment name that is prefixed to resource names
    Type: String

  VpcCIDR:
    Description: Please enter the IP range (CIDR notation) for this VPC
    Type: String
    Default: 10.192.0.0/16

  PublicSubnet1CIDR:
    Description: Please enter the IP range (CIDR notation) for the public subnet in the first Availability Zone
    Type: String
    Default: 10.192.10.0/24

  PublicSubnet2CIDR:
    Description: Please enter the IP range (CIDR notation) for the public subnet in the second Availability Zone
    Type: String
    Default: 10.192.11.0/24

Resources:
  VPC:
    Type: AWS::EC2::VPC
    Properties:
      CidrBlock: !Ref VpcCIDR
      EnableDnsSupport: true
      EnableDnsHostnames: true
      Tags:
        - Key: Name
          Value: !Ref EnvironmentName

  InternetGateway:
    Type: AWS::EC2::InternetGateway
    Properties:
      Tags:
        - Key: Name
          Value: !Ref EnvironmentName

  InternetGatewayAttachment:
    Type: AWS::EC2::VPCGatewayAttachment
    Properties:
      InternetGatewayId: !Ref InternetGateway
      VpcId: !Ref VPC

  PublicSubnet1:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      AvailabilityZone: !Select [ 0, !GetAZs '' ]
      CidrBlock: !Ref PublicSubnet1CIDR
      MapPublicIpOnLaunch: true
      Tags:
        - Key: Name
          Value: !Sub ${EnvironmentName} Public Subnet (AZ1)

  PublicSubnet2:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      AvailabilityZone: !Select [ 1, !GetAZs  '' ]
      CidrBlock: !Ref PublicSubnet2CIDR
      MapPublicIpOnLaunch: true
      Tags:
        - Key: Name
          Value: !Sub ${EnvironmentName} Public Subnet (AZ2)

  PublicRouteTable:
    Type: AWS::EC2::RouteTable
    Properties:
      VpcId: !Ref VPC
      Tags:
        - Key: Name
          Value: !Sub ${EnvironmentName} Public Routes

  DefaultPublicRoute:
    Type: AWS::EC2::Route
    DependsOn: InternetGatewayAttachment
    Properties:
      RouteTableId: !Ref PublicRouteTable
      DestinationCidrBlock: 0.0.0.0/0
      GatewayId: !Ref InternetGateway

  PublicSubnet1RouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      RouteTableId: !Ref PublicRouteTable
      SubnetId: !Ref PublicSubnet1

  PublicSubnet2RouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      RouteTableId: !Ref PublicRouteTable
      SubnetId: !Ref PublicSubnet2

  NoIngressSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupName: "no-ingress-sg"
      GroupDescription: "Security group with no ingress rule"
      VpcId: !Ref VPC

Outputs:
  VPC:
    Description: A reference to the created VPC
    Value: !Ref VPC
    Export:
      Name: "VPC"

  PublicSubnets:
    Description: A list of the public subnets
    Value: !Join [ ",", [ !Ref PublicSubnet1, !Ref PublicSubnet2 ]]
    Export:
      Name: "PublicSubnets"

  PublicSubnet1:
    Description: A reference to the public subnet in the 1st Availability Zone
    Value: !Ref PublicSubnet1
    Export:
      Name: "PublicSubnet1"

  PublicSubnet2:
    Description: A reference to the public subnet in the 2nd Availability Zone
    Value: !Ref PublicSubnet2
    Export:
      Name: "PublicSubnet2"

  NoIngressSecurityGroup:
    Description: Security group with no ingress rule
    Value: !Ref NoIngressSecurityGroup

EFS+EC2(AutoScaling)の準備

公式ページ Amazon Elastic File System サンプルテンプレート に記載のCloudFormationを修正し、AWSコンソールから適用。VPCなどを既存の物を使う場合、適宜修正お願いします。

修正内容は以下の通り

  • インスタンスタイプなど要らない部分削除
  • AMIのImageIDは直接指定する形に(ami-00f045aed21a55240:Amazon Linux 2 AMI 2.0.20201126.0 x86_64 HVM gp2を使用)
  • MountTargetを2つ(AZ分)に変更
  • 別のCloudFormationで使うMountTargetなどをExportして参照可能に
  • AccessPointのpathなど修正
  • 実際にはリソース名など変更しています

修正後のEFS+EC2(AutoScaling)CloudFormation
# https://docs.aws.amazon.com/ja_jp/AWSCloudFormation/latest/UserGuide/quickref-efs.html
AWSTemplateFormatVersion: '2010-09-09'
Description: This template creates an Amazon EFS file system and mount target and
  associates it with Amazon EC2 instances in an Auto Scaling group. **WARNING** This
  template creates Amazon EC2 instances and related resources. You will be billed
  for the AWS resources used if you create a stack from this template.
Parameters:
  InstanceType:
    Description: WebServer EC2 instance type
    Type: String
    Default: t3.small
    AllowedValues:
      - t3.nano
      - t3.micro
      - t3.small
      - t3.medium
      - t3.large
    ConstraintDescription: must be a valid EC2 instance type.
  AMIImageId:
    Type: String
    # Amazon Linux 2 AMI (HVM), SSD Volume Type
    Default: ami-00f045aed21a55240
  KeyName:
    Type: AWS::EC2::KeyPair::KeyName
    Description: Name of an existing EC2 key pair to enable SSH access to the ECS
      instances
  AsgMaxSize:
    Type: Number
    Description: Maximum size and initial desired capacity of Auto Scaling Group
    Default: '1'
  SSHLocation:
    Description: The IP address range that can be used to connect to the EC2 instances
      by using SSH
    Type: String
    MinLength: '9'
    MaxLength: '18'
    Default: 221.249.116.206/32
    AllowedPattern: "(\\d{1,3})\\.(\\d{1,3})\\.(\\d{1,3})\\.(\\d{1,3})/(\\d{1,2})"
    ConstraintDescription: must be a valid IP CIDR range of the form x.x.x.x/x.
  VolumeName:
    Description: The name to be used for the EFS volume
    Type: String
    MinLength: '1'
    Default: efsvolume
  MountPoint:
    Description: The Linux mount point for the EFS volume
    Type: String
    MinLength: '1'
    Default: efsmountpoint
Mappings:
  AWSInstanceType2Arch:
    t3.nano:
      Arch: HVM64
    t3.micro:
      Arch: HVM64
    t3.small:
      Arch: HVM64
    t3.medium:
      Arch: HVM64
    t3.large:
      Arch: HVM64
  AWSRegionArch2AMI:
    ap-northeast-1:
      HVM64: ami-00f045aed21a55240
Resources:
  CloudWatchPutMetricsRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
        - Effect: Allow
          Principal:
            Service:
            - ec2.amazonaws.com
          Action:
          - sts:AssumeRole
      Path: "/"
  CloudWatchPutMetricsRolePolicy:
    Type: AWS::IAM::Policy
    Properties:
      PolicyName: CloudWatch_PutMetricData
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
        - Sid: CloudWatchPutMetricData
          Effect: Allow
          Action:
          - cloudwatch:PutMetricData
          Resource:
          - "*"
      Roles:
      - Ref: CloudWatchPutMetricsRole
  CloudWatchPutMetricsInstanceProfile:
    Type: AWS::IAM::InstanceProfile
    Properties:
      Path: "/"
      Roles:
      - Ref: CloudWatchPutMetricsRole
  InstanceSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      VpcId:
        Fn::ImportValue: VPC
      GroupDescription: Enable SSH access via port 22
      SecurityGroupIngress:
      - IpProtocol: tcp
        FromPort: '22'
        ToPort: '22'
        CidrIp:
          Ref: SSHLocation
  MountTargetSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      VpcId:
        Fn::ImportValue: VPC
      GroupDescription: Security group for mount target
      SecurityGroupIngress:
      - IpProtocol: tcp
        FromPort: '2049'
        ToPort: '2049'
        CidrIp: 0.0.0.0/0
  FileSystem:
    Type: AWS::EFS::FileSystem
    Properties:
      PerformanceMode: generalPurpose
      FileSystemTags:
      - Key: Name
        Value:
          Ref: VolumeName
  MountTarget1:
    Type: AWS::EFS::MountTarget
    Properties:
      FileSystemId:
        Ref: FileSystem
      SubnetId:
        Fn::ImportValue: PublicSubnet1
      SecurityGroups:
      - Ref: InstanceSecurityGroup
      - Ref: MountTargetSecurityGroup
  MountTarget2:
    Type: AWS::EFS::MountTarget
    Properties:
      FileSystemId:
        Ref: FileSystem
      SubnetId:
        Fn::ImportValue: PublicSubnet2
      SecurityGroups:
      - Ref: InstanceSecurityGroup
      - Ref: MountTargetSecurityGroup
  EFSAccessPoint:
    Type: 'AWS::EFS::AccessPoint'
    Properties:
      FileSystemId: !Ref FileSystem
      RootDirectory:
        Path: "/"
  LaunchConfiguration:
    Type: AWS::AutoScaling::LaunchConfiguration
    Metadata:
      AWS::CloudFormation::Init:
        configSets:
          MountConfig:
          - setup
          - mount
        setup:
          packages:
            yum:
              nfs-utils: []
          files:
            "/home/ec2-user/post_nfsstat":
              content: !Sub |
                #!/bin/bash

                INPUT="$(cat)"
                CW_JSON_OPEN='{ "Namespace": "EFS", "MetricData": [ '
                CW_JSON_CLOSE=' ] }'
                CW_JSON_METRIC=''
                METRIC_COUNTER=0

                for COL in 1 2 3 4 5 6; do

                 COUNTER=0
                 METRIC_FIELD=$COL
                 DATA_FIELD=$(($COL+($COL-1)))

                 while read line; do
                   if [[ COUNTER -gt 0 ]]; then

                     LINE=`echo $line | tr -s ' ' `
                     AWS_COMMAND="aws cloudwatch put-metric-data --region ${AWS::Region}"
                     MOD=$(( $COUNTER % 2))

                     if [ $MOD -eq 1 ]; then
                       METRIC_NAME=`echo $LINE | cut -d ' ' -f $METRIC_FIELD`
                     else
                       METRIC_VALUE=`echo $LINE | cut -d ' ' -f $DATA_FIELD`
                     fi

                     if [[ -n "$METRIC_NAME" && -n "$METRIC_VALUE" ]]; then
                       INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
                       CW_JSON_METRIC="$CW_JSON_METRIC { \"MetricName\": \"$METRIC_NAME\", \"Dimensions\": [{\"Name\": \"InstanceId\", \"Value\": \"$INSTANCE_ID\"} ], \"Value\": $METRIC_VALUE },"
                       unset METRIC_NAME
                       unset METRIC_VALUE

                       METRIC_COUNTER=$((METRIC_COUNTER+1))
                       if [ $METRIC_COUNTER -eq 20 ]; then
                         # 20 is max metric collection size, so we have to submit here
                         aws cloudwatch put-metric-data --region ${AWS::Region} --cli-input-json "`echo $CW_JSON_OPEN ${!CW_JSON_METRIC%?} $CW_JSON_CLOSE`"

                         # reset
                         METRIC_COUNTER=0
                         CW_JSON_METRIC=''
                       fi
                     fi
                     COUNTER=$((COUNTER+1))
                   fi

                   if [[ "$line" == "Client nfs v4:" ]]; then
                     # the next line is the good stuff
                     COUNTER=$((COUNTER+1))
                   fi
                 done <<< "$INPUT"
                done

                # submit whatever is left
                aws cloudwatch put-metric-data --region ${AWS::Region} --cli-input-json "`echo $CW_JSON_OPEN ${!CW_JSON_METRIC%?} $CW_JSON_CLOSE`"
              mode: '000755'
              owner: ec2-user
              group: ec2-user
            "/home/ec2-user/crontab":
              content: "* * * * * /usr/sbin/nfsstat | /home/ec2-user/post_nfsstat\n"
              owner: ec2-user
              group: ec2-user
          commands:
            01_createdir:
              command: !Sub "mkdir /${MountPoint}"
        mount:
          commands:
            01_mount:
              command: !Sub >
                mount -t nfs4 -o nfsvers=4.1 ${FileSystem}.efs.${AWS::Region}.amazonaws.com:/ /${MountPoint}
            02_permissions:
              command: !Sub "chown ec2-user:ec2-user /${MountPoint}"
    Properties:
      AssociatePublicIpAddress: true
      ImageId:
        Ref: AMIImageId
      InstanceType:
        Ref: InstanceType
      KeyName:
        Ref: KeyName
      SecurityGroups:
      - Ref: InstanceSecurityGroup
      IamInstanceProfile:
        Ref: CloudWatchPutMetricsInstanceProfile
      UserData:
        Fn::Base64: !Sub |
          #!/bin/bash -xe
          yum install -y aws-cfn-bootstrap
          /opt/aws/bin/cfn-init -v --stack ${AWS::StackName} --resource LaunchConfiguration --configsets MountConfig --region ${AWS::Region}
          crontab /home/ec2-user/crontab
          /opt/aws/bin/cfn-signal -e $? --stack ${AWS::StackName} --resource AutoScalingGroup --region ${AWS::Region}
  AutoScalingGroup:
    Type: AWS::AutoScaling::AutoScalingGroup
    DependsOn:
    - MountTarget1
    - MountTarget2
    CreationPolicy:
      ResourceSignal:
        Timeout: PT15M
        Count:
          Ref: AsgMaxSize
    Properties:
      VPCZoneIdentifier:
        - Fn::ImportValue: PublicSubnet1
        - Fn::ImportValue: PublicSubnet2
      LaunchConfigurationName:
        Ref: LaunchConfiguration
      MinSize: '1'
      MaxSize:
        Ref: AsgMaxSize
      DesiredCapacity:
        Ref: AsgMaxSize
      Tags:
      - Key: Name
        Value: EFS FileSystem Mounted Instance
        PropagateAtLaunch: 'true'
Outputs:
  MountTargetID1:
    Description: Mount target ID
    Value:
      Ref: MountTarget1
  MountTargetID2:
    Description: Mount target ID
    Value:
      Ref: MountTarget2
  LambdaEFSArn:
    Description: File system Arn
    Value: !GetAtt FileSystem.Arn
    Export:
      Name: !Sub "LambdaEFSArn"
  LambdaEFSAccessPointArn:
    Description: File system AccessPointArn
    Value: !GetAtt EFSAccessPoint.Arn
    Export:
      Name: !Sub "LambdaEFSAccessPointArn"
  InstanceSecurityGroup:
    Description: A reference to the InstanceSecurityGroup
    Value: !Ref InstanceSecurityGroup
    Export:
      Name: "InstanceSecurityGroup"
  MountTargetSecurityGroup:
    Description: A reference to the MountTargetSecurityGroup
    Value: !Ref MountTargetSecurityGroup
    Export:
      Name: "MountTargetSecurityGroup"

EC2へログインしてモジュールインストール

EFSにセットアップしたPythonライブラリをLambdaにimportする方法をトレースさせて頂く。

ログイン

AWSコンソールからPublicIPを調べてssh。

ssh -i ~/.ssh/hogehoge-keypair.pem ec2-user@xx.yyy.xxx.zzz

マウント確認

コマンド実行
df -h
結果表示
Filesystem                                      Size  Used Avail Use% Mounted on
devtmpfs                                        469M     0  469M   0% /dev
tmpfs                                           479M     0  479M   0% /dev/shm
tmpfs                                           479M  388K  479M   1% /run
tmpfs                                           479M     0  479M   0% /sys/fs/cgroup
/dev/nvme0n1p1                                  8.0G  1.6G  6.5G  20% /
xx-yyyyyyyz.efs.ap-northeast-1.amazonaws.com:/  8.0E     0  8.0E   0% /efsmountpoint
tmpfs                                            96M     0   96M   0% /run/user/1000

/efsmountpoint にEFSがマウントされているのを確認。

Pythonなどのモジュールインストール

su にならないとginzaが上手くインストールできなかったのでその部分修正

sudo su -
cd /efsmountpoint
yum update
yum -y install gcc openssl-devel bzip2-devel libffi-devel
wget https://www.python.org/ftp/python/3.8.6/Python-3.8.6.tgz
tar xzf Python-3.8.6.tgz

cd Python-3.8.6
./configure --enable-optimizations
make altinstall

# check
python3.8 --version
pip3.8 --version

GiNZAインストール

pip3.8 install --upgrade --target lambda/ ginza==4.0.5

# 念のためフル権限にしておく
chmod 777 -R lambda/

※ここまででEC2は必要無くなります。AWSコンソールからEC2 => AutoScalingグループ => 対象のAutoScalingグループ選択 => グループの詳細 の「編集」で 「希望する容量」「最小キャパシティ」「最大キャパシティ」を全て0にしてインスタンスを終了。でないと不必要なお金がかかってしまうので注意!!!!

テスト用Lambdaを登録(メイン部分)

こちらのCloudFormationをAWSコンソールから適用。重要なのはインラインで記載されてるソースの以下部分。あと、FileSystemConfigs プロパティの設定。EFSを使うので、VPCに属するLambdaにしています。

ポイント部分
sys.path.append("/mnt/efs0/lambda")
EFSマウント指定部分
      FileSystemConfigs:
      - Arn:
          Fn::ImportValue: LambdaEFSAccessPointArn
        LocalMountPath: "/mnt/efs0"

テスト用LambdaのCloudFormation(Policy+Lambda)
AWSTemplateFormatVersion: '2010-09-09'
Description: Lambda test with EFS
Resources:
  LambdaRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: "LambdaRole"
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
        - Effect: Allow
          Principal:
            Service:
            - lambda.amazonaws.com
          Action:
          - sts:AssumeRole
      Path: "/"
      Policies:
      - PolicyName: "LambdaPolicy"
        PolicyDocument:
          Version: '2012-10-17'
          Statement:
            - Effect: Allow
              Action:
                - logs:CreateLogGroup
                - logs:CreateLogStream
                - logs:PutLogEvents
              Resource: "*"
            - Effect: Allow
              Action:
                - cloudwatch:GetMetricStatistics
              Resource: "*"
            - Effect: Allow
              Action:
                - dynamodb:GetRecords
                - dynamodb:GetItem
                - dynamodb:BatchGetItem
                - dynamodb:BatchWriteItem
                - dynamodb:DeleteItem
                - dynamodb:Query
                - dynamodb:Scan
                - dynamodb:PutItem
                - dynamodb:UpdateItem
              Resource: "*"

            - Effect: Allow
              Action:
                - ec2:CreateNetworkInterface
                - ec2:DescribeNetworkInterfaces
                - ec2:DeleteNetworkInterface
                - ec2:DescribeSecurityGroups
                - ec2:DescribeSubnets
                - ec2:DescribeVpcs
              Resource: "*"
            - Effect: Allow
              Action:
                - logs:CreateLogGroup
                - logs:CreateLogStream
                - logs:PutLogEvents
              Resource: "*"
            - Effect: Allow
              Action:
                - elasticfilesystem:ClientMount
                - elasticfilesystem:ClientWrite
                - elasticfilesystem:DescribeMountTargets
              Resource: "*"

  LambdaEFSTest:
    Type: AWS::Lambda::Function
    Properties:
      FunctionName: efstestlambda
      Handler: index.handler
      Runtime: python3.8
      Code:
        ZipFile: |
          import sys
          sys.path.append("/mnt/efs0/lambda")
          import json
          import spacy
          import logging
          from ginza import *

          logger = logging.getLogger()

          def handler(event, context):
              logger.info(context)
              target_text = event['text']
              nlp = spacy.load('ja_ginza')

              doc = nlp(target_text)
              morpheme_list = []
              for sent_idx, sent in enumerate(doc.sents):
                  for token_idx, tk in enumerate(sent):
                      wk_morpheme = {}
                      wk_morpheme['text'] = tk.text
                      wk_morpheme['dep'] = tk.dep_
                      wk_morpheme['pos'] = tk.pos_
                      wk_morpheme['tag'] = tk.tag_
                      morpheme_list.append(wk_morpheme)
              return morpheme_list
      FileSystemConfigs:
      - Arn:
          Fn::ImportValue: LambdaEFSAccessPointArn
        LocalMountPath: "/mnt/efs0"
      Description: Lambda test with EFS.
      MemorySize: 2048
      Timeout: 15
      Role: !GetAtt LambdaRole.Arn
      VpcConfig:
        SecurityGroupIds:
          - Fn::ImportValue: InstanceSecurityGroup
          - Fn::ImportValue: MountTargetSecurityGroup
        SubnetIds:
          - Fn::ImportValue: PublicSubnet1
          - Fn::ImportValue: PublicSubnet2

テストする

  • 「テスト」ボタンを押す
  • イベント名は適当に
  • {"text":"テストしてみる"} をテスト用Bodyに指定
  • 「作成」を押す
  • 元の画面に戻る。テストが作成されているのでその状態で「テスト」ボタンを押す。

image.png

成功!(2回目以降の実行なので622msになってます。1回目は4秒以上かかりました)

終わりに

いくつかの検討を経て、ようやくサーバーレスで自然言語処理が出来そうです(EFSはストレージなので許容します)。
LambdaコンテナもEFSとのマウントも今年の機能っぽいです。去年検討していたら諦めていた事になります。AWSの機能追加速度には目を見張るものがあります。すなわち日々キャッチアップが必要という事になる訳で。大変ですw

参考にさせて頂いた良記事

EFSにセットアップしたPythonライブラリをLambdaにimportする方法

8
3
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
8
3