More than 3 years have passed since last update.

データ分析のための基盤構築

Last updated at 2021-09-27Posted at 2021-08-14

はじめに

データ分析をするにあたって、データ収集、加工、蓄積をするための基盤をGCPを使用して構築しました。
自分がやったことの記録として記事にまとめます。

前提

スクレイピングをされる際は自己責任にてお願い致します。
GCPに課金される可能性があります。
PCはmacを使用しています。
Google Cloud SDKが使用できる必要があります。
- まだの方はGoogle Cloud SDKのインストールを実行して、パスを通しておきましょう。

全体像

以下に実装するアーキテクチャの全体像を示します。

上の図に示すような基盤を構築しました。
以下にざっくりとしたデータの流れを示します。

Compute EngineからPythonでスクレイピングを実施し、カーセンサーからデータを取得します。
取得したデータをCloud Storageに格納します。
Cloud Storage内のデータを綺麗に加工してBigQueryに転送します。

1について
スクレイピングには長時間かかることがあります。
その間ローカルのPCのリソースを使用されてしまうと不便です。
Compute Engineを使用すればローカルPCに影響はありません。
Compute Engineは適切な権限を与えることでGCPの様々なリソースにアクセスする事ができます。
上の図の緑色で示したロールをCompute Engineに付与することでCloud Storageにアクセスする事ができます。
2について
スクレイピングで取得したデータを全く加工せずにCloud Storageに格納します。
Cloud Storageをデータレイクとして使用します。
3について
Cloud Storageにデータが追加されたことがCloud Pub/Sub経由でCloud Functionsに伝達されます。
Cloud StorageのCSVをpandasのデータフレームとして読み込み、加工を行います。
加工を行うことで分析しやすくしています。
加工の詳細はこの記事の最後の方に説明しようと思います。
加工したデータはBigQueryに蓄積されます。
BigQueryに蓄積することで、クエリを発行して簡単にデータの概要を掴んだり、データポータルで可視化したり、外部からpandasで読み込んだりできるので便利です。

基盤の実装

では、早速手を動かして基盤を実装していきます。
今回はGCPのコンソール画面はなるべく使用せず、ほぼ全ての手順をコマンドラインを使用してアーキテクチャを構築していきます。
理由としてはコンソール画面は変更される頻度が高い一方、コマンドラインは変更されることが少ないからです。
それぞれの手順で参考のURLを貼っていますので、コマンドの詳細はそちらで確認するようにしてください。

手順

GCPのプロジェクトを作成
各種APIを有効化
Cloud Storageにバケットを作成
BigQueryにデータセットを作成
CloudFunctionsにコードをアップロード
サービスアカウントを作成
ファイアウォールを作成
Compute Engineのインスタンスを作成
インスタンスにSSH接続
インスタンスの環境を構築
インスタンス内にスクレイピングを行うPythonコードを実装
スクレイピングを実行

GCPのプロジェクトを作成

car-scraping-20210813という名前のプロジェクトを作成します。
- --set-as-defaultでデフォルトのプロジェクトに設定できます。
- car-scraping-20210813の部分を変更して実行してください。
- 参考 : https://cloud.google.com/sdk/gcloud/reference/projects/create

ローカル

$ gcloud projects create car-scraping-20210813 --set-as-default

Create in progress for [https://cloudresourcemanager.googleapis.com/v1/projects/car-scraping-20210813].
Waiting for [operations/cp.9220104895151486651] to finish...done.                         
Enabling service [cloudapis.googleapis.com] on project [car-scraping-20210813]...
Operation "operations/acf.p2-182414688985-3d0df817-8121-42a2-b971-0ac5b088bfd0" finished successfully.
Updated property [core/project] to [car-scraping-20210813].

プロジェクト一覧を表示し、プロジェクトが作成されたか確認します。
参考 : https://cloud.google.com/sdk/gcloud/reference/projects/list

ローカル

$ gcloud projects list

PROJECT_ID             NAME                   PROJECT_NUMBER
car-scraping-20210813  car-scraping-20210813  182414688985

デフォルトのプロジェクトとして設定されたかを確認します。
参考　: https://cloud.google.com/sdk/gcloud/reference/config/list

ローカル

$ gcloud config list

[core]
account = ******@gmail.com
disable_usage_reporting = False
project = car-scraping-20210813

Your active configuration is: [default]

GCPに指定した名前のプロジェクトを作成し、gcloudコマンドのデフォルトのプロジェクトとして設定することができました。

APIを有効化

以下に示すサービスを利用する必要があるため、それぞれのAPIを有効化します。

Compute Engine
Cloud Storage
BigQuery
CloudBuild
CloudFunctions

まずはプロジェクトの課金を有効化します。課金を有効にしないとAPIを有効化できません。
- 以下の参考URLからGCPのコンソール画面にアクセスし、課金を有効にしてください。
- 参考 : https://cloud.google.com/billing/docs/how-to/modify-project?hl=ja&visit_id=637636688872527912-1948123821&rd=1
次に、プロジェクトで有効にできるサービスのリストを取得します。
- 参考 : https://cloud.google.com/endpoints/docs/openapi/enable-api#gcloud

ローカル

$ gcloud services list --available

NAME                                                                                                 TITLE
abilitec-api.endpoints.liveramp-identity-public.cloud.goog                                           AbiliTec API
abusiveexperiencereport.googleapis.com                                                               Abusive Experience Report API
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~<省略>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
youtubeoembed.googleapis.com                                                                         YouTube oEmbed API
youtubereporting.googleapis.com                                                                      YouTube Reporting API

最後に、該当するサービス名を使用して、サービスを有効にします。
先ほど取得したサービスのリストにおいて、必要なAPIのNAMEの部分をコピーします。
gcloud services enableに続いてAPI名をスペース区切りで入力します。
参考 : https://cloud.google.com/endpoints/docs/openapi/enable-api#gcloud

ローカル

$ gcloud services enable compute.googleapis.com storage.googleapis.com bigquery.googleapis.com cloudbuild.googleapis.com cloudfunctions.googleapis.com

Operation "operations/acf.p2-182414688985-d135a069-b50f-4d69-b2df-1a08fb045b2e" finished successfully.

以上で、今回必要なAPIを有効化することができました。

Cloud Storageにバケットを作成

Cloud Storageにbucket-20210813という名前のバケットを作成します。
- bucket-20210813の部分を変更して実行してください。
- 参考 : https://cloud.google.com/storage/docs/creating-buckets/?hl=ja#storage-create-bucket-gsutil

ローカル

$ gsutil mb gs://mybucket-20210813

Creating gs://mybucket-20210813/...

プロジェクト内のバケットを一覧表示します。
参考　: https://cloud.google.com/storage/docs/listing-buckets?hl=ja#gsutil

ローカル

$ gsutil ls

gs://mybucket-20210813/

バケットを確認することができました。

BigQueryにデータセットを作成

BigQueryにデータセットを作成します。
- mydatasetという名前のデータセットを作成しています。
- 以下を参考にして下さい。
  - https://cloud.google.com/bigquery/docs/datasets?hl=ja#bq
  - https://cloud.google.com/bigquery/docs/reference/bq-cli-reference#bq_mk

ローカル

$ bq mk \   
--dataset \
car-scraping-20210813:mydataset

Dataset 'car-scraping-20210813:mydataset' successfully created.

プロジェクト内のデータセットを一覧表示します。
参考 : https://cloud.google.com/bigquery/docs/listing-datasets?hl=ja#bq

ローカル

$ bq ls

  datasetId  
 ----------- 
  mydataset

データセットが作成できていることが確認できました。

Cloud Functionsにコードをアップロード

今からデプロイするCloud Functionsは以下の機能を持ちます。

Cloud Storageのバケットにデータが格納されたことを検知して起動します。
該当のデータを分析に使用できるように加工します。
加工したデータをBigQueryに転送します。

Pythonコードの準備を行います。
- ローカルに以下に示すディレクトリ構成を作成して下さい。
  - ディレクトリ名は何でも良いです。ここではtriggerとしています。
- main.pyでカーセンサーのスクレイピングを行なっています。
- requirements.txtで必要なライブラリを定義しています。

ローカル

trigger/
    ├ main.py
    └ requirements.txt

main.py

import os
from google.cloud import storage as gcs
import pandas as pd
from io import BytesIO

def trigger_gcs(event, context):
    """
    GCSのデータをクレンジングしてBigQueryに格納

    Parameters
    ----------
    event : dict
    context : google.cloud.functions_v1.context.Context

    Returns
    -------
    None :
        データをBigQueryに格納
    """
    project_id = os.getenv('GOOGLE_CLOUD_PROJECT')
    bucket_name = event['bucket']
    file_name = event['name']

    client = gcs.Client(project_id)
    bucket = client.get_bucket(bucket_name)
    blob = bucket.blob(file_name)
    data = blob.download_as_bytes()
    df = pd.read_csv(BytesIO(data))
    
    for i in range(len(df)):
        if "交換" in df.loc[i, "mileage"]:
            df.loc[i, "mileage"] = "-"
        elif df.loc[i, "mileage"] in ["不明", "改ざん車"]:
            df.loc[i, "mileage"] = "-"
        elif df.loc[i, "mileage"] == "-":
            pass
        else:
            df.loc[i, "mileage"] = df.loc[i, "mileage"].replace("km", "")
            if "万" in df.loc[i, "mileage"]:
                df.loc[i, "mileage"] = str(int(float(df.loc[i, "mileage"][:-1]) * 10000))

    for i in range(len(df)):
        if not df.loc[i, "inspection"] in ['新車未登録', '車検整備付', '車検整備別', '車検整備無', '国内未登録']:
            yyyymm = df.loc[i, "inspection"][:4] + df.loc[i, "inspection"][10:-1]
            df.loc[i, "inspection"] = pd.to_datetime(yyyymm, format="%Y%m").strftime("%Y-%m")

    for i in range(len(df)):
        if df.loc[i, "base_price"] == "応談":
            df.loc[i, "base_price"] = "-"

    for i in range(len(df)):
        if df.loc[i, "total"] == "---万円":
            df.loc[i, "total"] = "-"

    dataset_id = 'mydataset'
    table_id = df['title'][0]
    df = df.astype('str')
    df.to_gbq(f'{dataset_id}.{table_id}')

requerements.txt

google-cloud-storage==1.41.1
pandas-gbq==0.15.0

Cloud Funcionsをデプロイします。
- 上記のtriggerディレクトリに移動しておいて下さい。
- trigger_gcsという名前のFunctionsをデプロイします。
- --runtimeを指定しています。私のローカルではPython3.9を使用しているのでpython39を指定しました。
- --trigger-resourceで上記で作成したバケットを指定しています。
- --trigger-eventでイベントのタイプを指定しています。google.storage.object.finalizeを指定することで、データが追加された際にCloud Functionsを起動させることができます。
- 以下を参考にして下さい。
  - https://cloud.google.com/functions/docs/deploying/filesystem?hl=ja
  - https://cloud.google.com/functions/docs/tutorials/storage?hl=ja#functions-prepare-environment-python

ローカル

$ gcloud functions deploy trigger_gcs \
> --runtime python39 \
> --trigger-resource mybucket-20210813 \
> --trigger-event google.storage.object.finalize

Deploying function (may take a while - up to 2 minutes)...⠹                               
For Cloud Build Logs, visit: https://console.cloud.google.com/cloud-build/builds;region=us-central1/413ee12b-133a-4392-abe5-e0155ce47fa4?project=182414688985
Deploying function (may take a while - up to 2 minutes)...done.                           
availableMemoryMb: 256
buildId: 413ee12b-133a-4392-abe5-e0155ce47fa4
entryPoint: trigger_gcs
eventTrigger:
  eventType: google.storage.object.finalize
  failurePolicy: {}
  resource: projects/_/buckets/mybucket-20210813
  service: storage.googleapis.com
ingressSettings: ALLOW_ALL
labels:
  deployment-tool: cli-gcloud
name: projects/car-scraping-20210813/locations/us-central1/functions/trigger_gcs
runtime: python39
serviceAccountEmail: car-scraping-20210813@appspot.gserviceaccount.com
sourceUploadUrl: https://storage.googleapis.com/gcf-upload-us-central1-9f696e4c-21b2-4f55-ab7a-b2eef800ca90/1247bded-fa67-4f4b-bbdc-b4c0f450934b.zip
status: ACTIVE
timeout: 60s
updateTime: '2021-08-13T05:49:31.228Z'
versionId: '1'

プロジェクト内のCloud Functionsを一覧表示します。
参考　: https://cloud.google.com/sdk/gcloud/reference/functions/list

ローカル

$ gcloud functions list

NAME         STATUS  TRIGGER        REGION
trigger_gcs  ACTIVE  Event Trigger  us-central1

Functionsをデプロイすることができました。

サービスアカウントを作成

今から作成するサービスアカウントは、この後Compute Engineを起動する際に使用します。

サービスアカウントを作成します。
- up-to-storagという名前のサービスアカウントを作成しています。
- --display-nameで表示名を指定しています。ここではサービスアカウント名と同じにしています。
- 以下を参考にして下さい。
  - https://cloud.google.com/iam/docs/creating-managing-service-accounts?hl=ja
  - https://cloud.google.com/sdk/gcloud/reference/iam/service-accounts/create?hl=ja

ローカル

$ gcloud iam service-accounts create up-to-storage \
> --display-name="up-to-storage"

Created service account [up-to-storage].

プロジェクト内のサービスアカウントを一覧表示します。
- 以下を参考にして下さい。
  - https://cloud.google.com/iam/docs/creating-managing-service-accounts?hl=ja#listing
  - https://cloud.google.com/sdk/gcloud/reference/iam/service-accounts/list?hl=ja

ローカル

$ gcloud iam service-accounts list

DISPLAY NAME                            EMAIL                                                        DISABLED
Compute Engine default service account  182414688985-compute@developer.gserviceaccount.com           False
App Engine default service account      car-scraping-20210813@appspot.gserviceaccount.com            False
up-to-storage                           up-to-storage@car-scraping-20210813.iam.gserviceaccount.com  False

作成したサービスアカウントにロールを付与します。
- 付与するロールはroles/storage.adminです。
- roles/storage.adminはバケットとオブジェクトのすべてを管理する権限を付与します。
- このロールをもつサービスアカウントを使用してCompute Engineのインスタンスを起動することで、そのインスタンスからCloud Storageにデータをアップロードすることができるようになります。(図緑部分)
- 以下を参考にして下さい。

ローカル

$ gcloud projects add-iam-policy-binding car-scraping-20210813 \
> --member="serviceAccount:up-to-storage@car-scraping-20210813.iam.gserviceaccount.com" \
> --role="roles/storage.admin"

Updated IAM policy for project [car-scraping-20210813].
bindings:
- members:
  - serviceAccount:182414688985@cloudbuild.gserviceaccount.com
  role: roles/cloudbuild.builds.builder
- members:
  - serviceAccount:service-182414688985@gcp-sa-cloudbuild.iam.gserviceaccount.com
  role: roles/cloudbuild.serviceAgent
~~~~~~~~~~<中略>~~~~~~~~~~
- members:
  - serviceAccount:service-182414688985@gcp-sa-pubsub.iam.gserviceaccount.com
  role: roles/pubsub.serviceAgent
- members:
  - serviceAccount:up-to-storage@car-scraping-20210813.iam.gserviceaccount.com
  role: roles/storage.admin
etag: BwXJanj56XQ=
version: 1

最後のmembersの部分でサービスアカウントup-to-storageに、ロールroles/storage.adminが付与されているのが確認できます。

ファイアウォールを作成

Compute EngineのインスタンスからJupyterを起動しますが、Jupyterはブラウザを使用して表示します。
ファイアウォール(上り)を作成することで、ブラウザを通してインスタンスにアクセスすることが出来ます。

ファイアウォールを作成する。
- create ingress-ruleという名前のファイアウォールを作成します。
- --directionはトラフィックの方向です。今回はingressとしたので、上りです。上りとは外の世界からインスタンスに入ってくる方向です。内向きとも言います。
- --actionで接続を__許可__するのか__ブロック__するのかを指定します。allowとしたので、ファイアウォールで指定したルールにおいて接続を__許可__することができます。
- --target-tagsでtcp8000という名前のタグを作成しています。インスタンス側でこのタグを指定することでこのファイアウォールを適用することができます。
- --rulesで、このファイアウォールの接続ルールを作成します。ここではプロトコルをtcp、ポートを8000と指定しました。
- 以下を参考にして下さい。

ローカル

$ gcloud compute firewall-rules create ingress-rule \
> --direction ingress \
> --action allow \
> --target-tags tcp8000 \
> --rules tcp:8000

Creating firewall...⠹Created [https://www.googleapis.com/compute/v1/projects/car-scraping-20210813/global/firewalls/ingress-rule].
Creating firewall...done.                                                                 
NAME          NETWORK  DIRECTION  PRIORITY  ALLOW     DENY  DISABLED
ingress-rule  default  INGRESS    1000      tcp:8000        False

ファイアウォールルールを一覧表示します。
- 以下を参考にして下さい。
  - https://cloud.google.com/vpc/docs/using-firewalls?hl=ja#listing-firewall-rules-for-a-vpc-network
  - https://cloud.google.com/sdk/gcloud/reference/compute/firewall-rules/list?hl=ja

ローカル

$ gcloud compute firewall-rules list
NAME                    NETWORK  DIRECTION  PRIORITY  ALLOW                         DENY  DISABLED
default-allow-icmp      default  INGRESS    65534     icmp                                False
default-allow-internal  default  INGRESS    65534     tcp:0-65535,udp:0-65535,icmp        False
default-allow-rdp       default  INGRESS    65534     tcp:3389                            False
default-allow-ssh       default  INGRESS    65534     tcp:22                              False
ingress-rule            default  INGRESS    1000      tcp:8000                            False

To show all fields of the firewall, please show in JSON format: --format=json
To show all fields in table format, please see the examples in --help.

一番下にingress-ruleという名前のファイアウォールを確認できました。

Compute Engineのインスタンスを作成

Compute Engineのインスタンスを作成します。
- scraping-instance-20210813という名前のインスタンスを作成しています。
- --service-accountには先ほど作成したサービスアカウントのメールアドレスを指定します。
  - こうすることで、作成したインスタンスが、サービスアカウントの権限を利用できるようになります。
- --scopesでアクセススコープを指定しています。
  - cloud-platformを指定することでほとんどのCloud APIsに対するアクセスを許可することができます。(ただし、サービスアカウントのロールにはroles/storage.adminしか付与していませんので、このインスタンスはCloud Storageのみのアクセス権限を持つことになります。)
- --tagsには先ほど作成したファイアウォールの--target-tagsの値を指定しています。
- --zoneはなんでも良いと思います。ここではus-central1-aを指定しました。
- 以下を参照して下さい。
  - https://cloud.google.com/compute/docs/access/create-enable-service-accounts-for-instances?hl=ja#using
  - https://cloud.google.com/sdk/gcloud/reference/compute/instances/create?hl=ja

ローカル

$ gcloud compute instances create scraping-instance-20210813 \
> --service-account up-to-storage@car-scraping-20210813.iam.gserviceaccount.com \
> --scopes cloud-platform \
> --tags tcp8000 \
> --zone asia-northeast1-a

Created [https://www.googleapis.com/compute/v1/projects/car-scraping-20210813/zones/asia-northeast1-a/instances/scraping-instance-20210813].
NAME                        ZONE               MACHINE_TYPE   PREEMPTIBLE  INTERNAL_IP  EXTERNAL_IP    STATUS
scraping-instance-20210813  asia-northeast1-a  n1-standard-1               10.146.0.2   35.194.109.40  RUNNING

作成したインスタンスの詳細を確認
- 参考 : https://cloud.google.com/sdk/gcloud/reference/compute/instances/describe

ローカル

% gcloud compute instances describe scraping-instance-20210813 --zone=asia-northeast1-a
canIpForward: false
cpuPlatform: Intel Broadwell
creationTimestamp: '2021-08-12T22:59:22.110-07:00'
deletionProtection: false
disks:
- autoDelete: true
  boot: true
  deviceName: persistent-disk-0
  diskSizeGb: '10'
  guestOsFeatures:
  - type: UEFI_COMPATIBLE
  - type: VIRTIO_SCSI_MULTIQUEUE
  index: 0
  interface: SCSI
  kind: compute#attachedDisk
  licenses:
  - https://www.googleapis.com/compute/v1/projects/debian-cloud/global/licenses/debian-10-buster
  mode: READ_WRITE
  source: https://www.googleapis.com/compute/v1/projects/car-scraping-20210813/zones/asia-northeast1-a/disks/scraping-instance-20210813
  type: PERSISTENT
~~~~~~~~~~<中略>~~~~~~~~~~
selfLink: https://www.googleapis.com/compute/v1/projects/car-scraping-20210813/zones/asia-northeast1-a/instances/scraping-instance-20210813
serviceAccounts:
- email: up-to-storage@car-scraping-20210813.iam.gserviceaccount.com
  scopes:
  - https://www.googleapis.com/auth/cloud-platform
shieldedInstanceConfig:
  enableIntegrityMonitoring: true
  enableSecureBoot: false
  enableVtpm: true
shieldedInstanceIntegrityPolicy:
  updateAutoLearnPolicy: true
startRestricted: false
status: RUNNING
tags:
  fingerprint: N7eDKdVmcIs=
  items:
  - tcp8000
zone: https://www.googleapis.com/compute/v1/projects/car-scraping-20210813/zones/asia-northeast1-a

インスタンスを作成する際に特に何も指定せずにコマンドを実行しましたが、licensesにhttps://www.googleapis.com/compute/v1/projects/debian-cloud/global/licenses/debian-10-busterとあるため、OSはデビアン系であることが分かります。
また、emailは指定したサービスアカウントのメールアドレスになっており、
tagsのitemsには指定したファイアーウォールのタグtcp8000が表示されています。

インスタンスにSSH接続

作成したインスタンスに接続します。
- --projectで接続したいインスタンスが存在するプロジェクトを指定します。
- --zoneはインスタンスを作成する際に指定した際と同じ値を指定します。
- 最後に接続するインスタンス名を指定しています。
以下を参考にして下さい。
- https://cloud.google.com/compute/docs/instances/connecting-to-instance?hl=ja#gcloud
- https://cloud.google.com/sdk/gcloud/reference/compute/ssh?hl=ja

ローカル

% gcloud compute ssh --project=car-scraping-20210813 --zone=asia-northeast1-a scraping-instance-20210813

Updated [https://www.googleapis.com/compute/v1/projects/car-scraping-20210813].           
Updating project ssh metadata...done.                                                     
Waiting for SSH key to propagate.
Warning: Permanently added 'compute.8943762423492097241' (ECDSA) to the list of known hosts.
Linux scraping-instance-20210813 4.19.0-17-cloud-amd64 #1 SMP Debian 4.19.194-3 (2021-07-18) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
adachikeisuke@scraping-instance-20210813:~$

最後の行にアカウント名@インスタンス名:~$という表示が出ればインスタンスに接続できています。

インスタンスの環境を構築

スクレイピングをするための環境をインスタンス上に整えていきましょう。
今回はDockerを使用して環境を構築していきます。

インスタンスのaptをアップデート

インスタンス

adachikeisuke@scraping-instance-20210813:~$ sudo apt update

Get:1 http://security.debian.org/debian-security buster/updates InRelease [65.4 kB]
Hit:2 http://deb.debian.org/debian buster InRelease                             
Get:3 http://deb.debian.org/debian buster-updates InRelease [51.9 kB]  
~~~~~~~~~~<中略>~~~~~~~~~~
Building dependency tree       
Reading state information... Done
9 packages can be upgraded. Run 'apt list --upgradable' to see them.

インスタンスにDockerをインストールします
- -yをつけることで、インストール中の問い合わせに全てにyesで回答してくれます。
- docker.ioをインストールします。

インスタンス

adachikeisuke@scraping-instance-20210813:~$ sudo apt install -y docker.io

Reading package lists... Done
Building dependency tree       
Reading state information... Done
~~~~~~~~~~<中略>~~~~~~~~~~
Processing triggers for systemd (241-7~deb10u8) ...
Processing triggers for man-db (2.8.5-2) ...
Processing triggers for libc-bin (2.28-10) ...

Dockerのバージョンを確認

インスタンス

adachikeisuke@scraping-instance-20210813:~$ docker --version

Docker version 18.09.1, build 4c52b90

Dockerがインストールできたことが確認できました。

現在のLinuxユーザーを確認

インスタンス

adachikeisuke@scraping-instance-20210813:~$ who

adachikeisuke pts/0        2021-08-13 06:33 (106.72.51.130)

adachikeisukeというユーザーであることが分かりました。

dockerグループに所属するユーザーを確認

インスタンス

adachikeisuke@scraping-instance-20210813:~$ getent group docker

docker:x:113:

現在のユーザーをdockerグループに追加する。
- これにより、dockerコマンドを実行する際にsudoを使用する必要がなくなります。

インスタンス

adachikeisuke@scraping-instance-20210813:~$ sudo gpasswd -a adachikeisuke docker

Adding user adachikeisuke to group docker

再び、dockerグループに所属するユーザーを確認

インスタンス

adachikeisuke@scraping-instance-20210813:~$ getent group docker

docker:x:113:adachikeisuke

現在のユーザーがdockerグループに追加されたことが確認できました。

一度インスタンスから抜けて入り直します。
- これをしないと何故かdockerコマンドが現在のユーザー権限で実行できません。。

インスタンス

adachikeisuke@scraping-instance-20210813:~$ exit

logout
Connection to 35.194.109.40 closed.

ローカル

$ gcloud compute ssh --project=car-scraping-20210813 --zone=asia-northeast1-a scraping-instance-20210813

Linux scraping-instance-20210813 4.19.0-17-cloud-amd64 #1 SMP Debian 4.19.194-3 (2021-07-18) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Fri Aug 13 06:33:01 2021 from 106.72.51.130
adachikeisuke@scraping-instance-20210813:~$

dockerコマンドが使用できるか確認します。

インスタンス

adachikeisuke@scraping-instance-20210813:~$ docker images

REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE

使用できました。

インスタンスに以下に示すディレクトリ構成を作成します。(コマンドの解説は割愛)
- ホームディレクトリ配下にディレクトリを作成します。名前は何でも良いです。ここではdocker_envとしています。
- docker_env配下にディレクトリを作成します。これも名前は何でも良いです。ここではmain_dirとしています。空のままで良いです。
- docker_env配下にDockerfileという名前のファイルを作成します。vimコマンドなどで作成すると良いです。
  - FROM ubuntu:latest・・・OSイメージをubuntuに指定しています。
  - RUN mkdir /work・・・workディレクトリを作成します。インスタンスのディレクトリにマウントするためのディレクトリです。
  - RUN apt update(以下略)・・・必要なライブラリをインストールします。
  - WORKDIR /work・・・workディレクトリに移動します。
  - CMD ["jupyter",(以下略)・・・JupyterLabを起動します。
    - "jupyter", "lab"・・・$ jupyter labというコマンドを実行したことになり、JupyterLabを起動しています。
    - "--ip=0.0.0.0", "--allow-root"・・・このオプションを付けないとJupyterLabが正常に起動しないようです。理由を正確に理解できていないので解説は控えます。
    - "--LabApp.token=''・・・トークンを空文字に設定しています。つまり、トークン無しでJupyterLabにログインできます。
    - "--port=8080"・・・ポートを8080番に指定しています。コンテナの8080番ポートでJupyterLabが起動します。

インスタンス

~/docker_env/
    ├ main_dir/
    └ Dockerfile

Dockerfile

FROM ubuntu:latest
RUN mkdir /work
RUN apt update \
	&& apt install -y \
	python3 \
	python3-pip \
	&& pip3 install jupyterlab \
	urllib3 \
	beautifulsoup4 \
	pandas \
	numpy \
	google-cloud-storage
WORKDIR /work
CMD ["jupyter", "lab", "--ip=0.0.0.0", "--allow-root", "--LabApp.token=''", "--port=8080"]

Dockerイメージを作成します。
- ~/docker_env/ディレクトリに移動しておきます。

インスタンス

adachikeisuke@scraping-instance-20210813:~/docker_env$ docker build .

Sending build context to Docker daemon   2.56kB
Step 1/5 : FROM ubuntu:latest
latest: Pulling from library/ubuntu
~~~~~~~~~~<中略>~~~~~~~~~~
Removing intermediate container 3f23e47588f2
 ---> f174348cc5c9
Successfully built f174348cc5c9

Dockerコンテナを起動します。
- -d・・・デタッチドモードです。コンテナを起動した後もコマンドを入力できます。
- --rm・・・コンテナの終了時に、自動的にコンテナをクリーンアップし、ファイルシステムを削除します。
- -p 8000:8080・・・インスタンスの8000ポートからコンテナの8080ポートに接続します。
  - ファイアウォールで8000ポートの上りを許可しているため、インスタンスの8000ポートで接続します。
  - DockerfileでJupyterを8080ポートで起動したためコンテナの8080ポートを指定しています。
- -v ~/docker_env/main_dir/:/work/・・・インスタンスのmain_dirディレクトリをコンテナのworkディレクトリにマウントしています。
  - 通常、コンテナからインスタンスのディレクトリにアクセスすることはできませんが、こうすることでコンテナからインスタンスのディレクトリを操作することができるようになります。
  - こうすることで、ソースコードはインスタンスに配置し、コンテナは実行環境のみの役割を持つことになります。
- 参考 : https://docs.docker.jp/engine/reference/run.html

インスタンス

adachikeisuke@scraping-instance-20210813:~/docker_env$ docker run -d --rm -p 8000:8080 -v ~/docker_env/main_dir/:/work/ f174348cc5c9

9c83942d064a7b843902d4eb819373695f9da23f34c85640b998dc4563e87d28

インスタンスとのSSH接続を切ります。

adachikeisuke@scraping-instance-20210813:~/docker_env$ exit

logout
Connection to 35.194.109.40 closed.

インスタンスの外部IPを調べる。

ローカル

$ gcloud compute instances list

NAME                        ZONE               MACHINE_TYPE   PREEMPTIBLE  INTERNAL_IP  EXTERNAL_IP    STATUS
scraping-instance-20210813  asia-northeast1-a  n1-standard-1               10.146.0.2   35.194.109.40  RUNNING

外部IPは35.194.109.40であることが分かりました。

ブラウザでJupyterにアクセス

ブラウザのアドレスバー

35.194.109.40:8000

上記のようにJupyterLabの画面が開けば完了です。

インスタンス内にスクレイピングを行うPythonコードを実装

ここからはJupyterLabで作業をします。
まずはJupyterLabで以下のディレクトリ構成を作成します。
ルートディレクトリ直下にファイルを配置しているように見えますが、実際はインスタンスのmain_dirディレクトリです。
コンテナを起動する際に-v ~/docker_env/main_dir/:/work/というオプションを使用したため、コンテナのworkディレクトリからインスタンスのmain_dirディレクトリが見えるようになっています。
また、DockerfileでWORKDIR /workという記述を加えたことでコンテナのworkディレクトリに移動しているため、関係のないディレクトリがJupyterLabから全く見えない状態になっています。

JupyterLab

/
├ functions.py
└ Untitled.ipynb

スクレイピング用のPythonコードを実装します。
- このコードを実装すると以下の3車種の情報を取得できます。
  - マツダ MAZDA3ファストバック
  - トヨタ RAV4
  - BMW X3
- Untitled.ipynbは以下の2つの部分を自身の設定に合わせて変更してから実行しましょう。
  - project_id = "car-scraping-20210813"
  - bucket_name = "mybucket-20210813"

functions.py

from urllib import request
from bs4 import BeautifulSoup
import re
import time



def func_scraping(cars, i):
    """
    車一台分の詳細情報を取得する

    Parameters
    ----------
    cars : bs4.element.ResultSet
        車の詳細情報のhtmlデータのリスト
    i : int
        carsの要素番号

    Returns
    -------
    car_info : list
        車の詳細情報のリスト
    """
    try:
        brand = cars[i].select(".casetMedia__body__maker")[0].text
    except:
        brand = "-"

    try:
        title = cars[i].select(".casetMedia__body__title")[0].text.split()[0]
    except:
        title = "-"

    try:
        body_type = cars[i].select(".casetMedia__body__spec > p")[0].text
    except:
        body_type = "-"

    try:
        year = cars[i].select(".specWrap__box__num")[0].text
    except:
        year = "-"

    try:
        running = cars[i].select(".specWrap__box")[1].select("p")
        distance = running[1].text + running[2].text 
    except:
        distance = "-"

    try:
        displacement = cars[i].select(".specWrap__box__num")[2].text
    except:
        displacement = "-"

    try:
        inspection = ''.join(cars[i].select(".specWrap__box")[3].text.split()[1:])
    except:
        inspection = "-"

    try:
        repair = cars[i].select(".specWrap__box")[4].text.split()[1]
    except:
        repair = "-"

    try:
        color = cars[i].select(".casetMedia__body__spec")[0].text.split()[2]
    except:
        color = "-"

    try:
        price = cars[i].select(".basePrice__price")[0].text.split()[0]
    except:
        price = "-"

    try:
        payment = cars[i].select(".totalPrice__price")[0].text.split()[0]
    except:
        payment = "-"

    try:
        location = cars[i].select(".casetSub__area > p")[0].text
    except:
        location = "-"

    try:
        evaluation = cars[i].select(".casetSub__review__score.js_shop > a > span")
        score = evaluation[0].text
        number = evaluation[1].text
    except:
        score = "-"
        number = "-"

    car_info = [brand, title, body_type, year, distance, displacement,
                inspection, repair, color, price, payment, location, score, number]

    return car_info


def func_detail(car_url, DETAIL_LIST):
    """
    func_scraping関数を再起的に実行して、1ページ分の詳細情報を取得する
    次のページに進む

    Parameters
    ----------
    car_url : str
        車種情報のURL
    DETAIL_LIST : list
        詳細情報のリスト

    Returns
    -------
    car_url : str
        (次のページの)車種情報のURL
    DETAIL_LIST : list
        詳細情報のリスト(情報追加済み)
    """
    time.sleep(3)
    url = "https://www.carsensor.net/"
    try:
        html = request.urlopen(car_url)
        soup = BeautifulSoup(html, "html.parser")
        cars = soup.select(".caset.js_listTableCassette")
    except:
        car_url = "None2"
        DETAIL_LIST = list()
        return car_url, DETAIL_LIST

    for i in range(len(cars)):
        car_info = func_scraping(cars, i)
        DETAIL_LIST.append(car_info)

    try:
        next_url = soup.select(".btnFunc.pager__btn__next")[1].get("onclick")
        pattern = "(?<=').*?(?=')"
        car_url = url + re.search(pattern, next_url).group()[1:]
    except:
        car_url = "None1"

    return car_url, DETAIL_LIST


def func_detail_list(car_url, DETAIL_LIST):
    """
    func_detail関数を再起的に実行して、指定車種の全ての詳細データを取得
    スクレピングの進捗状況を表示
    ※必ずDETAIL_LISTを"list()"で初期化してから引数に渡すこと

    Parameters
    ----------
    car_url : str
        車種情報のURL
    DETAIL_LIST : list
        詳細情報のリスト

    Returns
    -------
    title : str
        車種の名前
    DETAIL_LIST : list
        詳細情報のリスト
    """
    time.sleep(3)
    try:
        html = request.urlopen(car_url)
        soup = BeautifulSoup(html, "html.parser")
        title = soup.select(".casetMedia__body__title > a")[0].text.split()[0]
        all_num = int(soup.select(".resultBar__result > p")[0].text[:-1].replace(",", ""))
        pro_bar = (" " * 30)
        print('\r{0} [{1}] {2}/{3}'.format(title, pro_bar, 0, all_num), end='')
    except:
        pass

    while car_url != "None1" and car_url != "None2":
        car_url, DETAIL_LIST = func_detail(car_url, DETAIL_LIST)
        try:
            now_num = len(DETAIL_LIST)
            progress = int((now_num / all_num) * 30)
            pro_bar = ("=" * progress) + (" " * (30 - progress))
            print('\r{0} [{1}] {2}/{3}'.format(title, pro_bar, now_num, all_num), end='')
        except:
            pass
        
    try:
        print('\r{0} [{1}] {2}/{3}'.format(title, pro_bar, now_num, all_num), end="\n")
    except:
        pass
    
    if car_url == "None1":
        return title, DETAIL_LIST
    else:
        title = "Error"
        return title, DETAIL_LIST

Untitled.ipynb

# [1]:
from functions import func_detail_list
import pandas as pd
from google.cloud import storage as gcs

# [2]:
CAR_URL_LIST = [
    "https://www.carsensor.net/usedcar/bMA/s097/index.html",
    "https://www.carsensor.net/usedcar/bTO/s147/index.html",
    "https://www.carsensor.net/usedcar/bBM/s025/index.html"
]

# [3]:
columns = ['brand', 'title', 'type', 'year', 'mileage', 'displacement', 'inspection', 'repair',
            'color', 'base_price', 'total', 'location', 'evaluation', 'num_evaluation']

for car_url in CAR_URL_LIST:
    DETAIL_LIST = list()
    title, DETAIL_LIST = func_detail_list(car_url, DETAIL_LIST)
    if title != "Error":
        df = pd.DataFrame(DETAIL_LIST, columns=columns)

        project_id = "car-scraping-20210813" # 自身の設定に合わせて変更しましょう
        bucket_name = "mybucket-20210813" # 自身の設定に合わせて変更しましょう
        gcs_path = "{}.csv".format(title)

        client = gcs.Client(project_id)
        bucket = client.get_bucket(bucket_name)
        blob_gcs = bucket.blob(gcs_path)
        blob_gcs.upload_from_string(
            data=df.to_csv(index=False),
            content_type="text/csv"
        )
    else:
        pass

スクレイピングを実行

Untitled.ipynbのセルを実行しましょう。
- 以下の画像の右下のように進捗バーが進んでいけばスクレイピングが実行されています。

Cloud Storageを確認
- 参考 : https://cloud.google.com/storage/docs/listing-objects?hl=ja

ローカル

$ gsutil ls -r gs://mybucket-20210813/

gs://mybucket-20210813/MAZDA3ファストバック.csv
gs://mybucket-20210813/RAV4.csv
gs://mybucket-20210813/X3.csv

BigQueryを確認
- 参考 : https://cloud.google.com/bigquery/docs/information-schema-tables?hl=ja#tables_view

ローカル

$ bq query --nouse_legacy_sql \
'SELECT
   * EXCEPT(is_typed)
 FROM
   mydataset.INFORMATION_SCHEMA.TABLES'

Waiting on bqjob_r336b0bcba9ff4b14_0000017b3e85c25d_1 ... (0s) Current status: DONE   
+-----------------------+--------------+----------------------+------------+--------------------+---------------------+
|     table_catalog     | table_schema |      table_name      | table_type | is_insertable_into |    creation_time    |
+-----------------------+--------------+----------------------+------------+--------------------+---------------------+
| car-scraping-20210813 | mydataset    | RAV4                 | BASE TABLE | YES                | 2021-08-13 07:19:55 |
| car-scraping-20210813 | mydataset    | X3                   | BASE TABLE | YES                | 2021-08-13 07:21:29 |
| car-scraping-20210813 | mydataset    | MAZDA3ファストバック | BASE TABLE | YES                | 2021-08-13 07:17:31 |
+-----------------------+--------------+----------------------+------------+--------------------+---------------------+

Cloud StorageとBigQueryのどちらにもスクレイピングで取得されたデータが格納されています。
構築したアーキテクチャが機能している事が確認できました。

データ加工について

Cloud Functionsに実装したデータ加工を行なっている部分について解説を行います。

変換前のデータを確認

Cloud FunctionsではCloud Storageに格納されたデータを加工しています。
Cloud Storageにはスクレイピングによって取得したデータがそのまま格納されています。
Cloud Storageに格納されたデータを見てみましょう。

Cloud Storageのデータをローカルにダウンロードする
- ローカルのどこでもいいので空のディレクトリを作成し、そこに移動しておきましょう。
- 参考 : https://cloud.google.com/storage/docs/downloading-objects?hl=ja#gsutil

ローカル

$ gsutil cp gs://mybucket-20210813/RAV4.csv .

Copying gs://mybucket-20210813/RAV4.csv...
/ [1 files][110.8 KiB/110.8 KiB]                                                
Operation completed over 1 objects/110.8 KiB.

ダウンロードしたデータをJupyterで確認
- ローカルでJupyterを起動させましょう。
- 以下のデータは汚い部分を集めて表示しています。

Untitled.ipynb

import pandas as pd

df = pd.read_csv("RAV4.csv")
df.head()

	brand	title	type	year	mileage	displacement	inspection	repair	color	base_price	total	location	evaluation	num_evaluation
0	トヨタ	RAV4	クロカン・ＳＵＶ	2019	4.5万km	2000	2022(R04)年09月	なし	真珠白	247.0	259.8	滋賀県	4.9	13
1	トヨタ	RAV4	クロカン・ＳＵＶ	2020	10km	2000	2023(R05)年02月	なし	薄赤	299.9	316.4	愛知県	4.6	152
2	トヨタ	RAV4	クロカン・ＳＵＶ	2010	7.6万km	2400	車検整備付	あり	白真珠	38.5	56	大阪府	4.8	30
3	トヨタ	RAV4	クロカン・ＳＵＶ	2020	0.5万km	2000	車検整備別	なし	黒Ｍ	375.9	---万円	千葉県	4.8	12
4	トヨタ	RAV4	クロカン・ＳＵＶ	2021	100km	2000	2024(R06)年04月	なし	白	応談	---万円	広島県	-	-

データ加工の解説

カラムごとにかいつまんで解説を行います。

mileage
- __〜万km__が多いですが、中には__万__がついていないものもあります。
- Functionsのコードのmileageについての部分を以下に示します。
- 万__がついているものに関しては数字の部分を__10000倍__して〜km__に単位を揃えています。

main.py

for i in range(len(df)):
    if "交換" in df.loc[i, "mileage"]:
        df.loc[i, "mileage"] = "-"
    elif df.loc[i, "mileage"] in ["不明", "改ざん車"]:
        df.loc[i, "mileage"] = "-"
    elif df.loc[i, "mileage"] == "-":
        pass
    else:
        df.loc[i, "mileage"] = df.loc[i, "mileage"].replace("km", "")
        if "万" in df.loc[i, "mileage"]:
            df.loc[i, "mileage"] = str(int(float(df.loc[i, "mileage"][:-1]) * 10000))

inspection
- 車検についての情報ですが、年月のデータもあれば__車検整備付__というような日付データではない文字列も入っています。
- Functionsのコードのinspectionについての部分を以下に示します。
- __新車未登録__や__車検整備付__などの年月で無い部分は、これ自体にも意味がありそうなのでそのまま残します。
- 年月のデータは最初の4文字(年)と最後の2文字(月)を切り出し、pandasのto_datetimeメソッドで日付データに変換しています。

main.py

for i in range(len(df)):
    if not df.loc[i, "inspection"] in ['新車未登録', '車検整備付', '車検整備別', '車検整備無', '国内未登録']:
        yyyymm = df.loc[i, "inspection"][:4] + df.loc[i, "inspection"][10:-1]
        df.loc[i, "inspection"] = pd.to_datetime(yyyymm, format="%Y%m").strftime("%Y-%m")

base_price
- 車両本体価格についての情報です。
- 値段が開示されておらず「応談」となっているものがあります。
- 出現頻度が低いので、「-」に変換します。

main.py

for i in range(len(df)):
    if df.loc[i, "base_price"] == "応談":
        df.loc[i, "base_price"] = "-"

変換後のデータの確認

bigquery内のデータを確認します。
- 実際は、上記で表示したデータを集めて表示しています。
- 参考 : https://cloud.google.com/bigquery/docs/reference/bq-cli-reference#bq_head

ローカル

$ bq head -s 0 -n 5 mydataset.RAV4

+--------+-------+------------------+------+---------+--------------+------------+--------+--------+------------+-------+----------+------------+----------------+
| brand  | title |       type       | year | mileage | displacement | inspection | repair | color  | base_price | total | location | evaluation | num_evaluation |
+--------+-------+------------------+------+---------+--------------+------------+--------+--------+------------+-------+----------+------------+----------------+
| トヨタ | RAV4  | クロカン・ＳＵＶ | 2019 | 45000   | 2000         | 2022-09    | なし   | 真珠白 | 247.0      | 259.8 | 滋賀県   | 4.9        | 13             |
+--------+-------+------------------+------+---------+--------------+------------+--------+--------+------------+-------+----------+------------+----------------+
| トヨタ | RAV4  | クロカン・ＳＵＶ | 2020 | 10      | 2000         | 2023-02    | なし   | 薄赤  | 299.9      | 316.4 | 愛知県   | 4.6        | 152            |
+--------+-------+------------------+------+---------+--------------+------------+--------+-------+------------+-------+----------+------------+----------------+
| トヨタ | RAV4  | クロカン・ＳＵＶ | 2010 | 76000   | 2400         | 車検整備付 | あり   | 白真珠 | 38.5       | 56    | 大阪府   | 4.8        | 30             |
+--------+-------+------------------+------+---------+--------------+------------+--------+--------+------------+-------+----------+------------+----------------+
| トヨタ | RAV4  | クロカン・ＳＵＶ | 2020 | 5000    | 2000         | 車検整備別 | なし   | 黒Ｍ  | 375.9      | -     | 千葉県   | 4.8        | 12             |
+--------+-------+------------------+------+---------+--------------+------------+--------+-------+------------+-------+----------+------------+----------------+
| トヨタ | RAV4  | クロカン・ＳＵＶ | 2021 | 100     | 2000         | 2024-04    | なし   | 白     | -          | -     | 広島県   | -          | -              |
+--------+-------+------------------+------+---------+--------------+------------+--------+-------+------------+-------+----------+------------+----------------+

上記の通り、最初の状態よりも綺麗なデータになりました。
BigQueryにデータを蓄積できたので、この後はこのデータをGoogleデータポータルを使用して可視化したり、pandasで読み込んで分析したりしてみようと思います。

終わりに

GCPのアーキテクチャを組み合わせてデータ収集、加工、蓄積の基盤を、コマンドラインを使用して作成することができました。
アーキテクチャを組み合わせて一つのシステムを作るのは、パズルをしているようでとても楽しかったです。
GCPに限らず様々なツールが存在するので、組み合わせながらいろいろなシステムを作ってみようと思います。
最後まで読んで頂きありがとうございます！

参考文献

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up