Python
初心者
docker

Pythonによるスクレイピング&機械学習のお勉強その0 - 準備と環境構築

背景

記事執筆時点で筆者はPython歴約1週間の初心者です。
このシリーズでは主に機械学習・深層学習に関するプログラミング技術を習得するため、教科書(文献1)による学習(機械学習ではなく、人間の)を行っていきます。
今回の教科書は出版社による公式サポートページがあり、サンプルコードもダウンロードできますが、記事中ではできるだけ著作権に配慮して書籍中のソースコードを丸ごと掲載することは避ける方針です。

今回の目標

教科書を使って学習するに当たって、Pythonの開発環境を構築します。学習用の環境として独立させるため、docker上のコンテナで構築します。

方法と結果

筆者の主たるマシン環境は
* MacBook Pro 13inch(2017) Core i5 3.1GHz/RAM 16GB
* OS masOS High Sierra 10.13.4
です。Dockerは以前の学習シリーズでインストール済みの状態です。

学習用dockerイメージの作成

まず、ベースとなるdockerイメージとして前回の学習同様minicondaイメージを使用し、新しいイメージをcommitします。

$ docker run -t -i conda/miniconda3 /bin/bash
# exit
$ docker ps -l
CONTAINER ID        IMAGE               COMMAND             CREATED                  STATUS                     PORTS               NAMES
88775d7bbf31        conda/miniconda3    "/bin/bash"         Less than a second ago   Exited (0) 3 seconds ago                       elated_curie
$ docker commit 88775d7bbf31 pylearn2:latest
sha256:34f6bf8e93cba3c32d16aeae8e9b853c0774a3fc01ffa1d015a7808c3f2fa7c6
$ docker images | grep pylearn2
pylearn2                 latest              34f6bf8e93cb        29 seconds ago      228MB

これで新しい学習用イメージpylearn2ができました。これに学習用の頻用ライブラリなどを入れてイメージを固定します。

$ docker run -i -t pylearn2 /bin/bash
#condaのアップデート
# conda update —all
Fetching package metadata .........
Solving package specifications: .

Package plan for installation in environment /usr/local:

The following NEW packages will be INSTALLED:

    certifi:    2016.2.28-py36_0

The following packages will be UPDATED:

    conda:      4.3.22-py36_0    --> 4.3.30-py36h5d9f9f4_0
    idna:       2.5-py36_0       --> 2.6-py36_0           
    pycparser:  2.17-py36_0      --> 2.18-py36_0          
    pyparsing:  2.1.4-py36_0     --> 2.2.0-py36_0         
    python:     3.6.1-2          --> 3.6.2-0              
    setuptools: 27.2.0-py36_0    --> 36.4.0-py36_1        
    xz:         5.2.2-1          --> 5.2.3-0              
    zlib:       1.2.8-3          --> 1.2.11-0             

Proceed ([y]/n)? y
#(後略)

#pipのアップデート
# pip install --upgrade pip
Collecting pip
  Downloading https://files.pythonhosted.org/packages/0f/74/ecd13431bcc456ed390b44c8a6e917c1820365cbebcb6a8974d1cd045ab4/pip-10.0.1-py2.py3-none-any.whl (1.3MB)
    100% |################################| 1.3MB 585kB/s 
Installing collected packages: pip
  Found existing installation: pip 9.0.1
    Uninstalling pip-9.0.1:
      Successfully uninstalled pip-9.0.1
Successfully installed pip-10.0.1

#Pillowのインストール
# pip install pillow
Collecting pillow
  Downloading https://files.pythonhosted.org/packages/5f/4b/8b54ab9d37b93998c81b364557dff9f61972c0f650efa0ceaf470b392740/Pillow-5.1.0-cp36-cp36m-manylinux1_x86_64.whl (2.0MB)
    100% |################################| 2.0MB 542kB/s 
Installing collected packages: pillow
Successfully installed pillow-5.1.0

#requestsのインストール
# pip install requests
Requirement already satisfied: requests in /usr/local/lib/python3.6/site-packages (2.14.2)

#BeautifulSoup4のインストール
# pip install beautifulsoup4
Collecting beautifulsoup4
  Downloading https://files.pythonhosted.org/packages/9e/d4/10f46e5cfac773e22707237bfcd51bbffeaf0a576b0a847ec7ab15bd7ace/beautifulsoup4-4.6.0-py3-none-any.whl (86kB)
    100% |################################| 92kB 1.1MB/s 
Installing collected packages: beautifulsoup4
Successfully installed beautifulsoup4-4.6.0

#PyYAMLのインストール
# pip install pyyaml
Collecting pyyaml
  Downloading https://files.pythonhosted.org/packages/4a/85/db5a2df477072b2902b0eb892feb37d88ac635d36245a72a6a69b23b383a/PyYAML-3.12.tar.gz (253kB)
    100% |################################| 256kB 185kB/s 
Building wheels for collected packages: pyyaml
  Running setup.py bdist_wheel for pyyaml ... done
  Stored in directory: /root/.cache/pip/wheels/03/05/65/bdc14f2c6e09e82ae3e0f13d021e1b6b2481437ea2f207df3f
Successfully built pyyaml
Installing collected packages: pyyaml
Successfully installed pyyaml-3.12


#Pandasのインストール
# pip install pandas
Collecting pandas
  Downloading https://files.pythonhosted.org/packages/69/ec/8ff0800b8594691759b78a42ccd616f81e7099ee47b167eb9bbd502c02b9/pandas-0.23.0-cp36-cp36m-manylinux1_x86_64.whl (11.7MB)
    100% |################################| 11.7MB 699kB/s 
Collecting python-dateutil>=2.5.0 (from pandas)
  Downloading https://files.pythonhosted.org/packages/cf/f5/af2b09c957ace60dcfac112b669c45c8c97e32f94aa8b56da4c6d1682825/python_dateutil-2.7.3-py2.py3-none-any.whl (211kB)
    100% |################################| 215kB 850kB/s 
Collecting pytz>=2011k (from pandas)
  Downloading https://files.pythonhosted.org/packages/dc/83/15f7833b70d3e067ca91467ca245bae0f6fe56ddc7451aa0dc5606b120f2/pytz-2018.4-py2.py3-none-any.whl (510kB)
    100% |################################| 512kB 236kB/s 
Collecting numpy>=1.9.0 (from pandas)
  Downloading https://files.pythonhosted.org/packages/71/90/ca61e203e0080a8cef7ac21eca199829fa8d997f7c4da3e985b49d0a107d/numpy-1.14.3-cp36-cp36m-manylinux1_x86_64.whl (12.2MB)
    100% |################################| 12.2MB 340kB/s 
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.6/site-packages (from python-dateutil>=2.5.0->pandas) (1.10.0)
Installing collected packages: python-dateutil, pytz, numpy, pandas
Successfully installed numpy-1.14.3 pandas-0.23.0 python-dateutil-2.7.3 pytz-2018.4

#scipyのインストール
# pip install scipy
Collecting scipy
  Downloading https://files.pythonhosted.org/packages/a8/0b/f163da98d3a01b3e0ef1cab8dd2123c34aee2bafbb1c5bffa354cc8a1730/scipy-1.1.0-cp36-cp36m-manylinux1_x86_64.whl (31.2MB)
    100% |################################| 31.2MB 888kB/s 
Requirement already satisfied: numpy>=1.8.2 in /usr/local/lib/python3.6/site-packages (from scipy) (1.14.3)
Installing collected packages: scipy
Successfully installed scipy-1.1.0


#matplotlibのインストール
#pip install matplotlib
Collecting matplotlib
  Downloading https://files.pythonhosted.org/packages/49/b8/89dbd27f2fb171ce753bb56220d4d4f6dbc5fe32b95d8edc4415782ef07f/matplotlib-2.2.2-cp36-cp36m-manylinux1_x86_64.whl (12.6MB)
    100% |################################| 12.6MB 799kB/s 
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/site-packages (from matplotlib) (2.7.3)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.6/site-packages (from matplotlib) (1.10.0)
Requirement already satisfied: numpy>=1.7.1 in /usr/local/lib/python3.6/site-packages (from matplotlib) (1.14.3)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/site-packages (from matplotlib) (2.2.0)
Collecting kiwisolver>=1.0.1 (from matplotlib)
  Downloading https://files.pythonhosted.org/packages/69/a7/88719d132b18300b4369fbffa741841cfd36d1e637e1990f27929945b538/kiwisolver-1.0.1-cp36-cp36m-manylinux1_x86_64.whl (949kB)
    100% |################################| 952kB 1.6MB/s 
Requirement already satisfied: pytz in /usr/local/lib/python3.6/site-packages (from matplotlib) (2018.4)
Collecting cycler>=0.10 (from matplotlib)
  Downloading https://files.pythonhosted.org/packages/f7/d2/e07d3ebb2bd7af696440ce7e754c59dd546ffe1bbe732c8ab68b9c834e61/cycler-0.10.0-py2.py3-none-any.whl
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib) (36.4.0)
Installing collected packages: kiwisolver, cycler, matplotlib
Successfully installed cycler-0.10.0 kiwisolver-1.0.1 matplotlib-2.2.2

#scikit-imageのインストール
# pip install scikit-image
Collecting scikit-image
  Downloading https://files.pythonhosted.org/packages/34/79/cefff573a53ca3fb4c390739d19541b95f371e24d2990aed4cd8837971f0/scikit_image-0.14.0-cp36-cp36m-manylinux1_x86_64.whl (25.3MB)
    100% |################################| 25.3MB 1.4MB/s 
Collecting PyWavelets>=0.4.0 (from scikit-image)
  Downloading https://files.pythonhosted.org/packages/32/c0/3646053c0ce297686da524bc968bff6017151a9089d16c33afe7d330a48b/PyWavelets-0.5.2-cp36-cp36m-manylinux1_x86_64.whl (5.7MB)
    100% |################################| 5.7MB 2.7MB/s 
Requirement already satisfied: six>=1.10.0 in /usr/local/lib/python3.6/site-packages (from scikit-image) (1.10.0)
Requirement already satisfied: matplotlib>=2.0.0 in /usr/local/lib/python3.6/site-packages (from scikit-image) (2.2.2)
Requirement already satisfied: scipy>=0.17.0 in /usr/local/lib/python3.6/site-packages (from scikit-image) (1.1.0)
Collecting cloudpickle>=0.2.1 (from scikit-image)
  Downloading https://files.pythonhosted.org/packages/e7/bf/60ae7ec1e8c6742d2abbb6819c39a48ee796793bcdb7e1d5e41a3e379ddd/cloudpickle-0.5.3-py2.py3-none-any.whl
Requirement already satisfied: pillow>=4.3.0 in /usr/local/lib/python3.6/site-packages (from scikit-image) (5.1.0)
Collecting networkx>=1.8 (from scikit-image)
  Downloading https://files.pythonhosted.org/packages/11/42/f951cc6838a4dff6ce57211c4d7f8444809ccbe2134179950301e5c4c83c/networkx-2.1.zip (1.6MB)
    100% |################################| 1.6MB 972kB/s 
Collecting dask[array]>=0.9.0 (from scikit-image)
  Downloading https://files.pythonhosted.org/packages/91/1a/71be14f468f8f3f94e708afd5662cf75a0ca33a78924ca9f129a9c45c66b/dask-0.17.5-py3-none-any.whl (598kB)
    100% |################################| 604kB 768kB/s 
Requirement already satisfied: numpy>=1.9.1 in /usr/local/lib/python3.6/site-packages (from PyWavelets>=0.4.0->scikit-image) (1.14.3)
Requirement already satisfied: pytz in /usr/local/lib/python3.6/site-packages (from matplotlib>=2.0.0->scikit-image) (2018.4)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/site-packages (from matplotlib>=2.0.0->scikit-image) (1.0.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/site-packages (from matplotlib>=2.0.0->scikit-image) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/site-packages (from matplotlib>=2.0.0->scikit-image) (2.2.0)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/site-packages (from matplotlib>=2.0.0->scikit-image) (2.7.3)
Collecting decorator>=4.1.0 (from networkx>=1.8->scikit-image)
  Downloading https://files.pythonhosted.org/packages/bc/bb/a24838832ba35baf52f32ab1a49b906b5f82fb7c76b2f6a7e35e140bac30/decorator-4.3.0-py2.py3-none-any.whl
Collecting toolz>=0.7.3; extra == "array" (from dask[array]>=0.9.0->scikit-image)
  Downloading https://files.pythonhosted.org/packages/14/d0/a73c15bbeda3d2e7b381a36afb0d9cd770a9f4adc5d1532691013ba881db/toolz-0.9.0.tar.gz (45kB)
    100% |################################| 51kB 2.7MB/s 
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib>=2.0.0->scikit-image) (36.4.0)
Building wheels for collected packages: networkx, toolz
  Running setup.py bdist_wheel for networkx ... done
  Stored in directory: /root/.cache/pip/wheels/44/c0/34/6f98693a554301bdb405f8d65d95bbcd3e50180cbfdd98a94e
  Running setup.py bdist_wheel for toolz ... done
  Stored in directory: /root/.cache/pip/wheels/f4/0c/f6/ce6b2d1aa459ee97cc3c0f82236302bd62d89c86c700219463
Successfully built networkx toolz
Installing collected packages: PyWavelets, cloudpickle, decorator, networkx, toolz, dask, scikit-image
Successfully installed PyWavelets-0.5.2 cloudpickle-0.5.3 dask-0.17.5 decorator-4.3.0 networkx-2.1 scikit-image-0.14.0 toolz-0.9.0


#scikit-learnのインストール
# pip install scikit-learn
Collecting scikit-learn
  Downloading https://files.pythonhosted.org/packages/3d/2d/9fbc7baa5f44bc9e88ffb7ed32721b879bfa416573e85031e16f52569bc9/scikit_learn-0.19.1-cp36-cp36m-manylinux1_x86_64.whl (12.4MB)
    100% |################################| 12.4MB 2.3MB/s 
Installing collected packages: scikit-learn
Successfully installed scikit-learn-0.19.1

#Janomeのインストール
# pip install janome
Collecting janome
  Downloading https://files.pythonhosted.org/packages/b4/7b/6f4fa5243a235cd682693b448f05afacedb2b10fc2efea3369d6336ab83b/Janome-0.3.6.tar.gz (20.0MB)
    100% |################################| 20.0MB 1.3MB/s 
Building wheels for collected packages: janome
  Running setup.py bdist_wheel for janome ... done
  Stored in directory: /root/.cache/pip/wheels/53/60/be/fe884e2d0ebc9fec0988736cf08a2820ab34e3569fc0c5a25a
Successfully built janome
Installing collected packages: janome
Successfully installed janome-0.3.6

他にも教科書内で使われているライブラリはあるようですが、とりあえずこれくらい集めておいて、dockerイメージを更新します。

# exit #docker上で
$ docker ps -l
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS                      PORTS               NAMES
7472fcb46134        pylearn2            "/bin/bash"         26 minutes ago      Exited (0) 16 seconds ago                       eager_leavitt
$ docker commit 7472fcb46134 pylearn2:latest
sha256:2c59b5ead597ebf6c1c20b259a189c6721fcfe07ade9aba08e13df9cdab6d85e
$ docker images | grep pylearn2
pylearn2                 latest              2c59b5ead597        14 seconds ago      1.07GB

これで各種ライブラリをインストールした学習環境のイメージができました。今後学習する際にはソースディレクトリのマウントも含めて、

$ docker run -t -i -v $HOME/src:$HOME/src pylearn2 /bin/bash

でコンテナを起動できます。

今回達成したこと

  • スクレイピング&機械学習の勉強用dockerイメージを構築できました。

参考文献

  1. クジラ飛行机, Pythonによるスクレイピング&機械学習[開発テクニック], ソシム株式会社, 2016