More than 3 years have passed since last update.

Windows10のPCに分析環境（VirtualBox／Vagrant＋Kubernetes＋JupyterHub／JupyterLab）を作ってみた

Last updated at 2020-10-18Posted at 2020-08-23

はじめに

過去の記事(※1、※2）では、Windows10のPC上に分析環境を整備した。

自宅でも会社でも個人Workで利用するには特に不都合はないが、会社のチームWorkで利用するには考える必要がある。

ユーザ管理（どうやって認証・認可する？）
リソース管理（どのようにリソースを振り分ける？）
イメージ管理（どうやって環境を統一する？）
データ管理（どこにデータ保管する？）

といった感じ。

上記で挙げた組織運用（複数のデータサイエンティスト）する際の考慮点までカバーする仕組みを兼ね備えたのが、「JupyterHub」ということらしい。
ということで、「JupyterHub」を使った分析環境を作ってみようと思う。
が、PCを何台も準備できないので、仮想OS（VirtualBox）で準備する。

以下、本稿で目指す分析環境のイメージです。

本稿で紹介すること

Kubernetesのインストール
Helmのインストール
JupyterHubのインストール
JupyterHubの環境設定

14 Steps to Install kubernetes on Ubuntu 18.04 and 16.04 ("hashicorp/bionic64") -
Project Jupyter | JupyterHub
JupyterHub — JupyterHub 1.1.0 documentation
Zero to JupyterHub with Kubernetes — Zero to JupyterHub with Kubernetes 0.0.1-set.by.chartpress documentation
The JupyterHub Architecture — Zero to JupyterHub with Kubernetes 0.0.1-set.by.chartpress documentation

本稿で紹介しないこと

Windows10のセットアップ（含むWebブラウザ）　※Windows10 Pro Ver.1909
VirtualBoxのインストール　※VirtualBox 6.1.10を使用
Vagrantのインストール　※Vagrant 2.2.9を使用
Kubernetesの仕組み
Active Directoryのインストール（含む環境設定）
NFSサーバのインストール（含む環境設定）

Oracle VM VirtualBox
Vagrant by HashiCorp
【まとめ】Vagrant コマンド一覧
 Kubernetesのコンポーネント | Kubernetes

環境構築

大きく、4ステップです。

Kubernetesのインストール
Helmのインストール
JupyterHubのインストール
JupyterHubの環境設定

以下、本稿の環境を構成する仮想OSです。

OS	役割	IP-Addr	Hostname	CPU	MEM	DISK	起動
Windows Server 2012 R2 Standard	Active Directory（LDAP）	100.0.0.10	winserv2012r2	2Core	2GB	50GB	Manual
Ubuntu 18.04.4 LTS	K8s-Master	100.0.0.1	master	4Core	8GB	50GB	Vagrant
Ubuntu 18.04.4 LTS	K8s-Worker	100.0.0.2	worker	4Core	8GB	50GB	Vagrant
CentOS 7.6.1810	NFS-Server	100.0.0.4	centos76	2Core	2GB	64GB	Vagrant

1. Kubernetesのインストール

以下のリンクに記載された手順を見本とし、Kubernetesをインストールしました。
手順の通り、計2ノード（Master×1、Worker×1）の構成です。
14 Steps to Install kubernetes on Ubuntu 18.04 and 16.04 ("hashicorp/bionic64") -

以下、インストール完了後の状態です。

$ kubectl get nodes
NAME     STATUS   ROLES    AGE    VERSION
master   Ready    master   7d1h   v1.16.8
worker   Ready    node     7d1h   v1.16.8
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.8", GitCommit:"ec6eb119b81be488b030e849b9e64fda4caaf33c", GitTreeState:"clean", BuildDate:"2020-03-12T21:00:06Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.14", GitCommit:"d2a081c8e14e21e28fe5bdfa38a817ef9c0bb8e3", GitTreeState:"clean", BuildDate:"2020-08-13T12:24:51Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

以下、元の手順からの変更点。

仮想OSのスペック

CPU:4CoreおよびMemory:8GBに一律で引き上げ、かつ、Disk:50GBの割り当てとします。

Vagrantfile

Vagrant.configure("2") do |config|
  config.vm.define "master" do |master|
    master.vm.box_download_insecure = true
    master.vm.box = "ubuntu/bionic64" # Ubuntu 18.04
    master.vm.network "private_network", ip: "100.0.0.1"
    master.vm.hostname = "master"
    master.vm.box_version = "20200525.0.0"
    master.vm.box_check_update = false
    master.disksize.size = '50GB'
    master.vm.provider "virtualbox" do |v|
      v.gui = false
      v.memory = 8192
      v.cpus = 4
    end
  end

  config.vm.define "worker" do |worker|
    worker.vm.box_download_insecure = true 
    worker.vm.box = "ubuntu/bionic64" # Ubuntu 18.04
    worker.vm.network "private_network", ip: "100.0.0.2"
    worker.vm.hostname = "worker"
    worker.vm.box_version = "20200525.0.0"
    worker.vm.box_check_update = false
    worker.disksize.size = '50GB'
    worker.vm.provider "virtualbox" do |v|
      v.gui = false
      v.memory = 8192
      v.cpus = 4
    end
  end
end

以下、筆者の環境における、各VagrantプラグインのVer.情報です。vagrant upコマンドの実行前にインストールが必要です。

> vagrant plugin list
vagrant-disksize (0.1.3, global)
vagrant-proxyconf (2.0.8, global)
vagrant-vbguest (0.24.0, global)

Kubernetesのバージョン

Kubernetes関連コンポーネント（kubelet、kubeadm、kubectl）のバージョンを指定します。

$ sudo apt-get update
$ sudo apt-get install -y kubelet=1.16.8-00 kubeadm=1.16.8-00 kubectl=1.16.8-00

2. Helmのインストール

以下のリンクに記載された手順を見本とし、Helmをインストールしました。
Setting up Helm — Zero to JupyterHub with Kubernetes 0.0.1-set.by.chartpress documentation

以下、インストール完了後の状態です。

$ helm version
Client: &version.Version{SemVer:"v2.16.10", GitCommit:"bceca24a91639f045f22ab0f41e47589a932cf5e", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.16.10", GitCommit:"bceca24a91639f045f22ab0f41e47589a932cf5e", GitTreeState:"clean"}

3. JupyterHubのインストール

以下のリンクに記載された手順を見本とし、JupyterHubをインストールしました。
Setting up JupyterHub — Zero to JupyterHub with Kubernetes 0.0.1-set.by.chartpress documentation

以下、インストール完了後の状態です。

$ kubectl get pods,service -o wide --all-namespaces
NAMESPACE     NAME                                  READY   STATUS    RESTARTS   AGE     IP             NODE     NOMINATED NODE   READINESS GATES
jhub          pod/hub-578c55c475-ft42j              1/1     Running   0          15h     10.244.1.114   worker   <none>           <none>
jhub          pod/proxy-5df958fb79-r725p            1/1     Running   0          16h     10.244.1.105   worker   <none>           <none>
jhub          pod/user-scheduler-5d97fdd64f-dctdr   1/1     Running   0          16h     10.244.1.103   worker   <none>           <none>
jhub          pod/user-scheduler-5d97fdd64f-qj4kj   1/1     Running   0          16h     10.244.1.102   worker   <none>           <none>
kube-system   pod/coredns-5644d7b6d9-ms5zc          1/1     Running   2          7d2h    10.244.0.7     master   <none>           <none>
kube-system   pod/coredns-5644d7b6d9-qp5m7          1/1     Running   2          7d2h    10.244.0.6     master   <none>           <none>
kube-system   pod/etcd-master                       1/1     Running   3          7d2h    100.0.0.1      master   <none>           <none>
kube-system   pod/kube-apiserver-master             1/1     Running   3          7d2h    100.0.0.1      master   <none>           <none>
kube-system   pod/kube-controller-manager-master    1/1     Running   3          7d2h    100.0.0.1      master   <none>           <none>
kube-system   pod/kube-flannel-ds-amd64-5rrqr       1/1     Running   4          7d2h    100.0.0.2      worker   <none>           <none>
kube-system   pod/kube-flannel-ds-amd64-rzzjs       1/1     Running   2          7d2h    100.0.0.1      master   <none>           <none>
kube-system   pod/kube-proxy-njb86                  1/1     Running   2          7d2h    100.0.0.2      worker   <none>           <none>
kube-system   pod/kube-proxy-zs5vd                  1/1     Running   2          7d2h    100.0.0.1      master   <none>           <none>
kube-system   pod/kube-scheduler-master             1/1     Running   3          7d2h    100.0.0.1      master   <none>           <none>
kube-system   pod/tiller-deploy-7fd74dc67c-hrmrx    1/1     Running   1          6d23h   10.244.1.22    worker   <none>           <none>

NAMESPACE     NAME                    TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE     SELECTOR
default       service/kubernetes      ClusterIP      10.96.0.1        <none>        443/TCP                      7d2h    <none>
jhub          service/hub             ClusterIP      10.101.67.145    <none>        8081/TCP                     6d23h   app=jupyterhub,component=hub,release=jhub
jhub          service/proxy-api       ClusterIP      10.103.182.69    <none>        8001/TCP                     6d23h   app=jupyterhub,component=proxy,release=jhub
jhub          service/proxy-public    LoadBalancer   10.101.206.109   100.0.0.1     443:32132/TCP,80:31124/TCP   6d23h   component=proxy,release=jhub
kube-system   service/kube-dns        ClusterIP      10.96.0.10       <none>        53/UDP,53/TCP,9153/TCP       7d2h    k8s-app=kube-dns
kube-system   service/tiller-deploy   ClusterIP      10.107.4.92      <none>        44134/TCP                    6d23h   app=helm,name=tiller

4. JupyterHubの環境設定

冒頭で紹介した、組織運用（複数のデータサイエンティスト）する際の考慮点を鑑みて、カスタマイズします。

4.1. ユーザ管理

Active Directory（以下、AD）を認証に利用する。
ただし、ADに登録されたユーザなら誰でも使えるようにしない。セキュリティグループを定義し、分析環境を利用するユーザをセキュリティグループのメンバとして登録し、認可に利用する。

以下、Active Directoryの構築メモ。

1.**Microsoftのサイト**から、Windows Server 2012 R2 StandardのISOファイルをダウンロード
2.VirtualBox上に手動でWindows Serverをセットアップ
3.Active Directoryをインストール
4.「People」OUを作成
5.「People」OU配下に「JupyterAllowed」セキュリティグループを作成
6.「People」OU配下にユーザを作成
7.「JupyterAllowed」セキュリティグループのメンバとしてユーザを追加
8.ユーザ数だけ6.～7.を繰り返し

以下、ユーザ管理に関する環境設定。
公式ドキュメントの**Example Active Directory Configuration**を参考に、設定する。

config.yaml

auth:
  type: ldap
  ldap:
    server:
      address: 100.0.0.10:389
    dn:
      lookup: false
      templates:
        - 'CN={username},OU=People,DC=demo,DC=da,DC=com'
      user:
        searchBase: 'OU=People,DC=demo,DC=da,DC=com'
        escape: False
        attribute: 'sAMAccountName'
        dnAttribute: 'cn'
    allowedGroups:
      - 'CN=JupyterAllowed,OU=People,DC=demo,DC=da,DC=com'

4.2. リソース管理

リソースは組織内で共用する必要があるため、JupyterLabコンテナに対してリソース上限を設ける。
リソース上限を設ける対象は、CPUおよびMemoryとする。

以下、リソース管理に関する環境設定。
公式ドキュメントの**Set user memory and CPU guarantees / limits**を参考に、設定する。

config.yaml

singleuser:
  cpu:
    limit: .5
    guarantee: .5
  memory:
    limit: 256M
    guarantee: 256M

4.3. イメージ管理

JupyterLabを使って取り組む分析テーマ（分析環境の利用目的）により異なるため、複数のコンテナイメージから選択しコンテナを起動する。
コンテナイメージの選択肢は、**Jupyter公式イメージ**とする。

以下、リソース管理に関する環境設定。
公式ドキュメントの**Using multiple profiles to let users select their environment**を参考に、設定する。
尚、上述のリソース管理に関連し、コンテナイメージ単位でリソース上限値を上書きすることも可能。

config.yaml

singleuser:
  profileList:
    - display_name: "Minimal environment"
      description: "[Resource] CPU:0.5/Mem:256MB<br />Documentation conversion tools such as pandoc and texlive have been added to base-notebook."
      kubespawner_override:
        image: jupyter/minimal-notebook:2343e33dec46
      default: true
    - display_name: "Scipy environment"
      description: "[Resource] CPU:1.0/Mem:2GB<br />Contains data analysis libraries in Python such as pandas and scikit-learn."
      kubespawner_override:
        image: jupyter/scipy-notebook:2343e33dec46
        cpu_limit: 1
        mem_limit: '2G'
    - display_name: "Tensorflow environment"
      description: "[Resource] CPU:1.5/Mem:3GB<br />Tensorflow has been added to scipy-notebook. There is no GPU support."
      kubespawner_override:
        image: jupyter/tensorflow-notebook:2343e33dec46
        cpu_limit: 1.5
        mem_limit: '3G'
    - display_name: "Datascience environment"
      description: "[Resource] CPU:2.0/Mem:4GB<br />R and Julia has been added to scipy-notebook."
      kubespawner_override:
        image: jupyter/datascience-notebook:2343e33dec46
        cpu_limit: 2
        mem_limit: '4G'

4.4. データ管理

外部NFSサーバを永続ボリュームに利用する。
永続ボリュームは個人単位の提供とし、NFSサーバの容量を組織内で共用する（個人単位で領域は分けるが、容量の上限を設けない）。

永続ボリュームの準備には、大きく３ステップであり、順に解説する。

NFSサーバの構築
KubernetesにPV（PersistentVolume）およびPVC（PersistentVolumeClaim）の定義
JupyterHubにストレージの環境設定

4.4.1. NFSサーバの構築

以下、NFSサーバの構築メモ。

1.Vagrantfileを作成

Vagrantfile

Vagrant.configure("2") do |config|
  config.vm.box_download_insecure = true
  config.vm.box = "bento/centos-7.6"
  config.vm.network "private_network", ip: "100.0.0.4"
  config.vm.hostname = "centos76"
  config.vm.box_version = "201812.27.0"
  config.vm.box_check_update = false
  config.vm.provider "virtualbox" do |vb|
    vb.gui = false
    vb.cpus = "2"
    vb.memory = "2048"
  end

  # config.vm.provision "shell", inline: <<-SHELL
  #   apt-get update
  #   apt-get install -y apache2
  # SHELL
  config.vm.provision "shell", inline: <<-SHELL
    # timezoneを日本に変更
    timedatectl set-timezone Asia/Tokyo
    # localeを日本に変更
    localectl set-locale LANG=ja_JP.UTF-8
    # virtualenv起動
    source /home/vagrant/venv/bin/activate
  SHELL
  #config.vm.provision :shell, path: "nfs-installer.sh"
end

2.vagrant upコマンドで仮想OSを起動

> vagrant up

3.NFS Serverをインストール　★CentOS7にNFSサーバーを秒で構築を参照
4.個人ボリュームとして公開するディレクトリを作成

$ sudo mkdir -p /mnt/personal/pv001
$ sudo mkdir -p /mnt/personal/pv002

5.ディレクトリのアクセス権を変更（所有者、所有グループ、その他ともに読み書き実行が可能）

$ sudo chmod 777 -R /mnt/personal/

6.NFS公開設定　/etc/exports　を編集

$ cat /etc/exports
/mnt/personal/pv001 100.0.0.0/24(rw,no_root_squash)
/mnt/personal/pv002 100.0.0.0/24(rw,no_root_squash)

7.NFS公開設定を反映

$ sudo exportfs -ra

8.NFS公開設定を確認

$ sudo exportfs -v
/mnt/personal/pv001
                100.0.0.0/24(sync,wdelay,hide,no_subtree_check,sec=sys,rw,secure,no_root_squash,no_all_squash)
/mnt/personal/pv002
                100.0.0.0/24(sync,wdelay,hide,no_subtree_check,sec=sys,rw,secure,no_root_squash,no_all_squash)
$ sudo showmount -e
Export list for centos76:
/mnt/personal/pv002 100.0.0.0/24
/mnt/personal/pv001 100.0.0.0/24

4.4.2. KubernetesにPVおよびPVCの定義

NFSサーバの構築に続いて、Kubernetesで外部ストレージを使うための環境設定をする。
ボリューム提供方法にはManual Volume ProvisioningとDynamic Volume Provisioninの2つあるが、前者を採用し事前にPVおよびPVCを定義する。

以下、PVおよびPVCの定義メモ。

1.PV用YAMLを作成

config_pv.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-home-001
  labels:
    name: pv-nfs-home-jupyter
    type: nfs
    environment: stg
  annotations:
    "volume.beta.kubernetes.io/storage-class": "slow"
  namespace: jhub
spec:
  capacity:
    storage: 1Gi
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: slow
  mountOptions:
    - hard
  nfs:
    path: /mnt/personal/pv001
    server: 100.0.0.4

2.PVC用YAMLを作成

config_pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nfs-home-jupyter
  annotations:
    "volume.beta.kubernetes.io/storage-class": "slow"
  namespace: jhub
spec:
  selector:
    matchLabels:
      name: pv-nfs-home-jupyter
      type: "nfs"
    matchExpressions:
      - {key: environment, operator: In, values: [stg]}
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Gi
  storageClassName: slow

3.PVおよびPVC（のリソース）を作成

$ kubectl apply -f config_pv.yaml -f config_pvc.yaml

4.PVおよびPVC（のリソース）を確認

$ kubectl get pv,pvc --all-namespaces
NAME                            CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                     STORAGECLASS   REASON   AGE
persistentvolume/nfs-home-001   1Gi        RWX            Retain           Bound    jhub/nfs-home-jupyter     slow                    20h
persistentvolume/nfs-home-002   1Gi        RWX            Retain           Bound    jhub/nfs-home-dummy       slow                    20h

NAMESPACE   NAME                                       STATUS   VOLUME         CAPACITY   ACCESS MODES   STORAGECLASS   AGE
jhub        persistentvolumeclaim/nfs-home-dummy       Bound    nfs-home-002   1Gi        RWX            slow           20h
jhub        persistentvolumeclaim/nfs-home-jupyter     Bound    nfs-home-001   1Gi        RWX            slow           20h

5.ユーザ数だけ1.～4.を繰り返し

4.4.3. JupyterHubにストレージの環境設定

以下、データ管理に関する環境設定。
公式ドキュメントの**Customizing User Storage**を参考に、設定する。
尚、永続ボリュームは　/home/jovyan/　にマウントされる。

config.yaml

singleuser:
  storage:
    type: "static"
    static:
      pvcName: 'nfs-home-{username}'
      subPath: 'private'
  extraEnv:
    GRANT_SUDO: "yes"
    CHOWN_HOME: "yes"
  uid: 0
  fsGid: 0
  cmd: 
    - jupyterhub-singleuser
    - --allow-root

完成版

以下、全体の環境設定ファイル。
取り上げた4つの設定以外にも、「hub:」セクションで、タイムアウト値や初期表示（WorkingDir）のPATHも変更した。
「prePuller:」セクションは、**Pulling images before users arrive**を参照されたし。

config.yaml

proxy:
  secretToken: "${YOUR RANDOM-HEX}"

hub:
  db:
    type: sqlite-memory
  extraConfig: |
    c.Spawner.start_timeout = 300
    c.Spawner.http_timeout = 300
    c.Spawner.notebook_dir = "/home/jovyan/"

auth:
  type: ldap
  ldap:
    server:
      address: 100.0.0.10:389
    dn:
      lookup: false
      templates:
        - 'CN={username},OU=People,DC=demo,DC=da,DC=com'
      user:
        searchBase: 'OU=People,DC=demo,DC=da,DC=com'
        escape: False
        attribute: 'sAMAccountName'
        dnAttribute: 'cn'
    allowedGroups:
      - 'CN=JupyterAllowed,OU=People,DC=demo,DC=da,DC=com'
  admin:
    users:
      - jupyter
    access: false

singleuser:
  defaultUrl: "/lab"
  cpu:
    limit: .5
    guarantee: .5
  memory:
    limit: 256M
    guarantee: 256M
  image:
    name: jupyter/base-notebook
    tag: 2343e33dec46
  profileList:
    - display_name: "Minimal environment"
      description: "[Resource] CPU:0.5/Mem:256MB<br />Documentation conversion tools such as pandoc and texlive have been added to base-notebook."
      kubespawner_override:
        image: jupyter/minimal-notebook:2343e33dec46
      default: true
    - display_name: "Scipy environment"
      description: "[Resource] CPU:1.0/Mem:2GB<br />Contains data analysis libraries in Python such as pandas and scikit-learn."
      kubespawner_override:
        image: jupyter/scipy-notebook:2343e33dec46
        cpu_limit: 1
        mem_limit: '2G'
    - display_name: "Tensorflow environment"
      description: "[Resource] CPU:1.5/Mem:3GB<br />Tensorflow has been added to scipy-notebook. There is no GPU support."
      kubespawner_override:
        image: jupyter/tensorflow-notebook:2343e33dec46
        cpu_limit: 1.5
        mem_limit: '3G'
    - display_name: "Datascience environment"
      description: "[Resource] CPU:2.0/Mem:4GB<br />R and Julia has been added to scipy-notebook."
      kubespawner_override:
        image: jupyter/datascience-notebook:2343e33dec46
        cpu_limit: 2
        mem_limit: '4G'
  storage:
    type: "static"
    static:
      pvcName: 'nfs-home-{username}'
      subPath: 'private'
    extraVolumes:
      - name: jupyterhub-shared
        persistentVolumeClaim:
          claimName: nfs1
    extraVolumeMounts:
      - name: jupyterhub-shared
        mountPath: "/mnt/jupyterhub/shared"
  extraEnv:
    GRANT_SUDO: "yes"
    CHOWN_HOME: "yes"
  uid: 0
  fsGid: 0
  cmd: 
    - jupyterhub-singleuser
    - --allow-root

prePuller:
  hook:
    enabled: false
  continuous:
    enabled: false

では早速、動作確認

手元のWindows10のPC（≒VirtualBoxホスト）からWebブラウザ（Google Chrome）で接続する。
VirtualBox上で、K8s-Masterに対してポートフォワーディングルール（TCP：127.0.0.1:8888->100.0.0.1:80）を定義されたし。

JupyterHubログオン画面

コンテナ選択起動画面

JupyterLab初期表示画面

永続ボリューム確認

まとめ

VirtualBoxの仮想OS上に、JupyterHubの分析環境を構築する方法、JupyterHubの環境設定例を紹介。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up