More than 5 years have passed since last update.

最強のPySpark用Docker作った

Posted at 2020-03-15

はじめに

PySparkの勉強始めよう！っておもったら結構めんどくさいですよね。

正しいJava Versionのインストール（Java8じゃないとだめとか）
pythonの設定（Python3?2?3.5以上じゃないとだめ？？）
Jupyter notebookで快適に勉強したい
当然bigdataに挑戦したくなったらS3やAzureStorageでデータの読み書きしたい。
ちゃんとした環境変数の設定。(JAVA_HOME? SPARK_HOME? なにそれ？）

全部、まるっとDokcerイメージにしました。

イメージの内容物

baseimage: phusion/baseimage:0.11 (ubuntu 18.04ベース)
openjdk-8-jre
hadoop 3.2.1
spark 2.4.5
Anaconda3-2019.10-Linux-x86_64(python3.7.4)
Jupyter notebook
pixiedust 1.1.18

実行方法

まずはローカルで作業用ディレクトリを作成


mkdir sparkstudy
cd sparkstudy
docker run -v `pwd`:/work -p8888:8888 -it --rm neppysan/pyspark

すると、以下の表示が出ます。最後の行のhttps://127.0.0.1...をコピペしてブラウザに貼る。


To access the notebook, open this file in a browser:
            file:///root/.local/share/jupyter/runtime/nbserver-6-open.html
    Or copy and paste one of these URLs:
            http://2457d17b9863:8888/?token=7a466b3b8d558c34ea7b62ef3b6da95ed83a403d0a210847
     or http://127.0.0.1:8888/?token=7a466b3b8d558c34ea7b62ef3b6da95ed83a403d0a210847

するとJupyter Notebook ができるので、workをクリックすると、自分のディレクトリに移動できる。
イメージが5Gぐらいあるので、最初のダウンロードはすこし時間がかかる。

Jupyter Notebook

S3やAzureStorageの接続方法のサンプルを書いた。
sample.ipynbを同梱したからみてみて。

参考

git にいろいろ詳しく書いた。
https://github.com/ShumpeiWatanabe/pyspark/blob/master/Dockerfile

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up