7
8

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

CubieTruck(Cubieboard) Hadoop + Spark1.0.2 cluster構築

Posted at

Hadoop HDFSと連携するためApache Sparkをインストールしてみた。

インストール手順

■ハード構成 
 ・自作無線ルータ(Raspberry Pi) 
 ・CubieTruck (Fedora 19) * 4
■ミドル構成
  ・Oracle JDK 1.7 for ARM
  ・Hadoop 2.2.0
  ・Apache Spark 1.0.2

Hadoop構築時に行った、hosts設定、Javaのインストール、SSHノンパス処理は割愛します。
【Hadoop構築手順】
 http://qiita.com/tsunaki/items/41b9ea36ae99b7702ae3

Sparkのインストール

# cd /usr/local/src
# wget http://ftp.riken.jp/net/apache/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz 

# tar zxf spark-1.0.2-bin-hadoop2.tgz 
# mv spark-1.0.2-bin-hadoop2 /opt/

これで既にStandalone modeで動きます。

ここから、cluster設定を行います。

# mv spark-1.0.2-bin-hadoop2 /opt/
# cd /opt/spark-1.0.2-bin-hadoop2
# cp conf/spark-env.sh.template conf/spark-env.sh

# vi conf/spark-env.sh.template

一番下にmasterのIPを追記します。

#!/usr/bin/env bash

# This file is sourced when running various Spark programs.
# Copy it as spark-env.sh and edit that to configure Spark for your site.

# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append

# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
# - SPARK_CLASSPATH, default classpath entries to append
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# - MESOS_NATIVE_LIBRARY, to point to your libmesos.so if you use Mesos

# Options read in YARN client mode
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_EXECUTOR_INSTANCES, Number of workers to start (Default: 2)
# - SPARK_EXECUTOR_CORES, Number of cores for the workers (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Worker (e.g. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Master (e.g. 1000M, 2G) (Default: 512 Mb)
# - SPARK_YARN_APP_NAME, The name of your application (Default: Spark)
# - SPARK_YARN_QUEUE, The hadoop queue to use for allocation requests (Default: ‘default’)
# - SPARK_YARN_DIST_FILES, Comma separated list of files to be distributed with the job.
# - SPARK_YARN_DIST_ARCHIVES, Comma separated list of archives to be distributed with the job.

# Options for the daemons used in the standalone deploy mode:
# - SPARK_MASTER_IP, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (e.g. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (e.g. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_INSTANCES, to set the number of worker processes per node
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. "-Dx=y")
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. "-Dx=y")
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers

# ADD master00 
SPARK_MASTER_IP=192.168.100.2

次にslavesを設定します。

# vi conf/slaves

localhostの記述をいかに変更します。

master00
slave01
slave02
slave03

このディレクトリを各slaveに展開します。
※masterと同じにすること。

展開後にmaster00で起動処理

# su - hduser
$ /opt/spark-1.0.2-bin-hadoop2/start-all.sh

これで完了です。

WEBで起動の確認

spark.png

クラスターで動いてることを確認できます。

確認できたらHadoopと連結して動作確認してみます。

先ずテストデータを準備します。ここでは郵便番号を使用します。
(処理はすべてmaster00で行ってます)

# su - hduser
$ wget http://www.post.japanpost.jp/zipcode/dl/oogaki/zip/ken_all.zip
$ unzip ken_all.zip

$ nkf -w KEN_ALL.CSV > KEN_ALL.CSV.utf
### UTFに変換します。
$ mv KEN_ALL.CSV.utf KEN_ALL.CSV 
### HFDSに登録
$ hdfs dfs -put KEN_ALL.CSV /user/
$ hdfs dfs -ls /user
[hduser@master00 ~]$ hdfs dfs -ls /user
14/08/29 21:55:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r--   4 hduser supergroup   12232566 2014-08-29 21:07 /user/KEN_ALL.CSV

Spark-shellを起動させてます。

$ /opt/spark-1.0.2-bin-hadoop2/bin/spark-shell

スクリーンショット 2014-08-29 21.59.27.png

Hadoopのテキストを読み込みます。

scala> val file = sc.textFile("hdfs://master00:9000/user/KEN_ALL.CSV")

スクリーンショット 2014-08-29 22.11.59.png

読み込んだことを確認してみます。

scala> file.count()

スクリーンショット 2014-08-29 22.12.49.png

次に横浜という文字を検索してみます

scala> file.filter(line => line.contains("横浜")).foreach(println)

スクリーンショット 2014-08-29 22.14.41.png

検索できたことが確認できます。

以上

HadoopのMapReduceに比べると、ものすごく早くなりました。
これならARMのCPUボードでも十分活用できそうです。

7
8
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
7
8

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?