More than 5 years have passed since last update.

YARN上でSparkを動かす

Last updated at 2016-04-26Posted at 2016-04-17

はじめに

タイトルの通りYARN上でSparkのワードカウントプログラムを動かしてみました．
結果が出たので動いていると思いますが，試行錯誤したらできたって感じです．
知識ほぼ0で挑んだ結果のまとめなので，メモ書き程度だと思って読んでください．
YARNって何？Sparkって何？って方は下の記事を参考にしてください（私も分かってません）

参考記事
YARNの紹介-IBM
第15回計算機クラスタのためのリソース管理基盤 Hadoop YARN
分散処理に入門してみた（Hadoop+Spark）
Apache Sparkを勉強して分散処理できますよ！って言えるようになる

内容

・環境構築（Hadoop, Sparkのビルドおよびconf設定）
・ SparkContextを使ってテキストファイルのワードカウント

環境

・ Ubuntu 14.04
・ Java 1.7.0_95
・ Hadoop 2.7.1
・ Spark 1.6.1
・ maven 3.3.9
・ Scala 2.11.7
・ sbt 0.13.7

Hadoop環境設定

< ビルド >
http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.1/hadoop-2.7.1-src.tar.gz からダウンロードしたものを解凍してビルドしました．
ビルドされたファイルはhadoop-2.7.1-src/hadoop-dist/target/hadoop-2.7.1にあります．
本記事では，hadoop-2.7.1ディレクトリを ${HADOOP_HOME} とします．

wget http://ftp.meisei-u.ac.jp/mirror/apache/dist/hadoop/common/hadoop-2.7.1/hadoop-2.7.1-src.tar.gz
tar xvfz hadoop-2.7.1-src.tar.gz
cd hadoop-2.7.1-src
mvn compile

< conf設定 >
http://zhongyaonan.com/hadoop-tutorial/setting-up-hadoop-2-6-on-mac-osx-yosemite.html を参考にしました．
上の記事通りにやればHadoopの設定は終わりです．

Spark環境設定

< ビルド >
git://github.com/apache/spark.git からクローンしてビルドしました．ブランチは1.6です．
本記事では，sparkディクトリを ${SPARK_HOME} とします．

git clone git://github.com/apache/spark.git
cd spark
git checkout branch-1.6
mvn -Pyarn -Phadoop-2.6 -Dscala-2.11 -DskipTests clean package

hadoopは2.7.1なので -Phadoop-2.6
scalaは2.11.7なので -Dscala-2.11
としています．

バージョンが異なる方は http://spark.apache.org/docs/latest/building-spark.html を参考にしてください．

ワードカウント

spark-submitに渡すワードカウントのjarファイル作成

本記事では，spark_testディレクトリを ${SPARK_TEST_HOME} とします．
sbt assemblyで ${SPARK_TEST_HOME}/target/scala-2.11/spark_test-assembly-1.0.jar ができます．

/spark_test   
  |- build.sbt
  |- project/
       |- plugins.sbt
  |- src/
     |- resources/
        |- wordcount_input.txt
     |- main/
        |- scala/
           |- WordCount.scala

build.sbt

name := "spark_test"

version := "1.0"

scalaVersion := "2.11.7"

libraryDependencies ++= Seq(
  "org.apache.spark" % "spark-core_2.11" % "1.6.1",
  "org.apache.spark" % "spark-streaming_2.11" % "1.6.1"
)

// Merge Strategy
assemblyMergeStrategy in assembly := {
  case PathList("javax", "servlet", xs @ _*)         => MergeStrategy.first
  case PathList(ps @ _*) if ps.last endsWith ".class" => MergeStrategy.first
  case "application.conf"                            => MergeStrategy.concat
  case "unwanted.txt"                                => MergeStrategy.discard
  case x =>
    val oldStrategy = (assemblyMergeStrategy in assembly).value
    oldStrategy(x)
}

plugins.sbt

logLevel := Level.Warn

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")

WordCount.scala

package examples

import org.apache.spark._
import org.apache.spark.SparkContext._

object WordCount extends Logging {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("spark_test")
    val sc = new SparkContext(conf)
    val files = sc.textFile("hdfs://localhost:9000/data/wordcount_input.txt")
    val words = files.flatMap(_.split(" "))
    val wordCounts = words.map(s => (s, 1)).reduceByKey(_ + _)
    wordCounts.saveAsTextFile("hdfs://localhost:9000/result")
  }
}

wordcount_input.txtは http://salsahpc.indiana.edu/tutorial/source_code/Hadoop-WordCount.zip から拝借しました．

wget http://salsahpc.indiana.edu/tutorial/source_code/Hadoop-WordCount.zip
unzip Hadoop-WordCount.zip
    Archive:  Hadoop-WordCount.zip
    creating: Hadoop-WordCount/
    creating: Hadoop-WordCount/classes/
    creating: Hadoop-WordCount/input/
    inflating: Hadoop-WordCount/input/Word_Count_input.txt
    inflating: Hadoop-WordCount/WordCount.java
    inflating: Hadoop-WordCount/clean.sh
    inflating: Hadoop-WordCount/build.sh
    inflating: Hadoop-WordCount/classes/WordCount$Reduce.class
    inflating: Hadoop-WordCount/classes/WordCount.class
    inflating: Hadoop-WordCount/classes/WordCount$Map.class
    inflating: Hadoop-WordCount/wordcount.jar

../resources/wordcount_input.txt

AFTER such a scene as the last, Walter Morel was for some days abashed
and ashamed, but he soon regained his old bullying indifference.
Yet there was a slight shrinking, a diminishing in his assurance.
Physically even, he shrank, and his fine full presence waned.
He never grew in the least stout, so that, as he sank from his erect,
assertive bearing, his physique seemed to contract along with his pride
and moral strength.

But now he realised how hard it was for his wife to drag
about at her work, and, his sympathy quickened by penitence,
hastened forward with his help. He came straight home from the pit,
...

参考記事
Spark1.5でSparkStreaming開発 [こと始め編]
Word Count (Spark, YARN, HDFS)

実行

1.Hadoopを立ち上げ，HDFSにワードカウントのデータファイルを置きます．

${HADOOP_HOME}/bin/hdfs namenode -format
${HADOOP_HOME}/sbin/start-dfs.sh
${HADOOP_HOME}/bin/hdfs dfs -mkdir hdfs://localhost:9000/data
${HADOOP_HOME}/bin/hdfs dfs -put ${SPARK_TEST_HOME}/src/main/resources/wordcount_input.txt hfs://localhost:9000/data
${HADOOP_HOME}/sbin/start-yarn.sh

2.jarファイルをspark-submitに渡し，ワードカウントを実行します．

${SPARK_HOME}/bin/spark-submit --class examples.WordCount --master yarn-client --num-executors 1 --driver-memory 2g --executor-memory 1g --executor-cores 1 ${SPARK_TEST_HOME}/target/scala-2.11/spark_test-assembly-1.0.jar

3.結果を確認し，wordcount_result.txtに保存します．

${HADOOP_HOME}/bin/hdfs dfs -ls hdfs://localhost:9000/result
    # lsの結果です．part-*に結果が保存されています．
    hdfs://localhost:9000/result/_SUCCESS
    hdfs://localhost:9000/result/part-00000
    hdfs://localhost:9000/result/part-00001

touch (適当なディレクトリ)/wordcount_result.txt
${HADOOP_HOME}/bin/hdfs dfs -cat hdfs://localhost:9000/result/part-00000 >> (適当なディレクトリ)/wordcount_result.txt
${HADOOP_HOME}/bin/hdfs dfs -cat hdfs://localhost:9000/result/part-00001 >> (適当なディレクトリ)/wordcount_result.txt

参考記事
YARN上でジョブを走らせてみる（Spark編）
Apache SparkをYARN上で動かしてみる

結果

下のような結果が得られました．

wordcount_result.txt

(simply,,1)
(bone,1)
(roses.,3)
(Whatstandwell.,1)
(stuck,,1)
(bananas.,1)
(letter:,1)
(insufferably,1)
(derniers,1)
(hem,1)
(sweat.,1)
(think,",1)
(wasn't,7)
(been,85)
(they,,2)
(PAUL,2)
(jelly,",1)
(does---",1)
(pig,1)
(crying,5)
(soon;,1)
(Sunday,",1)
(breath,2)
(knows,4)
(so's,2)
(whistled,,1)
(ignore,1)
(Western,7)
(smooth,2)
(BLOWN,1)
...

おわりに

YARN上でSPARKを使ってワードカウントできました（多分）
分かってないことばかりなので，コメント貰えるとありがたいです．

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up