Edited at

Apache SparkアプリをEclipseで動かすまでの手順

More than 3 years have passed since last update.


初めに


  1. 必要なソフトウェアのインストール

  2. giter8でプロジェクト作成

  3. sbtでビルド&テスト

  4. sbtのEclipseプラグインでEcplise設定ファイルを出力

  5. Eclipseにインポート

という流れでやっていきます

IntelliJ IDEAでもほぼ同じ手順で問題ないと思いますがIntelliJ IDEAを持っていないので試せていません


必要なソフトウェアのインストール

Apache Spark 1.2.1対応のscala 2.1.0をインストールします

sbtはビルドツールです

giter8はプロジェクトのテンプレートを作成してくれるツール(テンプレートはgithubで管理されてます)

brew cask install java

brew install scala210
brew link --force scala210
brew install sbt
brew install giter8


Eclipseプラグインのインストール

インストールの前にeclipse.iniを編集してheapを増やしておいたほうが良いです


~/Applications/Eclipse.app/Contents/MacOS/eclipse.ini

-Xms256m

-Xmx1024m


Eclipse Scala IDE

マーケットプレースからEclipse Scala IDEをインストール


ScalaTest for Scala IDE

新規ソフトウェアのインストールから以下をインストール

http://download.scala-ide.org/sdk/lithium/e44/scala211/stable/site

Scala IDE plugins -> ScalaTest for Scala IDE


giter8で新規プロジェクト作成

プロジェクトテンプレートはnttdata-oss/basic-spark-project.g8を利用

name、package、versionの入力を求められるので適当に入力

$ g8 nttdata-oss/basic-spark-project.g8

Picked up _JAVA_OPTIONS: -Dfile.encoding=UTF-8

A basic spark application project

name [Basic Spark]: SparkExample
package [com.example]: spark
version [0.0.1]:

Template applied in ./sparkexample


生成されたファイルの確認

ちょっとしたExampleが入っています

$ tree

.
├── README.rst
├── assembly.sbt
├── build.sbt
├── project
│   ├── assembly.sbt
│   └── plugins.sbt
└── src
├── main
│   └── scala
│   └── spark
│   ├── GroupByTest.scala
│   ├── RandomTextWriter.scala
│   ├── SparkHdfsLR.scala
│   ├── SparkLR.scala
│   ├── SparkLRTestDataGenerator.scala
│   ├── SparkPi.scala
│   ├── WordCount.scala
│   └── Words.scala
└── test
└── scala
└── spark
└── SparkPiSpec.scala


build.sbtの確認

spark-streaming、spark-sql、spark-hive、spark-mllibはコメントになっているので必要に応じて編集してください


build.sbt

name := "SparkExample"

organization := ""

version := "0.0.1"

scalaVersion := "2.10.4"

resolvers ++= Seq("cloudera" at "https://repository.cloudera.com/artifactory/cloudera-repos/")

libraryDependencies ++= Seq(
"org.scalatest" %% "scalatest" % "2.0.M5b" % "test" withSources() withJavadoc(),
"org.scalacheck" %% "scalacheck" % "1.10.0" % "test" withSources() withJavadoc(),
"org.apache.spark" %% "spark-core" % "1.2.1" % "provided" withSources() withJavadoc(),
// "org.apache.spark" %% "spark-streaming" % "1.2.1" % "provided" withSources() withJavadoc(),
// "org.apache.spark" %% "spark-sql" % "1.2.1" % "provided" withSources() withJavadoc(),
// "org.apache.spark" %% "spark-hive" % "1.2.1" % "provided" withSources() withJavadoc(),
// "org.apache.spark" %% "spark-mllib" % "1.2.1" % "provided" withSources() withJavadoc(),
"org.apache.hadoop" % "hadoop-client" % "2.5.0-cdh5.3.1" % "provided" withJavadoc(),
"com.github.scopt" %% "scopt" % "3.2.0"
)

initialCommands := "import .sparkexample._"



sbtでビルド&テスト

初回は依存ライブラリのDLでとても時間がかかります

$ sbt test

Picked up _JAVA_OPTIONS: -Dfile.encoding=UTF-8
[info] Loading project definition from /Users/ishihamat/Documents/workspace/sparkexample/project
[info] Updating {file:/Users/ishihamat/Documents/workspace/sparkexample/project/}sparkexample-build...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] downloading https://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/com.eed3si9n/sbt-assembly/scala_2.10/sbt_0.13/0.11.1/jars/sbt-assembly.jar
(略)
[info] Compiling 1 Scala source to /Users/ishihamat/Documents/workspace/sparkexample/target/scala-2.10/test-classes...
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/03/14 22:47:23 INFO SecurityManager: Changing view acls to: ishihamat
15/03/14 22:47:23 INFO SecurityManager: Changing modify acls to: ishihamat
15/03/14 22:47:23 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ishihamat); users
with modify permissions: Set(ishihamat)
15/03/14 22:47:23 INFO Slf4jLogger: Slf4jLogger started
15/03/14 22:47:23 INFO Remoting: Starting remoting
15/03/14 22:47:23 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.11.2:53331]
15/03/14 22:47:23 INFO Utils: Successfully started service 'sparkDriver' on port 53331.
15/03/14 22:47:23 INFO SparkEnv: Registering MapOutputTracker
15/03/14 22:47:23 INFO SparkEnv: Registering BlockManagerMaster
15/03/14 22:47:23 INFO DiskBlockManager: Created local directory at /var/folders/mh/yw9p58bj0q56r3n50qn07tgh0000gn/T/spark-c7a664c4-0fa4-4ca5-a1cf-04d
c5d8853dd/spark-254de019-ab13-43af-ab11-6900d0584549
15/03/14 22:47:23 INFO MemoryStore: MemoryStore started with capacity 510.3 MB
15/03/14 22:47:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/03/14 22:47:24 INFO HttpFileServer: HTTP File server directory is /var/folders/mh/yw9p58bj0q56r3n50qn07tgh0000gn/T/spark-87116159-39a8-4dae-a875-c6
c9352dafdf/spark-4b21cfc1-d4fc-45ba-a2c9-2320197ff0ed
15/03/14 22:47:24 INFO HttpServer: Starting HTTP Server
15/03/14 22:47:24 INFO Utils: Successfully started service 'HTTP file server' on port 53332.
15/03/14 22:47:24 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/03/14 22:47:24 INFO SparkUI: Started SparkUI at http://192.168.11.2:4040
15/03/14 22:47:24 INFO Executor: Starting executor ID <driver> on host localhost
15/03/14 22:47:24 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@192.168.11.2:53331/user/HeartbeatReceiver
15/03/14 22:47:24 INFO NettyBlockTransferService: Server created on 53333
15/03/14 22:47:24 INFO BlockManagerMaster: Trying to register BlockManager
15/03/14 22:47:24 INFO BlockManagerMasterActor: Registering block manager localhost:53333 with 510.3 MB RAM, BlockManagerId(<driver>, localhost, 53333
)
15/03/14 22:47:24 INFO BlockManagerMaster: Registered BlockManager
15/03/14 22:47:24 INFO SparkContext: Starting job: reduce at SparkPi.scala:50
15/03/14 22:47:24 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:50) with 1 output partitions (allowLocal=false)
15/03/14 22:47:24 INFO DAGScheduler: Final stage: Stage 0(reduce at SparkPi.scala:50)
15/03/14 22:47:24 INFO DAGScheduler: Parents of final stage: List()
15/03/14 22:47:24 INFO DAGScheduler: Missing parents: List()
15/03/14 22:47:24 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map at SparkPi.scala:46), which has no missing parents
15/03/14 22:47:24 INFO MemoryStore: ensureFreeSpace(1688) called with curMem=0, maxMem=535088332
15/03/14 22:47:24 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1688.0 B, free 510.3 MB)
15/03/14 22:47:24 INFO MemoryStore: ensureFreeSpace(1228) called with curMem=1688, maxMem=535088332
15/03/14 22:47:25 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1228.0 B, free 510.3 MB)
15/03/14 22:47:25 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:53333 (size: 1228.0 B, free: 510.3 MB)
15/03/14 22:47:25 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
15/03/14 22:47:25 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:838
15/03/14 22:47:25 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (MappedRDD[1] at map at SparkPi.scala:46)
15/03/14 22:47:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/03/14 22:47:25 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1317 bytes)
15/03/14 22:47:25 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/03/14 22:47:25 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 727 bytes result sent to driver
15/03/14 22:47:25 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 57 ms on localhost (1/1)
15/03/14 22:47:25 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/03/14 22:47:25 INFO DAGScheduler: Stage 0 (reduce at SparkPi.scala:50) finished in 0.069 s
15/03/14 22:47:25 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:50, took 0.303240 s
[info] SparkPiSpec:
[info] Pi
[info] - should be less than 4 and more than 3
[info] Passed: Total 1, Failed 0, Errors 0, Passed 1
[success] Total time: 638 s, completed 2015/03/14 22:47:25


Eclipeの設定ファイルを出力

$ sbt eclipse

Picked up _JAVA_OPTIONS: -Dfile.encoding=UTF-8
[info] Loading project definition from /Users/ishihamat/Documents/workspace/sparkexample/project
[info] Set current project to SparkExample (in build file:/Users/ishihamat/Documents/workspace/sparkexample/)
[info] About to create Eclipse project files for your project(s).
[info] Successfully created Eclipse project files for project(s):
[info] SparkExample

ちなみにsbtのeclipseプラグイン設定がnttdata-oss/basic-spark-project.g8に入っていました。IntelliJ IDEAプラグインも入っていますね。


project/plugins.sbt

addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "2.3.0")

addSbtPlugin("com.github.mpeltonen" % "sbt-idea" % "1.6.0")



Eclipseにインポート

Eclipseのメニューから


ファイル -> インポート -> 一般 -> 既存プロジェクトワークスペースへ


を選択しインポートします

私の環境では14件の問題が起きていましたがEclipseのScalaコンパイラバージョンを2.10に設定したら全て解決しました

あとはデバッグでScalaTestを選択し色々テストをしてください