0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

Windows 10上に Apache Sparkをインストール

Last updated at Posted at 2023-07-25

色々機能が使えなかったので備忘録

前提

プロダクト バージョン URL
Windows 10 Pro - -
Java 18.0.1 - -
Apache Spark 3.4.1 https://spark.apache.org/downloads.html image.png
Hadoop 3.3.0 https://hadoop.apache.org/releases.html image.pngimage.png
winutils - https://github.com/ruslanmv/How-to-install-Hadoop-on-Windows/tree/master/winutils/hadoop-3.3.0-YARN-8246/bin

Apache Spark のインストール

spark-3.4.1-bin-hadoop3.tgz をダウンロードしてきて任意のフォルダで解凍
解凍先を SPARK_HOME として環境変数に追加

Hadoop のインストール

hadoop-3.3.0.tar.gz をダウンロードしてきて任意のフォルダで解凍
7-zip だとエラーが出ていたっぽい(気のせいかも)のでコマンドで解凍

tar -xvzf hadoop-3.3.0.tar.gz -C 解凍先フォルダ

解凍先を HADOOP_HOME として環境変数に追加

Java のインストール

省略
JAVA_HOME を環境変数に追加しないと動かない機能もあるようだ
例)spark_df=pyspark.sql.dataframe.DataFrame とした場合、spark_df.write 

winutils のインストール

gitからチェックアウト

git clone https://github.com/ruslanmv/How-to-install-Hadoop-on-Windows.git

How-to-install-Hadoop-on-Windows\winutils\bin
を HADOOP_HOMEにコピー(binフォルダを上書き)

Pathを通す

pathに
SPARK_HOME\bin
HADOOP_HOME\bin
を追加(Javaは通っている前提)

確認

Apache Spark

spark-submit --version  

image.png

Hadoop

hadoop version  

image.png

動かないときの対策

pysparkでshow() 実行 に SocketTimeoutException

エラー内容

Py4JJavaError: An error occurred while calling o44.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (2112N-14046.jp.misumi.pri executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
<-- 中略 -->
Caused by: java.net.SocketTimeoutException: Accept timed out

対策:下記を追加

import findspark 

findspark.init()
findspark.find()

to_pandas() 実行時に UnknownTimeZoneError

エラー内容

UnknownTimeZoneError Traceback (most recent call last) Input In [43], in <cell line: 1>() ----> 1 df_pandas = spark_df.toPandas()

対策:sparkセッション作成時にtimezoneを指定

spark = SparkSession.builder.appName("アプリ名").config("spark.sql.session.timeZone", "Asia/Tokyo").getOrCreate()

参考

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?