More than 1 year has passed since last update.

SparkConfを使用してSparkContextの設定をカスタマイズ

Pyspark

Last updated at 2024-03-30Posted at 2022-02-14

# SparkConfクラスからインスタンスを初期化し、SparkContextクラスのインスタンス時の引数に入れる
# pysparkだとSparkContextクラスのインスタンスが最初から初期化されている

print(sc)  # 最初から初期化されている

from pyspark.conf import SparkConf
from pyspark.context import SparkContext

conf = SparkConf()
# conf.set('spark.executor.memory', '2g')
print(conf.getAll())

# もう存在するのでエラーにならないよう既存のものをstopする必要あり
sc.stop()
sc = SparkContext(conf=conf)

<SparkContext master=local[*] appName=PySparkShell>
[('spark.master', 'local[*]'), ('spark.submit.pyFiles', ''), ('spark.submit.deployMode', 'client'), ('spark.app.name', 'PySparkShell'), ('spark.ui.showConsoleProgress', 'true')]

sc.getConf()

<pyspark.conf.SparkConf at 0x1c6aa4b2d00>

# Attributes
print(sc.applicationId)
print(sc.defaultMinPartitions)  # A unique identifier for the Spark application.
print(sc.defaultParallelism)  # CPUのスレッド数だっけ？そしたら21だったと思うけど
print(sc.resources)
print(sc.startTime)
print(sc.uiWebUrl)
print(sc.version)
sc.stop()

local-1644843835282
2
20
{}
1644843835148
http://kubernetes.docker.internal:4041
3.2.0

sparkConfの設定をいろいろ変えてみる

conf = SparkConf()
conf.getAll()

[('spark.master', 'local[*]'),
 ('spark.submit.pyFiles', ''),
 ('spark.submit.deployMode', 'client'),
 ('spark.app.name', 'PySparkShell'),
 ('spark.ui.showConsoleProgress', 'true')]

from pprint import pprint
conf.setMaster("local").setAppName("changed app name")
conf.setMaster("CHANGED")
pprint(conf.getAll())

conf.setExecutorEnv("VAR1", "value1")
pprint(conf.getAll())

[('spark.app.name', 'changed app name'),
 ('spark.submit.pyFiles', ''),
 ('spark.submit.deployMode', 'client'),
 ('spark.master', 'CHANGED'),
 ('spark.ui.showConsoleProgress', 'true')]
[('spark.executorEnv.VAR1', 'value1'),
 ('spark.app.name', 'changed app name'),
 ('spark.submit.pyFiles', ''),
 ('spark.submit.deployMode', 'client'),
 ('spark.master', 'CHANGED'),
 ('spark.ui.showConsoleProgress', 'true')]

ほかのアプローチ

sparkConfは設定できる項目が少ない。ほかの設定方法を探す。
https://spark.apache.org/docs/latest/configuration.html#environment-variables

Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file.

以上の記載によると、優先度の関係で、
spark-3.2.0-bin-hadoop3.2\conf\spark-defaults.conf.template
でなく、cli上で実行するときのオプションやspark-submitのオプションに設定すると反映される。

コマンドライン上でpyspark --NAME CHANGEとすると、spark.app.name=CHANGEになっていることがわかる。設定は4040のenviromentから参照できる。

--executor-coresなどの反映はsum.java.commandから参照できる模様。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up