0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

PythonのpandasからSparkへの置き換え

0
Last updated at Posted at 2026-01-18

「pandas でよく書かれるコード → Databricks(Spark)での正しい置き換え」 を、実務で頻出するパターン別 にまとめます。

原則
❌ pandasの for / apply
✅ Spark DataFrame + 組み込み関数(withColumn / groupBy / 高階関数)

① 行ごとの計算

pandas(dfはPandasのDataFrame)

Python
df["sum"] = df["a"] + df["b"]

Spark(dfはSparkのDataFrame)

Python
from pyspark.sql.functions import col

df = df.withColumn("sum", col("a") + col("b"))

② 条件分岐(if / lambda)

pandas

Python
df["label"] = df["score"].apply(lambda x: "A" if x >= 80 else "B")

Spark(when)

Python
from pyspark.sql.functions import when

df = df.withColumn(
    "label",
    when(col("score") >= 80, "A").otherwise("B")
)

③ apply(axis=1) → ❌一番の地雷

Python
df["total"] = df.apply(lambda r: r["a"] * r["b"] + r["c"], axis=1)

Spark(列演算に分解)

Python
df = df.withColumn("total", col("a") * col("b") + col("c"))

✔ Sparkでは 「行」という概念で考えない

④ groupby + agg

pandas

Python
df.groupby("category")["sales"].sum().reset_index()

Spark

Python
from pyspark.sql.functions import sum

df.groupBy("category").agg(sum("sales").alias("sales_sum"))

⑤ groupby + apply → Sparkでは書き換え必須

pandas

Python
df.groupby("id").apply(
    lambda g: g.assign(rate=g["x"] / g["x"].sum())
)

Spark(Window関数)

Python
from pyspark.sql.window import Window
from pyspark.sql.functions import sum

w = Window.partitionBy("id")

df = df.withColumn(
    "rate",
    col("x") / sum("x").over(w)
)

✔ Databricks最重要テクニック

⑥ merge(join)

pandas

Python
df = df1.merge(df2, on="id", how="left")

Spark

Python
df = df1.join(df2, on="id", how="left")

⑦ 欠損値処理

pandas

Python
df["x"] = df["x"].fillna(0)

Spark

Python
df = df.fillna({"x": 0})

⑧ sort_values → orderBy

pandas

Python
df = df.sort_values(["date", "score"], ascending=[True, False])

Spark

Python
df = df.orderBy(col("date"), col("score").desc())

⑨ unique / nunique

pandas

Python
df["user"].nunique()

Spark

Python
from pyspark.sql.functions import countDistinct

df.select(countDistinct("user")).show()

⑩ isin

pandas

Python
df[df["status"].isin(["A", "B"])]

Spark

Python
df.filter(col("status").isin("A", "B"))

⑪ explode(リスト列)

pandas

Python
df = df.explode("items")

Spark

Python
from pyspark.sql.functions import explode

df = df.withColumn("items", explode("items"))

⑫ for / apply がどうしても必要な場合(最終手段)

❌ pandas UDF(遅い)
⭕ Spark UDF(本当に最後)

Python
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

@udf(IntegerType())
def my_func(x):
    return x * 2

df = df.withColumn("y", my_func(col("x")))

👉 90%のケースでUDFは不要

⑬ pandas → Spark 置き換え判断フロー

apply / for を書きたくなった
                
列演算に分解できる  YES  withColumn
         NO
Window関数で表現できる  YES  over()
         NO
高階関数array/map)?  YES
         NO
最終手段UDF

まとめ(超重要)

pandas Spark
for / apply
ベクトル演算 withColumn
groupby+apply Window
list処理 transform / explode
集計 groupBy / agg
0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?