PythonのpandasからSparkへの置き換え

Last updated at 2026-01-18Posted at 2026-01-18

「pandas でよく書かれるコード → Databricks（Spark）での正しい置き換え」を、実務で頻出するパターン別にまとめます。

原則
❌ pandasの for / apply
✅ Spark DataFrame + 組み込み関数（withColumn / groupBy / 高階関数）

① 行ごとの計算

pandas（dfはPandasのDataFrame）

Python

df["sum"] = df["a"] + df["b"]

Spark（dfはSparkのDataFrame）

Python

from pyspark.sql.functions import col

df = df.withColumn("sum", col("a") + col("b"))

② 条件分岐（if / lambda）

pandas

Python

df["label"] = df["score"].apply(lambda x: "A" if x >= 80 else "B")

Spark（when）

Python

from pyspark.sql.functions import when

df = df.withColumn(
    "label",
    when(col("score") >= 80, "A").otherwise("B")
)

③ apply(axis=1) → ❌一番の地雷

Python

df["total"] = df.apply(lambda r: r["a"] * r["b"] + r["c"], axis=1)

Spark（列演算に分解）

Python

df = df.withColumn("total", col("a") * col("b") + col("c"))

✔ Sparkでは「行」という概念で考えない

④ groupby + agg

pandas

Python

df.groupby("category")["sales"].sum().reset_index()

Spark

Python

from pyspark.sql.functions import sum

df.groupBy("category").agg(sum("sales").alias("sales_sum"))

⑤ groupby + apply → Sparkでは書き換え必須

pandas

Python

df.groupby("id").apply(
    lambda g: g.assign(rate=g["x"] / g["x"].sum())
)

Spark（Window関数）

Python

from pyspark.sql.window import Window
from pyspark.sql.functions import sum

w = Window.partitionBy("id")

df = df.withColumn(
    "rate",
    col("x") / sum("x").over(w)
)

✔ Databricks最重要テクニック

⑥ merge（join）

pandas

Python

df = df1.merge(df2, on="id", how="left")

Spark

Python

df = df1.join(df2, on="id", how="left")

⑦ 欠損値処理

pandas

Python

df["x"] = df["x"].fillna(0)

Spark

Python

df = df.fillna({"x": 0})

⑧ sort_values → orderBy

pandas

Python

df = df.sort_values(["date", "score"], ascending=[True, False])

Spark

Python

df = df.orderBy(col("date"), col("score").desc())

⑨ unique / nunique

pandas

Python

df["user"].nunique()

Spark

Python

from pyspark.sql.functions import countDistinct

df.select(countDistinct("user")).show()

⑩ isin

pandas

Python

df[df["status"].isin(["A", "B"])]

Spark

Python

df.filter(col("status").isin("A", "B"))

⑪ explode（リスト列）

pandas

Python

df = df.explode("items")

Spark

Python

from pyspark.sql.functions import explode

df = df.withColumn("items", explode("items"))

⑫ for / apply がどうしても必要な場合（最終手段）

❌ pandas UDF（遅い）
⭕ Spark UDF（本当に最後）

Python

from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

@udf(IntegerType())
def my_func(x):
    return x * 2

df = df.withColumn("y", my_func(col("x")))

👉 90%のケースでUDFは不要

⑬ pandas → Spark 置き換え判断フロー

apply / for を書きたくなった
        ↓        
列演算に分解できる？ → YES → withColumn
        ↓ NO
Window関数で表現できる？ → YES → over()
        ↓ NO
高階関数（array/map）？ → YES
        ↓ NO
最終手段：UDF

まとめ（超重要）

pandas	Spark
for / apply	❌
ベクトル演算	withColumn
groupby+apply	Window
list処理	transform / explode
集計	groupBy / agg

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up