0
0

2024/4/12に翔泳社よりApache Spark徹底入門を出版します!

書籍のサンプルノートブックをウォークスルーしていきます。Python/Chapter02/2-2 M&M Countとなります。

翻訳ノートブックのリポジトリはこちら。

ノートブックはこちら

from pyspark.sql.functions import *

# CSVファイルのパス
mnm_file = "/databricks-datasets/learning-spark-v2/mnm_dataset.csv"

CSVからの読み込みおよびスキーマの推定

mnm_df = (spark
          .read
          .format("csv") # フォーマット指定
          .option("header", "true") # ヘッダーあり
          .option("inferSchema", "true") # スキーマを推定
          .load(mnm_file))

display(mnm_df) # displayコマンドでデータフレームを表示

Screenshot 2024-03-27 at 12.20.47.png

mnm_df.rdd.getNumPartitions()
1

全ての色のカウントを集計し、州と色でgroupByし、カウントの降順でorderBy

count_mnm_df = (mnm_df
                .select("State", "Color", "Count") # State、Color、Countを選択
                .groupBy("State", "Color") # StateとColorでグルーピング
                .agg(count("Count").alias("Total")) # カウントを集計し列名をTotalに
                .orderBy("Total", ascending=False)) # Totalの降順でソート

count_mnm_df.show(n=60, truncate=False) # 先頭60件を表示
print(f"Total Rows = {count_mnm_df.count()}")
+-----+------+-----+
|State|Color |Total|
+-----+------+-----+
|CA   |Yellow|1807 |
|WA   |Green |1779 |
|OR   |Orange|1743 |
|TX   |Green |1737 |
|TX   |Red   |1725 |
|CA   |Green |1723 |
|CO   |Yellow|1721 |
|CA   |Brown |1718 |
|CO   |Green |1713 |
|NV   |Orange|1712 |
|TX   |Yellow|1703 |
|AZ   |Brown |1698 |
|NV   |Green |1698 |
|CO   |Blue  |1695 |
|WY   |Green |1695 |
|NM   |Red   |1690 |
|AZ   |Orange|1689 |
|NM   |Yellow|1688 |
|NM   |Brown |1687 |
|UT   |Orange|1684 |
|NM   |Green |1682 |
|UT   |Red   |1680 |
|AZ   |Green |1676 |
|NV   |Yellow|1675 |
|NV   |Blue  |1673 |
|WA   |Red   |1671 |
|WY   |Red   |1670 |
|WA   |Brown |1669 |
|NM   |Orange|1665 |
|WY   |Blue  |1664 |
|WA   |Yellow|1663 |
|WA   |Orange|1658 |
|CA   |Orange|1657 |
|NV   |Brown |1657 |
|CO   |Brown |1656 |
|CA   |Red   |1656 |
|UT   |Blue  |1655 |
|AZ   |Yellow|1654 |
|TX   |Orange|1652 |
|AZ   |Red   |1648 |
|OR   |Blue  |1646 |
|OR   |Red   |1645 |
|UT   |Yellow|1645 |
|CO   |Orange|1642 |
|TX   |Brown |1641 |
|NM   |Blue  |1638 |
|AZ   |Blue  |1636 |
|OR   |Green |1634 |
|UT   |Brown |1631 |
|WY   |Yellow|1626 |
|WA   |Blue  |1625 |
|CO   |Red   |1624 |
|OR   |Brown |1621 |
|TX   |Blue  |1614 |
|OR   |Yellow|1614 |
|NV   |Red   |1610 |
|CA   |Blue  |1603 |
|WY   |Orange|1595 |
|UT   |Green |1591 |
|WY   |Brown |1532 |
+-----+------+-----+

Total Rows = 60

Stateでフィルタリングすることでカルフォルニアのカウント集計値を取得

ca_count_mnm_df = (mnm_df
                   .select("State", "Color", "Count") # State、Color、Countを選択
                   .where(mnm_df.State == "CA") # StateがCAであるものをフィルタリング
                   .groupBy("State", "Color") # StateとColorでグルーピング
                   .agg(count("Count").alias("Total")) # カウントを集計し列名をTotalに
                   .orderBy("Total", ascending=False)) # Totalの降順でソート

ca_count_mnm_df.show(n=10, truncate=False) # 先頭10件を表示
+-----+------+-----+
|State|Color |Total|
+-----+------+-----+
|CA   |Yellow|1807 |
|CA   |Green |1723 |
|CA   |Brown |1718 |
|CA   |Orange|1657 |
|CA   |Red   |1656 |
|CA   |Blue  |1603 |
+-----+------+-----+

はじめてのDatabricks

はじめてのDatabricks

Databricks無料トライアル

Databricks無料トライアル

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0