More than 1 year has passed since last update.

English SDK for Sparkのサンプルノートブックのウォークスルー

Posted at 2023-08-16

こちらから参照されているEnglish SDK for Sparkのサンプルノートブックが新しくなっていたので試してみます。特にUDF生成の挙動が勉強になりました。

環境設定

%pip install pyspark-ai
dbutils.library.restartPython()

import os
os.environ["OPENAI_API_KEY"] = dbutils.secrets.get("demo-token-takaaki.yayoi", "openai_api_key")
os.environ["GOOGLE_API_KEY"] = "<GOOGLE_API_KEY>"
os.environ["GOOGLE_CSE_ID"] = "<GOOGLE_CSE_ID>"

import openai
openai.api_key = os.getenv("OPENAI_API_KEY")

from pyspark_ai import SparkAI

spark_ai = SparkAI(verbose=True)
spark_ai.activate()  # active partial functions for Spark DataFrame

ここでverbose=Trueを指定しているので、途中の処理過程も出力します。Falseにすると途中の処理過程の出力が抑制されます。

例1: USの2022年におけるブランドごとの自動車販売数

URL指定でデータフレームを生成します。

auto_df = spark_ai.create_df("https://www.carpro.com/blog/full-year-2022-national-auto-sales-by-brand")
auto_df.show()

INFO: Parsing URL: https://www.carpro.com/blog/full-year-2022-national-auto-sales-by-brand

INFO: SQL query for the ingestion:
CREATE OR REPLACE TEMP VIEW spark_ai_temp_view_732c4c AS 
SELECT * FROM VALUES
('Toyota', 1849751, -9),
('Ford', 1767439, -2),
('Chevrolet', 1502389, 6),
('Honda', 881201, -33),
('Hyundai', 724265, -2),
('Kia', 693549, -1),
('Jeep', 684612, -12),
('Nissan', 682731, -25),
('Subaru', 556581, -5),
('Ram Trucks', 545194, -16),
('GMC', 517649, 7),
('Mercedes-Benz', 350949, 7),
('BMW', 332388, -1),
('Volkswagen', 301069, -20),
('Mazda', 294908, -11),
('Lexus', 258704, -15),
('Dodge', 190793, -12),
('Audi', 186875, -5),
('Cadillac', 134726, 14),
('Chrysler', 112713, -2),
('Buick', 103519, -42),
('Acura', 102306, -35),
('Volvo', 102038, -16),
('Mitsubishi', 102037, -16),
('Lincoln', 83486, -4),
('Porsche', 70065, 0),
('Genesis', 56410, 14),
('INFINITI', 46619, -20),
('MINI', 29504, -1),
('Alfa Romeo', 12845, -30),
('Maserati', 6413, -10),
('Bentley', 3975, 0),
('Lamborghini', 3134, 3),
('Fiat', 915, -61),
('McLaren', 840, -35),
('Rolls-Royce', 460, 7)
AS v1(Brand, US_Sales_2022, Sales_Change_Percentage)

INFO: Storing data into temp view: spark_ai_temp_view_732c4c

+-------------+-------------+-----------------------+
|        Brand|US_Sales_2022|Sales_Change_Percentage|
+-------------+-------------+-----------------------+
|       Toyota|      1849751|                     -9|
|         Ford|      1767439|                     -2|
|    Chevrolet|      1502389|                      6|
|        Honda|       881201|                    -33|
|      Hyundai|       724265|                     -2|
|          Kia|       693549|                     -1|
|         Jeep|       684612|                    -12|
|       Nissan|       682731|                    -25|
|       Subaru|       556581|                     -5|
|   Ram Trucks|       545194|                    -16|
|          GMC|       517649|                      7|
|Mercedes-Benz|       350949|                      7|
|          BMW|       332388|                     -1|
|   Volkswagen|       301069|                    -20|
|        Mazda|       294908|                    -11|
|        Lexus|       258704|                    -15|
|        Dodge|       190793|                    -12|
|         Audi|       186875|                     -5|
|     Cadillac|       134726|                     14|
|     Chrysler|       112713|                     -2|
+-------------+-------------+-----------------------+
only showing top 20 rows

売り上げの変化が-100%から100%であることを検証します。

auto_df.ai.verify("expect sales change percentage to be between -100 to 100")

Trueが返ってきます。

INFO: LLM Output:
def check_sales_change_percentage(df) -> bool:
    from pyspark.sql.functions import col

    # Check if Sales_Change_Percentage is between -100 and 100
    df_filtered = df.filter((col("Sales_Change_Percentage") >= -100) & (col("Sales_Change_Percentage") <= 100))

    # If the number of rows in the filtered DataFrame is equal to the number of rows in the original DataFrame,
    # then all Sales_Change_Percentage values are between -100 and 100
    if df_filtered.count() == df.count():
        return True
    else:
        return False

result = check_sales_change_percentage(df)
INFO: Generated code:
def check_sales_change_percentage(df) -> bool:
    from pyspark.sql.functions import col

    # Check if Sales_Change_Percentage is between -100 and 100
    df_filtered = df.filter((col("Sales_Change_Percentage") >= -100) & (col("Sales_Change_Percentage") <= 100))

    # If the number of rows in the filtered DataFrame is equal to the number of rows in the original DataFrame,
    # then all Sales_Change_Percentage values are between -100 and 100
    if df_filtered.count() == df.count():
        return True
    else:
        return False

result = check_sales_change_percentage(df)

INFO: 
Result: True

プロットします。plotの引数を指定しなくてもいい感じで描画してくれます。以下の出力ではPythonコードのみを抜粋しています。実際にはコードの説明文も出力されます。

auto_df.ai.plot()

import plotly.graph_objects as go
from pyspark.sql import SparkSession
import pandas as pd

# Start Spark session
spark = SparkSession.builder.appName('SparkSQL').getOrCreate()

# Assuming df is the Spark DataFrame
# Convert Spark DataFrame to Pandas DataFrame
pandas_df = df.toPandas()

# Create a bar chart for US_Sales_2022
trace1 = go.Bar(
    x=pandas_df['Brand'],
    y=pandas_df['US_Sales_2022'],
    name='US Sales 2022'
)

# Create a line chart for Sales_Change_Percentage
trace2 = go.Scatter(
    x=pandas_df['Brand'],
    y=pandas_df['Sales_Change_Percentage'],
    name='Sales Change Percentage',
    yaxis='y2'
)

data = [trace1, trace2]

layout = go.Layout(
    title='Brand Sales Analysis',
    yaxis=dict(
        title='US Sales 2022'
    ),
    yaxis2=dict(
        title='Sales Change Percentage',
        titlefont=dict(
            color='rgb(148, 103, 189)'
        ),
        tickfont=dict(
            color='rgb(148, 103, 189)'
        ),
        overlaying='y',
        side='right'
    )
)

fig = go.Figure(data=data, layout=layout)
fig.show()

円グラフを描画させます。

auto_df.ai.plot("pie chart for US sales market shares, show the top 5 brands and the sum of others")

import plotly.graph_objects as go
from pyspark.sql import SparkSession
import pandas as pd

# Start Spark session
spark = SparkSession.builder.getOrCreate()

# Assuming df is the Spark DataFrame
df_pd = df.toPandas()

# Calculate the total sales
total_sales = df_pd['US_Sales_2022'].sum()

# Calculate the sales percentage for each brand
df_pd['Sales_Percentage'] = df_pd['US_Sales_2022'] / total_sales * 100

# Sort the dataframe by sales percentage in descending order
df_pd = df_pd.sort_values('Sales_Percentage', ascending=False)

# Get the top 5 brands
top_5_brands = df_pd.head(5)

# Calculate the sum of sales percentage for other brands
other_brands = pd.DataFrame([['Others', df_pd[5:]['Sales_Percentage'].sum()]], columns=['Brand', 'Sales_Percentage'])

# Concatenate the top 5 brands and other brands
final_df = pd.concat([top_5_brands, other_brands], ignore_index=True)

# Create a pie chart
fig = go.Figure(data=[go.Pie(labels=final_df['Brand'], values=final_df['Sales_Percentage'], hole=.3)])

# Display the plot
fig.show()

データフレームを変換します。以下の例では最も成長の著しいブランドを抽出させています。

# Apply transforms to a Dataframe
auto_top_growth_df=auto_df.ai.transform("brand with the highest growth")
auto_top_growth_df.show()

Cadillacが最も成長率が高いと返ってきます。

NFO: Creating temp view for the transform:
df.createOrReplaceTempView("spark_ai_temp_view_01becc")



> Entering new AgentExecutor chain...
Thought: I will query the Brand and Sales_Change_Percentage columns, ordering by Sales_Change_Percentage in descending order.
Action: query_validation
Action Input: SELECT Brand, Sales_Change_Percentage FROM spark_ai_temp_view_01becc ORDER BY Sales_Change_Percentage DESC LIMIT 1
Observation: OK
Thought:I now know the final answer.
Final Answer: SELECT Brand, Sales_Change_Percentage FROM spark_ai_temp_view_01becc ORDER BY Sales_Change_Percentage DESC LIMIT 1

> Finished chain.
INFO: SQL query:
SELECT Brand, Sales_Change_Percentage FROM spark_ai_temp_view_01becc ORDER BY Sales_Change_Percentage DESC LIMIT 1

+--------+-----------------------+
|   Brand|Sales_Change_Percentage|
+--------+-----------------------+
|Cadillac|                     14|
+--------+-----------------------+

explainでデータフレームを説明させます。

# Explain what a DataFrame is retrieving.
auto_top_growth_df.ai.explain()

Out[9]: 'In summary, this dataframe is retrieving the brand with the highest sales change percentage in 2022 from a temporary view. The sales change percentage is calculated based on the US sales in 2022.'

まとめると、このデータフレームは一時ビューから2022年の売り上げの変化率が最も高いブランドを抽出しています。売り上げの変化率は2022年のUSでの売り上げをベースにして計算されています。

例2: アメリカの大統領

Wikipediaからデータを取得します。

# You can also specify the expected columns for the ingestion.
df=spark_ai.create_df("https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States", ["president", "vice_president"])
df.show()

INFO: Parsing URL: https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States

INFO: SQL query for the ingestion:
CREATE OR REPLACE TEMP VIEW spark_ai_temp_view_f24ac6 AS 
SELECT * FROM VALUES
('George Washington', 'John Adams'),
('John Adams', 'Thomas Jefferson'),
('Thomas Jefferson', 'Aaron Burr'),
('James Madison', 'George Clinton'),
('James Monroe', 'Daniel D. Tompkins'),
('John Quincy Adams', 'John C. Calhoun'),
('Andrew Jackson', 'John C. Calhoun'),
('Martin Van Buren', 'Richard Mentor Johnson'),
('William Henry Harrison', 'John Tyler'),
('John Tyler', 'None'),
('James K. Polk', 'George M. Dallas'),
('Zachary Taylor', 'Millard Fillmore'),
('Millard Fillmore', 'None'),
('Franklin Pierce', 'William R. King'),
('James Buchanan', 'John C. Breckinridge'),
('Abraham Lincoln', 'Hannibal Hamlin'),
('Andrew Johnson', 'None'),
('Ulysses S. Grant', 'Schuyler Colfax'),
('Rutherford B. Hayes', 'William A. Wheeler'),
('James A. Garfield', 'Chester A. Arthur'),
('Chester A. Arthur', 'None'),
('Grover Cleveland', 'Thomas A. Hendricks'),
('Benjamin Harrison', 'Levi P. Morton'),
('Grover Cleveland', 'Adlai Stevenson I'),
('William McKinley', 'Garret Hobart'),
('Theodore Roosevelt', 'Charles W. Fairbanks'),
('William Howard Taft', 'James S. Sherman'),
('Woodrow Wilson', 'Thomas R. Marshall'),
('Warren G. Harding', 'Calvin Coolidge'),
('Calvin Coolidge', 'Charles G. Dawes'),
('Herbert Hoover', 'Charles Curtis'),
('Franklin D. Roosevelt', 'John Nance Garner'),
('Harry S. Truman', 'Alben W. Barkley'),
('Dwight D. Eisenhower', 'Richard Nixon'),
('John F. Kennedy', 'Lyndon B. Johnson')
AS v1(president, vice_president)

INFO: Storing data into temp view: spark_ai_temp_view_f24ac6

+--------------------+--------------------+
|           president|      vice_president|
+--------------------+--------------------+
|   George Washington|          John Adams|
|          John Adams|    Thomas Jefferson|
|    Thomas Jefferson|          Aaron Burr|
|       James Madison|      George Clinton|
|        James Monroe|  Daniel D. Tompkins|
|   John Quincy Adams|     John C. Calhoun|
|      Andrew Jackson|     John C. Calhoun|
|    Martin Van Buren|Richard Mentor Jo...|
|William Henry Har...|          John Tyler|
|          John Tyler|                None|
|       James K. Polk|    George M. Dallas|
|      Zachary Taylor|    Millard Fillmore|
|    Millard Fillmore|                None|
|     Franklin Pierce|     William R. King|
|      James Buchanan|John C. Breckinridge|
|     Abraham Lincoln|     Hannibal Hamlin|
|      Andrew Johnson|                None|
|    Ulysses S. Grant|     Schuyler Colfax|
| Rutherford B. Hayes|  William A. Wheeler|
|   James A. Garfield|   Chester A. Arthur|
+--------------------+--------------------+
only showing top 20 rows

副大統領にもなった大統領を抽出します。

presidents_who_were_vp = df.ai.transform("presidents who were also vice presidents")
presidents_who_were_vp.show()

INFO: Creating temp view for the transform:
df.createOrReplaceTempView("spark_ai_temp_view_23f20d")



> Entering new AgentExecutor chain...
Thought: I will query the president column where the president's name is also in the vice_president column.
Action: query_validation
Action Input: SELECT president FROM spark_ai_temp_view_23f20d WHERE president IN (SELECT vice_president FROM spark_ai_temp_view_23f20d)
Observation: OK
Thought:I now know the final answer.
Final Answer: SELECT president FROM spark_ai_temp_view_23f20d WHERE president IN (SELECT vice_president FROM spark_ai_temp_view_23f20d)

> Finished chain.
INFO: SQL query:
SELECT president FROM spark_ai_temp_view_23f20d WHERE president IN (SELECT vice_president FROM spark_ai_temp_view_23f20d)

+-----------------+
|        president|
+-----------------+
|       John Adams|
| Thomas Jefferson|
|       John Tyler|
| Millard Fillmore|
|Chester A. Arthur|
|  Calvin Coolidge|
+-----------------+

データフレームの説明。

presidents_who_were_vp.ai.explain()

Out[12]: 'In summary, this dataframe is retrieving the names of the presidents who have also been vice-presidents from the temporary view spark_ai_temp_view_23f20d.'

まとめると、このデータフレームは一時ビューspark_ai_temp_view_23f20dから副大統領にもなったことがある大統領の名前を取得したものです。

NULL値がないことを検証します。

presidents_who_were_vp.ai.verify("expect no NULL values")

Trueが返ってきます。

INFO: LLM Output:
def has_no_null_values(df) -> bool:
    from pyspark.sql.functions import col, sum as _sum

    # Check if there are any null values in the DataFrame
    null_counts = df.select([_sum(col(c).isNull().cast("int")).alias(c) for c in df.columns]).collect()[0].asDict()

    # If any column has null values, return False
    if any(value > 0 for value in null_counts.values()):
        return False
    else:
        return True

result = has_no_null_values(df)
INFO: Generated code:
def has_no_null_values(df) -> bool:
    from pyspark.sql.functions import col, sum as _sum

    # Check if there are any null values in the DataFrame
    null_counts = df.select([_sum(col(c).isNull().cast("int")).alias(c) for c in df.columns]).collect()[0].asDict()

    # If any column has null values, return False
    if any(value > 0 for value in null_counts.values()):
        return False
    else:
        return True

result = has_no_null_values(df)

INFO: 
Result: True

例3: トップ10のテック企業

Web検索を行ってデータフレームを生成します。['company', 'cap', 'country']で明示的に列名を指定しています。

# Search and ingest web content into a DataFrame
company_df=spark_ai.create_df("Top 10 tech companies by market cap", ['company', 'cap', 'country'])
company_df.show()

INFO: Parsing URL: https://www.statista.com/statistics/1350976/leading-tech-companies-worldwide-by-market-cap/

INFO: SQL query for the ingestion:
CREATE OR REPLACE TEMP VIEW spark_ai_temp_view_e5a152 AS 
SELECT * FROM VALUES
('Apple', 2242, 'USA'), 
('Microsoft', 1821, 'USA'), 
('Alphabet (Google)', 1229, 'USA'), 
('Amazon', 902.4, 'USA'), 
('Tesla', 541.4, 'USA'), 
('TSMC', 410.9, 'Taiwan'), 
('NVIDIA', 401.7, 'USA'), 
('Tencent', 377.8, 'China'), 
('Meta Platforms (Facebook)', 302.1, 'USA'), 
('Samsung', 301.7, 'South Korea') 
AS v1(company, cap, country)

INFO: Storing data into temp view: spark_ai_temp_view_e5a152

+--------------------+------+-----------+
|             company|   cap|    country|
+--------------------+------+-----------+
|               Apple|2242.0|        USA|
|           Microsoft|1821.0|        USA|
|   Alphabet (Google)|1229.0|        USA|
|              Amazon| 902.4|        USA|
|               Tesla| 541.4|        USA|
|                TSMC| 410.9|     Taiwan|
|              NVIDIA| 401.7|        USA|
|             Tencent| 377.8|      China|
|Meta Platforms (F...| 302.1|        USA|
|             Samsung| 301.7|South Korea|
+--------------------+------+-----------+

USAの企業に限定します。

us_company_df=company_df.ai.transform("companies in USA")
us_company_df.show()

INFO: Creating temp view for the transform:
df.createOrReplaceTempView("spark_ai_temp_view_ca6168")



> Entering new AgentExecutor chain...
Final Answer: SELECT company FROM spark_ai_temp_view_ca6168 WHERE country = 'USA'

> Finished chain.
INFO: SQL query:
SELECT company FROM spark_ai_temp_view_ca6168 WHERE country = 'USA'

+--------------------+
|             company|
+--------------------+
|               Apple|
|           Microsoft|
|   Alphabet (Google)|
|              Amazon|
|               Tesla|
|              NVIDIA|
|Meta Platforms (F...|
+--------------------+

データフレームの説明。

us_company_df.ai.explain()

Out[16]: "In summary, this dataframe is retrieving the company names from a temporary view where the country is 'USA'."

企業名がユニークであることの検証。

us_company_df.ai.verify("expect all company names to be unique")

INFO: LLM Output:
def check_unique_companies(df) -> bool:
    from pyspark.sql import functions as F

    # Count the number of unique company names
    unique_companies = df.select(F.countDistinct("company")).collect()[0][0]

    # Check if the number of unique companies is equal to the total number of rows
    if unique_companies == df.count():
        return True
    else:
        return False

result = check_unique_companies(df)
INFO: Generated code:
def check_unique_companies(df) -> bool:
    from pyspark.sql import functions as F

    # Count the number of unique company names
    unique_companies = df.select(F.countDistinct("company")).collect()[0][0]

    # Check if the number of unique companies is equal to the total number of rows
    if unique_companies == df.count():
        return True
    else:
        return False

result = check_unique_companies(df)

INFO: 
Result: True

例4: URLからの取り込み

Web検索ではなく、直接URL指定でデータを取り込むこともできます。

best_albums_df = spark_ai.create_df('https://time.com/6235186/best-albums-2022/', ["album", "artist", "year"])
best_albums_df.show()

INFO: Parsing URL: https://time.com/6235186/best-albums-2022/

INFO: SQL query for the ingestion:
CREATE OR REPLACE TEMP VIEW spark_ai_temp_view_5d4812 AS SELECT * FROM VALUES
('Motomami', 'Rosalía', 2022),
('You Can’t Kill Me', '070 Shake', 2022),
('Mr. Morale & The Big Steppers', 'Kendrick Lamar', 2022),
('Big Time', 'Angel Olsen', 2022),
('Electricity', 'Ibibio Sound Machine', 2022),
('It’s Almost Dry', 'Pusha T', 2022),
('Chloe and the Next 20th Century', 'Father John Misty', 2022),
('Renaissance', 'Beyoncé', 2022),
('19 Masters', 'Saya Gray', 2022),
('Un Verano Sin Ti', 'Bad Bunny', 2022)
AS v1(album, artist, year)

INFO: Storing data into temp view: spark_ai_temp_view_5d4812

+--------------------+--------------------+----+
|               album|              artist|year|
+--------------------+--------------------+----+
|            Motomami|             Rosalía|2022|
|   You Can’t Kill Me|           070 Shake|2022|
|Mr. Morale & The ...|      Kendrick Lamar|2022|
|            Big Time|         Angel Olsen|2022|
|         Electricity|Ibibio Sound Machine|2022|
|     It’s Almost Dry|             Pusha T|2022|
|Chloe and the Nex...|   Father John Misty|2022|
|         Renaissance|             Beyoncé|2022|
|          19 Masters|           Saya Gray|2022|
|    Un Verano Sin Ti|           Bad Bunny|2022|
+--------------------+--------------------+----+

年数が2022であることを検証。

best_albums_df.ai.verify("expect each year to be 2022")

INFO: LLM Output:
def check_year(df) -> bool:
    from pyspark.sql.functions import col

    # Check if all years in the DataFrame are 2022
    if df.filter(col('year') != 2022).count() == 0:
        return True
    else:
        return False

result = check_year(df)
INFO: Generated code:
def check_year(df) -> bool:
    from pyspark.sql.functions import col

    # Check if all years in the DataFrame are 2022
    if df.filter(col('year') != 2022).count() == 0:
        return True
    else:
        return False

result = check_year(df)

INFO: 
Result: True

例5: UDFの生成

個人的にはこれが一番腑に落ちました。これまでよく分かっていなかったです。デコレーター@spark_ai.udfを宣言してUDF(ユーザー定義関数)を定義してコメントを記述すると、それに従ってUDFを実装してくれるということでした。以下の例ではコメントに標準的な境界点を用いてグレードのパーセントを文字に変換すると記述しています。

@spark_ai.udf
def convert_grades(grade_percent: float) -> str:
    """Convert the grade percent to a letter grade using standard cutoffs"""
    ...

すごい、実装されてる。

INFO: Creating following Python UDF:
def convert_grades(grade_percent) -> str:
    if grade_percent is not None:
        if grade_percent >= 90:
            return 'A'
        elif grade_percent >= 80:
            return 'B'
        elif grade_percent >= 70:
            return 'C'
        elif grade_percent >= 60:
            return 'D'
        else:
            return 'F'

UDFとして登録すれば、普通にPySparkから呼び出せます。

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
spark.udf.register("convert_grades", convert_grades)
percentGrades = [(1, 97.8), (2, 72.3), (3, 81.2)]
df = spark.createDataFrame(percentGrades, ["student_id", "grade_percent"])
df.selectExpr("student_id", "convert_grades(grade_percent)").show()

+----------+-----------------------------+
|student_id|convert_grades(grade_percent)|
+----------+-----------------------------+
|         1|                            A|
|         2|                            C|
|         3|                            B|
+----------+-----------------------------+

キャッシュ

動作は確認しましたが使い方はいまいち分からず。メソッドも現状ではcommit()しか実装されていないです。

The SparkAI supports a simple in-memory and persistent cache system. It keeps an in-memory staging cache, which gets updated for LLM and web search results. The staging cache can be persisted through the commit() method. Cache lookup is always performed on both in-memory staging cache and persistent cache.

SparkAI(English SDK for Spark)ではシンプルなインメモリと永続化のキャッシュシステムをサポートしています。LLMやWeb検索の結果によって更新されるインメモリのステージングキャッシュを保持します。ステージングキャッシュはcommit()メソッドで永続化することができます。キャッシュの検索は常にインメモリのキャッシュと永続化キャッシュの両方で実行されます。

spark_ai.commit()

カレントパスにspark_ai_cache.jsonが生成されます。

%sh
cat spark_ai_cache.json

途中の挙動が記録されているようです。

{
    "key": "https://www.carpro.com/blog/full-year-2022-national-auto-sales-by-brand",
    "value": "\n\n\nFull-Year 2022 National Auto Sales By Brand\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n  <img  src=\"https://data.adxcel-ec2.com/pixel/?ad_log=referer&amp;action=lead&amp;pixid=b719b86c-f9d4-4c28-a8a8-135a5ff2099d\" width=\"1\" height=\"1\" border=\"0\">\n  \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCreate Account\n\n\n\n\n\n\n\n\nLogin\n\n\n\n\n\n\n\n\nCarPro You have a friend.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nListen Live\n\n\n\n\n\nBuy a Car\n\nFind a CarPro Friend\nFind a Car\nDeliver Mine, Please!\n\nSell Your Car\nCar Buying Resources\n\nAdvice\nCar Reviews\nFAQs\nNews\n\nRadio Show\n\nRadio Schedule\nRadio Blog\nDealer True Stories\nCarPro Radio Show Podcast\nDriving N Heels Podcast\nCarPro Radio Show Newsletter\n\n\n\n\n\n\nFind a CarPro Friend\n\n\n\n\n\n\n\n\n\nListen Live\n\n\n\n\n\n\n\nMenu\n\n\n\n\n\n\n\n\n\n\n\n\nNews,\u00a0\n          \n\nAuto Sales\n\n\nFull-Year 2022 National Auto Sales By Brand\nWritten by Jerry Reynolds\nJanuary 12, 2023\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nShare: \n\n\n\nFacebook\n\n\n\n\n\n\n\nTwitter\n\n\n\n\n\n\n\nLinkedIn\n\n\n\n\n\n\n\nEmail\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nIt was a challenging year for the auto industry, primarily due to the microchip shortages, and certainly rising interest rates did not help.\u00a0 There was also a noticeable lack of incentives from the automakers, and lease deals were not nearly as good as in years past.\n\nDecember showed some sparks of life for many car companies, as sales rose 7.2%, but not enough to put the sales numbers over the 14 million mark, making 2022 sales the lowest since 2011.\nGeneral Motors regained the #1 sales leadership spot, wrestling it away from Toyota, which was #1 in 2021.\u00a0 Ford F-Series trucks retained the title of #1 truck in America for 46 straight years and the #1 selling vehicle overall for 41 straight years.\nHere are your winners and losers by brand for the full year of 2022 and how the numbers fared versus the year of 2021:\n\n \u00a0\n\n\n\n  \u00a0\n \n\n\n\u00a0\n Brand \n \u00a02022 U.S. Sales \n VS 2021 \n\n\n1.\nToyota\n1,849,751 \u00a0\nDown 9%\n\n\n2.\nFord\n1,767,439\u00a0\u00a0\nDown 2%\n\n\n3.\nChevrolet\n1,502,389\u00a0\u00a0\nUp 6%\n\n\n4.\nHonda\n881,201\u00a0\nDown 33%\n\n\n5.\nHyundai\n724,265\u00a0\nDown 2%\n\n\n6.\nKia\n693,549\u00a0\nDown 1%\n\n\n7.\nJeep\n684,612\u00a0\nDown 12%\n\n\n8.\nNissan\n682,731\u00a0\u00a0\nDown 25%\n\n\n9.\nSubaru\u00a0\n556,581\u00a0\u00a0\nDown 5%\n\n\n10.\u00a0\nRam Trucks\n545,194\u00a0\nDown 16%\n\n\n11.\nGMC\n517,649\u00a0\u00a0\nUp 7%\n\n\n12.\nMercedes-Benz\n350,949\u00a0\nUp 7%\n\n\n13.\nBMW\n332,388\u00a0\nDown 1%\n\n\n14.\nVolkswagen\n301,069\u00a0\nDown 20%\n\n\n15.\nMazda\n294,908\u00a0\nDown 11%\n\n\n16.\nLexus\n258,704\u00a0\nDown 15%\n\n\n17.\nDodge\n190,793\u00a0\u00a0\nDown 12%\n\n\n18.\nAudi\n186,875\u00a0\nDown 5%\n\n\n19.\nCadillac\n134,726\u00a0\nUp 14%\n\n\n20.\nChrysler\u00a0\n112,713\nDown 2%\n\n\n21.\nBuick\n103,519\u00a0\nDown 42%\n\n\n22.\nAcura\n102,306\u00a0\u00a0\nDown 35%\n\n\n23.\nVolvo\n102,038\u00a0\nDown 16%\n\n\n24.\nMitsubishi\n102,037 \nDown 16%\n\n\n25.\nLincoln\n83,486\u00a0\nDown 4%\n\n\n26.\nPorsche\n 70,065 \n Flat \n\n\n27.\nGenesis\n 56,410\u00a0 \n Up 14% \n\n\n28.\nINFINITI\n46,619\u00a0\nDown 20%\n\n\n29.\nMINI\n29,504\u00a0\nDown 1%\n\n\n30.\nAlfa Romeo\n12,845\u00a0\u00a0\nDown 30%\n\n\n31.\nMaserati\n\u00a06,413\u00a0\nDown 10%\n\n\n32.\nBentley\n3,975\u00a0 \u00a0 \u00a0\u00a0\nFLAT\n\n\n33.\nLamborghini\n3,134\u00a0\nUp 3%\n\n\n34.\nFiat\u00a0\n915\u00a0\u00a0\nDown 61%\n\n\n35.\nMcLaren\n 840\u00a0 \n Down 35% \n\n\n36.\nRolls-Royce\n460\u00a0\nUp 7%\n\n\n\n\n\u00a0\nNote: Jaguar and Land Rover have not reported 2022 sales. Also, the all-electric makes like Tesla, Lucid, Karma, and Polestar do not report sales, only how many vehicles were produced, but that is not a reflection of actual sales since we don\u2019t know how many were built, but not sold. \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0\n \n\n\n      You may also like:\n    \n\n\n\n\n\n\n\n Photo Credit: Toyota.\n\n\n\n\n\n\n\nShare: \n\n\n\nFacebook\n\n\n\n\n\n\n\nTwitter\n\n\n\n\n\n\n\nLinkedIn\n\n\n\n\n\n\n\nEmail\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nJerry Reynolds\n\n\nFrom police officer to top-selling Ford dealer to on-air car expert, Jerry helms our weekly radio show. Since selling his dealerships in 2006, Jerry has pursued his top three passions: helping consumers have a great car buying experience, supporting the military and first responders, being a loyal servant to his dog! While Jerry admits to hitting the pedal to the metal on test drives, it's all the service of finding out just how fast he can go from 0 to 60.\n\n\n\n\n\nShare On Facebook\n\n\n\n\n\n\n\nShare On Twitter\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nRecent Posts\n\n\n\n\n\n\n\n\n\n\n2024 Subaru Forester Starts At $26,895\n\n\n\n            \n              Read Post\n            \n          \n\n\n\n\n\n\n\n\n\n\n\n\n\nCheck Out The Lucid Air Sapphire:\u00a0 0-to-60 In 1.89 Seconds\n\n\n\n            \n              Read Post\n            \n          \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCarPro Newsletter\nGet our weekly newsletter packed with exclusive advice and news.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCarPro\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nFollow us on Facebook\n\n\n\n\n\nFollow us on Twitter\n\n\n\n\n\n\nFollow us on Youtube\n\n\n\n\n\n\nFollow us on Instagram\n\n\n\n\n\n\n\n\nFind a CarPro Friend\n\nFind a Car\n\nRadio Show\n\nAbout Us\n\nWhy CarPro\n\n\nFAQs\n\nNews\n\nCar Reviews\n\nLaura's Auto News\n\nRadio Show Newsletter\n\n\nWebsite Help\n\nEmail the CarPro\n\nMedia Company Inquiry\n\nBecome a Radio Partner\n\nBecome a Certified CarPro\n\n\n\n\n\u00a92023 CarPro\n\n\nPrivacy Policy\nTerms Of Use\nAccessibility Statement\n\n\n\n\nCarPro requires our Certified Dealers to meet certain customer service requirements in accordance with our Certification Process and Agreement. CarPro does not broker, sell, or lease vehicles. All vehicles shown on this website are offered for sale by licensed motor vehicle dealers, unless where otherwise noted. Used vehicles are subject to prior sale. By accessing this website, you agree to the CarPro Terms of Service and Privacy Policy. We strive to update our website in a timely manner however CarPro cannot guarantee that the inventory shown will be available at the dealership.\nWe are not responsible for typographical and other errors, including data transmissions or software errors that may appear on the site. If the posted price, incentive, offer or other service is incorrect due to typographical or other error we will only be responsible for honoring the correct price, incentive or offer. We make every effort to provide you the most accurate, up-to-the-minute information however when you are ready to purchase products or services, it is your responsibility to verify with us that all details listed are accurate. Prices include all applicable rebates.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"
}
{
    "key": "Given the following question:\n```\nFull-Year 2022 National Auto Sales By Brand\n```\nI got the following answer from a web page:\n```\nFull-Year 2022 National Auto Sales By Brand <img src=\"https://data.adxcel-ec2.com/pixel/?ad_log=referer&amp;action=lead&amp;pixid=b719b86c-f9d4-4c28-a8a8-135a5ff2099d\" width=\"1\" height=\"1\" border=\"0\"> Search Create Account Login CarPro You have a friend. Listen Live Buy a Car Find a CarPro Friend Find a Car Deliver Mine, Please! Sell Your Car Car Buying Resources Advice Car Reviews FAQs News Radio Show Radio Schedule Radio Blog Dealer True Stories CarPro Radio Show Podcast Driving N Heels Podcast CarPro Radio Show Newsletter Find a CarPro Friend Listen Live Menu News, Auto Sales Full-Year 2022 National Auto Sales By Brand Written by Jerry Reynolds January 12, 2023 Share: Facebook Twitter LinkedIn Email It was a challenging year for the auto industry, primarily due to the microchip shortages, and certainly rising interest rates did not help. There was also a noticeable lack of incentives from the automakers, and lease deals were not nearly as good as in years past. December showed some sparks of life for many car companies, as sales rose 7.2%, but not enough to put the sales numbers over the 14 million mark, making 2022 sales the lowest since 2011. General Motors regained the #1 sales leadership spot, wrestling it away from Toyota, which was #1 in 2021. Ford F-Series trucks retained the title of #1 truck in America for 46 straight years and the #1 selling vehicle overall for 41 straight years. Here are your winners and losers by brand for the full year of 2022 and how the numbers fared versus the year of 2021: Brand 2022 U.S. Sales VS 2021 1. Toyota 1,849,751 Down 9% 2. Ford 1,767,439 Down 2% 3. Chevrolet 1,502,389 Up 6% 4. Honda 881,201 Down 33% 5. Hyundai 724,265 Down 2% 6. Kia 693,549 Down 1% 7. Jeep 684,612 Down 12% 8. Nissan 682,731 Down 25% 9. Subaru 556,581 Down 5% 10. Ram Trucks 545,194 Down 16% 11. GMC 517,649 Up 7% 12. Mercedes-Benz 350,949 Up 7% 13. BMW 332,388 Down 1% 14. Volkswagen 301,069 Down 20% 15. Mazda 294,908 Down 11% 16. Lexus 258,704 Down 15% 17. Dodge 190,793 Down 12% 18. Audi 186,875 Down 5% 19. Cadillac 134,726 Up 14% 20. Chrysler 112,713 Down 2% 21. Buick 103,519 Down 42% 22. Acura 102,306 Down 35% 23. Volvo 102,038 Down 16% 24. Mitsubishi 102,037 Down 16% 25. Lincoln 83,486 Down 4% 26. Porsche 70,065 Flat 27. Genesis 56,410 Up 14% 28. INFINITI 46,619 Down 20% 29. MINI 29,504 Down 1% 30. Alfa Romeo 12,845 Down 30% 31. Maserati 6,413 Down 10% 32. Bentley 3,975 FLAT 33. Lamborghini 3,134 Up 3% 34. Fiat 915 Down 61% 35. McLaren 840 Down 35% 36. Rolls-Royce 460 Up 7% Note: Jaguar and Land Rover have not reported 2022 sales. Also, the all-electric makes like Tesla, Lucid, Karma, and Polestar do not report sales, only how many vehicles were produced, but that is not a reflection of actual sales since we don\u2019t know how many were built, but not sold. You may also like: Photo Credit: Toyota. Share: Facebook Twitter LinkedIn Email Jerry Reynolds From police officer to top-selling Ford dealer to on-air car expert, Jerry helms our weekly radio show. Since selling his dealerships in 2006, Jerry has pursued his top three passions: helping consumers have a great car buying experience, supporting the military and first responders, being a loyal servant to his dog! While Jerry admits to hitting the pedal to the metal on test drives, it's all the service of finding out just how fast he can go from 0 to 60. Share On Facebook Share On Twitter Recent Posts 2024 Subaru Forester Starts At $26,895 Read Post Check Out The Lucid Air Sapphire: 0-to-60 In 1.89 Seconds Read Post CarPro Newsletter Get our weekly newsletter packed with exclusive advice and news. CarPro Follow us on Facebook Follow us on Twitter Follow us on Youtube Follow us on Instagram Find a CarPro Friend Find a Car Radio Show About Us Why CarPro FAQs News Car Reviews Laura's Auto News Radio Show Newsletter Website Help Email the CarPro Media Company Inquiry Become a Radio Partner Become a Certified CarPro \u00a92023 CarPro Privacy Policy Terms Of Use Accessibility Statement CarPro requires our Certified Dealers to meet certain customer service requirements in accordance with our Certification Process and Agreement. CarPro does not broker, sell, or lease vehicles. All vehicles shown on this website are offered for sale by licensed motor vehicle dealers, unless where otherwise noted. Used vehicles are subject to prior sale. By accessing this website, you agree to the CarPro Terms of Service and Privacy Policy. We strive to update our website in a timely manner however CarPro cannot guarantee that the inventory shown will be available at the dealership. We are not responsible for typographical and other errors, including data transmissions or software errors that may appear on the site. If the posted price, incentive, offer or other service is incorrect due to typographical or other error we will only be responsible for honoring the correct price, incentive or offer. We make every effort to provide you the most accurate, up-to-the-minute information however when you are ready to purchase products or services, it is your responsibility to verify with us that all details listed are accurate. Prices include all applicable rebates.\n```\nNow help me write a SQL query to store the answer into a temp view.\nGive each column a clearly descriptive name (no abbreviations).\nIf a column can be either String or Numeric, ingest it as Numeric.\nHere is an example of how to store data into the temp view spark_ai_temp_view:\n```\nCREATE OR REPLACE TEMP VIEW spark_ai_temp_view AS SELECT * FROM VALUES('Citizen Kane', 1941), ('Schindler's List', 1993) AS v1(title, year)\n```\n\nThe answer MUST contain query only and the temp view MUST be spark_ai_temp_view.\n",
    "value": "Here is the SQL query to store the data into the temp view:\n\n```sql\nCREATE OR REPLACE TEMP VIEW spark_ai_temp_view_732c4c AS \nSELECT * FROM VALUES\n('Toyota', 1849751, -9),\n('Ford', 1767439, -2),\n('Chevrolet', 1502389, 6),\n('Honda', 881201, -33),\n('Hyundai', 724265, -2),\n('Kia', 693549, -1),\n('Jeep', 684612, -12),\n('Nissan', 682731, -25),\n('Subaru', 556581, -5),\n('Ram Trucks', 545194, -16),\n('GMC', 517649, 7),\n('Mercedes-Benz', 350949, 7),\n('BMW', 332388, -1),\n('Volkswagen', 301069, -20),\n('Mazda', 294908, -11),\n('Lexus', 258704, -15),\n('Dodge', 190793, -12),\n('Audi', 186875, -5),\n('Cadillac', 134726, 14),\n('Chrysler', 112713, -2),\n('Buick', 103519, -42),\n('Acura', 102306, -35),\n('Volvo', 102038, -16),\n('Mitsubishi', 102037, -16),\n('Lincoln', 83486, -4),\n('Porsche', 70065, 0),\n('Genesis', 56410, 14),\n('INFINITI', 46619, -20),\n('MINI', 29504, -1),\n('Alfa Romeo', 12845, -30),\n('Maserati', 6413, -10),\n('Bentley', 3975, 0),\n('Lamborghini', 3134, 3),\n('Fiat', 915, -61),\n('McLaren', 840, -35),\n('Rolls-Royce', 460, 7)\nAS v1(Brand, US_Sales_2022, Sales_Change_Percentage)\n```\n\nThis query creates a temporary view named `spark_ai_temp_view_732c4c` with three columns: `Brand`, `US_Sales_2022`, and `Sales_Change_Percentage`. The values are inserted into the view using the `VALUES` clause."
}
{
    "key": "\nGiven 1) a PySpark dataframe, df, and 2) a description of expected properties, desc,\ngenerate a Python function to test whether the given dataframe satisfies the expected properties.\nYour generated function should take 1 parameter, df, and the return type should be a boolean.\nYou will call the function, passing in df as the parameter, and return the output (True/False).\n\nIn total, your output must follow the format below, exactly (no explanation words):\n1. function definition f, in Python (Do NOT surround the function definition with quotes)\n2. 1 blank new line\n3. Call f on df and assign the result to a variable, result: result = name_of_f(df)\nThe answer MUST contain python code only. For example, do NOT include \"Here is your output:\"\n\nInclude any necessary import statements INSIDE the function definition, like this:\ndef gen_random():\n    import random\n    return random.randint(0, 10)\n\nYour output must follow the format of the example below, exactly:\nInput:\ndf = DataFrame[name: string, age: int]\ndesc = \"expect 5 columns\"\n\nOutput:\ndef has_5_columns(df) -> bool:\n    # Get the number of columns in the DataFrame\n    num_columns = len(df.columns)\n\n    # Check if the number of columns is equal to 5\n    if num_columns == 5:\n        return True\n    else:\n        return False\n\nresult = has_5_columns(df)\n\nNo explanation words (e.g. do not say anything like \"Here is your output:\")\n\nHere is your input df: DataFrame[Brand: string, US_Sales_2022: int, Sales_Change_Percentage: int]\nHere is your input description: expect sales change percentage to be between -100 to 100\n",
    "value": "def check_sales_change_percentage(df) -> bool:\n    from pyspark.sql.functions import col\n\n    # Check if Sales_Change_Percentage is between -100 and 100\n    df_filtered = df.filter((col(\"Sales_Change_Percentage\") >= -100) & (col(\"Sales_Change_Percentage\") <= 100))\n\n    # If the number of rows in the filtered DataFrame is equal to the number of rows in the original DataFrame,\n    # then all Sales_Change_Percentage values are between -100 and 100\n    if df_filtered.count() == df.count():\n        return True\n    else:\n        return False\n\nresult = check_sales_change_percentage(df)"
}
{
    "key": "You are an Apache Spark SQL expert, who can summary what a dataframe retrieves. Given an analyzed\nquery plan of a dataframe, you will\n1. convert the dataframe to SQL query. Note that an explain output contains plan\nnodes separated by `\\n`. Each plan node has its own expressions and expression ids.\n2. summary what the sql query retrieves.\n\n\n\nanalyzed_plan: \nGlobalLimit 100\n    +- LocalLimit 100\n       +- Sort [d_year ASC NULLS FIRST, sum_agg DESC NULLS LAST, brand_id ASC NULLS FIRST], true\n          +- Aggregate [d_year, i_brand, i_brand_id], [d_year, i_brand_id AS brand_id, i_brand AS brand, sum(ss_ext_sales_price) AS sum_agg]\n             +- Filter (((d_date_sk = ss_sold_date_sk) AND (ss_item_sk = i_item_sk)) AND ((i_manufact_id = 128) AND (d_moy = 11)))\n                +- Join Inner\n                   :- Join Inner\n                   :  :- SubqueryAlias dt\n                   :  :  +- SubqueryAlias spark_catalog.tpcds_sf1_delta.date_dim\n                   :  :     +- Relation spark_catalog.tpcds_sf1_delta.date_dim[d_date_sk,d_date_id,d_date,d_month_seq,d_week_seq,d_quarter_seq,d_year,d_dow,d_moy,d_dom,d_qoy,d_fy_year,d_fy_quarter_seq,d_fy_week_seq,d_day_name,d_quarter_name,d_holiday,d_weekend,d_following_holiday,d_first_dom,d_last_dom,d_same_day_ly,d_same_day_lq,d_current_day,... 4 more fields] parquet\n                   :  +- SubqueryAlias spark_catalog.tpcds_sf1_delta.store_sales\n                   :     +- Relation spark_catalog.tpcds_sf1_delta.store_sales[ss_sold_date_sk,ss_sold_time_sk,ss_item_sk,ss_customer_sk,ss_cdemo_sk,ss_hdemo_sk,ss_addr_sk,ss_store_sk,ss_promo_sk,ss_ticket_numberL,ss_quantity,ss_wholesale_cost,ss_list_price,ss_sales_price,ss_ext_discount_amt,ss_ext_sales_price,ss_ext_wholesale_cost,ss_ext_list_price,ss_ext_tax,ss_coupon_amt,ss_net_paid,ss_net_paid_inc_tax,ss_net_profit] parquet\n                   +- SubqueryAlias spark_catalog.tpcds_sf1_delta.item\n                      +- Relation spark_catalog.tpcds_sf1_delta.item[i_item_sk,i_item_id,i_rec_start_date,i_rec_end_date,i_item_desc,i_current_price,i_wholesale_cost,i_brand_id,i_brand,i_class_id,i_class,i_category_id,i_category,i_manufact_id,i_manufact,i_size,i_formulation,i_color,i_units,i_container,i_manager_id,i_product_name] parquet\n\nexplain: \nThe analyzed plan can be translated into the following SQL query:\n```sql\nSELECT\n  dt.d_year,\n  item.i_brand_id brand_id,\n  item.i_brand brand,\n  SUM(ss_ext_sales_price) sum_agg\nFROM date_dim dt, store_sales, item\nWHERE dt.d_date_sk = store_sales.ss_sold_date_sk\n  AND store_sales.ss_item_sk = item.i_item_sk\n  AND item.i_manufact_id = 128\n  AND dt.d_moy = 11\nGROUP BY dt.d_year, item.i_brand, item.i_brand_id\nORDER BY dt.d_year, sum_agg DESC, brand_id\nLIMIT 100\n```\nIn summary, this dataframe is retrieving the top 100 brands (specifically of items manufactured by manufacturer with id 128) with the highest total sales price for each year in the month of November. It presents the results sorted by year, total sales (in descending order), and brand id.\n\n\n\nanalyzed_plan: SubqueryAlias spark_ai_temp_view\n+- View (`spark_ai_temp_view`, [Brand,US_Sales_2022,Sales_Change_Percentage])\n   +- Project [cast(Brand as string) AS Brand, cast(US_Sales_2022 as int) AS US_Sales_2022, cast(Sales_Change_Percentage as int) AS Sales_Change_Percentage]\n      +- Project [Brand, US_Sales_2022, Sales_Change_Percentage]\n         +- SubqueryAlias v1\n            +- LocalRelation [Brand, US_Sales_2022, Sales_Change_Percentage]\n\nexplain:",
    "value": "The analyzed plan can be translated into the following SQL query:\n```sql\nSELECT\n  CAST(Brand AS STRING) AS Brand,\n  CAST(US_Sales_2022 AS INT) AS US_Sales_2022,\n  CAST(Sales_Change_Percentage AS INT) AS Sales_Change_Percentage\nFROM v1\n```\nIn summary, this dataframe is retrieving the brand, US sales in 2022, and the sales change percentage from a local relation named `v1`. The data types of the columns are being explicitly cast to string for `Brand` and integer for `US_Sales_2022` and `Sales_Change_Percentage`."
}
{
    "key": "\nYou are an Apache Spark SQL expert programmer.\nIt is forbidden to include old deprecated APIs in your code.\nFor example, you will not use the pandas method \"append\" because it is deprecated.\n\nGiven a pyspark DataFrame `df`, with the output columns:\nBrand: string\nUS_Sales_2022: int\nSales_Change_Percentage: int\n\nAnd an explanation of `df`: The analyzed plan can be translated into the following SQL query:\n```sql\nSELECT\n  CAST(Brand AS STRING) AS Brand,\n  CAST(US_Sales_2022 AS INT) AS US_Sales_2022,\n  CAST(Sales_Change_Percentage AS INT) AS Sales_Change_Percentage\nFROM v1\n```\nIn summary, this dataframe is retrieving the brand, US sales in 2022, and the sales change percentage from a local relation named `v1`. The data types of the columns are being explicitly cast to string for `Brand` and integer for `US_Sales_2022` and `Sales_Change_Percentage`.\n\nWrite Python code to visualize the result of `df` using plotly. Make sure to use the exact column names of `df`.\nYour code may NOT contain \"append\" anywhere. Instead of append, use pd.concat.\nThere is no need to install any package with pip. Do include any necessary import statements.\nDisplay the plot directly, instead of saving into an HTML.\nDo not use scatter plot to display any kind of percentage data.\nYou must import and start your Spark session if you use a Spark DataFrame.\nRemember to ensure that your code does NOT include \"append\" anywhere, under any circumstance (use pd.concat instead).\n\nEnsure that your code is correct.\n\n",
    "value": "Here is the Python code to visualize the result of `df` using plotly:\n\n```python\nimport plotly.graph_objects as go\nfrom pyspark.sql import SparkSession\nimport pandas as pd\n\n# Start Spark session\nspark = SparkSession.builder.appName('SparkSQL').getOrCreate()\n\n# Assuming df is the Spark DataFrame\n# Convert Spark DataFrame to Pandas DataFrame\npandas_df = df.toPandas()\n\n# Create a bar chart for US_Sales_2022\ntrace1 = go.Bar(\n    x=pandas_df['Brand'],\n    y=pandas_df['US_Sales_2022'],\n    name='US Sales 2022'\n)\n\n# Create a line chart for Sales_Change_Percentage\ntrace2 = go.Scatter(\n    x=pandas_df['Brand'],\n    y=pandas_df['Sales_Change_Percentage'],\n    name='Sales Change Percentage',\n    yaxis='y2'\n)\n\ndata = [trace1, trace2]\n\nlayout = go.Layout(\n    title='Brand Sales Analysis',\n    yaxis=dict(\n        title='US Sales 2022'\n    ),\n    yaxis2=dict(\n        title='Sales Change Percentage',\n        titlefont=dict(\n            color='rgb(148, 103, 189)'\n        ),\n        tickfont=dict(\n            color='rgb(148, 103, 189)'\n        ),\n        overlaying='y',\n        side='right'\n    )\n)\n\nfig = go.Figure(data=data, layout=layout)\nfig.show()\n```\n\nThis code first converts the Spark DataFrame to a Pandas DataFrame. Then it creates a bar chart for `US_Sales_2022` and a line chart for `Sales_Change_Percentage`. The two charts are combined into a single figure with two y-axes. The figure is then displayed."
}

しかし、興味深い。他のサンプルも試してみます。

Databricksクイックスタートガイド

Databricks無料トライアル

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up