1. ã¯ããã«
é ãŸããªãããâïžPolarsð»ââïžâïž ã觊ã£ãŠã¿ããšãããäœããè¯ããã! ã³ã¬ç¥ããã€ãããããŒã¿ãµã€ãšã³ãã£ã¹ã!? ãã£ã³ãã£ã©ããããããç¬ãŠãŸãããã!! ãšãŸã§æã£ãŠããŸã£ãã®ã§ãç°¡åã«ãTipsããŸãšããŠã¿ãŸãããPandasðŒ ã¯äœ¿ãããã©ãããšããæ¹ã ã«åããŠãäž¡è ã®ã³ãŒãã䞊ã¹ãŠæžããŠããŸãã
(Kaggle ããŒããã㯠"Let's Play with ð Polars ð»ââïž & ð Pandas ðŒ"ããã®åŒçš)
2. äœãè¯ããã ã??
âïžPolarsð»ââïžâïž ã®äœãè¯ããã ã?? ãšæã£ãæ¹ã ã«åããŠãè¯ããšãããã€ãã€ããšæžã綎ã£ãŠãããŸãã
- ãšã«ããæ©ãã
- ããŒã¿ã«ãããã以äžã®éããæ°ååã«ãéãåŸãã
- ããŒã¿ã®èŠæš¡ã倧ããã»ã©ãPandasðŒ ãšã®å·®ãé¡èã«ãªãã
- ã¡ã¢ãªã«ä¹ããããªã倧èŠæš¡ãªããŒã¿ãåãæ±ããã
- PandasðŒ ã§ãã§ããªãããšã¯ãªãããããç°¡åãªçãã³ãŒãã§åãããšããããã
(PandasðŒ ã¯ç¡é§ã«å€§ããªã¡ã¢ãªã䜿ãã®ã§ã倧èŠæš¡ãªããŒã¿ãæ±ãã¥ããã)
- PandasðŒ ã§ãã§ããªãããšã¯ãªãããããç°¡åãªçãã³ãŒãã§åãããšããããã
- çŽæçã«åãããããã
- PandasðŒ ãããçŽæçã«åãããããæžãæ¹ã§æžããã
- indexã¿ãããªãããããä»çµã¿ããªãã
(Polars HPããã®åŒçš)
3. âïžPolarsð»ââïžâïž ã®10Tips
ã§ã¯ãæ¬é¡ã§ããããã§ã¯ãéå»ã«ãKaggleã§éå¬ãããã³ã³ã "Home Credit Default Risk" ã®ããŒã¿ã»ããããâïžPolarsð»ââïžâïžã䜿ã£ãŠåŠçããŠã¿ãŸãããã®äžã§ã以äžã®10åã®Tipsã玹ä»ããŠãããããšæããŸããæ¯èŒã®ãããPandasðŒ ã®ã³ãŒãã䜵ããŠèšèŒããŠãããŸãã
- #001 ãã¡ã€ã«ãèªã¿èŸŒãã
- #002 ååãæå®ããŠåãåãåºãã
- #003 ååã®äžéšãæå®ããŠåãåãåºãã
- #004 ããŒã¿ã®åãæå®ããŠåãåãåºãã
- #005 è¡ã®çªå·ãæå®ããŠè¡ãåãåºãã
- #006 æ¡ä»¶ãæå®ããŠè¡ãåãåºãã
- #007 æ°å€ã®åã®åºæ¬çãªçµ±èšéãæ±ããã
- #008 æ°å€ä»¥å€ã®åã®ãŠããŒã¯ãªå€æ¯ã®è¡ã®æ°ãæ°ããã
- #009 æ¬ æå€ã®æ°ãæ°ããã
- #010 æ¬ æå€ãåãããããã
#000 ãŸãã¯æºåã
ãŸããå¿ èŠãªããã±ãŒãžãã€ã³ã¹ããŒã«ããŠãã€ã³ããŒãããŸãã
# âïžPolarsð»ââïžâïž
%pip install -Uq polars
import polars as pl
import polars.selectors as cs
# PandasðŒ
import pandas as pd
ããã§ãæºåã¯å®äºã§ãã
#001 ãã¡ã€ã«ãèªã¿èŸŒãã
ãã¡ã€ã«ãèªã¿èŸŒã¿ãŸãã
# âïžPolarsð»ââïžâïž
path_to_your_file = "/kaggle/input/home-credit-default-risk/application_train.csv"
df_pl = pl.read_csv(path_to_your_file)
df_pl
# PandasðŒ
path_to_your_file = "/kaggle/input/home-credit-default-risk/application_train.csv"
df_pd = pd.read_csv(path_to_your_file)
df_pd
#002 ååãæå®ããŠåãåãåºãã
ååãæå®ããŠãåãåãåºããŠã¿ãŸããããã§ã¯ãäŸãšããŠãSK_ID_CURR
ãTARGET
ãšããååã®åãåãåºããŠã¿ãŸãã
# âïžPolarsð»ââïžâïž
df_pl.select(pl.col(["SK_ID_CURR", "TARGET"]))
# PandasðŒ
df_pd[["SK_ID_CURR", "TARGET"]]
#003 ååã®äžéšãæå®ããŠåãåãåºãã
ååã®äžéšãæå®ããŠãåãåãåºããŠã¿ãŸã(æ£èŠè¡šçŸã䜿ã£ãææ§æ€çŽ¢ãæ£èŠè¡šçŸãæ°ã«ãªã£ãŠããæ¹ã¯ãäžåºŠã°ã°ã£ãŠã¿ãŸãããã)ãããã§ã¯ãäŸãšããŠãåã®ååã«CREDIT_BUREAU
ãšããããŒã¯ãŒããå«ãåãåãåºããŠã¿ãŸãã
# âïžPolarsð»ââïžâïž
df_pl.select(pl.col(r"^.*CREDIT_BUREAU.*$"))
# PandasðŒ
df_pd.filter(regex=r"^.*CREDIT_BUREAU.*$")
#004 ããŒã¿ã®åãæå®ããŠåãåãåºãã
ããŒã¿ã®åãæå®ããŠãåãåãåºããŠã¿ãŸããããã§ã¯ãäŸãšããŠãæååã®å€ãå«ãåãåãåºããŠã¿ãŸãã
# âïžPolarsð»ââïžâïž
df_pl.select(cs.string())
# PandasðŒ
df_pd.select_dtypes(include="object")
#005 è¡ã®çªå·ãæå®ããŠè¡ãåãåºãã
è¡ã®çªå·ãæå®ããŠãè¡ãåãåºããŠã¿ãŸããããã§ã¯ãäŸãšããŠã2è¡ç®ãã7è¡ç®ãåãåºããŠã¿ãŸãã
# âïžPolarsð»ââïžâïž
df_pl.with_row_index().filter(pl.col("index").is_between(2, 7))
# PandasðŒ
df_pd.iloc[2:7+1]
#006 æ¡ä»¶ãæå®ããŠè¡ãåãåºãã
æ¡ä»¶ãæå®ããŠãè¡ãåãåºããŠã¿ãŸããããã§ã¯ãäŸãšããŠãAMT_INCOME_TOTAL
ã 1_000_000
以äžã®è¡ãåãåºããŠã¿ãŸãã
# âïžPolarsð»ââïžâïž
df_pl.filter(pl.col("AMT_INCOME_TOTAL") >= 1_000_000)
# PandasðŒ
df_pd[df_pd["AMT_INCOME_TOTAL"] >= 1_000_000]
#007 æ°å€ã®åã®åºæ¬çãªçµ±èšéãæ±ããã
æ°å€ã®åã®åºæ¬çãªçµ±èšé(å¹³åãåæ£ãåäœæ°ãç)ãæ±ããŠã¿ãŸãã
# âïžPolarsð»ââïžâïž
df_pl.select(cs.numeric()).describe()
# PandasðŒ
df_pd.select_dtypes(include="number").describe()
#008 æ°å€ä»¥å€ã®åã®ãŠããŒã¯ãªå€æ¯ã®è¡ã®æ°ãæ°ããã
æ°å€ä»¥å€ã®åã®ãŠããŒã¯ãªå€æ¯ã«è¡ã®æ°ãæ°ããŠã¿ãŸãã
# âïžPolarsð»ââïžâïž
f_sliced = df_pl.select(~cs.numeric())
for col in df_sliced.columns:
with pl.Config(tbl_rows=1000):
print(df_sliced.select(pl.col(col).value_counts()))
print()
# PandasðŒ
df_sliced = df_pd.select_dtypes(exclude="number")
for col in df_sliced.columns:
print(df_sliced[col].value_counts(dropna=False))
print()
#009 æ¬ æå€ã®æ°ãæ°ããã
åæ¯ã«ãæ¬ æå€ã®æ°ãæ°ããŠã¿ãŸããããã§ã¯ãããåããããããããããå€ã®æ¬ æãããåã ããåãåºããã°ã©ãã®åœ¢ã§è¡šç€ºããããã«ããŠããŸãã
# âïžPolarsð»ââïžâïž
plt.figure(figsize=(8, 16))
sns.barplot(data=df_pl.null_count().melt(value_name="num_missing_values").filter(pl.col("num_missing_values") > 0),
y="variable",
x="num_missing_values",
hue="variable")
plt.show()
# PandasðŒ
df_missing_values = df_pd.isna().sum()
df_missing_values = df_missing_values[df_missing_values > 0]
plt.figure(figsize=(8, 16))
sns.barplot(y=df_missing_values.index,
x=df_missing_values.values,
hue=df_missing_values.index)
plt.show()
#010 æ¬ æå€ãåãããããã
æ¬ æå€ãåãåãããŠã¿ãŸããããã§ã¯ãäŸãšããŠãå AMT_REQ_CREDIT_BUREAU_YEAR
ã®å€ããâ å®æ° 999
ããã㯠⡠äžå€®å€ ã§åãåãããŠã¿ãŸãã
# âïžPolarsð»ââïžâïž
df_pl = df_pl.with_columns(
pl.col("AMT_REQ_CREDIT_BUREAU_YEAR").fill_null(pl.lit(999)).name.suffix("_filled_with_constant"),
pl.col("AMT_REQ_CREDIT_BUREAU_YEAR").fill_null(pl.median("AMT_REQ_CREDIT_BUREAU_YEAR")).name.suffix("_filled_with_median"),
)
df_pl.select(pl.col(r"^.*AMT_REQ_CREDIT_BUREAU_YEAR.*$"))
# PandasðŒ
median = df_pd["AMT_REQ_CREDIT_BUREAU_YEAR"].median()
df_pd["AMT_REQ_CREDIT_BUREAU_YEAR_filled_with_constant"] = df_pd["AMT_REQ_CREDIT_BUREAU_YEAR"].fillna(value=999)
df_pd["AMT_REQ_CREDIT_BUREAU_YEAR_filled_with_median"] = df_pd["AMT_REQ_CREDIT_BUREAU_YEAR"].fillna(value=median)
df_pd.filter(regex=r"^.*AMT_REQ_CREDIT_BUREAU_YEAR.*$")
以äžã10åã®Tipsã§ãããæ¬åœã¯ããã¡ãã£ãšãããã ãã©ãããšããããPolarsã®æ¬åœã®è¯ããšããã¯ããŸã ãä»ã«ãããã ãã©ã(LazyãšããLazyãšããã»ã»ã»ã)ãããã¯ããŸãã次åãã次次åã§çŽ¹ä»ããããšæããŸãã
4. åè
ããããâïžPolarsð»ââïžâïž ã©ãð ã«ãªã£ãæ¹ãå± ãã°ããã¡ãããåèã«ãªããŸãã®ã§ãçºããŠã¿ãŠäžããã倧åã¯ãè±èªã§ãããåå¹ãåºãŸãããå··ã«ç¿»èš³ããŠããããœããããããããããŸããè¿ãã«ãè±èªãåããæ¹ãããã£ãããã°ã翻蚳ããé¡ãããŠã¿ãã®ãäžèã§ãããã®èŸºããèŠããããããäžç«¯ã®ããŒã¿ãµã€ãšã³ãã£ã¹ãã«äžæ©è¿ã¥ãããšãã§ããŸãããã²ãèŠãŠã¿ãŠäžããã
âïž Polars user guide
âïžPolarsð»ââïžâïž ã®ãŠãŒã¶ãŒã¬ã€ãã§ããããã«ããå 容ãå šãŠåçµãããããã¯ããä»ã¯èŠãªããŠãè¯ããããã§ããè±èªã§ããããããåãããããã§ããããžã«ã«ããã€ãç°¡åã«æžããŠããã®ã§ãè±èªåãããªããŠããã³ãŒããå®è¡ããªããè¿œããããŠããã°ãäœè£ã§ç解ã§ããŠããŸããŸãã
âïž Ritchie Vink - Keynote Polars | PyCon Lithuania 2024
ãªãã¢ãã¢ã§éå¬ããããPyConã®åç»ã§ããâïžPolarsð»ââïžâïž ã®èè ãRitchie VinkããããçŽ æŽããã Keynote 㧠âïžPolarsð»ââïžâïž ã玹ä»ããŠãããŠããŸãããæ¬äººã«èå³ãããæ¹ããŠãŒã¶ãŒã¬ã€ãã¯ãŸã ã ãºã€ããšããæ¹ããã²ãèŠãŠã¿ãŠäžããã
âïž Let's Play with ð Polars ð»ââïž & ð Pandas ðŒ
ãã®èšäºã®ããŒã¹ãšãªã£ãŠãããKaggleã®ããŒãããã¯ã§ããå®éã«ãã³ãŒããå®è¡ããŠçµæãèŠãŠã¿ããæ¹ã¯ããããã³ããŒããŠãèªèº«ã§åãããŠã¿ãŸãããã
âïž Thomas Bierhance: Polars - make the switch to lightning-fast dataframes
ãã€ãã§éå¬ããããPyConã®åç»ã§ãããã®éã®ããã§ãããThomas Bierhanceããããå®éã®ä»äºã§ãâïžPolarsð»ââïžâïž ãå°å ¥ããéã«åãã£ãã¡ãªããã(売ãç©ã«ãªããããªã¬ãã«ã®)ã³ãŒãã®æžãæ¹ã®äŸ ãªã©ãã玹ä»ããŠãããŠããŸããâïžPolarsð»ââïžâïž ã®åºæ¬ãåãã£ãŠããŠããããä»äºã§äœ¿ã£ãŠãããããšæã£ãŠããæ¹ã«ãªã¹ã¹ã¡ã®åç»ã§ããç§ãã圌ã®ãã¬ãŒã³ãããã³ãŒãã®æžãæ¹ãåŠã³ãŸããã
âïž PyCon JP
PyConã¯ãæ¥æ¬ã§ãéå¬ãããŠããŸãã âïžPolarsð»ââïžâïž ãåºãŠãããã¯åãããŸããããæããèŠãŠãåå ããŠã¿ãã®ãäžèã§ããçŽæ¥ã瀟å€ã®åªç§ãªäººã®è©±ãèããšãåºæ¿ãåããŠãããªãã®æé·ã«ã€ãªãããŸãã
5. æåŸã«
çãç©ã¯å¿ ãæ»ãè¿ããŸãããããããã€èšªããã®ãããããŸãããææ¥ããµã€ã«ãé£ãã§ãããããæéã¯æéã§ãããã®äžã§ãåŠäœã«ããããããšãæãéããã®ãããã®éµã¯æéã®äœ¿ãæ¹ã«ãããŸããèšç®æ©ã«ãããä»äºã¯ãã¹ãŠãèšç®æ©ã«ãã£ãŠããããŸãããããããŠãèªåã«ããã§ããªãããšãããããããšããããŸãããã倧äºãªããšã¯ããšã«ãããèªåã®ããããããšã楜ããããšããã£ãŠãé¢çœããããçããŠããããšã§ãã