More than 3 years have passed since last update.

pandasとSQLの対応

Posted at 2021-01-15

pandasでは、SQLと似たデータ操作ができる。
SQLのデータ操作とpandasのデータ操作の対応をメモっておく。
個人的に利用頻度が高い順に書いてる。

limit -> head

任意の行数を抽出するために利用する。

select *
from table
limit 10

↓pandas

table.head(10)

order by -> sort_values

ソート処理

select *
from table
order by col1

↓pandas

table.sort_values(by='col1')

order byとsort_valuesのソートは、デフォルトで昇順になる。
order byなら最後にdescをつけるが、sort_valuesでは、ascending=Falseとする（Trueなら昇順）。

table.sort_values(by='col1', ascending=False)

また、複数の条件でソートも可能

select *
from table
order by col1, col2 desc

↓pandas

table.sort_values(by=['col1', 'col2'], ascending=[True, False])

where -> []

select *
from table
where col1 = 'a'

↓pandas

table[table.col1 == 'a']

in句

select *
from table
where col1 in ('a', 'b')

↓pandas

table[table.col1.isin(['a', 'b'])]

複数条件

select *
from table
where col1 = 'a' and col2 = 1

↓pandas

table[(table.col1 == 'a')&(table.col2 == 1)]

各条件を()で囲って、&で結ぶ。
andではなく、orの場合は、|を利用する

table[(table.col1 == 'a')|(table.col2 == 1)]

これらの条件指定は、下に書いてある、deleteやupdateの時にも利用できる。

関数 -> agg

select
    sum(col1),
    avg(col1),
    max(col1),
    min(col2),
    count(col2),
    count(distinct(col2)),
from table

↓pandas

table.agg({'col1': ['sum', 'mean', 'max'], 'col2': ['min', 'count', 'nunique']})

nuniqueが、count distinctに対応。

distinct関数は次のようになる。

select
    distinct(col1)
from talbe

↓pandas

table.agg({'col1': 'unique'})

sqlでは、集計関数とdistinctは一緒には使えないが、pandasでは利用可能

table.agg({'col1': ['sum', 'mean', 'max', 'min', 'count', 'nunique', 'unique']})

のようにできる。

sqlとpandasの関数対応

sum -> sum
avg -> mean
max -> max
min -> min
count -> count
count distinct -> nunique
distinct -> unique

group by -> groupby

select
    col1,
    sum(col2)
from table
group by col1

↓pandas

table.groupby('col1').agg({'col2': 'sum'})

groupbyに指定されたカラムは、結果のデータフレームにおいてインデックスになってしまうので、

table.groupby('col1', as_index=False).agg({'col2': 'sum'})

のように、as_index=Falseを入れると扱いやすい。

複数のカラムでgroup byする場合

select
    col1,
    col2,
    sum(col3)
from table
group by col1, col2

↓pandas

table.groupby(['col1', 'col2'], as_index=False).agg({'col3': 'sum'})

join -> merge

select
    *
from table t
left join table2 t2
on table.col1 = table2.col1

↓pandas

pd.merge(table, table2, on='col1', how='left')

howには、inner、outer、rightも揃ってる。

onで複数のカラムを利用する場合

select
    *
from table t
left join table2 t2
on table.col1 = table2.col1
    and table.col2 = table2.col2

↓pandas

pd.merge(table, table2, on=['col1', 'col2'], how='left')

onにリストを指定。

update -> loc[]

update table
set col1 = 'b'
where col2 = 1

↓pandas

table.loc[table.col2 == 1, 'col1'] = 'b'

.locの第一引数にwhere条件、第二引数に更新対象のカラムを指定する。

delete -> []

delete
from table
where col1 = 'a'

↓pandas

table = table[table.col1 != 'a']

条件一致しないものだけ抽出して、削除する。

columnの別名 -> rename

select
    col1 as new_col1
from table

↓pandas

table.rename(columns={'col1': 'new_col1'})

union -> concat

select
    *
from table
union
select
    *
from table2

↓pandas

pd.concat([table, table2], ignore_index=True)

concat関数の第一引数は、リストなので3つ以上のデータフレームのユニオンも可能。
ignore_indexを指定しないとインデックスの値は、UNION前のデータフレームのものが引き継がれる。
つまり、インデックスの値がかぶることがあるので、指定しておいた方が良いかも。

insert -> append

insert into
table(
    col1,
    col2
)
values(
    'a',
    '2'
)

↓pandas

table = table.append({'col': 'a', 'b': 2}, ignore_index=True)

ignore_indexは、unionと同様。

case when -> apply

select
    case when col1 = 'a' then 1
            when col1 = 'b' then 2
            else 3 end as case_col1
from table

↓pandas

def case_when(x):
    if x.col1 == 'a':
        return 1
    elif x.col1 == 'b':
        return 2
    else:
        return 3

table.apply(case_when, axis=1)

axis=1としているのは、行単位にデータを取ってくるため。axis=0としたらcase_whenのxには、列データが入ってくる。
printでxを表示させればわかりやすい。xの型はpandasのSeries型になる。

カラムを追加して返すことも可能。
下の方法は、上の方法で
table['case_col1'] = table.apply(case_when, axis=1)
とした場合と結果は一致する。

select
    *,
    case when col1 = 'a' then 1
            when col1 = 'b' then 2
            else 3 end as case_col1
from table

↓pandas

def case_when(x):
    if x.col1 == 'a':
        ret = 1
    elif x.col1 == 'b':
        ret = 2
    else:
        ret = 3
    
    x['case_col1'] = ret

    return x

table.apply(case_when, axis=1)

group by case when -> groupby.apply

select
    col1,
    sum(case when col2 < 10 then col3 else 0 end) as sum_under10_col3,
    sum(case when col2 >= 10 then col3 else 0 end) as sum_over10_col3
from table
group by col1

↓pandas

def case_when(df):
    sum_under10_col3 = df[df.col2 < 10].col3.sum()
    sum_over10_col3 = df[df.col2 >= 10].col3.sum()
    return pd.Series({'sum_under10_col3': sum_under10_col3, 'sum_over10_col3': sum_over10_col3})

table.groupby('col1').apply(case_when).reset_index()

結構煩雑だけど、一応同じことはできる。
case_when関数には、データフレームが送られるので、それらを集計して、その結果をSeriesで返している。
reset_index()は、col1をインデックスから外してる。
applyを使ってる時に、as_index=Falseにするとcol1が自動的に列に加わってくれない。おそらく、case_when関数の中でreturnするときにSeriesの中にcol1を含めないといけない。
それは面倒なので、reset_indexで対応してる。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up