More than 5 years have passed since last update.

Pandas　基礎

Posted at 2020-08-15

Pandas　基礎まとめ

SeriesとDataFrameについて

Series

Seriesとは. １次元の値のリスト

dict型のオブジェクトをSeriesにいれるとkeyをindexとして表現します。

data = {
    "Name":"Jhon",
    "Sex":"male",
    "AGe":22
}
pd.Series(data)
>
Name    Jhon
Sex     male
AGe       22
dtype: object

Numpy arrayからSeriesを作成

array = np.array([22,31,42,23])
age_series = pd.Series(array)
age_series

arrayにindexを指定して、indexで呼び出す

array = np.array(['John','male',22])
john_series = pd.Series(array,index = ['Name','Sex','Age'])
john_seiies["Name"]
>John

john_seiries
>
Name    John
Sex     male
Age       22
dtype: object

元々のNumpy arrayを取ってくる

age_series.values.values 
>array([22, 31, 42, 23])

DataFrame

イメージとして、行列そのものをテーブルとして扱う　（行Series　列Series)を操作し組み合わせたものが、DataFrameといった感じか。

上の図では、列Seriesのみですが、
行のSeriesも扱います

Numpy array から作成

ndarray = np.arange(10).reshape(2,5)
ndarray
>
array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

pd.DataFrame(ndarray,index = ["index1",'index2'] ,columns = ['a','b','c','d','e' ])
>
         | a | b | c | d | e |
| index1 | 0 | 1 | 2 | 3 | 4 |
| index2 | 5 | 6 | 7 | 8 | 9 |

基本的流れ　
１　read_csvで読み込み
２　データの基礎情報を分析

df = pd.read_csv("dataset/tmdb_5000_movies.csv")
# len()でデータ数確認
len(df)

一覧を省略せず表示させたい時

# colomuの制限をなくす
pd.set_option('display.max_columns',None)

# row（１つ１つのデータ）の制限をなくす　　（※重たくなるので注意）
pd.set_option('display.max_rows',None)

df.describe()
type(df)  #describe自体もDataFrameとして扱える

DataFrameの操作

Seriesで返ってくる

df[" カラム名  "]　○　推奨
df.カラム名　　　　 ▲ 非推奨

DataFrameで返ってくる

df[["revenue"]]

＃columを複数選択も可能
df[["revenue","original_title","budget"]]

# 特定の行のindexを指定し、取り出す
df.iloc[10:13]

# 特定の行のindexを指定し、指定したカラムを取り出す
df.iloc[10:13]["original_title"]

行・列の削除

drop() # 元のDataFrameは変更されない

inplace = True で元のDataFrameを変更する


＜特定の行をまとめて削除 axis  = 0  （※デフォルトで指定済）＞
df.drop('id', (axis = 0) ,(inplace = True))  

＜指定した列を削除　　axis = 1＞
df.drop('id', axis = 1,(inplace = True))  

df = df.drop(5) #　inplaceよりメジャーな、元のデータを更新する手法！　 同じ変数を使いまわしていく　

dropna() 欠損値を全削除

np.isnan()　nan（欠損値） があるか判定　

fillna()　欠損値を埋める
>fillna(df["runtime"].mean())

Filter

フィルターのかけ方
# 例）日本の映画のみ指定したい
j_movie = df[df['original_language'] == 'ja'] #この書き方が基本よく使う


()&()や()|()で複数の条件を入れる
# 例）日本の映画で評価が８以上の作品だけ指定したい。
j_movie = df[(df['original_language'] == 'ja') & (df["vote_average"] >= 8 ) ]

df[ (df['budget'] == 0 ) | (df['revenue'] == 0 ) ]
→フィルタ　：「予算もしくは売り上げが０」
 

df[ ~ ((df['budget'] == 0 ) | (df['revenue'] == 0 )) ]
フィルタ　 ：「予算もしくは売り上げが０」ではない（NOT演算〜）

merge()使い方

引数howのオプション

df1 = pd.DataFrame({'key':["k0","k1","k2"],
                  'A':["a0","a1","a2"],
                  'B':["b0","b1","b2"]})

df2 = pd.DataFrame({'key':["k0","k1","k2"],
                  'C':["c0","c1","c2"],
                  'D':["d0","d1","d2"]})

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

Pandas 基礎