More than 5 years have passed since last update.

Pandasのきほん： Dataframe Objectとは何か

Posted at 2018-03-28

Dataframes

collections of series objects
シリーズオブジェクトの集合である

もっと端的に言うと一次元であるSeriesの集合体。つまり２次元である必要がある。SQLやエクセルで使うシートをイメージしてもらうとわかりやすいかもしれない。

DataFrames use Series objects to represent columns. When we select a single column from a DataFrame, pandas will return the Series object representing that column. By default, pandas indexes each individual Series object in a DataFrame with the integer data type. Each value in the Series has a unique integer index, or position. Like most Python data structures, the Series object uses 0-indexing. The indexing ranges from 0 to n-1, where n is the number of rows. We can use an integer index to select an individual value in a Series if we know its position.
DataFramesはSeriesオブジェクトを使用して列を表します。 DataFrameから単一の列を選択すると、pandasはその列を表すSeriesオブジェクトを返します。デフォルトでは、pandasは整数データ型を持つDataFrame内の個々のSeriesオブジェクトにインデックスを付けます。シリーズの各値には、一意の整数インデックスまたは位置があります。ほとんどのPythonデータ構造と同様に、Seriesオブジェクトは0インデックスを使用します。インデックスの範囲は0〜n-1です（nは行数）。シリーズ内の位置を知っていれば、整数インデックスを使用して個々の値を選択できます

Pandas dataframe share a row index across columns. By default, this is an integer index. Pandas enforces this shared row index by throwing an error if we read in a CSV file with columns that contain a different number of elements.
Pandasのデータフレームは列間で行インデックスを共有します。デフォルトでは、これは整数インデックスです。 Pandasは、異なる数の要素を含む列を含むCSVファイルを読み込むとエラーを吐くことによってこの共有行インデックスを強制します。

データの取得

Seriesではindex valueはデータの値だったのに対しDataframeでは行そのものになる。

extract_rows.py

# First five rows
fandango[0:5]
# From row at 140 and higher
fandango[140:]
# Just row at index 50
fandango.iloc[50]
# Just row at index 45 and 90
fandango.iloc[[45,90]]

行一つを取り出したい場合はiloc[]メソッドを使う必要がある。選択する際に使うオブジェクトは以下の通り：

An integer
A list of integers
A slice object
A Boolean array

行の選択においては主に

loc[]（iloc[]）
括弧をつかったスライシング

の2つがある。

select_rows.py

# Slice using either bracket notation or loc[]
fandango_films["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"]
fandango_films.loc["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"]

# Specific movie
fandango_films.loc['Kumiko, The Treasure Hunter (2015)']

# Selecting list of movies
movies = ['Kumiko, The Treasure Hunter (2015)', 'Do You Believe? (2015)', 'Ant-Man (2015)']
fandango_films.loc[movies]

たくさんスライシングしているとindexがごちゃごちゃになって混乱しないかと思うかもしれないがここでpandasの大きな特徴の一つでもあるdata alignmentの要素が強く関わってくるわけだ。これによって既存のindexに依存した状態でデータ分析などが行える。

set_indexを使えば指定したカラムをインデックスとして使えるようになる。そのまま既存のデータを使いたい場合はinplace=Trueにしておく。インデックスとして使用したカラムだがそのままそのカラムを残しておきたい時はdrop=Falseにしておく。

inplace: If set to True, this parameter will set the index for the current, "live" dataframe, instead of returning a new dataframe.
drop: If set to False, this parameter will keep the column we specified as the index, instead of dropping it.

set_index.py

In [37]: df = pd.DataFrame({
    ...: 'a':[1,2,3,4,5], 
    ...: 'b':[0,0,1,1,1],
    ...: 'c':['a','b','c','c','d'],
    ...: 'd':['e','f','g','h','i']
    ...: })

In [38]: df
Out[38]: 
   a  b  c  d
0  1  0  a  e
1  2  0  b  f
2  3  1  c  g
3  4  1  c  h
4  5  1  d  i

In [39]: df.set_index('a')
Out[39]: 
   b  c  d
a         
1  0  a  e
2  0  b  f
3  1  c  g
4  1  c  h
5  1  d  i

データの処理

.apply()が主流。

apply.py

types = fandango_films.dtypes
float_columns = types[types.values == 'float64'].index
# float_df contains only the float columns
float_df = fandango_films[float_columns]

# usage of a lambda function
float_df.apply(lambda x: x*2)

上ではラムダを使って処理をかけている。わざわざ関数を定義しなくて良いのでその場だけで使いたい時にラムダは便利。

If it usually returns a value for each element (such as multiplying or dividing by 2), it will transform all of the values and return them as a new dataframe:
それぞれの行/列への処理をかける場合は基本的に新しいdataframeとして返してくれる。

get_std.py

import numpy as np

float_df.apply(lambda x: np.std(x))

apply_func_examples.py

import numpy as np

# returns the data types as a Series
types = fandango_films.dtypes
# filter data types to just floats, index attributes returns just column names
float_columns = types[types.values == 'float64'].index
# use bracket notation to filter columns to just float columns
float_df = fandango_films[float_columns]

# `x` is a Series object representing a column
deviations = float_df.apply(lambda x: np.std(x))

print(deviations)

デフォルトではコラムになっているので行自体に処理をかけたい場合は引数の一つであるaxis=1に変えてあげる。

get_std.py

rt_mt_user = float_df[['RT_user_norm', 'Metacritic_user_nom']]
rt_mt_user.apply(lambda x: np.std(x), axis=1)
# returns a Series object containing the standard deviation of each movie's ratings from RT_user_norm and Metacritic_user_nom

例題

1

Use the pandas dataframe method set_index to assign the FILM column as the custom index for the dataframe. Also, specify that we don't want to drop the FILM column from the dataframe. We want to keep the original dataframe, so assign the new one to fandango_films.
Display the index for fandango_films using the index attribute and the print function.

解答

set_index.py

fandango = pd.read_csv('fandango_score_comparison.csv')
fandango_films = fandango.set_index('FILM', drop=False, inplace=False)
fandango_films.index
print(fandango_films)

2

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up