More than 5 years have passed since last update.

Pandasのきほんのき#2

Last updated at 2018-03-30Posted at 2018-03-16

前回のPandasのきほんのきに引き続き、Dataquestの問題を解いていこうと思います。取り扱うデータは前回の投稿のものと同じです。

#データの取り扱いについて

##データ処理（計算）

pandasでは列を指定してするだけでその列内のデータ全てを書き換えてくれる。ただ（自分の解釈が正しければ）これは既存のデータを書き換えているわけではなく新しく更新されたデータを生成している。

manipulate_data.py

div_1000 = food_info["Iron_(mg)"] / 1000
# mg -> gに変えたいとき

# Adds 100 to each value in the column and returns a Series object.
add_100 = food_info["Iron_(mg)"] + 100

# Subtracts 100 from each value in the column and returns a Series object.
sub_100 = food_info["Iron_(mg)"] - 100

# Multiplies each value in the column by 2 and returns a Series object.
mult_2 = food_info["Iron_(mg)"]*2

データ同士での処理もコラム別にまとめて行うことができる。

compute_water_energy.py

water_energy = food_info["Water_(g)"] * food_info["Energ_Kcal"]

Screen Shot 2018-03-15 at 8.43.33 PM.png

##数字の単位を揃えてたい

今回のテーブル内のデータの数値の統合性を図るために以下の式を用いてデータの値の範囲を0から1の間にする。

x' = \frac{x-min(x)}{max(x) - min(x)}

例：

normalize_values.py

# The largest value in the "Energ_Kcal" column.
max_calories = food_info["Energ_Kcal"].max()

# Divide the values in "Energ_Kcal" by the largest value.
normalized_calories = food_info["Energ_Kcal"] / max_calories

##計算した列丸ごと新しい列として追加する

基本的な考え方はpythonのdictionaryと同じ。

add_rows.py

iron_grams = food_info["Iron_(mg)"] / 1000  
food_info["Iron_(g)"] = iron_grams

##データの表示される優先度を変える

例えばSodium(g)列に表示されるデータの優先順位を変えるには

sort_column.py

food_info.sort_values("Sodium_(mg)")

Pandasはデフォルトで新しいdataframeを作成してしまうのでinplaceという引数を使って新たにdataframeを作らないようにする。また表示される順番もascending=Trueだからと言ってdescending=Trueにすればいいというわけではないので注意。代わりにascending=Falseとする。

sort_dataframe_by_column.py

# Sorts the DataFrame in-place, rather than returning a new DataFrame.
food_info.sort_values("Sodium_(mg)", inplace=True)

# Sorts by descending order, rather than ascending.
food_info.sort_values("Sodium_(mg)", inplace=True, ascending=False)

ちなみにDataframe.iloc[]で表示したいデータの数を操作できる。ilocは_integer location_の略。

firstfive.py

first_five_rows = new_titanic_survival.iloc[0:5]

##Indexing

locとilocを使うのが主流のよう。.loc[]では取り出したいコラムの名前をストリング/数字で指定。iloc[]では数字指定。

get_rows_or_columns.py

first_row_first_column = new_titanic_survival.iloc[0,0]
all_rows_first_three_columns = new_titanic_survival.iloc[:,0:3]
row_index_83_age = new_titanic_survival.loc[83,"age"]
row_index_766_pclass = new_titanic_survival.loc[766,"pclass"]

###ReIndexing

データをいじっていると時としてもう一度インデックスし直した方が都合がいい場合がある。データ取得に弊害が生じる可能性があるからだ。そんな時はDataframe.reset_index()で実行可能。また、古いデータのインデックスを一度取り除いたい場合はdrop=Trueにすることを忘れずに。

reset_index.py

#Reindex the new_titanic_survival dataframe so the row indexes start from 0, and the old index is dropped.
#Assign the final result to titanic_reindexed.
#Print the first 5 rows and the first 3 columns of titanic_reindexed.

titanic_reindexed = new_titanic_survival.reset_index(drop=True)
print(titanic_reindexed.iloc[0:5, 0:3])

##NaN (null)への対処法

pandas.isnull()でチェックできる。例えばちょっとした応用としてsexがnullの行だけを取得したい場合は

get_null_rows.py

# sex列を取得
sex = titanic_survival["sex"]

# nullの値がある値判定、True/Falseベースで生成
sex_is_null = pandas.isnull(sex)

# Trueとなった行だけを取得
sex_null_true = sex[sex_is_null]

Dataframe.dropna()というメソッドを使えばマトリックスに対して値がNaNの行を排除してくれる。

The dropna() method takes an axis parameter, which indicates whether you would like to drop rows or columns. Specifying axis=0 or axis='index' will drop any rows that have null values, while specifying axis=1 or axis='columns' will drop any columns that have null values. We will use 0 and 1 since they're more commonly used, but you can use either.

dropna()というメソッドはどうやらaxisという引数を持っているらしい。これはつまるところ「排除したいものはNaN値が含まれている行なのかそれとも列なのか」ということである。行を取り除きたい場合は0もしくはindexを使い、列を取り除きたい場合は1かcolumnsを使う。

drop_na_rows.py

drop_na_rows = titanic_survival.dropna(axis=0)

##Pivot Table

一つのデータを多角的な視点から集計、比較する際に便利なピボットテーブルの使い方など。例題３のように既存メソッドを使わずにマニュアルでもできるが今回はDataframe.pivot_table()を使って問題を解いていく。

pivot_table.py

passenger_class_fares = titanic_survival.pivot_table(index="pclass", values="fare", aggfunc=np.mean)

indexでグルーピングしたいコラムの指定
valuesで計算やメソッドをあてたいコラムの指定
aggfuncであてたいメソッドの種類を指定する

#メソッドの適用

dataframeに対してある特定のメソッドを適用させたい場合はDataframe.apply()を使う。個人的にはforループをイメージしている。

apply.py

# This function returns the hundredth item from a series
def hundredth_row(column):
    # Extract the hundredth item
    hundredth_item = column.iloc[99]
    return hundredth_item

# Return the hundredth item from each column
hundredth_row_var = titanic_survival.apply(hundredth_row)

引数にaxis=1を入れてあげれば列ではなく行に対してメソッドを適用させることもできる。

apply.py

def which_class(row):
    pclass = row['pclass']
    if pd.isnull(pclass):
        return "Unknown"
    elif pclass == 1:
        return "First Class"
    elif pclass == 2:
        return "Second Class"
    else:
        return "Third Class"

classes = titanic_survivors.apply(which_class, axis=1)

#例題集
自分が気になった問題や新たな発見につながった問題をピックアップして備忘録としてここに残しておこうと思います。

##例題１

Count how many values in the "age" column have null values:
Use pandas.isnull() on age variable to create a Series of True and False values.
Use the resulting series to select only the elements in age that are null, and - assign the result to age_null_true
Assign the length of age_null_true to age_null_count.
Print age_null_count to see how many null values are in the "age" column.

##自分の解答

script.py

age = titanic_survival["age"]
age_is_null = age.isnull()
age_null_true = age[age_is_null]
age_null_count = age_null_true.shape[0]
print(age_null_count)

##例題２

Use age_is_null to create a vector that only contains values from the "age" column that aren't NaN.
Calculate the mean of the new vector, and assign the result to correct_mean_age.

##自分の解答

get_mean.py

age_is_null = pd.isnull(titanic_survival["age"]) == False
age_null_false = titanic_survival[age_is_null]
correct_mean_age = sum(age_null_false["age"]) / len(age_null_false)

##反省

要所要所でどういうデータをとってきているのかのイメージを描けていない。例えば今回の問題では当初以下のようなコードでしばらくの間エラーが続いた。

errr.py

age_is_null = pd.isnull(titanic_survival["age"]) == False
age_null_false = titanic_survival[age_is_null]
correct_mean_age = sum(age_null_false) / len(age_null_false)

ここではage=nullではない列の値の合計値が欲しいのでまずは__age_null_falseというテーブルの中のageコラムをとってくる必要がある__。これをせずにテーブルをそのままぶん投げていたのでエラーが起きていた。

追記：
似たようなエラーにまた遭遇したので備忘録としてメモしておく。

count_thanksgiving.py

import pandas as pd
data = pd.read_csv("thanksgiving.csv", encoding="Latin-1")
dont_celebrate = pd.isnull(data["Do you celebrate Thanksgiving?"]) == False
celebrate_only = data[dont_celebrate]
celebrate_only.value_counts()

#AttributeError: 'DataFrame' object has no attribute 'value_counts'

celebrate_onlyはあくまで２次元のdataframeなのでcelebrate_only["column"]と指定して一次元のデータに変えてあげないとSeriesメソッドは使えない。ちなみに指定したコラム内の値をカテゴリ別採算を出したい場合はpandas.Series.value_counts()がお勧め。

count_by_category.py

import numpy as np
typical_main_dish = pd.isnull(data["What is typically the main dish at your Thanksgiving dinner?"]) == False
new_data = data[typical_main_dish]
new_data["What is typically the main dish at your Thanksgiving dinner?"].value_counts()

###対処法２

pandasのメソッドの多くは自動的にNaN値を弾いてくれるらしい。

correct_mean_fare = titanic_survival["fare"].mean()

##例題３

Use a for loop to iterate over passenger_classes. Within the for loop:
Select just the rows in titanic_survival where the pclass value is equivalent to the current iterator value (class).
Select just the fare column for the current subset of rows.
Use the Series.mean method to calculate the mean of this subset.
Add the mean of the class to the fares_by_class dictionary with class as the key.
Once the loop completes, the dictionary fares_by_class should have 1, 2, and 3 as keys, with the average fares as the corresponding values.

get_mean_fare.py

passenger_classes = [1, 2, 3]
fares_by_class = {}

for passenger_class in passenger_classes:
    pclass_rows = titanic_survival[titanic_survival["pclass"] == passenger_class]
    pclass_fares = pclass_rows["fare"]
    mean_fare = pclass_fares.mean()
    fares_by_class[passenger_class] = mean_fare

booleans_only.py

pclass_rows_equivalent = pandas.isnull(titanic_survival["pclass"] == passenger_class)
    pclass_rows = titanic_survival[pclass_rows_equivalent]

上記のisnullはどうやらブーリアンのみ適用可っぽい

pclass_rows = titanic_survival[titanic_survival["pclass"] == passenger_class]

##例題４

Use the DataFrame.pivot_table() method to calculate the mean age for each passenger class ("pclass").
Assign the result to passenger_age.
Display the passenger_age pivot table using the print() function.

calculate_mean.py

import numpy as np
passenger_age = titanic_survival.pivot_table(index="pclass", values="age", aggfunc=np.mean)
print(passenger_age)

##例題５

valuesに自分の求めたいカラムをリストにしまって引数として渡してあげると複数のカラムに対して処理を行うことができる。例えば以下の例題ではembarked別にグルーピングされたデータに対してカラム別にfareとsurvivedの値の合計を取得している。

Make a pivot table that calculates the total fares collected ("fare") and total number of survivors ("survived") for each embarkation port ("embarked").
Assign the result to port_stats.
Display port_stats using the print() function.

get_port_stats.py

import numpy as np
port_stats = titanic_survival.pivot_table(index="embarked", values=["fare", "survived"], aggfunc=np.sum)
print(port_stats)

##例題６

Drop all rows in titanic_survival where the columns "age" or "sex" have missing values and assign the result to new_titanic_survival.

new_titanic_survival.py

new_titanic_survival = titanic_survival.dropna(axis=0, subset=["age", "sex"])

##例題７

Assign the first ten rows from new_titanic_survival to first_ten_rows.
Assign the fifth row from new_titanic_survival to row_position_fifth.
Assign the row with index label 25 from new_titanic_survivalto row_index_25.

###自分の解答

# We have already sorted new_titanic_survival by age
first_ten_rows = new_titanic_survival.iloc[0:10]
row_position_fifth = new_titanic_survival.iloc[4]
row_index_25 = new_titanic_survival.iloc[24]

###反省点

Remember to use .loc when addressing by label, and .iloc when indexing by position.

.locはインデックス（unique id）別に検索したい時に使い.ilocはデータの物理的な立ち位置（ポジション）を指定したい時に使う。

get_rows.py

# We have already sorted new_titanic_survival by age
first_ten_rows = new_titanic_survival.iloc[0:10]
row_position_fifth = new_titanic_survival.iloc[4]
row_index_25 = new_titanic_survival.loc[25]

##例題８

Assign the value at row index label 1100, column index label "age" from new_titanic_survival to row_index_1100_age.
Assign the value at row index label 25, column index label "survived" from new_titanic_survival to row_index_25_survived.
Assign the first 5 rows and first three columns from new_titanic_survival to five_rows_three_cols.

get_rows_cols.py

row_index_1100_age = new_titanic_survival.loc[1100, "age"]
row_index_25_survived = new_titanic_survival.loc[25, "survived"]
five_rows_three_cols = new_titanic_survival.iloc[0:5, 0:3]

##例題９

Write a function that counts the number of null elements in a Series.
Use the DataFrame.apply() method along with your function to run across all the columns in titanic_survival.
Assign the result to column_null_count.

count_null.py

def null_count(column):
    is_null = column.isnull()
    null_columns = titanic_survival[is_null]
    return null_columns.shape[0]

column_null_count = titanic_survival.apply(null_count)

##例題１０

Create a function that returns the string "minor" if someone is under 18, "adult" if they are equal to or over 18, and "unknown" if their age is null.
Then, use the function along with .apply() to find the correct label for everyone in the titanic_survival dataframe.
Assign the result to age_labels.
You can use pd.isnull to check if a value is null or not.

labels_age.py

def labels_age(row):
    age = row['age']
    if pd.isnull(age):
        return "unknown"
    elif age < 18:
        return "minor"
    else:
        return "adult"
    
age_labels = titanic_survival.apply(labels_age, axis=1)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up