More than 5 years have passed since last update.

Python for Data Analysis Chapter 2

Last updated at 2013-10-09Posted at 2013-10-09

Introductory Examples

すべての章でiPythonのインタラクティブシェルを利用している。
起動は$ ipython --pylabで行う。
pandasのバージョン0.10.0だとapply関数のバグがあるので最新のを入れましょう。

usa.gov data from bit.ly

元データがJSON形式なのでこれをパースして取り込む。

import json

path = "data.txt"
records = [json.loads(line) for line in open(path)]
time_zones = [rec["tz"] for rec in records if "tz" in rec]

要素のカウントはcollections.Counterを用いると良い。

from collections import Counter

counts = Counter(time_zones)

"""
counts.most_common(10)
In [103]: counts.most_common(10)
Out[103]: 
[(u'America/Chicago', 3641),
 (u'America/New_York', 2474),
 (u'', 1623),
 (u'Europe/London', 590),
 (u'America/Los_Angeles', 500),
 (u'Asia/Tokyo', 226),
 (u'America/Indianapolis', 192),
 (u'America/Rainy_River', 166),
 (u'America/Denver', 160),
 (u'Asia/Calcutta', 148)]
"""

上位10個の要素がとれる。

pandasを用いて解析をする

from pandas import DataFrame, Series
import pandas as pd

frame = DataFrame(records)

"""
In [104]: frame
Out[104]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12202 entries, 0 to 12201
Data columns:
_heartbeat_    120  non-null values
a              12082  non-null values
al             10109  non-null values
c              10485  non-null values
cy             10485  non-null values
g              12082  non-null values
gr             10485  non-null values
h              12082  non-null values
hc             12082  non-null values
hh             12082  non-null values
kw             1225  non-null values
l              12082  non-null values
ll             10462  non-null values
nk             12082  non-null values
r              12082  non-null values
t              12082  non-null values
tz             12082  non-null values
u              12082  non-null values
dtypes: float64(4), object(14)
"""

pandasのDataFrameは辞書のリストを渡すとRのdataframeの様にDataFrame型にデータを格納してくれる。
イメージは表

index	a	al	c	cy	g	. . .
0	Mozilla…	en-us	US	Durand	YmtpnZ	…
1	Mozilla…	en-US	US	Plymouth	YmtpnZ	…
2	Mozilla…	en-us	US	Pollock	10nPD5S	…
3	Mozilla…	NaN	US	Malden	oVXSUv	…
4	…	…	…	…	…	…

例えばtz列の中身はタイムゾーン情報になっている。

"""
In [105]: frame["tz"][:10]
Out[105]: 
0     America/Chicago
1     America/Chicago
2     America/Chicago
3    America/New_York
4     America/Chicago
5     America/Chicago
6     America/Chicago
7       Asia/Calcutta
8     America/Chicago
9     America/Chicago
Name: tz
"""

このframe["tz"]はSeriesオブジェクトとして帰ってくる。
これは列の要素をカウントしてくれるvalue_countsメソッドを持っている

tz_counts = frame["tz"].value_counts()
tz_counts[:10]

"""
In [115]: tz_counts[:10]
Out[115]: 
America/Chicago         3641
America/New_York        2474
Unknown                 1623
Europe/London            590
America/Los_Angeles      500
Asia/Tokyo               226
America/Indianapolis     192
America/Rainy_River      166
America/Denver           160
Asia/Calcutta            148
"""

このデータをプロットするためにNA値や空の文字列値""になっている物を取り除く。

clean_tz = frame["tz"].fillna("Missing")
clean_tz[clean_tz == ""] = "Unknown"
tz_counts = clean_tz.value_counts()

"""
In [117]: tz_counts[:10]
Out[117]: 
America/Chicago         3641
America/New_York        2474
Unknown                 1623
Europe/London            590
America/Los_Angeles      500
Asia/Tokyo               226
America/Indianapolis     192
America/Rainy_River      166
America/Denver           160
Asia/Calcutta            148
"""

可視化

tz_counts[:10].plot(kind="barh", rot=0)

これにOS（Windowsか非Windowsか）の情報を付加する。

# notnull()はnull値のところでFalseを返すので、これでframeをマスクする
cframe = frame[frame.a.notnull()]
# whereは第一引数がTrueだったら第二引数を、Falseだったら第三引数を返す
operating_system = np.where(cframe["a"].str.contains("Windows"), "Windows", "Not windows")

"""
In [130]: operating_system[:10]
Out[130]: 
0        Windows
1        Windows
2    Not windows
3    Not windows
4        Windows
5        Windows
6        Windows
7        Windows
8        Windows
9        Windows
Name: a
"""
# operatingSystemの中で同じタイムゾーンを持つものをグループ化する
by_tz_os = cframe.groupby(["tz", operating_system])
# テーブル型に変換
agg_counts = by_tz_os.size().unstack().fillna(0)
# タイムゾーンの値でソート（WindowsとNot windowsで分かれているので一度sumをとる）
indexer = agg_counts.sum(1).argsort()
# agg_countsを並び替える
count_subset = agg_counts.take(indexer)[-10:]
# プロット
count_subset.plot(kind="barh", stacked=True)
# パーセンテージ化
normed_subset = count_subset.div(count_subset.sum(1), axis=0)
normed_subset.plot(kind="barh",stacked=True)

MovieLens 1M Data Set

元データ - MovieLens 1M Data Set

データの読み込み

import pandas as pd
unames = ["user_id", "gender", "age", "occupation", "zip"]
users = pd.read_table("users.dat", sep="::", header=None, names=unames)

rnames = ["user_id", "movie_id", "rating", "timestamp"]
ratings = pd.read_table("ratings.dat", sep="::", header=None, names=rnames)

mnames = ["user_id", "title", "genres"]
movies = pd.read_table("movies.dat", sep="::", header=None, names=rnames)

"""
In [151]: users[:5]
Out[151]: 
   user_id gender  age  occupation    zip
0        1      F    1          10  48067
1        2      M   56          16  70072
2        3      M   25          15  55117
3        4      M   45           7  02460
4        5      M   25          20  55455

In [152]: ratings[:5]
Out[152]: 
   user_id  movie_id  rating  timestamp
0        1      1193       5  978300760
1        1       661       3  978302109
2        1       914       3  978301968
3        1      3408       4  978300275
4        1      2355       5  978824291

In [153]: movies[:5]
Out[153]: 
   movie_id                               title                        genres
0         1                    Toy Story (1995)   Animation|Children's|Comedy
1         2                      Jumanji (1995)  Adventure|Children's|Fantasy
2         3             Grumpier Old Men (1995)                Comedy|Romance
3         4            Waiting to Exhale (1995)                  Comedy|Drama
4         5  Father of the Bride Part II (1995)                        Comedy
"""
# 読み込んだデータを結合。
data = pd.merge(pd.merge(ratings, users), movies)

"""
In [155]: data[:5]
Out[155]: 
   user_id  movie_id  rating  timestamp gender  age  occupation    zip  \
0        1      1193       5  978300760      F    1          10  48067   
1        2      1193       5  978298413      M   56          16  70072   
2       12      1193       4  978220179      M   25          12  32793   
3       15      1193       4  978199279      M   25           7  22903   
4       17      1193       5  978158471      M   50           1  95350   

                                    title genres  
0  One Flew Over the Cuckoo's Nest (1975)  Drama  
1  One Flew Over the Cuckoo's Nest (1975)  Drama  
2  One Flew Over the Cuckoo's Nest (1975)  Drama  
3  One Flew Over the Cuckoo's Nest (1975)  Drama  
4  One Flew Over the Cuckoo's Nest (1975)  Drama  

In [156]: data.ix[0]
Out[156]: 
user_id                                            1
movie_id                                        1193
rating                                             5
timestamp                                  978300760
gender                                             F
age                                                1
occupation                                        10
zip                                            48067
title         One Flew Over the Cuckoo's Nest (1975)
genres                                         Drama
Name: 0
"""

性別と映画のタイトルで映画の評価値をクロス集計する。

mean_ratings = data.pivot_table("rating", rows="title", cols="gender", aggfunc="mean")

"""
mean_ratings[:5]
In [160]: mean_ratings[:5]
Out[160]: 
gender                                F         M
title                                            
$1,000,000 Duck (1971)         3.375000  2.761905
'Night Mother (1986)           3.388889  3.352941
'Til There Was You (1997)      2.675676  2.733333
'burbs, The (1989)             2.793478  2.962085
...And Justice for All (1979)  3.828571  3.689024
"""

映画のタイトル別に評価の合計値を計算

# groupbyでタイトル別にグループ化し、sizeでタイトルごとのSeriesを得る
ratings_by_title = data.groupby("title").size()

"""
In [170]: ratings_by_title[:5]
Out[170]: 
title
$1,000,000 Duck (1971)            37
'Night Mother (1986)              70
'Til There Was You (1997)         52
'burbs, The (1989)               303
...And Justice for All (1979)    199
"""

# 適当にレートの合計が高いもの（よく観られている映画）をとってくる
active_titles = ratings_by_title.index[ratings_by_title >= 1000]

# mean_ratingsからactive_titlesのみを選択
active_title_mean_ratings = mean_ratings.ix[active_titles]

# 男女別に高評価のものを取ってくる
top_male_ratings = active_title_mean_ratings.sort_index(by="M", ascending=False)
top_female_ratings = active_title_mean_ratings.sort_index(by="F", ascending=False)

"""
In [205]: top_male_ratings[:5]
Out[205]: 
gender                                            F         M
title                                                        
Godfather, The (1972)                      4.314700  4.583333
Shawshank Redemption, The (1994)           4.539075  4.560625
Raiders of the Lost Ark (1981)             4.332168  4.520597
Usual Suspects, The (1995)                 4.513317  4.518248
Star Wars: Episode IV - A New Hope (1977)  4.302937  4.495307

In [207]: top_female_ratings[:5]
Out[207]: 
gender                                   F         M
title                                               
Schindler's List (1993)           4.562602  4.491415
Shawshank Redemption, The (1994)  4.539075  4.560625
Usual Suspects, The (1995)        4.513317  4.518248
Rear Window (1954)                4.484536  4.472991
Sixth Sense, The (1999)           4.477410  4.379944
"""

男女評価の差が大きい映画を調べる

active_title_mean_ratings["diff"] = active_title_mean_ratings["M"] - active_title_mean_ratings["F"]
sorted_by_diff = active_title_mean_ratings.sort_index(by="diff")

"""
女性に人気
In [211]: sorted_by_diff[:15]
Out[211]: 
gender                                               F         M      diff
title                                                                     
Rocky Horror Picture Show, The (1975)         3.673016  3.160131 -0.512885
Mary Poppins (1964)                           4.197740  3.730594 -0.467147
Gone with the Wind (1939)                     4.269841  3.829371 -0.440471
Full Monty, The (1997)                        4.113456  3.760976 -0.352481
Little Mermaid, The (1989)                    3.975936  3.632375 -0.343561
Pretty Woman (1990)                           3.846914  3.511700 -0.335213
Thelma & Louise (1991)                        3.916268  3.581582 -0.334686
Clueless (1995)                               3.827004  3.514640 -0.312365
Ghost (1990)                                  3.698667  3.395194 -0.303473
Willy Wonka and the Chocolate Factory (1971)  4.063953  3.789474 -0.274480
League of Their Own, A (1992)                 3.865497  3.595773 -0.269724
When Harry Met Sally... (1989)                4.257028  3.987850 -0.269178
Titanic (1997)                                3.764228  3.499051 -0.265176
Beauty and the Beast (1991)                   4.054945  3.797414 -0.257531
Romancing the Stone (1984)                    3.854227  3.632735 -0.221493

男性に人気
In [212]: sorted_by_diff[::-1][:15]
Out[212]: 
gender                                              F         M      diff
title                                                                    
Animal House (1978)                          3.628906  4.167192  0.538286
Reservoir Dogs (1992)                        3.769231  4.213873  0.444642
South Park: Bigger, Longer and Uncut (1999)  3.422481  3.846686  0.424206
Airplane! (1980)                             3.656566  4.064419  0.407854
Predator (1987)                              3.299401  3.706195  0.406793
Godfather: Part II, The (1974)               4.040936  4.437778  0.396842
Clockwork Orange, A (1971)                   3.757009  4.145813  0.388803
Aliens (1986)                                3.802083  4.186684  0.384601
Starship Troopers (1997)                     2.802721  3.181102  0.378381
Apocalypse Now (1979)                        3.955307  4.294885  0.339577
Full Metal Jacket (1987)                     3.821839  4.157407  0.335568
Terminator 2: Judgment Day (1991)            3.785088  4.115367  0.330279
Alien (1979)                                 3.888252  4.216119  0.327867
Mad Max 2 (a.k.a. The Road Warrior) (1981)   3.393701  3.713341  0.319640
Terminator, The (1984)                       3.899729  4.205899  0.306170
"""

# 標準偏差
rating_std_by_title = data.groupby("title")["rating"].std()
# active_titlesを抽出
rating_std_by_title = rating_std_by_title.ix[active_titles]
# ソート
rating_std_by_title.order(ascending=False)[:10]

"""
In [228]: rating_std_by_title.order(ascending=False)[:10]
Out[228]: 
title
Blair Witch Project, The (1999)                 1.316368
Rocky Horror Picture Show, The (1975)           1.260177
South Park: Bigger, Longer and Uncut (1999)     1.235380
Armageddon (1998)                               1.203439
Starship Troopers (1997)                        1.203245
Mars Attacks! (1996)                            1.185632
Titanic (1997)                                  1.167448
Austin Powers: The Spy Who Shagged Me (1999)    1.151063
Face/Off (1997)                                 1.136265
Magnolia (1999)                                 1.132217
Name: rating
"""

US Baby Names 1880-2010

元データ - National data (7Mb)

データを取り込んで確認。
ちなみにiPython上で!headという感じに!をつけるとシェルのコマンドを実行できるみたい。
なので下のコマンドを打たなくても確認だけなら!head -5 yob1880.txtで良い。

import pandas as pd
names1880 = pd.read_csv("yob1880.txt", names=["name", "sex", "births"])

"""
In [243]: names1880[:5]
Out[243]: 
        name sex  births
0       Mary   F    7065
1       Anna   F    2604
2       Emma   F    2003
3  Elizabeth   F    1939
4     Minnie   F    1746
"""

# 男女の出生数を確認
"""
In [265]: names1880.groupby("sex").sum()
Out[265]: 
     births
sex        
F     90993
M    110491
"""

一つのファイルは上の感じなので、とりあえず全部のファイルを一つのnames変数に入れる。

years = range(1880, 2013)
pieces = []
columns = ["name", "sex", "births"]

for year in years:
    path = "names/yob%d.txt" % year
    frame = pd.read_csv(path, names=columns)
    frame["year"] = year
    pieces.append(frame)

names = pd.concat(pieces, ignore_index=True)

"""
In [263]: names[:5]
Out[263]: 
        name sex  births  year
0       Mary   F    7065  1880
1       Anna   F    2604  1880
2       Emma   F    2003  1880
3  Elizabeth   F    1939  1880
4     Minnie   F    1746  1880

In [264]: names[-5:]
Out[264]: 
              name sex  births  year
1758725      Zylin   M       5  2012
1758726     Zymari   M       5  2012
1758727      Zyrin   M       5  2012
1758728      Zyrus   M       5  2012
1758729  Zytaevius   M       5  2012
"""

ほんでデータの整理

# 出生年度と性別で出生数をクロス集計
total_births = names.pivot_table("births", rows="year", cols="sex", aggfunc=sum)

"""
In [269]: total_births.tail()
Out[269]: 
sex         F        M
year                  
2008  1886109  2035075
2009  1831382  1977632
2010  1770632  1911572
2011  1750078  1889557
2012  1743626  1877705
"""

# プロッティング
total_births.plot(title="Total births by sex and year")

def add_prop(group):
    #Integer division floors
    births = group.births.astype(float)
    group["prop"] = births / births.sum()
    return group

# その年の性別別出生人数に対してその名前の割合を示すcolumnを追加
names = names.groupby(["year", "sex"]).apply(add_prop)
# 誤差を含めた合計が1かどうかで確からしさを確認
"""
In [305]: np.allclose(names.groupby(["year","sex"]).prop.sum(),1)
Out[305]: True
"""

# 名前・性別ごとに人気の名前を1000個とってくる
pieces = []
for year_sex, group in names.groupby(["year", "sex"]):
    pieces.append(group.sort_index(by="births", ascending=False)[:1000])
top1000 = pd.concat(pieces, ignore_index=True)
"""
In [356]: top1000
Out[356]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 265877 entries, 0 to 265876
Data columns:
name      265877  non-null values
sex       265877  non-null values
births    265877  non-null values
year      265877  non-null values
prop      265877  non-null values
dtypes: float64(1), int64(2), object(2)
"""

これで人気の名前データがとれてきたのでこれを使って解析をする。

Analyzing Naming Trends

# 男女別データ
boys = top1000[top1000.sex == "M"]
girls = top1000[top1000.sex == "F"]
# 年度・性別別出生数
total_births = top1000.pivot_table("births", rows="year", cols="name", aggfunc=sum)
subset = total_births[["John", "Harry", "Mary", "Marilyn"]]
subset.plot(subplots=False, figsize=(12,10), grid=False, title="Number of births per year")

Measuring the increase in naming diversity

親は子どもに一般的な名前を付けなくなっているのではないかという仮説を調べる。

# 年度別にtop1000の名前がその年の何パーセントを占めているか調べる
table = top1000.pivot_table("prop", rows="year", cols="sex", aggfunc=sum)
table.plot(title="Sum of table1000.prop by year and sex", yticks=np.linspace(0,1.2,13), xticks=range(1880,2020,10))

確かにtop1000が占める割合が減少していることがわかる。
さてここで、ある年度について上位何種類までの名前がその年の名前の割合の半分を占めているか気になってくる。
2012年と1900年の男性データについて見てみる。

# まず2012年のデータ
df = boys[boys.year == 2012]
prop_cumsum = df.sort_index(by="prop", ascending=False).prop.cumsum()
"""
In [398]: prop_cumsum[:10]
Out[398]: 
264877    0.010065
264878    0.020107
264879    0.029452
264880    0.038613
264881    0.047520
264882    0.056407
264883    0.064935
264884    0.073454
264885    0.081498
264886    0.089369
"""
# propの値が大きい順に加算されているのがわかる
# この値が0.5を超えるところを探せば良い（searchsorted())
"""
In [399]: prop_cumsum.searchsorted(0.5)
Out[399]: 123
"""

# 次に1990年のデータ
df = boys[boys.year == 1990]
in1990 = df.sort_index(by="prop", ascending=False).prop.cumsum()
in1990.searchsorted(0.5) + 1
"""
In [402]: in1990.searchsorted(0.5) + 1
Out[402]: 45
"""

これにより2012年は124種類の名前が上位50%を占めているが、1990年はわずか45種類が上位50%を占めていることがわかる。
他の年についても見てゆく。

def get_quantile_count(group, q=0.5):
    group = group.sort_index(by="prop", ascending=False)
    return group.prop.cumsum().searchsorted(q) + 1

diversity = top1000.groupby(["year", "sex"]).apply(get_quantile_count)
diversity = diversity.unstack("sex")
diversity.plot(title="Number of popular names in top 50%")

見てわかるように男女共に名前のバラエティは増えている。

The "Last letter" Revolution

男性の名前のlast letter分布はここ100年で移り変わっているという研究がなされているので、これを確かめる。

get_last_letter = lambda x: x[-1]
last_letters = names.name.map(get_last_letter)
last_letters.name = "last_letter"
table = names.pivot_table("births", rows=last_letters, cols=["sex", "year"], aggfunc=sum)
subtable = table.reindex(columns=[1910,1960,2010],level="year")
"""
In [54]: subtable.head()
Out[54]: 
sex               F                      M                
year           1910    1960    2010   1910    1960    2010
last_letter                                               
a            108395  691251  675059    977    5207   28772
b               NaN     694     454    411    3914   39149
c                 5      49     953    482   15472   23279
d              6751    3732    2633  22114  262140   44690
e            133592  435050  315891  28662  178814  129916"
"""
# 正規化
"""
In [55]: subtable.sum()
Out[55]: 
sex  year
F    1910     396482
     1960    2022126
     2010    1770632
M    1910     194214
     1960    2132733
     2010    1911572
dtype: float64
"""
letter_prop = subtable / subtable.sum().astype(float)

見てわかるように男性のnは明らかに近年増えている。

letter_prop = table / table.sum().astype(float)
dny_ts = letter_prop.ix[["d","n","y"],"M"].T
"""
In [68]: dny_ts.head()
Out[68]: 
             d         n         y
year                              
1880  0.083057  0.153216  0.075762
1881  0.083249  0.153207  0.077452
1882  0.085340  0.149560  0.077537
1883  0.084058  0.151649  0.079146
1884  0.086120  0.149915  0.080405
"""

Boy names that became girl names (and vice versa)

他の面白い流行としては昔と今では性別の違う名前があることだ。
たとえばLesleyやLeslieのような。

all_names = top1000.name.unique()
mask = np.array(["lesl" in x.lower() for x in all_names])
lesley_like = all_names[mask]
"""
In [75]: lesley_like
Out[75]: array(['Leslie', 'Lesley', 'Leslee', 'Lesli', 'Lesly'], dtype=object)
"""

filtered = top1000[top1000.name.isin(lesley_like)]
"""
In [79]: filtered.groupby("name").births.sum()
Out[79]: 
name
Leslee       993
Lesley     35028
Lesli        929
Leslie    372981
Lesly      10839
Name: births, dtype: int64
"""

table = filtered.pivot_table("births", rows="year",cols="sex",aggfunc="sum")
table = table.div(table.sum(1),axis=0)
"""
In [82]: table.tail()
Out[82]: 
sex   F   Mask
year       
2008  1 NaN
2009  1 NaN
2010  1 NaN
2011  1 NaN
2012  1 NaN
"""
table.plot(style={"M":"k-","F":"k--"})

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up