More than 5 years have passed since last update.

Pandasのきほん： Series Objectとは何か

Posted at 2018-03-27

前回に引き続きdataquestの問題を解いていこうと思う。学習過程で学んだことは備忘録になるようできる限り整理して序盤に貼っておきます。ところどころ原文をそのまま抜き取っていますがご了承を。

Numpyのきほんのき
 Pandasのきほんのき#1
Pandasのきほんのき#2

Series Objects

一言で言うと

collections of values
値の集合である

is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.
Seriesは、任意のデータ型（整数、文字列、浮動小数点数、Pythonオブジェクトなど）を保持できる1次元のラベル付き配列です。軸ラベルはまとめてインデックスと呼ばれます。

もっとわかりやすく言うと一次元のリスト。２次元になるとDataframeと呼ばれるようになる。

主な特徴

処理速度向上のためにNumpy仕様の配列を使っているがそれに付随してデータ分析に関する様々なツールが入っている
Numpy配列では_integer index_（数字インデックス）を用いる代わりに_Series Objects_ではストリングをインデックスとして使えたりもできる。
また配列内のデータは統一されている必要がなく、NaNをNone値として使う

取り扱うデータの種類

float - For float values
int - For integer values
bool - For Boolean values
datetime64[ns] - For date & time, without time zone
datetime64[ns, tz] - For date & time, with time zone
timedelta[ns] - For representing differences in dates & times (seconds, minutes, etc.)
category - For categorical values
object - For string values

If we only had these two Series objects and wanted to look up the Rotten Tomatoes scores for Minions (2015) and Leviathan (2014), we'd have to:

Find the integer index corresponding to Minions (2015) in series_film
Look up the value at that integer index from series_rt
Find the integer index corresponding to Leviathan (2014) in series_film
Look up the value at that integer index from series_rt

To accomplish this, we need to move away from using integer indexes, and use string indexes corresponding to the film names instead. Then we can pass in a list of strings matching the film names to retrieve the scores, like so:

lookup_value.py

series_custom[['Minions (2015)', 'Leviathan (2014)']]

Re-indexing

インデックスのリセットとイメージしてもらえればわかりやすいかもしれない。主な手順は

Return a list representation of the current index using tolist().
Sort the index with sorted().
Use reindex() to set the newly-ordered index.

でもこの作業自体が面倒なのでsort_index()もしくはsort_values()を使って手間を省くこともできる。

To make sorting easier, pandas comes with a sort_index() method that sorts a Series by index, and a sort_values() method that sorts a Series by its values. Since the values representing the Rotten Tomatoes scores are integers, sorting by values will return the data in numerically ascending order (low to high).

In both cases, pandas preserves the link between each element's index (film name) and value (score). We call this data alignment, which is a key tenet of pandas that's incredibly important when analyzing data. Pandas allows us to assume the linking will be preserved, unless we specifically change a value or an index.

data alignmentとは

データの統合性を意味している。Pandasではデータの統合性維持のために自動で各データの値とそのインデックスを覚えてくれている。他の変数にアサインしたり意図的に変更を加えない限りはその関係は維持される。

One of pandas' core tenets is data alignment. Series objects align along indices, and DataFrame objects align along both indices and columns. With Series objects, pandas implicitly preserves the link between the index labels and the values across operations and transformations, unless we explicitly break it. With DataFrame objects, the values link to the index labels and the column labels. Pandas also preserves these links, unless we explicitly break them (by reassigning or editing a column or index label, for example).

Comparison

compare.py

series_custom > 50

こう描いてあげるだけでシリーズオブジェクトのそれぞれの値全てと比較を行ってくれる。結果としてブーリアンシリーズが返ってくる。

will actually return a Series object with a Boolean value for each film.

That's because pandas applies the filter (> 50) to each value in the Series object. To retrieve the actual film names, we need to pass this Boolean series into the original Series object.

get_actual_nums.py

series_greater_than_50 = series_custom[series_custom > 50]

Pandas returns Boolean Series objects that serve as intermediate representations of the logic. These objects make it easier to separate complex logic into modular pieces. We can specify filtering criteria in different variables, then chain them together with the and operator (&) or the or operator (|). Finally, we can use a Series object's bracket notation to pass in an expression representing a Boolean Series object and get back the filtered data set.

例題

1

Both of these data sets group the various majors into categories in the Major_category column. Let's start by understanding the number of people in each Major_category for both data sets.

To do so, you'll need to:

Return the unique values in Major_category.
Use the Series.unique() method to return the unique values in a column, like this: recent_grads['Major_category'].unique()
For each unique value:
Return all of the rows where Major_category equals that unique value.
Calculate the total number of students those rows represent (using the Total column).
Use the Series.sum() to calculate the sum of the values in a column. recent_grads['Total'].sum() returns the sum of the values in the Total column.
Keep track of the totals by adding the Major_category value and the total number of students to a dictionary.
Use the Total column to calculate the number of people who fall under each Major_category in each data set.
Store the result as a separate dictionary for each data set.
The key for the dictionary should be the Major_category, and the value should be the total count.
For the counts from all_ages, store the results as a dictionary named aa_cat_counts.
For the counts from recent_grads, store the results as a dictionary named rg_cat_counts.

解答

get_sum.py

aa_cat_counts = dict()
rg_cat_counts = dict()

# Unique values in Major_category column.
unique_values = all_ages['Major_category'].unique()
for unique_value in unique_values:
    unique_values_age = all_ages.loc[all_ages['Major_category'] == unique_value]
    total_sum_age = unique_values_age['Total'].sum()
    aa_cat_counts[unique_value] = total_sum_age
   
unique_values = recent_grads['Major_category'].unique()
for unique_value in unique_values:
    unique_values_grad = recent_grads.loc[recent_grads["Major_category"] == unique_value]
    total_sum_grad = unique_values_grad['Total'].sum()
    rg_cat_counts[unique_value] = total_sum_grad

参照

Select rows from a DataFrame based on values in a column in pandas

2

It looks like only about 9.85%, equivalent to a proportion of 0.0985, of graduates took on a low wage job after finishing college.

Both the all_ages and recent_grads data sets have 173 rows, corresponding to the 173 college major codes. This enables us to do some comparisons between the two data sets, and perform some initial calculations to see how the statistics for recent college graduates compare with those for the entire population.

Next, let's calculate the number of majors where recent graduates did better than the overall population.

Use a for loop to iterate over majors.
For each major, use Boolean filtering to find the corresponding row in both DataFrames.
Compare the values for Unemployment_rate to see which DataFrame has a lower value.
Increment rg_lower_count if the value for Unemployment_rate is lower for recent_grads than it is for all_ages.
Display rg_lower_count with the print() function.

解答

compare_urates.py

# All majors, common to both DataFrames
majors = recent_grads['Major'].unique()
rg_lower_count = 0
for major in majors:
    recent_grads_row = recent_grads[recent_grads["Major"] == major]
    all_ages_row = all_ages[all_ages["Major"] == major]
    rg_unemp_rate = recent_grads_row.iloc[0]["Unemployment_rate"]
    aa_unemp_rate = all_ages_row.iloc[0]["Unemployment_rate"]
    if rg_unemp_rate < aa_unemp_rate:
        rg_lower_count += 1
print(rg_lower_count)

気づいたこと

pivot tableを使ってall_age.pivot_table(index="Major_category", values="Total", aggfunc=np.sum)を用いた方が綺麗に書けたかもしれない。

rg_unemp_rate = recent_grads_row.iloc[0]["Unemployment_rate"]
aa_unemp_rate = all_ages_row.iloc[0]["Unemployment_rate"]

上記の二行を追加していなかったためにCan only compare identically-labeled DataFrame objectsというエラーで怒られ続けた。

3

Create a new Series object named series_custom that has a string index (based on the values from film_names), and contains all of the Rotten Tomatoes scores from series_rt.
To create a new Series object:
Import Series from pandas.
Instantiate a new Series object, which takes in a data parameter and an index parameter. See the documentation for help.
Both of these parameters need to be lists.

解答

get_series.py

# Import the Series object from pandas
from pandas import Series

film_names = series_film.values
rt_scores = series_rt.values
film_names_list = film_names.tolist()
rt_scores_list = rt_scores.tolist()

series_custom = pd.Series(index=film_names_list, data=rt_scores_list)

気づいたこと

引数として入るindexとdataはどちらともリストである必要があるので.tolist()を忘れずに。

4

The list original_index contains the original index. Sort this index using the Python 3 core method sorted(), then pass the result in to the Series method reindex().
Store the result in a variable named sorted_by_index.

解答

reindex.py

original_index = series_custom.index
sorted_index = sorted(index_list)
sorted_by_index = series_custom.reindex(sorted_index)

5

Normalize series_custom (which is currently on a 0 to 100-point scale) to a 0 to 5-point scale by dividing each value by 20.
Assign the new normalized Series object to series_normalized.

解答

normalize_vals.py

series_normalized = series_custom/20

6

In the following code cell, the criteria_one and criteria_two statements return Boolean Series objects.
Return a filtered Series object named both_criteria that only contains the values where both criteria are true. Use bracket notation and the & operator to obtain this Series object.

解答

chain_conds.py

criteria_one = series_custom > 50
criteria_two = series_custom < 75
both_criteria = series_custom[criteria_one & criteria_two]

7

rt_critics and rt_users are Series objects containing the average ratings from critics and users for each film.
Both Series objects use the same custom string index, which they base on the film names. Use the Python arithmetic operators to return a new Series object, rt_mean, that contains the mean ratings from both critics and users for each film.

解答

get_mean.py

rt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])
rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM'])
rt_mean = (rt_critics + rt_users) / 2

気づいたこと

設問の理解に多少時間がかかった。今回の要求では

pythonのarthmetic operatorsを使わないといけない。df.mean()が使えない
rt_meanはそれぞれのfilmに対応するcriticsとusers値の平均をが欲しい

であったのはわかっていたがrm_meanは_contains the mean ratings from both critics and users for each film_と言う意味なのか_contains the mean of the two ratings from critics and users_と言う意味なのかどっちつかずの表現だったために~~（おそらく自分の読解力不足...）~~若干混乱した。

大事なことはこちら側がrt_criticsとrt_usersは二つとも同じ順番で同じ数だけの値（the same indexes for each）が入っていると想定していること。そうでないと今回のケースではNaNを生み出してしまうからだ。この前提がない場合はSeries.reindex(drop=True)で揃えてあげる必要がある。
また、混乱を防ぐためにindexはユニークであることが推奨されている。

当初は「なんで2で割るの？全体の数で割るんだからseries.count()とかじゃないの？」と思ったのだがよく考えたら各フィルムに対して変数の数は2個なので2個で割る必要があったみたい。これは上に書いた設問理解不足につながっている。

I first thought that to get the mean you divide (rt_critics + rt_users) by rt_critics.count() + rt_users.count(), but that doesn’t make any sense since in this problem what you want to do is to calculate the mean of both critics and users for each film. This tells me that what you would do in your example is something like 1+100/2, 1+50/2, 1+20/2 and so on.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up