前回に引き続きdataquestの問題を解いていこうと思う。学習過程で学んだことは備忘録になるようできる限り整理して序盤に貼っておきます。ところどころ原文をそのまま抜き取っていますがご了承を。
Numpyのきほんのき
Pandasのきほんのき#1
Pandasのきほんのき#2
Series Objects
一言で言うと
collections of values
値の集合である
is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.
Seriesは、任意のデータ型(整数、文字列、浮動小数点数、Pythonオブジェクトなど)を保持できる1次元のラベル付き配列です。軸ラベルはまとめてインデックスと呼ばれます。
もっとわかりやすく言うと一次元のリスト。2次元になるとDataframeと呼ばれるようになる。
主な特徴
- 処理速度向上のためにNumpy仕様の配列を使っているがそれに付随してデータ分析に関する様々なツールが入っている
- Numpy配列では_integer index_(数字インデックス)を用いる代わりに_Series Objects_ではストリングをインデックスとして使えたりもできる。
- また配列内のデータは統一されている必要がなく、
NaN
をNone
値として使う
取り扱うデータの種類
-
float
- For float values -
int
- For integer values -
bool
- For Boolean values -
datetime64[ns]
- For date & time, without time zone -
datetime64[ns, tz]
- For date & time, with time zone -
timedelta[ns]
- For representing differences in dates & times (seconds, minutes, etc.) -
category
- For categorical values -
object
- For string values
If we only had these two Series objects and wanted to look up the Rotten Tomatoes scores for
Minions (2015)
andLeviathan (2014)
, we'd have to:
Find the integer index corresponding to
Minions (2015)
inseries_film
Look up the value at that integer index fromseries_rt
Find the integer index corresponding to Leviathan (2014) inseries_film
Look up the value at that integer index fromseries_rt
To accomplish this, we need to move away from using integer indexes, and use string indexes corresponding to the film names instead. Then we can pass in a list of strings matching the film names to retrieve the scores, like so:
series_custom[['Minions (2015)', 'Leviathan (2014)']]
Re-indexing
インデックスのリセットとイメージしてもらえればわかりやすいかもしれない。主な手順は
- Return a list representation of the current index using
tolist()
. - Sort the index with
sorted()
. - Use
reindex()
to set the newly-ordered index.
でもこの作業自体が面倒なのでsort_index()
もしくはsort_values()
を使って手間を省くこともできる。
To make sorting easier, pandas comes with a
sort_index()
method that sorts a Series by index, and asort_values()
method that sorts a Series by its values. Since the values representing the Rotten Tomatoes scores are integers, sorting by values will return the data in numerically ascending order (low to high).
In both cases, pandas preserves the link between each element's index (film name) and value (score). We call this data alignment, which is a key tenet of pandas that's incredibly important when analyzing data. Pandas allows us to assume the linking will be preserved, unless we specifically change a value or an index.
data alignmentとは
データの統合性を意味している。Pandasではデータの統合性維持のために自動で各データの値とそのインデックスを覚えてくれている。他の変数にアサインしたり意図的に変更を加えない限りはその関係は維持される。
One of pandas' core tenets is data alignment. Series objects align along indices, and DataFrame objects align along both indices and columns. With Series objects, pandas implicitly preserves the link between the index labels and the values across operations and transformations, unless we explicitly break it. With DataFrame objects, the values link to the index labels and the column labels. Pandas also preserves these links, unless we explicitly break them (by reassigning or editing a column or index label, for example).
Comparison
series_custom > 50
こう描いてあげるだけでシリーズオブジェクトのそれぞれの値全てと比較を行ってくれる。結果としてブーリアンシリーズが返ってくる。
will actually return a Series object with a Boolean value for each film.
That's because pandas applies the filter (
> 50
) to each value in the Series object. To retrieve the actual film names, we need to pass this Boolean series into the original Series object.
series_greater_than_50 = series_custom[series_custom > 50]
Pandas returns Boolean Series objects that serve as intermediate representations of the logic. These objects make it easier to separate complex logic into modular pieces. We can specify filtering criteria in different variables, then chain them together with the and operator (&) or the or operator (|). Finally, we can use a Series object's bracket notation to pass in an expression representing a Boolean Series object and get back the filtered data set.
例題
1
Both of these data sets group the various majors into categories in the Major_category column. Let's start by understanding the number of people in each Major_category for both data sets.
To do so, you'll need to:
-
Return the unique values in
Major_category
. -
Use the
Series.unique()
method to return the unique values in a column, like this:recent_grads['Major_category'].unique()
-
For each unique value:
-
Return all of the rows where
Major_category
equals that unique value.
Calculate the total number of students those rows represent (using the Total column). -
Use the
Series.sum()
to calculate the sum of the values in a column.recent_grads['Total'].sum()
returns the sum of the values in theTotal
column. -
Keep track of the totals by adding the
Major_category
value and the total number of students to a dictionary. -
Use the
Total
column to calculate the number of people who fall under each Major_category in each data set. -
Store the result as a separate dictionary for each data set.
-
The key for the dictionary should be the
Major_category
, and the value should be the total count. -
For the counts from
all_ages
, store the results as a dictionary named aa_cat_counts. -
For the counts from
recent_grads
, store the results as a dictionary named rg_cat_counts.
解答
aa_cat_counts = dict()
rg_cat_counts = dict()
# Unique values in Major_category column.
unique_values = all_ages['Major_category'].unique()
for unique_value in unique_values:
unique_values_age = all_ages.loc[all_ages['Major_category'] == unique_value]
total_sum_age = unique_values_age['Total'].sum()
aa_cat_counts[unique_value] = total_sum_age
unique_values = recent_grads['Major_category'].unique()
for unique_value in unique_values:
unique_values_grad = recent_grads.loc[recent_grads["Major_category"] == unique_value]
total_sum_grad = unique_values_grad['Total'].sum()
rg_cat_counts[unique_value] = total_sum_grad
参照
2
It looks like only about 9.85%, equivalent to a proportion of 0.0985, of graduates took on a low wage job after finishing college.
Both the all_ages and recent_grads data sets have 173 rows, corresponding to the 173 college major codes. This enables us to do some comparisons between the two data sets, and perform some initial calculations to see how the statistics for recent college graduates compare with those for the entire population.
Next, let's calculate the number of majors where recent graduates did better than the overall population.
- Use a
for
loop to iterate over majors. - For each major, use Boolean filtering to find the corresponding row in both DataFrames.
- Compare the values for
Unemployment_rate
to see which DataFrame has a lower value. - Increment
rg_lower_count
if the value forUnemployment_rate
is lower forrecent_grads
than it is forall_ages
. - Display
rg_lower_count
with theprint()
function.
解答
# All majors, common to both DataFrames
majors = recent_grads['Major'].unique()
rg_lower_count = 0
for major in majors:
recent_grads_row = recent_grads[recent_grads["Major"] == major]
all_ages_row = all_ages[all_ages["Major"] == major]
rg_unemp_rate = recent_grads_row.iloc[0]["Unemployment_rate"]
aa_unemp_rate = all_ages_row.iloc[0]["Unemployment_rate"]
if rg_unemp_rate < aa_unemp_rate:
rg_lower_count += 1
print(rg_lower_count)
気づいたこと
pivot tableを使ってall_age.pivot_table(index="Major_category", values="Total", aggfunc=np.sum)
を用いた方が綺麗に書けたかもしれない。
rg_unemp_rate = recent_grads_row.iloc[0]["Unemployment_rate"]
aa_unemp_rate = all_ages_row.iloc[0]["Unemployment_rate"]
上記の二行を追加していなかったためにCan only compare identically-labeled DataFrame objects
というエラーで怒られ続けた。
3
- Create a new
Series
object namedseries_custom
that has a string index (based on the values from film_names), and contains all of the Rotten Tomatoes scores fromseries_rt
. - To create a new
Series
object: - Import
Series
frompandas
. - Instantiate a new
Series
object, which takes in a data parameter and an index parameter. See the documentation for help. - Both of these parameters need to be lists.
解答
# Import the Series object from pandas
from pandas import Series
film_names = series_film.values
rt_scores = series_rt.values
film_names_list = film_names.tolist()
rt_scores_list = rt_scores.tolist()
series_custom = pd.Series(index=film_names_list, data=rt_scores_list)
気づいたこと
引数として入るindex
とdata
はどちらともリストである必要があるので.tolist()
を忘れずに。
4
- The list
original_index
contains the original index. Sort this index using the Python 3 core methodsorted()
, then pass the result in to the Series methodreindex()
. - Store the result in a variable named
sorted_by_index
.
解答
original_index = series_custom.index
sorted_index = sorted(index_list)
sorted_by_index = series_custom.reindex(sorted_index)
5
- Normalize
series_custom
(which is currently on a 0 to 100-point scale) to a 0 to 5-point scale by dividing each value by 20. - Assign the new normalized Series object to
series_normalized
.
解答
series_normalized = series_custom/20
6
- In the following code cell, the criteria_one and
criteria_two
statements return BooleanSeries
objects. - Return a filtered
Series
object namedboth_criteria
that only contains the values where both criteria are true. Use bracket notation and the & operator to obtain thisSeries
object.
解答
criteria_one = series_custom > 50
criteria_two = series_custom < 75
both_criteria = series_custom[criteria_one & criteria_two]
7
-
rt_critics
andrt_users
areSeries
objects containing the average ratings from critics and users for each film. - Both
Series
objects use the same custom string index, which they base on the film names. Use the Python arithmetic operators to return a newSeries
object,rt_mean
, that contains the mean ratings from both critics and users for each film.
解答
rt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM'])
rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM'])
rt_mean = (rt_critics + rt_users) / 2
気づいたこと
設問の理解に多少時間がかかった。今回の要求では
- pythonのarthmetic operatorsを使わないといけない。
df.mean()
が使えない -
rt_mean
はそれぞれのfilm
に対応するcritics
とusers
値の平均をが欲しい
であったのはわかっていたがrm_mean
は_contains the mean ratings from both critics and users for each film_と言う意味なのか_contains the mean of the two ratings from critics and users_と言う意味なのかどっちつかずの表現だったために~~(おそらく自分の読解力不足...)~~若干混乱した。
大事なことはこちら側がrt_critics
とrt_users
は二つとも同じ順番で同じ数だけの値(the same indexes for each)が入っていると想定していること。そうでないと今回のケースではNaN
を生み出してしまうからだ。この前提がない場合はSeries.reindex(drop=True)
で揃えてあげる必要がある。
また、混乱を防ぐためにindexはユニークであることが推奨されている。
当初は「なんで2で割るの?全体の数で割るんだからseries.count()
とかじゃないの?」と思ったのだがよく考えたら各フィルムに対して変数の数は2個なので2個で割る必要があったみたい。これは上に書いた設問理解不足につながっている。
I first thought that to get the mean you divide
(rt_critics + rt_users)
byrt_critics.count() + rt_users.count()
, but that doesn’t make any sense since in this problem what you want to do is to calculate the mean of both critics and users for each film. This tells me that what you would do in your example is something like 1+100/2, 1+50/2, 1+20/2 and so on.