More than 5 years have passed since last update.

PythonのPandasを用いた情報処理の基本　-Series編-

Last updated at 2018-08-06Posted at 2018-08-05

この記事の目的

この記事はPythonでデータ解析を行うためのライブラリであるPandasについて解説した記事です。自分で勉強したことをアウトプットするためにこの記事を書きました。

参考にしているのはCourseraでミシガン大学が公開している「Introduction to Data Science in Python」です。

環境

Windows10
Python3.6
Jupyter Notebook

下のコードはすべてJupyter Notebookで実行することを想定したものです。

Pandasのインポート

Pandasは下のようにインポートして使います。一般的にはpdとしてインポートします。

import pandas as pd

Pandas.Seriesの基本的な使い方

`Series`を作る

もっとも簡単なPandasのデータ形式はSeriesと呼ばれるものです。これは後で説明するDataFrameの簡易版のようなものです。Seriesではデータをエクセルのように収納することができます。

animals = ['Tiger', 'Bear', 'Moose']
pd.Series(animals)

# ->
# 0    Tiger
# 1     Bear
# 2    Moose
# dtype: object

0行目にTigerが、1行目にBearがそして3行目にMooseが収納されていることがわかります。最後のdtypeは収納されているデータの型を示しています。エクセルによく似ていますが、行が0から始まる点が異なっています。この0から始まる連続した数字の部分のことをインデックスと呼びます。

インデックスを0から始まる数字ではない、別のものにしたいこともあります。その方法にはいくつかあります。

`Series`に変換するときに辞書型を使う方法

sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s

# ->
# Archery           Bhutan
# Golf            Scotland
# Sumo               Japan
# Taekwondo    South Korea
# dtype: object

Seriesを生成するときに明示的にインデックスを指定する方法

s = pd.Series(['Tiger', 'Bear', 'Moose'], index=['India', 'America', 'Canada'])
s

# ->
# India      Tiger
# America     Bear
# Canada     Moose
# dtype: object

Seriesを作った後に、そのインデックスを知りたいときには(Series name).indexを使います。

sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s.index

# ->
# Index(['Archery', 'Golf', 'Sumo', 'Taekwondo'], dtype='object')

また、ある辞書があって、そのうちのいくつかを使ってSeriesを作りたいとします。そのようなときには、次のように、index=...のところで作るインデックスを指定してやります。

sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports, index=['Golf', 'Sumo', 'Hockey'])
s

# ->
# Golf      Scotland
# Sumo         Japan
# Hockey         NaN
# dtype: object

上の例のようにもともとの辞書になかったHockeyをインデックスで指定すると、できたSeriesのHockeyのところにはNaNが入ります。

`Series`から要素を取り出す

次に、作ったSeriesから要素を取り出します。

何行目かを指定して取り出す

これにはilocを使います。

sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)

s.iloc[3]   #4行目の要素を取り出す。(0から数え始めることに注意)

# ->
# South Korea

これは下のような方法でも同じ結果が得られます。

s[3]

s['Taekwondo']

# どちらも同じ出力
# ->
# South Korea

Pythonでリストから要素を取り出すときにはs[3]のような記法をよく使うので、この記法のほうがなじみがあるかと思います。しかし場合によってはilocしか使えないような状況があります。それはインデックスに数字を指定した時です。例えば、

sports = {99: 'Bhutan',
          100: 'Scotland',
          101: 'Japan',
          102: 'South Korea'}
s = pd.Series(sports)

このようなときにs[3]を実行するとエラーになります。South Koreaという出力が欲しいときには

s.iloc[3]

# ->
# South Korea

しか使えません。

先頭からもしくは後ろから何行か取り出す

これはできたSeriesを確認するときにも使います。

s.head(20)    #先頭から20行とりだす。s.head()とするとデフォで5行表示
s.tail(20)    #後ろから20行取り出す。

インデックス名を指定して取り出す

これにはlocを使います。ilocと似ているので注意が必要です。

sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s.loc['Golf']

# ->
# Scotland

このlocを使って要素を追加することもできます。

sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)

s.loc['Kendo'] = 'Japan'
s

# ->
# Archery           Bhutan
# Golf            Scotland
# Sumo               Japan
# Taekwondo    South Korea
# Kendo              Japan
# dtype: object

`Series`で演算する

次のようなSeriesを使って説明していきます。

s = pd.Series(np.random.randint(0,1000,10000))  #0~1000までの整数からランダムに10000個生成
s.head()

# ->
# 0    442
# 1    477
# 2    707
# 3    556
# 4    974
# dtype: int32

要素の和を求める

要素を最初から取り出してfor文をまわして計算することもできますが、Numpyを使うと便利です。

import numpy as np
np.sum(s)

# ->
# 5017477

各要素に対して演算する

sの各要素に2を足すことを考えてみます。これには上と同様2通りの方法が考えられます。

# 方法1
for label, value in s.iteritems():
    s.loc[label]= value+2

# 方法2
s += 2

一見しても2番目の方法のほうが簡単です。しかし、簡単という理由以外に2番目には大きな利点があります。それは速度です。1番目のlocを使う方法は非常に時間がかかります。具体的には、

方法	時間
1	5.13 s
2	383 $\mu$s

桁が2つ違います。

2つの`Series`をつなげる

original_sports = pd.Series({'Archery': 'Bhutan',
                             'Golf': 'Scotland',
                             'Sumo': 'Japan',
                             'Taekwondo': 'South Korea'})
cricket_loving_countries = pd.Series(['Australia',
                                      'Barbados',
                                      'Pakistan',
                                      'England'], 
                                   index=['Cricket',
                                          'Cricket',
                                          'Cricket',
                                          'Cricket'])
all_countries = original_sports.append(cricket_loving_countries)

all_countries

# ->
# Archery           Bhutan
# Golf            Scotland
# Sumo               Japan
# Taekwondo    South Korea
# Cricket        Australia
# Cricket         Barbados
# Cricket         Pakistan
# Cricket          England
# dtype: object

all_countries.loc['Cricket']    #インデックスがCricketのものを全部取り出す

# ->
# Cricket    Australia
# Cricket     Barbados
# Cricket     Pakistan
# Cricket      England
# dtype: object

終わりに

この記事ではPandasの数ある機能のうち、最も基本的なものであるSeriesに限って説明をした。今後はより高度なDataFrameについて記事を書くつもりです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

PythonのPandasを用いた情報処理の基本 -Series編-