More than 1 year has passed since last update.

[pandas] DatFrame, Seriesのデータ抽出

Last updated at 2023-03-23Posted at 2022-10-19

Dataframeの要素抽出の方法についてまとめた。

公式のdocumentation

User Guide -- pandas 1.4.4 documentation

動作環境

種類	バージョン
MacBook Air	Monterey12.5.1
python	3.8.9
jupyter notebook	6.4.10
pandas	1.4.3

まずはパッケージのインポートから

import pandas as pd

pandasを扱うときはpdが慣例だそう。

今回扱うDataFrameについて

[python]pandas read_csvの備忘録
にて作成したDataFrameで説明している。

print(df_SN.head(10))
#      year   num  std  spot  certenty
# 0  1700.5   8.3 -1.0    -1         1
# 1  1701.5  18.3 -1.0    -1         1
# 2  1702.5  26.7 -1.0    -1         1
# 3  1703.5  38.3 -1.0    -1         1
# 4  1704.5  60.0 -1.0    -1         1
# 5  1705.5  96.7 -1.0    -1         1
# 6  1706.5  48.3 -1.0    -1         1
# 7  1707.5  33.3 -1.0    -1         1
# 8  1708.5  16.7 -1.0    -1         1
# 9  1709.5  13.3 -1.0    -1         1

順に、
year:観測年、num:黒点数、std:標準偏差、obs spot:観測地点数、certainty:確定or未確定
以前の記事のヘッダーから英語にしているとこだけご注意。

① []で抽出

columnで指定可能(index or columnの数字ではエラーを返される。)。

print(df_SN['count'])
# 0       8.3
# 1      18.3
# 2      26.7
# 3      38.3
# 4      60.0
#        ... 
# 317    21.7
# 318     7.0
# 319     3.6
# 320     8.8
# 321    29.6
# Name: count, Length: 322, dtype: float64

一方で、index、columnを単独で複数選択することは可能。
この場合は、indexは:、columnはリストにすればOK。

print(df_SN[['year', 'count']])
#       year  count
# 0    1700.5    8.3
# 1    1701.5   18.3
# 2    1702.5   26.7
# 3    1703.5   38.3
# 4    1704.5   60.0
# ..      ...    ...
# 317  2017.5   21.7
# 318  2018.5    7.0
# 319  2019.5    3.6
# 320  2020.5    8.8
# 321  2021.5   29.6
# 
# [322 rows x 2 columns]

②.loc

.locも使い方は先述の[]と同様。index、columnの記述の仕方は多少違いがある。

print(df_SN[5:9])
# 96.7

違い一つ目: columnの名称で指定するとエラーを返す。

print(df_SN.loc['year'])
# 中略
# KeyError: 'year'

columnだけ指定の時は、以下の方法で。

print(df_SN.loc[:, 'count'])
# 0       8.3
# 1      18.3
# 2      26.7
# 3      38.3
# 4      60.0
#        ... 
# 317    21.7
# 318     7.0
# 319     3.6
# 320     8.8
# 321    29.6
# Name: count, Length: 322, dtype: float64

違い二つ目: indexは数字のみでOK。

print(df_SN.loc[5])
# year         1705.5
# count          96.7
# std            -1.0
# obs spot       -1.0
# certainty       1.0
# Name: 5, dtype: float64

他は変わらない(はず)。
間違ってたらどんまい。

③.iloc

.locの数字指定onlyと覚えてもらうのが手っ取り早い。

print(df_SN.iloc[2:5, 1:3])
#    count  std
# 2   26.7 -1.0
# 3   38.3 -1.0
# 4   60.0 -1.0

なので無論、column指定に文字列をぶっこむとエラーを返す。

print(df_SN.iloc[5, 'count'])
# 中略
# ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

④.at&.iat

これらは、単独の値を抜き取るときにしか使えない。
だが、どうやらloc、ilocよりも処理は速い(らしい)。ソースは多分ラストに載せた参考URLにある。

print(df_SN.at[6, 'count'])
# 48.3

loc、ilocと同様、iatでは共に整数を指定せにゃあかん。

print(df_SN.at[6, 1])
# 48.3

参考にしたURL

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up