More than 5 years have passed since last update.

Python Pandas で様々なファイルを読み込み操作する

Last updated at 2018-03-17Posted at 2018-03-03

Pandas ファイル操作

Pandas の導入や基礎に関しては、以下を参照してください。

Pandas の導入とデータ型 - Qiita

Pandas は外部ファイルを読み込む機能を保持しています

Pandas 公式が配布しているファイルを読み込みながら、データを操作してみます。

上記のデータを「data」フォルダに保存した前提で、記述します。

CSV を読む

基本的な CSV を読み込みます。

import して、「read_csv」で読む事が可能です。

import pandas as pd

df = pd.read_csv('data/baseball.csv')

info で情報を確認してみます。


df.info()

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 100 entries, 0 to 99
    Data columns (total 23 columns):
    id        100 non-null int64
    player    100 non-null object
    year      100 non-null int64
    stint     100 non-null int64
    team      100 non-null object
    lg        100 non-null object
    g         100 non-null int64
    ab        100 non-null int64
    r         100 non-null int64
    h         100 non-null int64
    X2b       100 non-null int64
    X3b       100 non-null int64
    hr        100 non-null int64
    rbi       100 non-null float64
    sb        100 non-null float64
    cs        100 non-null float64
    bb        100 non-null int64
    so        100 non-null float64
    ibb       100 non-null float64
    hbp       100 non-null float64
    sh        100 non-null float64
    sf        100 non-null float64
    gidp      100 non-null float64
    dtypes: float64(9), int64(11), object(3)
    memory usage: 18.0+ KB

info で見ると 100行ほどあるようなので、全てを表示するとみずらいので、先頭10行を見てみます。


df[:10]

	id	player	year	stint	team	lg	g	ab	r	h	...	rbi	sb	cs	bb	so	ibb	hbp	sh	sf	gidp
0	88641	womacto01	2006	2	CHN	NL	19	50	6	14	...	2.0	1.0	1.0	4	4.0	0.0	0.0	3.0	0.0	0.0
1	88643	schilcu01	2006	1	BOS	AL	31	2	0	1	...	0.0	0.0	0.0	0	1.0	0.0	0.0	0.0	0.0	0.0
2	88645	myersmi01	2006	1	NYA	AL	62	0	0	0	...	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0.0
3	88649	helliri01	2006	1	MIL	NL	20	3	0	0	...	0.0	0.0	0.0	0	2.0	0.0	0.0	0.0	0.0	0.0
4	88650	johnsra05	2006	1	NYA	AL	33	6	0	1	...	0.0	0.0	0.0	0	4.0	0.0	0.0	0.0	0.0	0.0
5	88652	finlest01	2006	1	SFN	NL	139	426	66	105	...	40.0	7.0	0.0	46	55.0	2.0	2.0	3.0	4.0	6.0
6	88653	gonzalu01	2006	1	ARI	NL	153	586	93	159	...	73.0	0.0	1.0	69	58.0	10.0	7.0	0.0	6.0	14.0
7	88662	seleaa01	2006	1	LAN	NL	28	26	2	5	...	0.0	0.0	0.0	1	7.0	0.0	0.0	6.0	0.0	1.0
8	89177	francju01	2007	2	ATL	NL	15	40	1	10	...	8.0	0.0	0.0	4	10.0	1.0	0.0	0.0	1.0	1.0
9	89178	francju01	2007	1	NYN	NL	40	50	7	10	...	8.0	2.0	1.0	10	13.0	0.0	0.0	0.0	1.0	1.0

10 rows × 23 columns

read_csv のドキュメントは以下にあります。

Excel を読む

Excelを読む事ができます。「read_excel」を利用します。

Excelを読み込むには xlrd ライブラリ等が必要です。

df = pd.read_excel('data/test.xls')

df.info()

    <class 'pandas.core.frame.DataFrame'>
    DatetimeIndex: 7 entries, 2000-01-03 to 2000-01-11
    Data columns (total 4 columns):
    A    7 non-null float64
    B    7 non-null float64
    C    7 non-null float64
    D    7 non-null float64
    dtypes: float64(4)
    memory usage: 280.0 bytes

read_excel のドキュメントは以下にあります。

HTMLを読む

HTMLを簡単に読む事ができます。「read_html」を利用します。

当然ローカルのHTMLも読めますが、ネット上のHTMLを読み込む事ができます。

Wikipedia には猫の品種の一覧 - Wikipedia
のような、HTML テーブルの表がありますので、これを読み込む事が可能です。

url_str = 'https://ja.wikipedia.org/wiki/%E7%8C%AB%E3%81%AE%E5%93%81%E7%A8%AE%E3%81%AE%E4%B8%80%E8%A6%A7'
# header=0 を指定すると 0 行目をヘッダにします
# また指定URLの「全て」の表を取得します
df_lst = pd.read_html(url_str, header=0)

いくつの表を取得したかは len などで確認できます。


print(len(df_lst))

今回の場合、0 番目が猫一覧になっています。

df_lst[0].head()

	種類	原産国	発生	身体のタイプ	毛足の長さ	毛色および模様	画像
0	アビシニアン	エチオピア	自然発生種	フォーリン	短毛	複色	NaN
1	アメリカンカール	アメリカ合衆国	突然変異	セミフォーリン	短毛もしくは長毛	単色	NaN
2	アメリカンショートヘア	アメリカ合衆国	自然発生種	セミコビー	短毛	All	NaN
3	アメリカンボブテイル	アメリカ合衆国	自然発生種	ロング＆サブスタンシャル	短毛もしくは長毛	All	NaN
4	アメリカンワイヤーヘア	アメリカ合衆国	突然変異	セミコビー	Rex	All but colorpoint	NaN

read_html のドキュメントは以下にあります。

その他

その他に SQLite ファイル等も読めます。以下の公式サイトをご確認ください。

IO Tools (Text, CSV, HDF5, ...) — pandas 0.22.0 documentation

書いた人に関して

Tech Fun株式会社スペシャリスト、xza です。

社内で開催した初学者向け勉強会で利用した資料等を公開しています。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up