機械学習Advent Calendar 2024

【Python】Pandasでの時系列データの処理とMatplotlibでのグラフ化

Posted at 2024-12-24

この記事では、Pandasを使用して時系列データを処理し、そのデータをMatplotlibで可視化する方法について解説します。複数の列を持つ時系列データを扱い、データの操作方法やグラフ化の技術を紹介します。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 時系列データの生成 (0.1秒単位)
time_index = pd.date_range(start='2024-01-01', periods=10000, freq='0.1S')  # 0.1秒単位、10000の長さのindexを作成
temperature = np.random.normal(loc=22, scale=2, size=len(time_index))  # 温度 (平均22℃、標準偏差2)
humidity = np.random.normal(loc=60, scale=10, size=len(time_index))    # 湿度 (平均60%、標準偏差10)
pressure = np.random.normal(loc=1013, scale=5, size=len(time_index))    # 圧力 (平均1013 hPa、標準偏差5)

# DataFrameの作成
df = pd.DataFrame({
    'Temperature': temperature,
    'Humidity': humidity,
    'Pressure': pressure
}, index=time_index)

# 最初の5行を表示
print(df.head())

このコードでは、Pandasのdate_rangeを使って、0.1秒間隔の時系列インデックスを作成し、numpyでランダムなデータを生成しています。生成したデータはPandasのDataFrameに格納されます。

出力例（最初の5行）：

                         Temperature  Humidity  Pressure
2024-01-01 00:00:00.000    24.2342     59.341      1012.5
2024-01-01 00:00:00.100    21.7389     65.243      1015.2
2024-01-01 00:00:00.200    22.0458     61.567      1011.9
2024-01-01 00:00:00.300    19.6372     63.409      1013.1
2024-01-01 00:00:00.400    23.4367     58.923      1014.4

2. Pandasでの時系列データの読み込みと処理

時系列データは、よく特定の処理を施す必要があります。例えば、欠損値の補完、データのリサンプリング、指定した期間での集計などです。いくつかの基本的な操作を見ていきましょう。

欠損値の補完

ランダムに欠損値を挿入して、その補完方法を見てみましょう。

# 5%の確率で欠損値を挿入
df_missing = df.mask(np.random.rand(len(df), len(df.columns)) < 0.05)

# 欠損値を線形補完
df_filled = df_missing.interpolate(method='linear')

# 欠損値を表示
print(df_missing.isna().sum())

# 欠損値の補完後
print(df_filled.isna().sum())

このコードでは、maskを使ってランダムに5%の欠損値を挿入し、interpolateメソッドで線形補完しています。

リサンプリング

時系列データは、頻度を変更してリサンプリングすることができます。例えば、0.1秒単位のデータを1秒単位にリサンプリングする場合は以下のようにします。

# 1秒単位にリサンプリング（平均を取る）
df_resampled = df.resample('1s').mean()

# リサンプリング結果を表示
print(df_resampled.head())

このコードでは、resample('1s')を使ってデータを1秒間隔でリサンプリングし、mean()で各1秒間の平均値を計算しています。

時間帯別の集計

例えば、1分ごとの平均値を計算する場合は、次のようにします。

# 1分単位で平均を計算
df_1min_avg = df.resample('1min').mean()

# 集計結果を表示
print(df_1min_avg.head())

このコードでは、resample('1T')を使って1分単位で平均値を計算してリサンプリングしています。

3. Matplotlibでのグラフ化

次に、PandasのデータをMatplotlibで可視化する方法を見ていきましょう。ここでは、温度、湿度、圧力の3つの列を同じグラフに描画します。

import matplotlib.pyplot as plt

# グラフの設定
plt.figure(figsize=(10, 6))

# 各列をプロット
plt.plot(df.index, df['Temperature'], label='Temperature (°C)', color='tab:red', alpha=0.7)
plt.plot(df.index, df['Humidity'], label='Humidity (%)', color='tab:blue', alpha=0.7)
plt.plot(df.index, df['Pressure'], label='Pressure (hPa)', color='tab:green', alpha=0.7)

# グラフの装飾
plt.title('Time Series Data')
plt.xlabel('Time')
plt.ylabel('Value')
plt.legend(loc='upper right')
plt.grid(True)

# グラフの表示
plt.tight_layout()
plt.show()

このコードでは、plt.plot()で各列（温度、湿度、圧力）を描画し、plt.legend()で凡例を表示、plt.grid()でグリッドを追加しています。結果として、以下のようなグラフが表示されます。

4. まとめ

今回は、Pandasを使って時系列データを扱い、Matplotlibでそのデータを可視化する方法を紹介しました。このような時系列データの処理や可視化は、実データ分析の現場でも非常に役立ちます。是非、試してみてください！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up