More than 3 years have passed since last update.

SQLAlchemy から Pandas にデータを読み込む

Last updated at 2020-07-27Posted at 2018-10-18

Pandas で SQL からデータを読み込むにはどうすれば良いだろうか? pandas.read_sql_query を読むと、どうやら SQL 文をそのまま書く方法と、SQLAlchemy という OR マッパを使う方法と二つがあったので、両方試してみた。例えばこんな感じのテーブルがある時:

CREATE TABLE EventData (
    deviceID varchar(64),
    timestamp int(10) unsigned,
    latitude double,
    longitude double,
    primary key (deviceID, timestamp)
);

deviceID あたりのレコード数を知りたいとする。まず色々準備。

import pandas as pd
from sqlalchemy import create_engine, MetaData, Table, Column, String, Float, Integer
from sqlalchemy.sql import select
from sqlalchemy import func

input_db_url = 'mysql+mysqlconnector://xxx'
engine = create_engine(input_db_url)

SQL 文をそのまま使う

SQL 文をそのまま使う例は簡単。

pd.read_sql_query(sql="SELECT deviceID, count(*) as rows from EventData GROUP BY deviceID", con=engine)

SQLAlchemy を使う

read_sql_query の引数 sql には SQLAlchemy selectable というのが使える。これは SQL を直接文字列として記述する代わりに Python オブジェクトを組み合わせてクエリを表現するという物だ。クォート漏れを気にしなくて良いので、動的にクエリを生成する時は便利。多少冗長だが以下のように使う。

metadata = MetaData()
event_data = Table('EventData', metadata, autoload=True, autoload_with=engine)
device_id = event_data.c.deviceID
pd.read_sql_query(sql=select([device_id, func.count('*').label('rows')]).group_by(device_id).limit(10), con=engine)

SQLAlchemy では Table や Column オブジェクトを組み合わせてクエリを構築する。普通は明示的にスキーマを定義するが、ここではサボって reflection 機能を使いテーブル定義をデータベースから読んでいる。

MetaData は engine と関連付けるとデータベース情報を保持する。ここでは、Table のコンストラクタで autoload を指定しているのでテーブル定義が自動的に読み込まれる。

> metadata.tables

immutabledict({'EventData': Table('EventData', MetaData(bind=None), Column('deviceID', VARCHAR(length=64), table=<EventData>, primary_key=True, nullable=False), Column('timestamp', INTEGER(display_width=10, unsigned=True), table=<EventData>, primary_key=True, nullable=False), Column('latitude', DOUBLE(asdecimal=True), table=<EventData>), Column('longitude', DOUBLE(asdecimal=True), table=<EventData>), schema=None)})

> event_data

Table('EventData', MetaData(bind=None), Column('deviceID', VARCHAR(length=64), table=<EventData>, primary_key=True, nullable=False), Column('timestamp', INTEGER(display_width=10, unsigned=True), table=<EventData>, primary_key=True, nullable=False), Column('latitude', DOUBLE(asdecimal=True), table=<EventData>), Column('longitude', DOUBLE(asdecimal=True), table=<EventData>), schema=None)

テーブル内のカラムには columns (省略形 c) でアクセス出来る。カラム名がそのままプロパティになっている。

> event_data.c.deviceID

Column('deviceID', VARCHAR(length=64), table=<EventData>, primary_key=True, nullable=False)

あとは SQL Expression Language Tutorial などを参考にしてカラムを組み合わせてクエリを作る。str() で何となく出来上がるクエリが分かる。

str(select([device_id, func.count('*').label('rows')]).group_by(device_id).limit(10))

'SELECT "EventData"."deviceID", count(:count_1) AS rows \nFROM "EventData" GROUP BY "EventData"."deviceID"\n LIMIT :param_1'

SQLAlchemy でテーブルを Python class として扱う

Pandas だとあまり利点が無い気がするが、本来 SQLAlchemy は OR マッパとして使う。つまり、SQL データベースをあたかも Python オブジェクトのコレクションのように扱う。折角なので OR マッパとしての SQLAlchemy についても少し調べた。というか、順番としては頭からドキュメントを呼んでいくと OR マッパの長い説明の後ようやく selectable オブジェクトだけを使う方法を見つけた。

Object Relational Tutorial によると、テーブルを Python クラスとして表現するには declarative_base() で作ったベースクラスを使う。

from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()
class EventData(Base):
    __tablename__ = 'EventData'
    device_id = Column('deviceId', String, primary_key=True)
    timestamp = Column(Integer, primary_key=True)
    latitude = Column(Float)
    longitude = Column(Float)

    def __repr__(self):
        return 'EventData(%s, %s, %s, %s)' % (self.device_id, self.timestamp, self.latitude, self.longitude)

これでクラスの __table__ を覗いてみると Table 定義が出来上がっている。

EventData.__table__

Table('EventData', MetaData(bind=None), Column('deviceId', String(), table=<EventData>, primary_key=True, nullable=False), Column('timestamp', Integer(), table=<EventData>, primary_key=True, nullable=False), Column('latitude', Float(), table=<EventData>), Column('longitude', Float(), table=<EventData>), schema=None)

出来上がったクラスを使って実際に SQL を発行するには Session というものを使う。

from sqlalchemy.orm import sessionmaker

Session = sessionmaker(bind=engine)
session = Session()

for each in session.query(EventData)[0:10]:
    print(each)

session.close()

EventData(81b2d6, 1522711327, 35.6349139, 139.7683735)
EventData(81b2d6, 1522711332, 35.6352536, 139.7673153)
EventData(81b2d6, 1522711342, 35.6358887, 139.7653534)
EventData(81b2d6, 1522711347, 35.6361897, 139.7643313) ...

str() でクエリの内容を見ることも出来る。

> str(session.query(EventData))

'SELECT `EventData`.`deviceId` AS `EventData_deviceId`, `EventData`.timestamp AS `EventData_timestamp`, `EventData`.latitude AS `EventData_latitude`, `EventData`.longitude AS `EventData_longitude` \nFROM `EventData`'

宿題

さて、ここで不思議なのは、session.query で作った Query オブジェクトは read_sql_query の時に select で使った Select オブジェクトとは別物だという事だ、Select オブジェクトを使うには、session ではなく Connection を使う。

conn = engine.connect()
query = select([EventData.device_id, EventData.timestamp, EventData.latitude, EventData.longitude]).limit(10)

for each in conn.execute(query.limit(10)):
    print(each)

つまり、

OR マッパを使う時は Session
Select オブジェクトを使う時は Connection

という互換性の無い二つのやり方がある気がする。この二つの関係については後で調べる。

回答

SQLAlchemy: engine, connection and session difference によれば、Session は OR マッパで SQL を自動生成する場合に使い、Connection は SQL を明示的に生成する場合に使うとある。ちなみに、Session にも execute 関数があるため、Select オブジェクトはどちらにも使える。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up