データ分析や機械学習の前処理をしていると、
SQLでカテゴリカル変数のダミー化(OR One-Hot-Encoding)をしたくなることがあります。
SQLでは動的にカラムを生成できない(可能なものもある?)ので、CASE・IF文等で1個1個カラムを作ることになりますが、面倒です。
しかし、SQLでSQLを生成することにより解決できます。
例
以下はBigQueryの例ですが、文字列の結合関数(STRING_AGG,CONCAT)の部分をそのDBでの関数に置き換えれば他でも流用可能だと思います(一部RDBMS除く)。
SQL
WITH
-- 出力結果を見やすくするためにスペースや\nを入れてますがお好みで
dummie_column1 AS (
SELECT
STRING_AGG(CONCAT('COUNT(CASE WHEN seasonType = "',seasonType,'" THEN 1 ELSE NULL END) AS seasonType_',seasonType,'\n ')) AS column
FROM (
SELECT DISTINCT seasonType FROM `bigquery-public-data.baseball.games_post_wide`
)
)
,
dummie_column2 AS (
SELECT
-- カラムの中身にスペース等があると、ASでカラム名指定する際にエラーになるので置換しておく
STRING_AGG(CONCAT('COUNT(CASE WHEN homeTeamName = "',homeTeamName,'" THEN 1 ELSE NULL END) AS homeTeamName_',REGEXP_REPLACE(homeTeamName, ' ', '_'),'\n ')) AS column
FROM (
SELECT DISTINCT homeTeamName FROM `bigquery-public-data.baseball.games_post_wide`
)
)
,
dummie_column3 AS (
SELECT
-- カラムの中身にスペース等があると、ASでカラム名指定する際にエラーになるので置換しておく
STRING_AGG(CONCAT('COUNT(CASE WHEN awayTeamName = "',awayTeamName,'" THEN 1 ELSE NULL END) AS awayTeamName_',REGEXP_REPLACE(awayTeamName, ' ', '_'),'\n ')) AS column
FROM (
SELECT DISTINCT awayTeamName FROM `bigquery-public-data.baseball.games_post_wide`
)
)
SELECT
CONCAT('SELECT \n',
' gameId,\n ',
(SELECT column FROM dummie_column1),
',',
(SELECT column FROM dummie_column2),
',',
(SELECT column FROM dummie_column3),
'\nFROM `bigquery-public-data.baseball.games_post_wide`\n',
'GROUP BY 1') AS query
出力結果
SELECT
gameId,
COUNT(CASE WHEN seasonType = "PST" THEN 1 ELSE NULL END) AS seasonType_PST
,COUNT(CASE WHEN homeTeamName = "Dodgers" THEN 1 ELSE NULL END) AS homeTeamName_Dodgers
,COUNT(CASE WHEN homeTeamName = "Mets" THEN 1 ELSE NULL END) AS homeTeamName_Mets
,COUNT(CASE WHEN homeTeamName = "Rangers" THEN 1 ELSE NULL END) AS homeTeamName_Rangers
,COUNT(CASE WHEN homeTeamName = "Indians" THEN 1 ELSE NULL END) AS homeTeamName_Indians
,COUNT(CASE WHEN homeTeamName = "Cubs" THEN 1 ELSE NULL END) AS homeTeamName_Cubs
,COUNT(CASE WHEN homeTeamName = "Nationals" THEN 1 ELSE NULL END) AS homeTeamName_Nationals
,COUNT(CASE WHEN homeTeamName = "Blue Jays" THEN 1 ELSE NULL END) AS homeTeamName_Blue_Jays
,COUNT(CASE WHEN homeTeamName = "Red Sox" THEN 1 ELSE NULL END) AS homeTeamName_Red_Sox
,COUNT(CASE WHEN homeTeamName = "Giants" THEN 1 ELSE NULL END) AS homeTeamName_Giants
,COUNT(CASE WHEN awayTeamName = "Cubs" THEN 1 ELSE NULL END) AS awayTeamName_Cubs
,COUNT(CASE WHEN awayTeamName = "Giants" THEN 1 ELSE NULL END) AS awayTeamName_Giants
,COUNT(CASE WHEN awayTeamName = "Blue Jays" THEN 1 ELSE NULL END) AS awayTeamName_Blue_Jays
,COUNT(CASE WHEN awayTeamName = "Nationals" THEN 1 ELSE NULL END) AS awayTeamName_Nationals
,COUNT(CASE WHEN awayTeamName = "Red Sox" THEN 1 ELSE NULL END) AS awayTeamName_Red_Sox
,COUNT(CASE WHEN awayTeamName = "Dodgers" THEN 1 ELSE NULL END) AS awayTeamName_Dodgers
,COUNT(CASE WHEN awayTeamName = "Indians" THEN 1 ELSE NULL END) AS awayTeamName_Indians
,COUNT(CASE WHEN awayTeamName = "Rangers" THEN 1 ELSE NULL END) AS awayTeamName_Rangers
,COUNT(CASE WHEN awayTeamName = "Orioles" THEN 1 ELSE NULL END) AS awayTeamName_Orioles
FROM `bigquery-public-data.baseball.games_post_wide`
GROUP BY 1
他の良い手法があれば、ご教示ください。
可能であればPython(pandas.get_dummies)やR(dummies等色んなライブラリがある)でダミー変数化したほうが楽です。
ただしデータを移す必要があるので、それが面倒な時や、大きなデータを処理するときはSQLで完結した方が楽(状況によりますが)だと思います。