Help us understand the problem. What is going on with this article?

SQLでカテゴリカル変数をダミー化

データ分析や機械学習の前処理をしていると、
SQLでカテゴリカル変数のダミー化(OR One-Hot-Encoding)をしたくなることがあります。

SQLでは動的にカラムを生成できない(可能なものもある?)ので、CASE・IF文等で1個1個カラムを作ることになりますが、面倒です。
しかし、SQLでSQLを生成することにより解決できます。

以下はBigQueryの例ですが、文字列の結合関数(STRING_AGG,CONCAT)の部分をそのDBでの関数に置き換えれば他でも流用可能だと思います(一部RDBMS除く)。

SQL

WITH
-- 出力結果を見やすくするためにスペースや\nを入れてますがお好みで
dummie_column1 AS (
  SELECT 
    STRING_AGG(CONCAT('COUNT(CASE WHEN seasonType = "',seasonType,'" THEN 1 ELSE NULL END) AS seasonType_',seasonType,'\n  ')) AS column
  FROM (
    SELECT DISTINCT seasonType FROM `bigquery-public-data.baseball.games_post_wide` 
  )
)
,
dummie_column2 AS (
  SELECT 
    -- カラムの中身にスペース等があると、ASでカラム名指定する際にエラーになるので置換しておく
    STRING_AGG(CONCAT('COUNT(CASE WHEN homeTeamName = "',homeTeamName,'" THEN 1 ELSE NULL END) AS homeTeamName_',REGEXP_REPLACE(homeTeamName, ' ', '_'),'\n  ')) AS column
  FROM (
    SELECT DISTINCT homeTeamName FROM `bigquery-public-data.baseball.games_post_wide`
  )
)
,
dummie_column3 AS (
  SELECT 
    -- カラムの中身にスペース等があると、ASでカラム名指定する際にエラーになるので置換しておく
    STRING_AGG(CONCAT('COUNT(CASE WHEN awayTeamName = "',awayTeamName,'" THEN 1 ELSE NULL END) AS awayTeamName_',REGEXP_REPLACE(awayTeamName, ' ', '_'),'\n  ')) AS column
  FROM (
    SELECT DISTINCT awayTeamName FROM `bigquery-public-data.baseball.games_post_wide`
  )
)

SELECT 
  CONCAT('SELECT \n',
         '  gameId,\n  ',
         (SELECT column FROM dummie_column1),
         ',',
          (SELECT column FROM dummie_column2),   
         ',',
          (SELECT column FROM dummie_column3),   
         '\nFROM `bigquery-public-data.baseball.games_post_wide`\n',
         'GROUP BY 1') AS query

出力結果

SELECT 
  gameId,
  COUNT(CASE WHEN seasonType = "PST" THEN 1 ELSE NULL END) AS seasonType_PST
  ,COUNT(CASE WHEN homeTeamName = "Dodgers" THEN 1 ELSE NULL END) AS homeTeamName_Dodgers
  ,COUNT(CASE WHEN homeTeamName = "Mets" THEN 1 ELSE NULL END) AS homeTeamName_Mets
  ,COUNT(CASE WHEN homeTeamName = "Rangers" THEN 1 ELSE NULL END) AS homeTeamName_Rangers
  ,COUNT(CASE WHEN homeTeamName = "Indians" THEN 1 ELSE NULL END) AS homeTeamName_Indians
  ,COUNT(CASE WHEN homeTeamName = "Cubs" THEN 1 ELSE NULL END) AS homeTeamName_Cubs
  ,COUNT(CASE WHEN homeTeamName = "Nationals" THEN 1 ELSE NULL END) AS homeTeamName_Nationals
  ,COUNT(CASE WHEN homeTeamName = "Blue Jays" THEN 1 ELSE NULL END) AS homeTeamName_Blue_Jays
  ,COUNT(CASE WHEN homeTeamName = "Red Sox" THEN 1 ELSE NULL END) AS homeTeamName_Red_Sox
  ,COUNT(CASE WHEN homeTeamName = "Giants" THEN 1 ELSE NULL END) AS homeTeamName_Giants
  ,COUNT(CASE WHEN awayTeamName = "Cubs" THEN 1 ELSE NULL END) AS awayTeamName_Cubs
  ,COUNT(CASE WHEN awayTeamName = "Giants" THEN 1 ELSE NULL END) AS awayTeamName_Giants
  ,COUNT(CASE WHEN awayTeamName = "Blue Jays" THEN 1 ELSE NULL END) AS awayTeamName_Blue_Jays
  ,COUNT(CASE WHEN awayTeamName = "Nationals" THEN 1 ELSE NULL END) AS awayTeamName_Nationals
  ,COUNT(CASE WHEN awayTeamName = "Red Sox" THEN 1 ELSE NULL END) AS awayTeamName_Red_Sox
  ,COUNT(CASE WHEN awayTeamName = "Dodgers" THEN 1 ELSE NULL END) AS awayTeamName_Dodgers
  ,COUNT(CASE WHEN awayTeamName = "Indians" THEN 1 ELSE NULL END) AS awayTeamName_Indians
  ,COUNT(CASE WHEN awayTeamName = "Rangers" THEN 1 ELSE NULL END) AS awayTeamName_Rangers
  ,COUNT(CASE WHEN awayTeamName = "Orioles" THEN 1 ELSE NULL END) AS awayTeamName_Orioles

FROM `bigquery-public-data.baseball.games_post_wide`
GROUP BY 1

他の良い手法があれば、ご教示ください。

可能であればPython(pandas.get_dummies)やR(dummies等色んなライブラリがある)でダミー変数化したほうが楽です。
ただしデータを移す必要があるので、それが面倒な時や、大きなデータを処理するときはSQLで完結した方が楽(状況によりますが)だと思います。

Why not register and get more from Qiita?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
Comments
No comments
Sign up for free and join this conversation.
If you already have a Qiita account
Why do not you register as a user and use Qiita more conveniently?
You need to log in to use this function. Qiita can be used more conveniently after logging in.
You seem to be reading articles frequently this month. Qiita can be used more conveniently after logging in.
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
ユーザーは見つかりませんでした