10
8

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

SQLでカテゴリカル変数をダミー化

Last updated at Posted at 2020-03-24

データ分析や機械学習の前処理をしていると、
SQLでカテゴリカル変数のダミー化(OR One-Hot-Encoding)をしたくなることがあります。

SQLでは動的にカラムを生成できない(可能なものもある?)ので、CASE・IF文等で1個1個カラムを作ることになりますが、面倒です。
しかし、SQLでSQLを生成することにより解決できます。

以下はBigQueryの例ですが、文字列の結合関数(STRING_AGG,CONCAT)の部分をそのDBでの関数に置き換えれば他でも流用可能だと思います(一部RDBMS除く)。

SQL

WITH
-- 出力結果を見やすくするためにスペースや\nを入れてますがお好みで
dummie_column1 AS (
  SELECT 
    STRING_AGG(CONCAT('COUNT(CASE WHEN seasonType = "',seasonType,'" THEN 1 ELSE NULL END) AS seasonType_',seasonType,'\n  ')) AS column
  FROM (
    SELECT DISTINCT seasonType FROM `bigquery-public-data.baseball.games_post_wide` 
  )
)
,
dummie_column2 AS (
  SELECT 
    -- カラムの中身にスペース等があると、ASでカラム名指定する際にエラーになるので置換しておく
    STRING_AGG(CONCAT('COUNT(CASE WHEN homeTeamName = "',homeTeamName,'" THEN 1 ELSE NULL END) AS homeTeamName_',REGEXP_REPLACE(homeTeamName, ' ', '_'),'\n  ')) AS column
  FROM (
    SELECT DISTINCT homeTeamName FROM `bigquery-public-data.baseball.games_post_wide`
  )
)
,
dummie_column3 AS (
  SELECT 
    -- カラムの中身にスペース等があると、ASでカラム名指定する際にエラーになるので置換しておく
    STRING_AGG(CONCAT('COUNT(CASE WHEN awayTeamName = "',awayTeamName,'" THEN 1 ELSE NULL END) AS awayTeamName_',REGEXP_REPLACE(awayTeamName, ' ', '_'),'\n  ')) AS column
  FROM (
    SELECT DISTINCT awayTeamName FROM `bigquery-public-data.baseball.games_post_wide`
  )
)

SELECT 
  CONCAT('SELECT \n',
         '  gameId,\n  ',
         (SELECT column FROM dummie_column1),
         ',',
          (SELECT column FROM dummie_column2),   
         ',',
          (SELECT column FROM dummie_column3),   
         '\nFROM `bigquery-public-data.baseball.games_post_wide`\n',
         'GROUP BY 1') AS query

出力結果

SELECT 
  gameId,
  COUNT(CASE WHEN seasonType = "PST" THEN 1 ELSE NULL END) AS seasonType_PST
  ,COUNT(CASE WHEN homeTeamName = "Dodgers" THEN 1 ELSE NULL END) AS homeTeamName_Dodgers
  ,COUNT(CASE WHEN homeTeamName = "Mets" THEN 1 ELSE NULL END) AS homeTeamName_Mets
  ,COUNT(CASE WHEN homeTeamName = "Rangers" THEN 1 ELSE NULL END) AS homeTeamName_Rangers
  ,COUNT(CASE WHEN homeTeamName = "Indians" THEN 1 ELSE NULL END) AS homeTeamName_Indians
  ,COUNT(CASE WHEN homeTeamName = "Cubs" THEN 1 ELSE NULL END) AS homeTeamName_Cubs
  ,COUNT(CASE WHEN homeTeamName = "Nationals" THEN 1 ELSE NULL END) AS homeTeamName_Nationals
  ,COUNT(CASE WHEN homeTeamName = "Blue Jays" THEN 1 ELSE NULL END) AS homeTeamName_Blue_Jays
  ,COUNT(CASE WHEN homeTeamName = "Red Sox" THEN 1 ELSE NULL END) AS homeTeamName_Red_Sox
  ,COUNT(CASE WHEN homeTeamName = "Giants" THEN 1 ELSE NULL END) AS homeTeamName_Giants
  ,COUNT(CASE WHEN awayTeamName = "Cubs" THEN 1 ELSE NULL END) AS awayTeamName_Cubs
  ,COUNT(CASE WHEN awayTeamName = "Giants" THEN 1 ELSE NULL END) AS awayTeamName_Giants
  ,COUNT(CASE WHEN awayTeamName = "Blue Jays" THEN 1 ELSE NULL END) AS awayTeamName_Blue_Jays
  ,COUNT(CASE WHEN awayTeamName = "Nationals" THEN 1 ELSE NULL END) AS awayTeamName_Nationals
  ,COUNT(CASE WHEN awayTeamName = "Red Sox" THEN 1 ELSE NULL END) AS awayTeamName_Red_Sox
  ,COUNT(CASE WHEN awayTeamName = "Dodgers" THEN 1 ELSE NULL END) AS awayTeamName_Dodgers
  ,COUNT(CASE WHEN awayTeamName = "Indians" THEN 1 ELSE NULL END) AS awayTeamName_Indians
  ,COUNT(CASE WHEN awayTeamName = "Rangers" THEN 1 ELSE NULL END) AS awayTeamName_Rangers
  ,COUNT(CASE WHEN awayTeamName = "Orioles" THEN 1 ELSE NULL END) AS awayTeamName_Orioles
  
FROM `bigquery-public-data.baseball.games_post_wide`
GROUP BY 1

他の良い手法があれば、ご教示ください。

可能であればPython(pandas.get_dummies)やR(dummies等色んなライブラリがある)でダミー変数化したほうが楽です。
ただしデータを移す必要があるので、それが面倒な時や、大きなデータを処理するときはSQLで完結した方が楽(状況によりますが)だと思います。

10
8
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
10
8

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?