ランチの時にapache drillの話をふわっと聞いた。
なんでもRDS、DynamoDBのテーブルとファイルシステム上、S3上のJSON/tsv/csvファイルをSQLライクなクエリでJoinしたりできるらしい。
面白そうなので触ってみる
document
インストール
brewで一撃
brew install apache-drill
mockデータ作成
mockarooでmockデータ作成
JSON(カラムはデフォルトのまま
[{"id":1,"gender":"Male","first_name":"Willie","last_name":"Gardner","email":"wgardner0@csmonitor.com","ip_address":"171.75.157.165"},
{"id":2,"gender":"Female","first_name":"Lois","last_name":"Jacobs","email":"ljacobs1@photobucket.com","ip_address":"188.171.106.166"},
{"id":3,"gender":"Male","first_name":"Shawn","last_name":"Rivera","email":"srivera2@storify.com","ip_address":"160.224.138.213"},
{"id":4,"gender":"Female","first_name":"Stephanie","last_name":"Hernandez","email":"shernandez3@shinystat.com","ip_address":"77.67.203.114"},
{"id":5,"gender":"Male","first_name":"Ralph","last_name":"Wright","email":"rwright4@booking.com","ip_address":"89.240.134.189"},
....
CSV(colorとcompany_nameを追加
id,first_name,last_name,email,gender,ip_address,color,company_name
1,Kenneth,Wheeler,kwheeler0@domainmarket.com,Male,54.230.82.29,Khaki,Voonyx
2,Brenda,Young,byoung1@networkadvertising.org,Female,79.121.226.209,Goldenrod,Feedbug
3,Steven,Knight,sknight2@hud.gov,Male,161.79.31.161,Purple,Aimbo
4,Timothy,Daniels,tdaniels3@ovh.net,Male,233.15.252.170,Mauv,Oba
5,Adam,Wright,awright4@cbc.ca,Male,56.247.113.21,Indigo,Kwilith
....
動作確認
まずはdrillシェルに入る。just drill it!!
$ sqlline -u jdbc:drill:zk=local
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
Jan 05, 2016 3:32:08 PM org.glassfish.jersey.server.ApplicationHandler initialize INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26... apache drill 1.4.0
"just drill it"
0: jdbc:drill:zk=local>
WEBコンソールも用意されており、シェルに入った状態で
http://127.0.0.1:8047/
にアクセスすることができる
jsonの方を引いてみる
0: jdbc:drill:zk=local> SELECT tbl.* FROM dfs.`/tmp/work/MOCK_DATA.json` AS tbl LIMIT 10;
+-----+---------+-------------+------------+------------------------------------+------------------+
| id | gender | first_name | last_name | email | ip_address |
+-----+---------+-------------+------------+------------------------------------+------------------+
| 1 | Male | Willie | Gardner | wgardner0@csmonitor.com | 171.75.157.165 |
| 2 | Female | Lois | Jacobs | ljacobs1@photobucket.com | 188.171.106.166 |
| 3 | Male | Shawn | Rivera | srivera2@storify.com | 160.224.138.213 |
| 4 | Female | Stephanie | Hernandez | shernandez3@shinystat.com | 77.67.203.114 |
| 5 | Male | Ralph | Wright | rwright4@booking.com | 89.240.134.189 |
| 6 | Male | Donald | Graham | dgraham5@google.com.br | 239.6.250.80 |
| 7 | Male | Justin | Howell | jhowell6@dion.ne.jp | 159.104.8.8 |
| 8 | Male | Russell | Mills | rmills7@nsw.gov.au | 245.130.6.110 |
| 9 | Female | Rose | Gonzalez | rgonzalez8@scientificamerican.com | 16.9.119.25 |
| 10 | Female | Mildred | Ramos | mramos9@unc.edu | 82.232.80.156 |
+-----+---------+-------------+------------+------------------------------------+------------------+
10 rows selected (0.215 seconds)
うほほほほ。こりゃいい。しかも結構はやい。
csvの方も引いてみる
0: jdbc:drill:zk=local> SELECT tbl.* FROM dfs.`/tmp/work/MOCK_DATA.csv` AS tbl LIMIT 10;
+----------------------------------------------------------------------------------------------------------+
| columns |
+----------------------------------------------------------------------------------------------------------+
| ["id","first_name","last_name","email","gender","ip_address","color","company_name"] |
| ["1","Kenneth","Wheeler","kwheeler0@domainmarket.com","Male","54.230.82.29","Khaki","Voonyx"] |
| ["2","Brenda","Young","byoung1@networkadvertising.org","Female","79.121.226.209","Goldenrod","Feedbug"] |
| ["3","Steven","Knight","sknight2@hud.gov","Male","161.79.31.161","Purple","Aimbo"] |
| ["4","Timothy","Daniels","tdaniels3@ovh.net","Male","233.15.252.170","Mauv","Oba"] |
| ["5","Adam","Wright","awright4@cbc.ca","Male","56.247.113.21","Indigo","Kwilith"] |
| ["6","Jane","Wheeler","jwheeler5@unc.edu","Female","42.132.133.182","Teal","Gabcube"] |
| ["7","Jack","Smith","jsmith6@domainmarket.com","Male","199.219.253.212","Fuscia","Pixope"] |
| ["8","James","Hayes","jhayes7@businessinsider.com","Male","237.34.224.202","Purple","Yabox"] |
| ["9","Susan","Murphy","smurphy8@joomla.org","Female","85.200.49.196","Turquoise","Jaloo"] |
+----------------------------------------------------------------------------------------------------------+
ほほー。csvの方はこうなっているのですね。これでJOINなんてできるの・・?
なんかオプションがあるのだろうか。
と思ってドキュメント軽く見たら書いてた。
https://drill.apache.org/docs/text-files-csv-tsv-psv/
この辺りを定義すればいける模様。
"skipFirstLine": false,
"extractHeader": true,
cfgファイルではなくWebコンソールかcurlで設定するらしい。
今回はwebコンソールで既存のcsvの定義を変更することにした。
メニュータブのStorageに遷移し、dfsの[Update]をクリック
csv,tsvにskipFirstLineとextractHeaderを追記し、[Update]をクリック
0: jdbc:drill:zk=local> SELECT tbl.* FROM dfs.`/tmp/work/MOCK_DATA.csv` AS tbl LIMIT 10;
+-----+-------------+------------+---------------------------------+---------+------------------+------------+---------------+
| id | first_name | last_name | email | gender | ip_address | color | company_name |
+-----+-------------+------------+---------------------------------+---------+------------------+------------+---------------+
| 1 | Kenneth | Wheeler | kwheeler0@domainmarket.com | Male | 54.230.82.29 | Khaki | Voonyx |
| 2 | Brenda | Young | byoung1@networkadvertising.org | Female | 79.121.226.209 | Goldenrod | Feedbug |
| 3 | Steven | Knight | sknight2@hud.gov | Male | 161.79.31.161 | Purple | Aimbo |
| 4 | Timothy | Daniels | tdaniels3@ovh.net | Male | 233.15.252.170 | Mauv | Oba |
| 5 | Adam | Wright | awright4@cbc.ca | Male | 56.247.113.21 | Indigo | Kwilith |
| 6 | Jane | Wheeler | jwheeler5@unc.edu | Female | 42.132.133.182 | Teal | Gabcube |
| 7 | Jack | Smith | jsmith6@domainmarket.com | Male | 199.219.253.212 | Fuscia | Pixope |
| 8 | James | Hayes | jhayes7@businessinsider.com | Male | 237.34.224.202 | Purple | Yabox |
| 9 | Susan | Murphy | smurphy8@joomla.org | Female | 85.200.49.196 | Turquoise | Jaloo |
| 10 | Ann | Hill | ahill9@youku.com | Female | 254.251.33.227 | Red | Yadel |
+-----+-------------+------------+---------------------------------+---------+------------------+------------+---------------+
キターーー!
大正義Join!!!
おめぇ、これできたらと思うとおらワクワクすっべ・・・
0: jdbc:drill:zk=local> SELECT tbl1.first_name, tbl1.last_name, tbl2.color, tbl2.company_name FROM dfs.`/tmp/work/MOCK_DATA.json` AS tbl1
. . . . . . . . . . . > JOIN
. . . . . . . . . . . > dfs.`/tmp/work/MOCK_DATA.csv` AS tbl2 ON tbl1.first_name=tbl2.first_name and tbl1.last_name=tbl2.last_name;
+-------------+-------------+-------------+----------------+
| first_name | last_name | color | company_name |
+-------------+-------------+-------------+----------------+
| Scott | Ray | Maroon | Twinte |
| Andrew | Brown | Orange | Photolist |
| Rose | Washington | Orange | Livetube |
| Steven | Adams | Maroon | Zoombeat |
| Nancy | Porter | Indigo | Kwinu |
| Lois | Williams | Purple | Flashset |
| Justin | Hart | Red | Miboo |
| Arthur | Scott | Indigo | Thoughtbeat |
| Wanda | Peters | Khaki | Teklist |
| Christine | Dean | Aquamarine | Buzzshare |
| Paula | Nelson | Green | Kare |
| Betty | Wagner | Aquamarine | Photobean |
| Jerry | Ross | Mauv | Linktype |
| Harry | Washington | Green | Reallinks |
| Nancy | Lynch | Turquoise | Twitternation |
| Heather | Moore | Violet | Buzzbean |
| Linda | Gray | Violet | Fadeo |
+-------------+-------------+-------------+----------------+
17 rows selected (1.026 seconds)
あまり意味のないクエリですがきちんとJOINできました。
素晴らしいですね。
仕事で使えそうなケースがあれば大規模データでの検証やSparkとの比較も実施してみようと思います。