hiveの練習として、公開されているlivedoorグルメのデータを使ってみました。
環境構築
Install Java7, CDH4(擬似分散) and Hive
基本的にはCDHの公式インストール方法に従っています。
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Quick-Start/cdh4qs_topic_3.html
とりあえず動くようにという意図なので、改善の余地は多分あります。
# java
sudo apt-get -y install python-software-properties
sudo add-apt-repository -y ppa:webupd8team/java
sudo apt-get update
sudo apt-get -y install oracle-java7-installer
echo "export JAVA_HOME=/usr/lib/jvm/java-7-oracle" >> ~/.bashrc
echo "export PATH=$PATH:$JAVA_HOME/bin" >> ~/.bashrc
# CDH
curl -O http://archive.cloudera.com/cdh4/one-click-install/squeeze/amd64/cdh4-repository_1.0_all.deb
sudo dpkg -i cdh4-repository_1.0_all.deb
curl -s http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key | sudo apt-key add -
sudo apt-get update
sudo apt-get -y install hadoop-0.20-conf-pseudo
sudo -u hdfs hdfs namenode -format
for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done
sudo -u hdfs hadoop fs -mkdir /tmp
sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred
for x in `cd /etc/init.d ; ls hadoop-0.20-mapreduce-*` ; do sudo service $x start ; done
sudo -u hdfs hadoop fs -mkdir /user/hdfs
sudo -u hdfs hadoop fs -chown hdfs /user/hdfs
# Hive
sudo apt-get -y install hive
sudo chmod -R 777 /var/lib/hive/metastore/metastore_db
sudo -u hdfs hadoop fs -chmod -R 777 /user
sudo -u hdfs hadoop fs -mkdir /user/vagrant
sudo -u hdfs hadoop fs -chown -R vagrant /user/vagrant
sudo chown vagrant:vagrant -R /var/lib/hive
# 動作確認
$ hive
hive> show databases;
OK
default
データ投入
livedoorグルメデータを取得
database作成
hive> create database ldgourmet;
ratingsテーブル作成&ロード
hive> create table ratings (id int, restaurant_id int, user_id string, total int, food int, service int, atmosphere int, cost_performance int, title string, body string, purpose int, created_on timestamp) row format delimited fields terminated by ',' lines terminated by '\n';
load data local inpath './ratings.csv' overwrite into table ratings;
restaurantsテーブル作成&ロード
create table restaurants (id int, name string, property string, alphabet string, name_kana string, pref_id int, area_id int, station_id1 int, station_time1 int, station_distance1 int, station_id2 int, station_time2 int, station_distance2 int, station_id3 int, station_time3 int, station_distance3 int, category_id1 int, category_id2 int, category_id3 int, category_id4 int, category_id5 int, zip string , address string , north_latitude float, east_longitude float, description string, purpose int, open_morning boolean, open_lunch boolean, open_late boolean, photo_count int, special_count int, menu_count int, fan_count int, access_count int, created_on timestamp, modified_on timestamp, closed boolean) row format delimited fields terminated by ',' lines terminated by '\n';
load data local inpath './restaurants.csv' overwrite into table restaurants;
areaテーブル作成&ロード
create table areas (id int, pref_id int, name string) row format delimited fields terminated by ',' lines terminated by '\n';
load data local inpath './area.csv' overwrite into table areas;
集計処理
銀座・日本橋・新橋地域(aera.id = 1)にある各レストランの平均レート
hive> select avg(ratings.total), restaurants.id, max(restaurants.name) from ratings join restaurants on (ratings.restaurant_id = restaurants.id) where restaurants.area_id = 1 group by restaurants.id
...
東京の各地域ごとの平均点
hive> select avg(ratings.total), max(areas.name) from ratings join restaurants on (ratings.restaurant_id = restaurants.id) join areas on (restaurants.area_id = areas.id) where areas.pref_id = 13 group by areas.id;
その他色々やった。
dynamic partitionを使って既存のテーブル(area)からpartitionを持つテーブル(p_areas)を作る。
パーティションも試してみる。元データをパーティションで分けてロードするというのは面倒臭いので、ダイナミックパーティションとやらを試してみた。
hive> create table p_areas (id int, name string) partitioned by (pref_id int) row format delimited fields terminated by ',' lines terminated by '\n';
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
from areas insert overwrite table p_areas partition(pref_id) select id, name, pref_id;
-- partition数が大きくなる場合は以下のパラメータを検討する。
-- hive.exec.max.dynamic.partitions.pernode
-- hive.exec.max.dynamic.partitions
pythonから
pythonからhiveを叩くには、HiveServer2を使う。詳しくは以下。
https://gist.github.com/YoshihitoAso/9632814