LoginSignup
5
5

More than 5 years have passed since last update.

hiveでlivedoorグルメデータの集計

Last updated at Posted at 2014-09-03

hiveの練習として、公開されているlivedoorグルメのデータを使ってみました。

環境構築

Install Java7, CDH4(擬似分散) and Hive

基本的にはCDHの公式インストール方法に従っています。
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Quick-Start/cdh4qs_topic_3.html

とりあえず動くようにという意図なので、改善の余地は多分あります。

# java
sudo apt-get -y install python-software-properties
sudo add-apt-repository -y ppa:webupd8team/java
sudo apt-get update
sudo apt-get -y install oracle-java7-installer
echo "export JAVA_HOME=/usr/lib/jvm/java-7-oracle" >> ~/.bashrc
echo "export PATH=$PATH:$JAVA_HOME/bin" >> ~/.bashrc

# CDH
curl -O http://archive.cloudera.com/cdh4/one-click-install/squeeze/amd64/cdh4-repository_1.0_all.deb
sudo dpkg -i cdh4-repository_1.0_all.deb
curl -s http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key | sudo apt-key add -
sudo apt-get update
sudo apt-get -y install hadoop-0.20-conf-pseudo
sudo -u hdfs hdfs namenode -format
for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done
sudo -u hdfs hadoop fs -mkdir /tmp
sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred
for x in `cd /etc/init.d ; ls hadoop-0.20-mapreduce-*` ; do sudo service $x start ; done
sudo -u hdfs hadoop fs -mkdir /user/hdfs
sudo -u hdfs hadoop fs -chown hdfs /user/hdfs

# Hive
sudo apt-get -y install hive
sudo chmod -R 777 /var/lib/hive/metastore/metastore_db
sudo -u hdfs hadoop fs -chmod -R 777 /user
sudo -u hdfs hadoop fs -mkdir /user/vagrant
sudo -u hdfs hadoop fs -chown -R vagrant /user/vagrant
sudo chown vagrant:vagrant -R /var/lib/hive

# 動作確認
$ hive
hive> show databases;
OK
default

データ投入

livedoorグルメデータを取得

database作成

hive> create database ldgourmet;

ratingsテーブル作成&ロード

hive> create table ratings (id int, restaurant_id int, user_id string, total int, food int, service int, atmosphere int, cost_performance int, title string, body string, purpose int, created_on timestamp) row format delimited fields terminated by ',' lines terminated by '\n';

load data local inpath './ratings.csv' overwrite into table ratings;

restaurantsテーブル作成&ロード

create table restaurants (id int, name string, property string, alphabet string, name_kana string, pref_id int, area_id int, station_id1 int, station_time1 int, station_distance1 int, station_id2 int, station_time2 int, station_distance2 int, station_id3 int, station_time3 int, station_distance3 int, category_id1 int, category_id2 int, category_id3 int, category_id4 int, category_id5 int, zip string , address string , north_latitude float, east_longitude float, description string, purpose int, open_morning boolean, open_lunch boolean, open_late boolean, photo_count int, special_count int, menu_count int, fan_count int, access_count int, created_on timestamp, modified_on timestamp, closed boolean) row format delimited fields terminated by ',' lines terminated by '\n';

load data local inpath './restaurants.csv' overwrite into table restaurants;

areaテーブル作成&ロード

create table areas (id int, pref_id int, name string) row format delimited fields terminated by ',' lines terminated by '\n';

load data local inpath './area.csv' overwrite into table areas;

集計処理

銀座・日本橋・新橋地域(aera.id = 1)にある各レストランの平均レート

hive> select avg(ratings.total), restaurants.id, max(restaurants.name)  from ratings join restaurants on (ratings.restaurant_id = restaurants.id) where restaurants.area_id = 1 group by restaurants.id
...

東京の各地域ごとの平均点

hive> select avg(ratings.total), max(areas.name) from ratings join restaurants on (ratings.restaurant_id = restaurants.id) join areas on (restaurants.area_id = areas.id) where areas.pref_id = 13 group by areas.id;

その他色々やった。

dynamic partitionを使って既存のテーブル(area)からpartitionを持つテーブル(p_areas)を作る。

パーティションも試してみる。元データをパーティションで分けてロードするというのは面倒臭いので、ダイナミックパーティションとやらを試してみた。

hive> create table p_areas (id int, name string) partitioned by (pref_id int) row format delimited fields terminated by ',' lines terminated by '\n';

set hive.exec.dynamic.partition.mode=nonstrict; 
set hive.exec.dynamic.partition=true;

from areas insert overwrite table p_areas partition(pref_id) select id, name, pref_id;

-- partition数が大きくなる場合は以下のパラメータを検討する。
-- hive.exec.max.dynamic.partitions.pernode
-- hive.exec.max.dynamic.partitions

pythonから

pythonからhiveを叩くには、HiveServer2を使う。詳しくは以下。
https://gist.github.com/YoshihitoAso/9632814

5
5
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
5
5