Help us understand the problem. What is going on with this article?

Hadoop導入とPythonによるMapReduce

More than 5 years have passed since last update.

Hadoop始めたいけどJavaで書くのめんどくさい… という人のためのチュートリアル.

HadoopはJavaで記述されているため基本的にはMapper/ReducerもJavaで記述するが,HadoopにはHadoop Streamingという機能があり,Unixの標準入出力を介してデータの受け渡しをすることができる.
これを用いてPythonでMapper/Reducerを書いてみた.もちろんHadoop Streamingを利用すればPython以外の言語でも書ける.

今回はUbuntu上に擬似分散環境を構築してみた.

Ubuntu12.04 + Haadoop2.4.1

Hadoopの環境構築

Javaがない場合はインストール

$ sudo apt-get update
$ sudo apt-get install openjdk-7-jdk

Hadoopをダウンロード

$ wget http://mirror.nexcess.net/apache/hadoop/common/hadoop-2.4.1/hadoop-2.4.1.tar.gz
$ tar zxvf hadoop-2.4.1.tar.gz
$ mv hadoop-2.4.1.tar.gz hadoop
$ rm hadoop-2.4.1.tar.gz
$ sudo mv hadoop /usr/local
$ cd /usr/local/hadoop
$ export PATH=$PATH:/usr/local/hadoop/bin #.zshrcに書いておくとよい

以下の4つのファイルを編集

$ vim etc/hadoop/core-site.xml
core-site.xml
...
<configuration>
     <property>
         <name>fs.default.name</name>
         <value>hdfs://localhost:9000</value>
     </property>
</configuration>
$ vim etc/hadoop/hdfs-site.xml
hdfs-site.xml
...
<configuration>
     <property>
         <name>dfs.replication</name>
         <value>1</value>
     </property>
</configuration>
$ mv etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml
$ vim etc/hadoop/mapred-site.xml
mapred-site.xml
...
<configuration>
     <property>
         <name>dfs.replication</name>
         <value>1</value>
     </property>
</configuration>
$ vim etc/hadoop/hadoop-env.xml
hadoop-env.xml
...
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
...

鍵がない場合は追加する

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

最後にnamenodeを初期化し,Hadoopを起動する

$ hdfs namenode -format
$ sbin/start-dfs.sh

PythonによるMapper/Reducerの記述

今回は,HadoopのサンプルコードであるWordCountをPythonで記述する

まず,入力ファイルを用意する

$ mkdir inputs
$ echo "a b b c c c" > inputs/input.txt

Mapper

$ vim mapper.py
mapper.py
#!/usr/bin/env python

import sys

for l in sys.stdin:
    for word in l.strip().split(): print '{0}\t1'.format(word)

Mapperは以下のようなものを出力する

a    1
b    1
b    1
c    1
c    1
c    1

Reducer

$ vim reducer.py
reducer.py
#!/usr/bin/env python

from collections import defaultdict
from operator import itemgetter
import sys

wordcount_dict = defaultdict(int)

for l in sys.stdin:
    word, count = line.strip().split('\t')
    wordcount_dict[word] += int(count)

for word, count in sorted(wordcount_dict.items(), key=itemgetter(0)):
    print '{0}\t{1}'.format(word, count)

ReducerはMapperで出力されたそれぞれのwordを数え上げ,以下のようなものを出力する

a    1
b    2
c    3

Hadoop Streamingによる実行

いよいよ上記のMapper/ReducerをHadoop上で実行する

まず.Hadoop Streamingのためのjarファイルをダウンロードする

$ wget http://repo1.maven.org/maven2/org/apache/hadoop/hadoop-streaming/2.4.1/hadoop-streaming-2.4.1.jar

HDFS上にディレクトリを作成し,入力ファイルをのせる
(ローカル上のファイルととHDFS上のファイルがこっちゃにならないように注意)

$ hdfs dfs -mkdir /user
$ hdfs dfs -mkdir /user/vagrant
$ hdfs dfs -put inputs/input.txt /user/vagrant

実行すると指定した出力ディレクトリに結果が格納される

$ hadoop jar hadoop-streaming-2.4.1.jar -mapper mapper.py -reducer reducer.py -input /user/vagrant/input.txt -output outputs
$ hdfs dfs -cat /user/vagrant/outputs/part-00000
a    1
b    2
c    3
Why do not you register as a user and use Qiita more conveniently?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
Comments
Sign up for free and join this conversation.
If you already have a Qiita account
Why do not you register as a user and use Qiita more conveniently?
You need to log in to use this function. Qiita can be used more conveniently after logging in.
You seem to be reading articles frequently this month. Qiita can be used more conveniently after logging in.
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away