はじめに
CDH(Cloudera's Distribution Including Apache Hadoop)3を使用してhadoopのクラスタを
構築する方法を記述します。
環境
- CentOS 6.5
- CDH3u6
- jdk 1.6
構成
- master x 1
- slave x 2
- client x 1
役割 | ホスト名 | IPアドレス |
---|---|---|
master | hadoop-master | 192.168.121.11 |
slave | hadoop-slave | 192.168.121.21 |
slave | hadoop-slave2 | 192.168.121.22 |
client | hadoop-master | 192.168.121.101 |
Javaのインストール
CDH3では、OracleJDK 1.6が必要となり、1.6.0_26を推奨しているため、当該バージョンのJDKを
インストールします。
$ chmod +x jdk-6u26-linux-x64-rpm.bin
$ sudo ./jdk-6u26-linux-x64-rpm.bin
インストールされたJavaのバージョンを確認しておきます。
$ java -version
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)
CDH3のインストール
- yumリポジトリの追加
$ wget http://archive.cloudera.com/redhat/6/x86_64/cdh/cdh3-repository-1.0-1.noarch.rpm
$ sudo yum localinstall cdh3-repository-1.0-1.noarch.rpm
- リポジトリの一覧を確認します。
$ sudo yum clean all
$ yum repolist
...(省略)...
repo id repo name status
base CentOS-6 - Base 6,367
cloudera-cdh3 Cloudera's Distribution for Hadoop, Version 3 67
extras CentOS-6 - Extras 15
updates CentOS-6 - Updates 1,467
repolist: 7,916
- hadoopパッケージのインストール
クラスタの役割ごとに必要なパッケージをインストールします。
masterの場合
$ sudo yum install hadoop-0.20 hadoop-0.20-namenode hadoop-0.20-secondarynamenode hadoop-0.20-jobtracker
slaveの場合
$ sudo yum install hadoop-0.20 hadoop-0.20-datanode hadoop-0.20-tasktracker
clientの場合
$ sudo yum install hadoop-0.20
hadoopクラスタの設定
クラスタを構成するノードすべて
$ sudo cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.cluster
/etc/hadoop-0.20/conf.cluster/core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-master:8020</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/var/lib/hadoop/cache</value>
</property>
</configuration>
/etc/hadoop-0.20/conf.cluster/hdfs-site.xml
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/var/lib/hadoop/dfs/nn</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/var/lib/hadoop/dfs/dn</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.block.size</name>
<value>134217728</value>
</property>
<configuration>
/etc/hadoop-0.20/conf.cluster/mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>hadoop-master:8021</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/var/lib/hadoop/dfs/mapred/local</value>
</property>
</configuration>
- 必要となるディレクトリを作成します。
$ sudo mkdir -p /var/lib/hadoop/cache
$ sudo chown hdfs:hadoop /var/lib/hadoop/cache
$ sudo chmod 1777 /var/lib/hadoop/cache
$ sudo mkdir -p /var/lib/hadoop/dfs/nn
$ sudo mkdir -p /var/lib/hadoop/dfs/dn
$ sudo chown -R hdfs:hadoop /var/lib/hadoop/dfs
$ sudo mkdir -p /var/lib/hadoop/dfs/mapred/local
$ sudo chown -R mapred:hadoop /var/lib/hadoop/dfs/mapred
$ sudo chmod -R 775 /var/lib/hadoop/dfs
- conf.clusterを参照するようにalternativesの設定を追加します。
$ sudo alternatives --display hadoop-0.20-conf
hadoop-0.20-conf - status is auto.
link currently points to /etc/hadoop-0.20/conf.empty
/etc/hadoop-0.20/conf.empty - priority 10
Current `best' version is /etc/hadoop-0.20/conf.empty.
$ sudo alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.cluster 50
$ sudo alternatives --set hadoop-0.20-conf /etc/hadoop-0.20/conf.cluster
$ sudo alternatives --display hadoop-0.20-conf
hadoop-0.20-conf - status is auto.
link currently points to /etc/hadoop-0.20/conf.cluster
/etc/hadoop-0.20/conf.empty - priority 10
/etc/hadoop-0.20/conf.cluster - priority 50
Current `best' version is /etc/hadoop-0.20/conf.cluster.
- ホスト名でアクセスできるように、/etc/hostsへ記述を追加します。
/etc/hosts
192.168.121.11 hadoop-master
192.168.121.21 hadoop-slave
192.168.121.22 hadoop-slave2
192.168.121.101 hadoop-client
masterのみ
- HDFSのフォーマットします。
$ sudo su - hdfs
$ hadoop namenode -format
- サービスの起動
masterの場合
namenode及びjobtrackerを起動します。
$ sudo /etc/init.d/hadoop-0.20-namenode start
$ sudo /etc/init.d/hadoop-0.20-jobtracker start
サービスの起動を確認しておきます。
$ sudo /usr/java/default/bin/jps
XXXXX Jps
XXXXX JobTracker
XXXXX NameNode
slaveの場合
datanode及びtasktrackerを起動します。
$ sudo /etc/init.d/hadoop-0.20-datanode start
$ sudo /etc/init.d/hadoop-0.20-tasktracker start
サービスの起動を確認しておきます。
$ sudo /usr/java/default/bin/jps
XXXXX Jps
XXXXX TaskTracker
XXXXX DataNode
動作確認
hadoop-clientにて以下のサンプルプログラムを実行します。
# su - hdfs
$ hadoop hadoop-examples-0.20.2-cdh3u6.jar pi 1 300
Number of Maps = 1
Samples per Map = 300
Wrote input for Map #0
Starting Job
14/09/08 23:49:35 INFO mapred.FileInputFormat: Total input paths to process : 1
14/09/08 23:49:36 INFO mapred.JobClient: Running job: job_201409082238_0003
14/09/08 23:49:37 INFO mapred.JobClient: map 0% reduce 0%
14/09/08 23:49:44 INFO mapred.JobClient: map 100% reduce 0%
14/09/08 23:49:52 INFO mapred.JobClient: map 100% reduce 33%
14/09/08 23:49:54 INFO mapred.JobClient: map 100% reduce 100%
14/09/08 23:49:55 INFO mapred.JobClient: Job complete: job_201409082238_0003
14/09/08 23:49:55 INFO mapred.JobClient: Counters: 27
14/09/08 23:49:55 INFO mapred.JobClient: Job Counters
14/09/08 23:49:55 INFO mapred.JobClient: Launched reduce tasks=1
14/09/08 23:49:55 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=8058
14/09/08 23:49:55 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/09/08 23:49:55 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/09/08 23:49:55 INFO mapred.JobClient: Rack-local map tasks=1
14/09/08 23:49:55 INFO mapred.JobClient: Launched map tasks=1
14/09/08 23:49:55 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=9537
14/09/08 23:49:55 INFO mapred.JobClient: FileSystemCounters
14/09/08 23:49:55 INFO mapred.JobClient: FILE_BYTES_READ=28
14/09/08 23:49:55 INFO mapred.JobClient: HDFS_BYTES_READ=243
14/09/08 23:49:55 INFO mapred.JobClient: FILE_BYTES_WRITTEN=109744
14/09/08 23:49:55 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=215
14/09/08 23:49:55 INFO mapred.JobClient: Map-Reduce Framework
14/09/08 23:49:55 INFO mapred.JobClient: Map input records=1
14/09/08 23:49:55 INFO mapred.JobClient: Reduce shuffle bytes=28
14/09/08 23:49:55 INFO mapred.JobClient: Spilled Records=4
14/09/08 23:49:55 INFO mapred.JobClient: Map output bytes=18
14/09/08 23:49:55 INFO mapred.JobClient: CPU time spent (ms)=2040
14/09/08 23:49:55 INFO mapred.JobClient: Total committed heap usage (bytes)=176230400
14/09/08 23:49:55 INFO mapred.JobClient: Map input bytes=24
14/09/08 23:49:55 INFO mapred.JobClient: Combine input records=0
14/09/08 23:49:55 INFO mapred.JobClient: SPLIT_RAW_BYTES=125
14/09/08 23:49:55 INFO mapred.JobClient: Reduce input records=2
14/09/08 23:49:55 INFO mapred.JobClient: Reduce input groups=2
14/09/08 23:49:55 INFO mapred.JobClient: Combine output records=0
14/09/08 23:49:55 INFO mapred.JobClient: Physical memory (bytes) snapshot=281022464
14/09/08 23:49:55 INFO mapred.JobClient: Reduce output records=0
14/09/08 23:49:55 INFO mapred.JobClient: Virtual memory (bytes) snapshot=1429348352
14/09/08 23:49:55 INFO mapred.JobClient: Map output records=2
Job Finished in 19.828 seconds
Estimated value of Pi is 3.16000000000000000000