はじめに
CDH(Cloudera's Distribution Including Apache Hadoop)4を使用してhadoopのクラスタを構築する方法を記述します。
環境
- CentOS 6.5
- CDH 4.7.0
- jdk 1.7.0_55
構成
役割 | ホスト名 | IPアドレス |
---|---|---|
master | hadoop-master | 192.168.122.11 |
slave | hadoop-slave | 192.168.122.21 |
slave | hadoop-slave2 | 192.168.122.22 |
client | hadoop-client | 192.168.122.101 |
jdkのインストール
CDH4では、jdk1.7.0_55での動作を保証しているようなので、当該バージョンのjdkをインストールします。
$ sudo yum localinstall jdk-7u55-linux-x64.rpm
java version "1.7.0_55"
Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
CDH4のインストール
yumリポジトリの追加
$ wget http://archive.cloudera.com/cdh4/one-click-install/redhat/6/x86_64/cloudera-cdh-4-0.x86_64.rpm
$ sudo yum localinstall cloudera-cdh-4-0.x86_64.rpm
$ sudo yum clean all
$ yum repolist
repo id repo name status
base CentOS-6 - Base 6,367
cloudera-cdh4 Cloudera's Distribution for Hadoop, Version 4 110
extras CentOS-6 - Extras 15
updates CentOS-6 - Updates 1,487
repolist: 7,979
ネットワークの設定
- 名前解決が行えるように/etc/hostsのクラスタのホスト名を追加します。
/etc/hosts
192.168.122.11 hadoop-master
192.168.122.21 hadoop-slave
192.168.122.22 hadoop-slave2
192.168.122.101 hadoop-client
HDFSの設定
- パッケージのインストール
マスタの場合
$ sudo yum install hadoop hadoop-hdfs hadoop-hdfs-namenode
$ sudo yum install hadoop-yarn hadoop-yarn-resourcemanager
$ sudo yum install hadoop-mapreduce hadoop-mapreduce-historyserver
スレーブの場合
$ sudo yum install hadoop hadoop-hdfs hadoop-hdfs-datanode
$ sudo yum install hadoop-yarn hadoop-yarn-nodemanager
$ sudo yum install hadoop-mapreduce
クライアントの場合
$ sudo yum install hadoop hadoop-hdfs hadoop-mapreduce hadoop-yarn hadoop-client
- 設定ファイルのひな型をコピーします
(クラスタ全体)
$ alternatives --display hadoop-conf
hadoop-conf - status is auto.
link currently points to /etc/hadoop/conf.empty
/etc/hadoop/conf.empty - priority 10
Current `best' version is /etc/hadoop/conf.empty.
$ sudo cp -rp /etc/hadoop/conf.empty /etc/hadoop/conf.cluster
$ sudo alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.cluster 50
$ sudo alternatives --set hadoop-conf /etc/hadoop/conf.cluster
$ alternatives --display hadoop-conf
hadoop-conf - status is manual.
link currently points to /etc/hadoop/conf.cluster
/etc/hadoop/conf.empty - priority 10
/etc/hadoop/conf.cluster - priority 50
Current `best' version is /etc/hadoop/conf.cluster.
- core-site.xmlを設定します
/etc/hadoop/conf.cluster/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop-master:8020</value>
</property>
</configuration>
- hdfs-site.xmlを設定する
/etc/hadoop/conf.cluster/hdfs-site.xml
<configuration>
<property>
<name>dfs.permissions.superusergroup</name>
<value>hadoop</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/var/lib/hadoop-hdfs/nn</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/var/lib/hadoop-hdfs/dn</value>
</property>
</configuration>
- 必要なディレクトリを作成します。
$ sudo mkdir -p /var/lib/hadoop-hdfs/nn /var/lib/hadoop-hdfs/dn
$ sudo chown -R hdfs:hdfs /var/lib/hadoop-hdfs/nn /var/lib/hadoop-hdfs/dn
$ sudo chmod 700 /var/lib/hadoop-hdfs/nn /var/lib/hadoop-hdfs/dn
- ファイルシステムをフォーマットします。
$ sudo -u hdfs hdfs namenode -format
14/09/12 06:48:11 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = hadoop-master/192.168.122.11
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.0.0-cdh4.7.0
STARTUP_MSG: classpath = /etc/hadoop/conf:..(省略)..
STARTUP_MSG: java = 1.7.0_55
************************************************************/
14/09/12 06:48:11 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
14/09/12 06:48:12 WARN common.Util: Path /var/lib/hadoop-hdfs/nn should be specified as a URI in configuration files. Please update hdfs configuration.
14/09/12 06:48:12 WARN common.Util: Path /var/lib/hadoop-hdfs/nn should be specified as a URI in configuration files. Please update hdfs configuration.
Formatting using clusterid: CID-12509087-4dbc-4977-94ae-134068d9a02f
14/09/12 06:48:12 INFO namenode.FSNamesystem: fsLock is fair:true
14/09/12 06:48:12 INFO blockmanagement.HeartbeatManager: Setting heartbeat recheck interval to 30000 since dfs.namenode.stale.datanode.interval is less than dfs.namenode.heartbeat.recheck-interval
14/09/12 06:48:12 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
14/09/12 06:48:12 INFO util.GSet: Computing capacity for map BlocksMap
14/09/12 06:48:12 INFO util.GSet: VM type = 64-bit
14/09/12 06:48:12 INFO util.GSet: 2.0% max memory 966.7 MB = 19.3 MB
14/09/12 06:48:12 INFO util.GSet: capacity = 2^21 = 2097152 entries
14/09/12 06:48:12 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false
14/09/12 06:48:12 INFO blockmanagement.BlockManager: defaultReplication = 3
14/09/12 06:48:12 INFO blockmanagement.BlockManager: maxReplication = 512
14/09/12 06:48:12 INFO blockmanagement.BlockManager: minReplication = 1
14/09/12 06:48:12 INFO blockmanagement.BlockManager: maxReplicationStreams = 2
14/09/12 06:48:12 INFO blockmanagement.BlockManager: shouldCheckForEnoughRacks = false
14/09/12 06:48:12 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
14/09/12 06:48:12 INFO blockmanagement.BlockManager: encryptDataTransfer = false
14/09/12 06:48:12 INFO blockmanagement.BlockManager: maxNumBlocksToLog = 1000
14/09/12 06:48:13 INFO namenode.FSNamesystem: fsOwner = hdfs (auth:SIMPLE)
14/09/12 06:48:13 INFO namenode.FSNamesystem: supergroup = hadoop
14/09/12 06:48:13 INFO namenode.FSNamesystem: isPermissionEnabled = true
14/09/12 06:48:13 INFO namenode.FSNamesystem: HA Enabled: false
14/09/12 06:48:13 INFO namenode.FSNamesystem: Append Enabled: true
14/09/12 06:48:13 INFO namenode.NameNode: Caching file names occuring more than 10 times
14/09/12 06:48:13 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
14/09/12 06:48:13 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0
14/09/12 06:48:13 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension = 30000
14/09/12 06:48:13 INFO namenode.NNStorage: Storage directory /var/lib/hadoop-hdfs/nn has been successfully formatted.
14/09/12 06:48:13 INFO namenode.FSImage: Saving image file /var/lib/hadoop-hdfs/nn/current/fsimage.ckpt_0000000000000000000 using no compression
14/09/12 06:48:13 INFO namenode.FSImage: Image file of size 115 saved in 0 seconds.
14/09/12 06:48:13 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
14/09/12 06:48:13 INFO util.ExitUtil: Exiting with status 0
14/09/12 06:48:13 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop-master/192.168.122.11
************************************************************/
MapReduce2 (YARN)の設定
- mapred-site.xmlを設定します
/etc/hadoop/conf.cluster/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop-master:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop-master:19888</value>
</property>
</configuration>
- yarn-site.xmlを設定します
/etc/hadoop/conf.cluster/yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoop-master:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoop-master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoop-master:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>hadoop-master:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hadoop-master:8089</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/var/lib/hadoop-yarn/nm/local</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/var/log/hadoop-yarn/nm</value>
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>hdfs://hadoop-master:8020/var/log/hadoop-yarn/apps</value>
</property>
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/user</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>
$HADOOP_CONF_DIR,
$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
$YARN_HOME/*, $YARN_HOME/lib/*
</value>
</property>
</configuration>
- 必要なディレクトリを作成します。
$ sudo mkdir -p /var/lib/hadoop-yarn/nm/local /var/log/hadoop-yarn/nm
$ sudo chown -R yarn:yarn /var/lib/hadoop-yarn/nm/local /var/log/hadoop-yarn/nm
- ユーザの環境変数を設定します。
$ sudo cp -p /etc/skel/.bash* /var/lib/hadoop-hdfs
$ sudo chown hdfs:hdfs /var/lib/hadoop-hdfs/.bash*
$ sudo cp -p /etc/skel/.bash* /var/lib/hadoop-mapreduce
$ sudo chown mapred:mapred /var/lib/hadoop-mapreduce/.bash*
$ sudo cp -p /etc/skel/.bash* /var/lib/hadoop-yarn
$ sudo chown yarn:yarn /var/lib/hadoop-yarn/.bash*
- hadoop-env.shを設定する
$ sudo sh -c "echo 'export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce' >> /etc/hadoop/conf.cluster/hadoop-env.sh"
- ネームノードを起動します。
(hadoop-masterのみ)
$ sudo service hadoop-hdfs-namenode start
- データノードを起動します。
(hadoop-slave,hadoop-slave2)
$ sudo service hadoop-hdfs-datanode start
- システムディレクトリを作成します。
$ sudo su - hdfs
$ hadoop fs -mkdir /tmp
$ hadoop fs -chmod -R 1777 /tmp
$ hadoop fs -mkdir /user/history
$ hadoop fs -chmod -R 1777 /user/history
$ hadoop fs -chown mapred:hadoop /user/history
$ hadoop fs -mkdir /var/log/hadoop-yarn
$ hadoop fs -chown yarn:mapred /var/log/hadoop-yarn
$ hadoop fs -ls -R /
drwxrwxrwt - hdfs hadoop 0 2014-09-13 14:30 /tmp
drwxr-xr-x - hdfs hadoop 0 2014-09-13 14:33 /user
drwxrwxrwt - mapred hadoop 0 2014-09-13 14:33 /user/history
drwxr-xr-x - hdfs hadoop 0 2014-09-13 14:33 /var
drwxr-xr-x - hdfs hadoop 0 2014-09-13 14:33 /var/log
drwxr-xr-x - yarn mapred 0 2014-09-13 14:33 /var/log/hadoop-yarn
- リソースマネージャを起動します。
(hadoop-master)
$ sudo service hadoop-yarn-resourcemanager start
- ノードマネージャを起動します。
(hadoop-slave, hadoop-slave2)
$ sudo service hadoop-yarn-nodemanager start
- ジョブヒストリーサーバを起動します。
(hadoop-master)
$ sudo service hadoop-mapreduce-historyserver start
動作確認
- HDFSを操作する
hdfsユーザで操作します。
$ sudo su - hdfs
$ hadoop fs -mkdir /user/hdfs
$ hadoop fs -ls /user/hdfs
Found 1 items
drwxr-xr-x - hdfs hadoop 0 2014-09-13 18:59 /user/hdfs/input
$ hadoop fs -mkdir input
$ hostname > hostname.txt
$ hadoop fs -put hostname.txt input
$ hadoop fs -ls input
Found 1 items
-rw-r--r-- 3 hdfs hadoop 16 2014-09-13 18:59 input/hostname.txt
- サンプルプログラムを実行します。
mapredユーザで操作します。
$ sudo su - mapred
$ hadoop -jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-example.jar pi 5 300
Number of Maps = 5
Samples per Map = 300
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Starting Job
14/09/13 19:31:43 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited.
14/09/13 19:31:43 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started.
14/09/13 19:31:44 INFO input.FileInputFormat: Total input paths to process : 5
14/09/13 19:31:44 INFO mapreduce.JobSubmitter: number of splits:5
14/09/13 19:31:44 WARN conf.Configuration: mapred.jar is deprecated. Instead, use mapreduce.job.jar
14/09/13 19:31:44 WARN conf.Configuration: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
14/09/13 19:31:44 WARN conf.Configuration: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
14/09/13 19:31:44 WARN conf.Configuration: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
14/09/13 19:31:44 WARN conf.Configuration: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
14/09/13 19:31:44 WARN conf.Configuration: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
14/09/13 19:31:44 WARN conf.Configuration: mapred.job.name is deprecated. Instead, use mapreduce.job.name
14/09/13 19:31:44 WARN conf.Configuration: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class
14/09/13 19:31:44 WARN conf.Configuration: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class
14/09/13 19:31:44 WARN conf.Configuration: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
14/09/13 19:31:44 WARN conf.Configuration: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
14/09/13 19:31:44 WARN conf.Configuration: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class
14/09/13 19:31:44 WARN conf.Configuration: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
14/09/13 19:31:44 WARN conf.Configuration: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
14/09/13 19:31:44 WARN conf.Configuration: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
14/09/13 19:31:44 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1410586655016_0005
14/09/13 19:31:45 INFO client.YarnClientImpl: Submitted application application_1410586655016_0005 to ResourceManager at hadoop-master/192.168.122.11:8032
14/09/13 19:31:45 INFO mapreduce.Job: The url to track the job: http://hadoop-master:8089/proxy/application_1410586655016_0005/
14/09/13 19:31:45 INFO mapreduce.Job: Running job: job_1410586655016_0005
14/09/13 19:32:03 INFO mapreduce.Job: Job job_1410586655016_0005 running in uber mode : false
14/09/13 19:32:03 INFO mapreduce.Job: map 0% reduce 0%
14/09/13 19:33:15 INFO mapreduce.Job: map 20% reduce 0%
14/09/13 19:33:16 INFO mapreduce.Job: map 40% reduce 0%
14/09/13 19:33:17 INFO mapreduce.Job: map 100% reduce 0%
14/09/13 19:33:28 INFO mapreduce.Job: map 100% reduce 100%
14/09/13 19:33:28 INFO mapreduce.Job: Job job_1410586655016_0005 completed successfully
14/09/13 19:33:28 INFO mapreduce.Job: Counters: 43
File System Counters
FILE: Number of bytes read=116
FILE: Number of bytes written=440147
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1315
HDFS: Number of bytes written=215
HDFS: Number of read operations=23
HDFS: Number of large read operations=0
HDFS: Number of write operations=3
Job Counters
Launched map tasks=5
Launched reduce tasks=1
Data-local map tasks=5
Total time spent by all maps in occupied slots (ms)=363619
Total time spent by all reduces in occupied slots (ms)=11396
Map-Reduce Framework
Map input records=5
Map output records=10
Map output bytes=90
Map output materialized bytes=140
Input split bytes=725
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=140
Reduce input records=10
Reduce output records=0
Spilled Records=20
Shuffled Maps =5
Failed Shuffles=0
Merged Map outputs=5
GC time elapsed (ms)=6962
CPU time spent (ms)=19280
Physical memory (bytes) snapshot=761933824
Virtual memory (bytes) snapshot=3770851328
Total committed heap usage (bytes)=619794432
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=590
File Output Format Counters
Bytes Written=97
Job Finished in 105.013 seconds
Estimated value of Pi is 3.15200000000000000000