参考
- きりんさん日記: 1台構成のHadoopを30分で試してみる(CentOS + Cloudera)
- テキストマイニングで始める実践Hadoop活用(1):いまさら聞けないHadoopとテキストマイニング入門 (3/3) - @IT
インストール
brew install hadoop
サンプルプログラム
$ find /usr/local/Cellar/hadoop | grep jar | grep example
/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar
/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/mapreduce/lib-examples/hsqldb-2.0.0.jar
/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.7.0-sources.jar
/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.7.0-test-sources.jar
example確認
$ hadoop jar \
/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar
An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount: An example job that count the pageview counts from a database.
distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
wordmean: A map/reduce program that counts the average length of the words in the input files.
wordmedian: A map/reduce program that counts the median length of the words in the input files.
wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.
piを実行
$ hadoop jar /usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar pi 5 200
(省略)
File System Counters
FILE: Number of bytes read=1656432
FILE: Number of bytes written=3269077
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=5
Map output records=10
Map output bytes=90
Map output materialized bytes=140
Input split bytes=640
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=140
Reduce input records=10
Reduce output records=0
Spilled Records=20
Shuffled Maps =5
Failed Shuffles=0
Merged Map outputs=5
GC time elapsed (ms)=23
Total committed heap usage (bytes)=2572681216
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=650
File Output Format Counters
Bytes Written=109
Job Finished in 1.938 seconds
Estimated value of Pi is 3.14800000000000000000
こんな感じ
wordcount
テキストの準備
a
a a b b b c b c
b
ab a b b c c c
テキスト解析
テキストを転送
hadoop fs -put a /tmp/input/
hadoop fs -put b /tmp/input/
テキストが転送されたことを確認。
hadoop fs -ls -R /tmp/input/
テキストの解析
hadoop jar hadoop-mapreduce-examples-2.7.0.jar \
wordcount /tmp/input/ /tmp/output/
テキスト解析結果の確認
$ cat /tmp/output/part-r-00000
a 3
ab 1
b 6
c 5
もう一度解析
w3m -dump "http://www.casleyconsulting.co.jp/blog-engineer/分散処理/分散処理に入門してみた(hadoop-spark)/" > a
hadoop fs -put -f a /tmp/input/
rm -r /tmp/output
hadoop jar hadoop-mapreduce-examples-2.7.0.jar \
wordcount /tmp/input/ /tmp/output/
確認
cat /tmp/output/part-r-00000 | sort -k2 -r -n | more