hadoop少しだけ触ってみた(mac1台構成)

  • 6
    いいね
  • 0
    コメント
この記事は最終更新日から1年以上が経過しています。

参考

インストール

brew install hadoop

サンプルプログラム

$ find /usr/local/Cellar/hadoop | grep jar | grep example
/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar
/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/mapreduce/lib-examples/hsqldb-2.0.0.jar
/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.7.0-sources.jar
/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.7.0-test-sources.jar
example確認
$ hadoop jar \
  /usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar
An example program must be given as the first argument.
Valid program names are:
  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
  dbcount: An example job that count the pageview counts from a database.
  distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
  grep: A map/reduce program that counts the matches of a regex in the input.
  join: A job that effects a join over sorted, equally partitioned datasets
  multifilewc: A job that counts words from several files.
  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
  pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
  randomwriter: A map/reduce program that writes 10GB of random data per node.
  secondarysort: An example defining a secondary sort to the reduce.
  sort: A map/reduce program that sorts the data written by the random writer.
  sudoku: A sudoku solver.
  teragen: Generate data for the terasort
  terasort: Run the terasort
  teravalidate: Checking results of terasort
  wordcount: A map/reduce program that counts the words in the input files.
  wordmean: A map/reduce program that counts the average length of the words in the input files.
  wordmedian: A map/reduce program that counts the median length of the words in the input files.
  wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

piを実行

$ hadoop jar /usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar pi 5 200
(省略)
    File System Counters
        FILE: Number of bytes read=1656432
        FILE: Number of bytes written=3269077
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
    Map-Reduce Framework
        Map input records=5
        Map output records=10
        Map output bytes=90
        Map output materialized bytes=140
        Input split bytes=640
        Combine input records=0
        Combine output records=0
        Reduce input groups=2
        Reduce shuffle bytes=140
        Reduce input records=10
        Reduce output records=0
        Spilled Records=20
        Shuffled Maps =5
        Failed Shuffles=0
        Merged Map outputs=5
        GC time elapsed (ms)=23
        Total committed heap usage (bytes)=2572681216
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters
        Bytes Read=650
    File Output Format Counters
        Bytes Written=109
Job Finished in 1.938 seconds
Estimated value of Pi is 3.14800000000000000000

こんな感じ

wordcount

テキストの準備

a
a a b b b c b c
b
ab a b b c c c

テキスト解析

テキストを転送
hadoop fs -put a /tmp/input/
hadoop fs -put b /tmp/input/
テキストが転送されたことを確認。
hadoop fs -ls -R /tmp/input/
テキストの解析
hadoop jar hadoop-mapreduce-examples-2.7.0.jar \
  wordcount /tmp/input/ /tmp/output/
テキスト解析結果の確認
$ cat /tmp/output/part-r-00000
a   3
ab  1
b   6
c   5

もう一度解析

w3m -dump "http://www.casleyconsulting.co.jp/blog-engineer/分散処理/分散処理に入門してみた(hadoop-spark)/" > a
hadoop fs -put -f a /tmp/input/
rm -r /tmp/output
hadoop jar hadoop-mapreduce-examples-2.7.0.jar \
  wordcount /tmp/input/ /tmp/output/
確認
cat /tmp/output/part-r-00000  | sort -k2 -r | more