LoginSignup
7
7

More than 5 years have passed since last update.

hadoop少しだけ触ってみた(mac1台構成)

Last updated at Posted at 2015-06-18

参考

インストール

brew install hadoop

サンプルプログラム

$ find /usr/local/Cellar/hadoop | grep jar | grep example
/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar
/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/mapreduce/lib-examples/hsqldb-2.0.0.jar
/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.7.0-sources.jar
/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.7.0-test-sources.jar
example確認
$ hadoop jar \
  /usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar
An example program must be given as the first argument.
Valid program names are:
  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
  dbcount: An example job that count the pageview counts from a database.
  distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
  grep: A map/reduce program that counts the matches of a regex in the input.
  join: A job that effects a join over sorted, equally partitioned datasets
  multifilewc: A job that counts words from several files.
  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
  pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
  randomwriter: A map/reduce program that writes 10GB of random data per node.
  secondarysort: An example defining a secondary sort to the reduce.
  sort: A map/reduce program that sorts the data written by the random writer.
  sudoku: A sudoku solver.
  teragen: Generate data for the terasort
  terasort: Run the terasort
  teravalidate: Checking results of terasort
  wordcount: A map/reduce program that counts the words in the input files.
  wordmean: A map/reduce program that counts the average length of the words in the input files.
  wordmedian: A map/reduce program that counts the median length of the words in the input files.
  wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

piを実行

$ hadoop jar /usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar pi 5 200
(省略)
    File System Counters
        FILE: Number of bytes read=1656432
        FILE: Number of bytes written=3269077
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
    Map-Reduce Framework
        Map input records=5
        Map output records=10
        Map output bytes=90
        Map output materialized bytes=140
        Input split bytes=640
        Combine input records=0
        Combine output records=0
        Reduce input groups=2
        Reduce shuffle bytes=140
        Reduce input records=10
        Reduce output records=0
        Spilled Records=20
        Shuffled Maps =5
        Failed Shuffles=0
        Merged Map outputs=5
        GC time elapsed (ms)=23
        Total committed heap usage (bytes)=2572681216
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters
        Bytes Read=650
    File Output Format Counters
        Bytes Written=109
Job Finished in 1.938 seconds
Estimated value of Pi is 3.14800000000000000000

こんな感じ

wordcount

テキストの準備

a
a a b b b c b c
b
ab a b b c c c

テキスト解析

テキストを転送
hadoop fs -put a /tmp/input/
hadoop fs -put b /tmp/input/
テキストが転送されたことを確認。
hadoop fs -ls -R /tmp/input/
テキストの解析
hadoop jar hadoop-mapreduce-examples-2.7.0.jar \
  wordcount /tmp/input/ /tmp/output/
テキスト解析結果の確認
$ cat /tmp/output/part-r-00000
a   3
ab  1
b   6
c   5

もう一度解析

w3m -dump "http://www.casleyconsulting.co.jp/blog-engineer/分散処理/分散処理に入門してみた(hadoop-spark)/" > a
hadoop fs -put -f a /tmp/input/
rm -r /tmp/output
hadoop jar hadoop-mapreduce-examples-2.7.0.jar \
  wordcount /tmp/input/ /tmp/output/
確認
cat /tmp/output/part-r-00000  | sort -k2 -r -n | more
7
7
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
7
7