More than 5 years have passed since last update.

hadoop少しだけ触ってみた(mac1台構成)

hadoop

Last updated at 2017-04-19Posted at 2015-06-18

参考

インストール

brew install hadoop

サンプルプログラム

$ find /usr/local/Cellar/hadoop | grep jar | grep example
/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar
/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/mapreduce/lib-examples/hsqldb-2.0.0.jar
/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.7.0-sources.jar
/usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-2.7.0-test-sources.jar

example確認

$ hadoop jar \
  /usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar
An example program must be given as the first argument.
Valid program names are:
  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
  dbcount: An example job that count the pageview counts from a database.
  distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
  grep: A map/reduce program that counts the matches of a regex in the input.
  join: A job that effects a join over sorted, equally partitioned datasets
  multifilewc: A job that counts words from several files.
  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
  pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
  randomwriter: A map/reduce program that writes 10GB of random data per node.
  secondarysort: An example defining a secondary sort to the reduce.
  sort: A map/reduce program that sorts the data written by the random writer.
  sudoku: A sudoku solver.
  teragen: Generate data for the terasort
  terasort: Run the terasort
  teravalidate: Checking results of terasort
  wordcount: A map/reduce program that counts the words in the input files.
  wordmean: A map/reduce program that counts the average length of the words in the input files.
  wordmedian: A map/reduce program that counts the median length of the words in the input files.
  wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

piを実行

$ hadoop jar /usr/local/Cellar/hadoop/2.7.0/libexec/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar pi 5 200
(省略)
	File System Counters
		FILE: Number of bytes read=1656432
		FILE: Number of bytes written=3269077
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
	Map-Reduce Framework
		Map input records=5
		Map output records=10
		Map output bytes=90
		Map output materialized bytes=140
		Input split bytes=640
		Combine input records=0
		Combine output records=0
		Reduce input groups=2
		Reduce shuffle bytes=140
		Reduce input records=10
		Reduce output records=0
		Spilled Records=20
		Shuffled Maps =5
		Failed Shuffles=0
		Merged Map outputs=5
		GC time elapsed (ms)=23
		Total committed heap usage (bytes)=2572681216
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters
		Bytes Read=650
	File Output Format Counters
		Bytes Written=109
Job Finished in 1.938 seconds
Estimated value of Pi is 3.14800000000000000000

こんな感じ

wordcount

分散処理に入門してみた（Hadoop + Spark） | キャスレーコンサルティング　技術ブログ

テキストの準備

a a b b b c b c

ab a b b c c c

テキスト解析

テキストを転送

hadoop fs -put a /tmp/input/
hadoop fs -put b /tmp/input/

テキストが転送されたことを確認。

hadoop fs -ls -R /tmp/input/

テキストの解析

hadoop jar hadoop-mapreduce-examples-2.7.0.jar \
  wordcount /tmp/input/ /tmp/output/

テキスト解析結果の確認

$ cat /tmp/output/part-r-00000
a	3
ab	1
b	6
c	5

もう一度解析

w3m -dump "http://www.casleyconsulting.co.jp/blog-engineer/分散処理/分散処理に入門してみた（hadoop-spark）/" > a
hadoop fs -put -f a /tmp/input/
rm -r /tmp/output
hadoop jar hadoop-mapreduce-examples-2.7.0.jar \
  wordcount /tmp/input/ /tmp/output/

確認

cat /tmp/output/part-r-00000  | sort -k2 -r -n | more

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up