0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

Hadoop(-3.2.1.)で久方ぶりにpig(-0.17.0)を使ってみる

Last updated at Posted at 2020-12-04

■ こちらの環境
OS: Ubuntu 16 or 18
Hadoop: hadoop-3.2.1.tar.gz
JDK (Java): jdk-8u202-linux-x64.tar.gz

ネームノード
192.168.76.216: h-gpu05

データノード
192.168.76.210: h-gpu03
192.168.76.210: h-gpu04


$ wget https://archive.apache.org/dist/pig/pig-0.17.0/pig-0.17.0.tar.gz
$ tar zxvf pig-0.17.0.tar.gz

.bashrcに下記を追加する。


export PIG_HOME=/home/hadoop/pig-0.17.0
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=$PIG_HOME/conf:$HADOOP_INSTALL/etc/hadoop

historyserverを起動する。。。


  506  mr-jobhistory-daemon.sh start historyserver

バージョンの確認


hadoop@h-gpu05:~$ pig --version                                                                                                                                                                           
Apache Pig version 0.17.0 (r1797386) 
compiled Jun 02 2017, 15:41:58

今回はこのようなファイルを生成する。


hadoop@h-gpu05:~/qiita/hadoop/pig$ g++ rand_gen_sin.cpp 
hadoop@h-gpu05:~/qiita/hadoop/pig$ ./a.out 100000
hadoop@h-gpu05:~/qiita/hadoop/pig$ head -n 3 random_data.txt 
2019/07/02 03:03:00.000,35293
2019/07/02 06:06:00.000,34155.7
2019/07/02 20:20:00.000,35647.6

データ(random_data.txt)をHDFS上に置く。。。


hadoop@h-gpu05:~/qiita/hadoop/pig$ hdfs dfs -mkdir pig_input                                                                                                                                              
hadoop@h-gpu05:~/qiita/hadoop/pig$ hdfs dfs -put random_data.txt pig_input/
2020-12-04 16:17:15,437 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
hadoop@h-gpu05:~/qiita/hadoop/pig$ hdfs dfs -ls pig_input/
Found 1 items
-rw-r--r--   3 hadoop supergroup    2602451 2020-12-04 16:17 pig_input/random_data.txt

pigで読み込んでみる。。。


grunt> A = LOAD 'input/random_data.txt' USING PigStorage(',') as (pdfdata:CHARARRAY);   
grunt> dump A; 

↓のようなスクリプトを実行する。。。


lines = LOAD 'pig_input/random_data.txt' AS (line:chararray);                                                                                                                                                 
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;                                                                                                                                           
grouped = GROUP words BY word;                                                                                                                                                                            
wordcount = FOREACH grouped GENERATE group, COUNT(words);                                                                                                                                                 
DUMP wordcount; 


hadoop@h-gpu05:~$ pig 4.pig

↓ 出力例


(2019/07/02,100000)
(00:00:00.000,4064)
(01:01:00.000,4239)
(02:02:00.000,4159)
(03:03:00.000,4169)
(04:04:00.000,4208)
(05:05:00.000,4269)
(06:06:00.000,4135)
(07:07:00.000,4197)
(08:08:00.000,4217)
(09:09:00.000,4292)
(10:10:00.000,4149)
(11:11:00.000,4094)
(12:12:00.000,4204)
(13:13:00.000,4122)
(14:14:00.000,4222)
(15:15:00.000,4127)
(16:16:00.000,4199)
(17:17:00.000,4177)
(18:18:00.000,4089)
(19:19:00.000,4163)
(20:20:00.000,4130)
(21:21:00.000,4141)
(22:22:00.000,4082)
(23:23:00.000,4152)
2020-12-04 16:43:44,300 [main] INFO  org.apache.pig.Main - Pig script completed in 23 seconds and 850 milliseconds (23850 ms)

↓のようなスクリプトでファイルに書き込んでみる。。。


lines = LOAD 'pig_input/random_data.txt' AS (line:chararray);
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;
grouped = GROUP words BY word;
wordcount = FOREACH grouped GENERATE group, COUNT(words);
store wordcount into 'pig_output/output';

実行してみる。。。


hadoop@h-gpu05:~$ pig 4.pig
hadoop@h-gpu05:~$ hdfs dfs -ls pig_output
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2020-12-04 16:51 pig_output/output
hadoop@h-gpu05:~$ hdfs dfs -get pig_output/output
2020-12-04 16:52:15,928 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
hadoop@h-gpu05:~$ head -n 10 output/                                                                                                                                                                      
part-r-00000  _SUCCESS      
hadoop@h-gpu05:~$ head -n 10 output/part-r-00000 
29491   1
29494   2
29498   1
29501   2
29507   1
29508   2
29510   1
29513   1
29514   1
29522   1
hadoop@h-gpu05:~$ tail -n 10 output/part-r-00000 
14:14:00.000    4222
15:15:00.000    4127
16:16:00.000    4199
17:17:00.000    4177
18:18:00.000    4089
19:19:00.000    4163
20:20:00.000    4130
21:21:00.000    4141
22:22:00.000    4082
23:23:00.000    4152
0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?