1
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

shell で大量のファイルに対して集計処理を行う

Last updated at Posted at 2020-11-06

例えば大量のファイルから foo_* が 5 回以上登場する行の id を抽出し、id ごとの行数を算出する。

ファイル例

id=>aaa, "foo_24"
id=>bbb, "foo_4", "foo_pp", "aaaaaa", "foo_343"

愚直版

愚直に逐次実行。
ファイル数やサイズが小さければこれで十分。

# !/bin/sh

files=$(find . -type f)

# 抽出
regex='(.*((foo_).*){5,}.*)'
tmp_file='tmp_result.txt'
for file in files; do
  cat $file \
  | grep -E $regex \
  | awk -F "=>" '{print $2}' \
  | awk -F "," '{print $1}' \
  >> $tmp_file
done

# 集計
result_file="result.txt"
cat $tmp_file \
| awk '{count[$0]++}END{for(k in count)print k, count[k]}' \
| sort -r -t ' ' -k 2 -n \
>> $result_file

# tmp ファイルの掃除
rm $tmp_file

parallel 版

ファイル数やサイズがでかい場合は CPU をフルに使って頑張りたいのでこっち。

# !/bin/sh

files=$(find . -type f)

# 抽出。ここがボトルネックであり、並列化して頑張りたいところなので頑張る。
regex='(.*((foo_).*){5,}.*)'
tmp_file_name_prefix='tmp_extract_result'
parallel_num=50 # NOTE: マシンの CPU 負荷考慮して調整すること
k=0
for file in files; do
  k=$(( k % parallel_num ))
  (( k ++ ))
  if [[ $k -eq 0 ]]; then
      wait
  fi
  cat $file \
  | grep -E $regex \
  | awk -F "=>" '{print $2}' \
  | awk -F "," '{print $1}' \
  >> "${tmp_file_name_prefix}_$k.txt" &
done

# 集約
tmp_aggregation_file="tmp_aggregation.txt"
tmp_result_files=$(find . -name "${tmp_file_name_prefix}_*")
for tmp_result_file in $tmp_result_files; do
    cat $tmp_result_file >> $tmp_aggregation_file
done

# 集計
result_file="result.txt"
cat $tmp_aggregation_file \
| awk '{count[$0]++}END{for(k in count)print k, count[k]}' \
| sort -r -t ' ' -k 2 -n \
>> $result_file

# tmp ファイルの掃除
rm ${tmp_file_name_prefix}_*
rm $tmp_aggregation_file
1
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?