More than 5 years have passed since last update.

shell で大量のファイルに対して集計処理を行う

Last updated at 2020-11-11Posted at 2020-11-06

例えば大量のファイルから foo_* が 5 回以上登場する行の id を抽出し、id ごとの行数を算出する。

ファイル例

id=>aaa, "foo_24"
id=>bbb, "foo_4", "foo_pp", "aaaaaa", "foo_343"

愚直版

愚直に逐次実行。
ファイル数やサイズが小さければこれで十分。

# !/bin/sh

files=$(find . -type f)

# 抽出
regex='(.*((foo_).*){5,}.*)'
tmp_file='tmp_result.txt'
for file in files; do
  cat $file \
  | grep -E $regex \
  | awk -F "=>" '{print $2}' \
  | awk -F "," '{print $1}' \
  >> $tmp_file
done

# 集計
result_file="result.txt"
cat $tmp_file \
| awk '{count[$0]++}END{for(k in count)print k, count[k]}' \
| sort -r -t ' ' -k 2 -n \
>> $result_file

# tmp ファイルの掃除
rm $tmp_file

parallel 版

ファイル数やサイズがでかい場合は CPU をフルに使って頑張りたいのでこっち。

# !/bin/sh

files=$(find . -type f)

# 抽出。ここがボトルネックであり、並列化して頑張りたいところなので頑張る。
regex='(.*((foo_).*){5,}.*)'
tmp_file_name_prefix='tmp_extract_result'
parallel_num=50 # NOTE: マシンの CPU 負荷考慮して調整すること
k=0
for file in files; do
  k=$(( k % parallel_num ))
  (( k ++ ))
  if [[ $k -eq 0 ]]; then
      wait
  fi
  cat $file \
  | grep -E $regex \
  | awk -F "=>" '{print $2}' \
  | awk -F "," '{print $1}' \
  >> "${tmp_file_name_prefix}_$k.txt" &
done

# 集約
tmp_aggregation_file="tmp_aggregation.txt"
tmp_result_files=$(find . -name "${tmp_file_name_prefix}_*")
for tmp_result_file in $tmp_result_files; do
    cat $tmp_result_file >> $tmp_aggregation_file
done

# 集計
result_file="result.txt"
cat $tmp_aggregation_file \
| awk '{count[$0]++}END{for(k in count)print k, count[k]}' \
| sort -r -t ' ' -k 2 -n \
>> $result_file

# tmp ファイルの掃除
rm ${tmp_file_name_prefix}_*
rm $tmp_aggregation_file

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up