言語処理100本ノック 2015」ではじめるPythonとNLPスキルのリハビリテーション（第2章後半） #Python

16. ファイルをN分割する

自然数Nをコマンドライン引数などの手段で受け取り，入力のファイルを行単位でN分割せよ．同様の処理をsplitコマンドで実現せよ．

importとかargparseの設定とかは省略。
ファイルの行数Mが、与えた自然数Nでちょうど割り切れないときは、
最初に分割された部分から順に1行多く与える仕様です。

knock016.py

args = parser.parse_args()
N = args.line
filename = args.filename

# 最後のN行を表示する
f = open(filename)
lines = f.readlines()
M = len(lines)

# 商と余り
quotient = M/N
remainder = M - quotient*N

# ファイルを分割する行を求める
num_of_lines = [quotient+1 if i < remainder else quotient for i in xrange(N)]
num_of_lines_cumulative = [sum(num_of_lines[:i+1]) for i in xrange(N)]

for i, line in enumerate(lines):
    if i in num_of_lines_cumulative:
        print
        print line.strip()
    else:
        print line.strip()

f.close()

UNIXコマンド・・・
（不十分ではありますが）オプションのバリデーションを入れたらコードが長くなった。

knock016.sh

#!/bin/sh

# 自然数Nをコマンドライン引数などの手段で受け取り，入力のファイルを行単位でN分割せよ．
# 同様の処理をsplitコマンドで実現せよ．
# ex.
# sh knock016.sh -f hightemp.txt -n 7

while getopts f:n: OPT
do
  case $OPT in
    "f" ) FLG_F="TRUE" ; INPUT_FILE=$OPTARG ;;
    "n" ) FLG_N="TRUE" ; N=$OPTARG ;;
      * ) echo "Usage: $CMDNAME [-f file name] [-n split number]" 1>&2
          exit 1 ;;
  esac
done

if [ ! "$FLG_F" = "TRUE" ]; then
  echo 'file name is not set.'
  exit 1
fi
if [ ! "$FLG_N" = "TRUE" ]; then
  echo 'split number is not set.'
  exit 1
fi

#INPUT_FILE="hightemp.txt"
TMP_HEAD="split/tmphead.$INPUT_FILE"
TMP_TAIL="split/tmptail.$INPUT_FILE"
SPLITHEAD_PREFIX="split/splithead."
SPLITTAIL_PREFIX="split/splittail."

M=$( wc -l < $INPUT_FILE )
#N=9
quotient=`expr \( $M / $N \)`
remainder=`expr \( $M - $quotient \* $N \)`

if [ $quotient -eq 0 ]; then
  echo "cannot divide: N is larger than the lines of the input file."
  exit 0
fi

if [ $remainder -eq 0 ]; then
  # 余りが0なら、1つのファイルに$quotient行含まれるように分割する
  split -l $quotient $INPUT_FILE SPLITHEAD_PREFIX
else
  # 余りが0以外なら、
  # (a)冒頭から(($quotient + 1) * $remainder)行と(b)それ以降、の2ファイルに分割する
  split_head=`expr \( \( $quotient + 1 \) \* $remainder \)`
  split_tail=`expr \( $M - $split_head \)`
  head -n $split_head $INPUT_FILE > $TMP_HEAD
  tail -n $split_tail $INPUT_FILE > $TMP_TAIL

  # (a)は1つのファイルに($quotient+1)行、(b)は1つのファイルに$quotient行、含まれるように分割する
  split -l `expr \( $quotient + 1 \)` $TMP_HEAD $SPLITHEAD_PREFIX
  split -l $quotient $TMP_TAIL $SPLITTAIL_PREFIX

  rm -iv split/tmp*

fi

splitは1つのファイルに含まれる行数を指定して使うコマンドなので、
少し工夫が必要だった印象。

17. １列目の文字列の異なり

1列目の文字列の種類（異なる文字列の集合）を求めよ．確認にはsort, uniqコマンドを用いよ．

if __name__ == '__main__':

    f = open(filename)
    lines = f.readlines()

    # unlike problem 12., "+ '\n'" is not necessary
    content_col1 = [line.split()[0] for line in lines]
    content_col1_set = set(content_col1)
    print len(content_col1_set)

    for x in content_col1_set:
        print x

    f.close()

#>>>
#12
#愛知県
#山形県
#岐阜県
#千葉県
#埼玉県
#高知県
#群馬県
#山梨県
#和歌山県
#愛媛県
#大阪府
#静岡県

UNIXコマンド。
順番も同じにしないといけないのかな・・・？

awk -F'\t' '{print $1;}' hightemp.txt | sort | uniq
#>>>
#千葉県
#和歌山県
#埼玉県
#大阪府
#山形県
#山梨県
#岐阜県
#愛媛県
#愛知県
#群馬県
#静岡県
#高知県

18. 各行を3コラム目の数値の降順にソート

各行を3コラム目の数値の逆順で整列せよ（注意: 各行の内容は変更せずに並び替えよ）．確認にはsortコマンドを用いよ（この問題はコマンドで実行した時の結果と合わなくてもよい）．

if __name__ == '__main__':

    f = open(filename)
    lines = f.readlines()
    # reverse=True allows us to perform descending sort
    sorted_lines = sorted(lines, key=lambda line: float(line.split()[2]), reverse=True)

    for sorted_line in sorted_lines:
        print sorted_line,

    f.close()

#>>>
#高知県  江川崎   41  2013-08-12
#埼玉県  熊谷  40.9    2007-08-16
#岐阜県  多治見   40.9    2007-08-16
#山形県  山形  40.8    1933-07-25
#山梨県  甲府  40.7    2013-08-10
#和歌山県   かつらぎ    40.6    1994-08-08
#静岡県  天竜  40.6    1994-08-04
#山梨県  勝沼  40.5    2013-08-10
#埼玉県  越谷  40.4    2007-08-16
#群馬県  館林  40.3    2007-08-16
#群馬県  上里見   40.3    1998-07-04
#愛知県  愛西  40.3    1994-08-05
#千葉県  牛久  40.2    2004-07-20
#静岡県  佐久間   40.2    2001-07-24
#愛媛県  宇和島   40.2    1927-07-22
#山形県  酒田  40.1    1978-08-03
#岐阜県  美濃  40  2007-08-16
#群馬県  前橋  40  2001-07-24
#千葉県  茂原  39.9    2013-08-11
#埼玉県  鳩山  39.9    1997-07-05
#大阪府  豊中  39.9    1994-08-08
#山梨県  大月  39.9    1990-07-19
#山形県  鶴岡  39.9    1978-08-03
#愛知県  名古屋   39.9    1942-08-02

UNIXコマンド。

sort -k3r hightemp.txt

kオプションで列指定。rをつけて逆順に。

19. 各行の1コラム目の文字列の出現頻度を求め，出現頻度の高い順に並べる

各行の1列目の文字列の出現頻度を求め，その高い順に並べて表示せよ．確認にはcut, uniq, sortコマンドを用いよ．

from collections import defaultdict
from collections import Counter

...

if __name__ == '__main__':

    f = open(filename)
    lines = f.readlines()

    # extract 1st column
    content_col1 = [line.split()[0] for line in lines]

    # (1) defaultdict
    # http://docs.python.jp/2/library/collections.html#collections.defaultdict
    d = defaultdict(int)
    for col1 in content_col1:
        d[col1] += 1
    for word, cnt in sorted(d.items(), key=lambda x: x[1], reverse=True):
        print word, cnt

    print

    # (2) Counter
    # http://docs.python.jp/2/library/collections.html#collections.Counter
    counter = Counter(content_col1)
    for word, cnt in counter.most_common():
        print word, cnt

    f.close()

#>>>
#山形県 3
#埼玉県 3
#群馬県 3
#山梨県 3
#愛知県 2
#岐阜県 2
#千葉県 2
#静岡県 2
#高知県 1
#和歌山県 1
#愛媛県 1
#大阪府 1

#山形県 3
#埼玉県 3
#群馬県 3
#山梨県 3
#愛知県 2
#岐阜県 2
#千葉県 2
#静岡県 2
#高知県 1
#和歌山県 1
#愛媛県 1
#大阪府 1

(1)のように、defaultdict型でカウントしていくか、
(2)のように、そのものズバリのCounterを使うか。
most_common()メソッドなんてあるんだ・・・。

続いてUNIXコマンド。

cut -f 1 hightemp.txt | sort | uniq -c | sort -nr
#>>>
#   3 群馬県
#   3 山梨県
#   3 山形県
#   3 埼玉県
#   2 静岡県
#   2 愛知県
#   2 岐阜県
#   2 千葉県
#   1 高知県
#   1 愛媛県
#   1 大阪府
#   1 和歌山県

よく使うイディオム的なコマンドなので、しっかり覚えておきたいです。
sortでソートして、uniqで隣接行で同じものがあればまとめて、
-cオプションでそうした重複行のカウントを取り、
"sort -nr"で行を数字とみなして（降順に）ソートします。