More than 5 years have passed since last update.

fishで言語処理100本ノック (第2章)

Last updated at 2017-05-03Posted at 2017-05-02

はじめに

fishシェルスクリプトの練習のため，言語処理100本ノックに挑戦します．
fishで言語処理100本ノック (第1章)の続きです．

今回もできるだけfish組込みのコマンド(関数)で解くことにしますが，連番生成のためのseqと，ファイルを読むためのcatは使ってもよいものとしました．

参考にした記事:
言語処理100本ノック with Python（第2章・前編）
[言語処理100本ノック with Python（第2章・後編）]
(http://qiita.com/gamma1129/items/6afee2034d6028847e1a)

2章: UNIXコマンドの基礎

hightemp.txtファイルをカレントフォルダにダウンロードした状態からスタートします．

test -f ./hightemp.txt; or exit

10. 行数カウント

行数をカウントせよ．確認にはwcコマンドを用いよ．

count (cat ./hightemp.txt)

# 確認
wc -l ./hightemp.txt | awk '{print $1}'

コマンド置換で各行を要素とした配列を得てcountします．

11. タブをスペースに置換

タブ1文字につきスペース1文字に置換せよ．確認にはsedコマンド，trコマンド，もしくはexpandコマンドを用いよ．

string replace -a \t ' ' < ./hightemp.txt

# 確認
tr \t ' ' < ./hightemp.txt

stringでtr相当のことができます．

12. 1列目をcol1.txtに，2列目をcol2.txtに保存

各行の1列目だけを抜き出したものをcol1.txtに，2列目だけを抜き出したものをcol2.txtとしてファイルに保存せよ．確認にはcutコマンドを用いよ．

echo -n > ./col1.txt > ./col2.txt
while read -l a b _
    echo $a >> ./col1.txt
    echo $b >> ./col2.txt
end < ./hightemp.txt

# 確認
cut -f1 ./hightemp.txt > ./col1_check.txt
cut -f2 ./hightemp.txt > ./col2_check.txt

whileループにファイルをリダイレクトして処理します．
readに複数引数を渡すと入力を分割して変数に格納してくれるため，read -l a b _の部分ではaに1列目の要素，bに2列目の要素，_に3列目以降の要素全てが入ります¹．
read -a -l arrayで配列に格納することもできるので，そのあたりはお好みで．

13. col1.txtとcol2.txtをマージ

12で作ったcol1.txtとcol2.txtを結合し，元のファイルの1列目と2列目をタブ区切りで並べたテキストファイルを作成せよ．確認にはpasteコマンドを用いよ．

set -l col1 (cat ./col1.txt)
set -l col2 (cat ./col2.txt)

while count $col1 $col2 >/dev/null
    string join \t $col1[1] $col2[1]
    set -e col1[1]
    set -e col2[1]
end > ./col1_2.txt

# 確認
paste ./col1.txt ./col1.txt > ./col1_2_check.txt

今度はwhileループの出力をファイルにリダイレクトします．

14. 先頭からN行を出力

自然数Nをコマンドライン引数などの手段で受け取り，入力のうち先頭のN行だけを表示せよ．確認にはheadコマンドを用いよ．

function fish-head -a N file
    set -l lines (cat $file)
    string join \n $lines[1..$N]
end

fish-head 10 ./hightemp.txt

# 確認
head -n $N ./hightemp.txt

自然数Nをコマンドライン引数などの手段で受け取り... とのことなので関数化しました．
もっと簡単化してstring join \n (cat $file)[1..$N]もいけるか？と思ったのですが，コマンド置換の結果を配列としてスライスする際は変数展開が使えないようです．

string join (cat ./hightemp.txt)[1..10] # これはOK

set N 10
string join (cat ./hightemp.txt)[1..$N] # これはNG

15. 末尾のN行を出力

自然数Nをコマンドライン引数などの手段で受け取り，入力のうち末尾のN行だけを表示せよ．確認にはtailコマンドを用いよ．

function fish-tail -a N file
    set -l M "-$N"
    set -l lines (cat $file)
    string join \n $lines[$M..-1]
end

fish-tail 10 ./hightemp.txt

# 確認
tail -n 10 ./hightemp.txt

前の問題と同じ感じです．
配列のスライス部分で$lines[-$N..-1]のように書くことはできないため，Nの符号を反転した変数を予め用意する必要があります．

16. ファイルをN分割する

自然数Nをコマンドライン引数などの手段で受け取り，入力のファイルを行単位でN分割せよ．同様の処理をsplitコマンドで実現せよ．

function fish-split -a N file
    set -l lines (cat $file)
    set -l total (count $lines)
    set -l rows (math $total / $N)
    test (math $total \% $N) = 0 ; or set rows (math $rows + 1) 

    for i in (seq $N)
        if test $i = $N
            string join \n $lines
        else
            string join \n $lines[1..$rows]
            set -e lines[1..$rows]
        end  > split_$i.txt
    end
end

fish-split 5 ./hightemp.txt

# 確認
split -l 5 ./hightemp.txt split_check_

mathで色々計算しながら処理．数値が等しいかの判定は本来test $i -eq $Nとするべきなのでしょうけど，文字列比較としてtest $i = $Nと書いてしまうのが見やすくて好きです．

17. １列目の文字列の異なり

1列目の文字列の種類（異なる文字列の集合）を求めよ．確認にはsort, uniqコマンドを用いよ．

set -l uniq

while read -l a _
    contains $a $uniq; or set uniq $uniq $a
end < ./hightemp.txt

count $uniq

# 確認
cat ./hightemp.txt | cut -f1 | sort | uniq | wc | awk '{print $1}'

containsで地道にチェック．

18. 各行を3コラム目の数値の降順にソート

各行を3コラム目の数値の逆順で整列せよ（注意: 各行の内容は変更せずに並び替えよ）．確認にはsortコマンドを用いよ（この問題はコマンドで実行した時の結果と合わなくてもよい）．

function bubble_sort_by_col3
    test (count $argv) -lt 2; and echo $argv; and return

    for i in (seq (math (count $argv) -1))
        set -l j (math $i + 1)
        echo $argv[$i] | read -l _ _ a _
        echo $argv[$j] | read -l _ _ b _

        if test (math "$a < $b") = 1
            set -l buf $argv[$i]
            set argv[$i] $argv[$j]
            set argv[$j] $buf
        end
    end
    
    bubble_sort_by_col3 $argv[1..-2]
    echo $argv[-1]
end

bubble_sort_by_col3 (cat ./hightemp.txt)

# 確認
sort -k3nr < ./hightemp.txt

再帰で単純なバブルソートを実装．遅いです．

小数の比較はどうすればよいのだろう，と調べたところmathで可能なようです．ただし比較の真偽は終了ステータスではなく標準出力に返され，真の場合に1となることに注意．

19. 各行の1コラム目の文字列の出現頻度を求め，出現頻度の高い順に並べる

各行の1列目の文字列の出現頻度を求め，その高い順に並べて表示せよ．確認にはcut, uniq, sortコマンドを用いよ．

set -l keys
set -l vals
while read -l a _
    set -l i (contains -i $a $keys)
    if test $status = 0
        set vals[$i] (math $vals[$i] + 1)
    else
        set keys $keys $a
        set vals $vals 1
    end
end < ./hightemp.txt

bubble_sort_by_col3 (for i in (seq (count $keys))
    echo "$keys[$i] _ $vals[$i]"
end) | while read -l a _
    echo $a
end

# 確認
cut -f1 ./hightemp.txt | sort | uniq -c | sort -r | cut -c 6-

fishに辞書型は存在しないので，keysとvalsのふたつの配列を使って文字列毎に出現回数を記録しました．

ソート関数を再定義するのが面倒だったのでbubble_sort_by_col3を使いまわすことに．
コマンド置換の途中で改行してforループしてますが，これ可読性がよくないですね...

おわりに

普通にシェルを使うなら# 確認の部分でやっているようにコマンドの組み合わせを使うべきです．
しかしfishシェルスクリプトだけでも，そこまで面倒なことにはならなかった...はず．

fishで言語処理100本ノック (第3章) に続く予定です．

ただし，fishで_は実行中のジョブ名を示す環境変数となっており，readで値をセットしても次の行でecho $_するとジョブ名に戻っています．つまり_に捨てた値は以後参照することができず，本当に捨てられます． ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up