言語処理100本ノックチャレンジ（Python）準備運動編2 #Python3

今回の範囲

今回で準備運動編終了です。

06. 集合
"paraparaparadise"と"paragraph"に含まれる文字bi-gramの集合を，それぞれ, XとYとして求め，XとYの和集合，積集合，差集合を求めよ．さらに，'se'というbi-gramがXおよびYに含まれるかどうかを調べよ．

07. テンプレートによる文生成
引数x, y, zを受け取り「x時のyはz」という文字列を返す関数を実装せよ．さらに，x=12, y="気温", z=22.4として，実行結果を確認せよ．

08. 暗号文
与えられた文字列の各文字を，以下の仕様で変換する関数cipherを実装せよ．

英小文字ならば(219 - 文字コード)の文字に置換
その他の文字はそのまま出力
この関数を用い，英語のメッセージを暗号化・復号化せよ．

09. Typoglycemia
スペースで区切られた単語列に対して，各単語の先頭と末尾の文字は残し，それ以外の文字の順序をランダムに並び替えるプログラムを作成せよ．ただし，長さが４以下の単語は並び替えないこととする．適当な英語の文（例えば"I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."）を与え，その実行結果を確認せよ．

06.集合

前回作成したngramの関数を使います。
タプルのリストを返す実装にしたので、タプルを文字に変換してます。
pythonでは set に集合演算が用意されているので、setを使うと簡単です。
setもリスト内包で直接つくることができるようです。

蛇足ですが、pythonには特殊メソッドというものがあります。
+は__add__メソッドを、-は__sub__、&は__and__、|は__or__をそれぞれ実装することによって演算を定義することができます。setには上記のメソッドを集合演算として定義しているため簡単に使うことができます。

main.py

import re


def _05_n_gram(sentence, type='word'):
    if type not in ['word', 'char']:
        raise ValueError('type requires word or char')

    pattern = re.compile('\w+' if type == 'word' else '\w')
    words = pattern.findall(sentence)
    return [(word, words[i+1]) for i, word in enumerate(words)
        if i+1 < len(words)]

def _06_n_gram_set(sentence1, sentence2):
    X = {''.join(ngram) for ngram in _05_n_gram(sentence1, type='char')}
    Y = {''.join(ngram) for ngram in _05_n_gram(sentence2, type='char')}
    print(X)
    print(Y)
    print('XとYの和集合は', X | Y)
    print('XとYの積集合は', X & Y)
    print('XとYの差集合は', X - Y)
    print('Xに含まれます' if 'se' in X else 'Xに含まれません')
    print('Yに含まれます' if 'se' in Y else 'Yに含まれません')

if __name__ == '__main__':
    _06_n_gram_set('paraparaparadise', 'paragraph')

実行するとこんな感じになります。

実行結果

$ python main.py
{'ad', 'ap', 'ar', 'pa', 'se', 'ra', 'di', 'is'}
{'ag', 'ap', 'ph', 'ar', 'pa', 'ra', 'gr'}
XとYの和集合は {'ag', 'ar', 'pa', 'di', 'is', 'ad', 'ap', 'ph', 'se', 'ra', 'gr'}
XとYの積集合は {'ar', 'ra', 'pa', 'ap'}
XとYの差集合は {'di', 'ad', 'se', 'is'}
Xに含まれます
Yに含まれません

追記（2017.1.7）

@shiracamus さんにコメントいただき調べました。
ジェネレータ式を使い配列を作らずにsetを生成する方法を教えていただきましたが、そのままsetを作れるようなので、そちらに修正しました。
変更前のコードはこちらです。

main.py

def _06_n_gram_set(sentence1, sentence2):
    X = set([''.join(ngram) for ngram in _05_n_gram(sentence1, type='char')])
    Y = set([''.join(ngram) for ngram in _05_n_gram(sentence2, type='char')])
    print(X)
    print(Y)
    print('XとYの和集合は', X | Y)
    print('XとYの積集合は', X & Y)
    print('XとYの差集合は', X - Y)
    print('Xに含まれます' if 'se' in X else 'Xに含まれません')
    print('Yに含まれます' if 'se' in Y else 'Yに含まれません')

07. テンプレートによる文生成

strのformatメソッドを使うことで実現できます。

main.py

def _07_template(x, y, z):
    return '{x}時の{y}は{z}'.format(x=x, y=y, z=z)

if __name__ == '__main__':
    print('07', _07_template(x=12, y='気温', z=22.4))

実行結果

$ python main.py
07 12時の気温は22.4

08. 暗号文

pythonには組み込み関数でordとchrが実装されています。
それぞれ文字からascii、asciiから文字の変換ができます。

main.py

def _08_chipher(sentence):
    return ''.join(chr(219 - ord(c)) if c.islower() else c for c in sentence)

if __name__ == '__main__':
     print('08', _08_chipher("Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."))

実行結果

$ python main.py
08 Nld I mvvw z wirmp, zoxlslorx lu xlfihv, zugvi gsv svzeb ovxgfivh rmeloermt jfzmgfn nvxszmrxh.

追記（2018.1.7)

@shiracamus さんにコメントいただいたので、修正しました。
修正前のコードはこちらになります。

英小文字の判定は正規表現で行っていましたが、islowerによって判定するようにしました。

main.py

import re


def _08_chipher(sentence):
    pattern = re.compile('[a-z]')
    return ''.join([chr(219 - ord(s)) if pattern.search(s) else s for s in sentence])

09. Typoglycemia

strのsplit関数を使うと、strからlistへの変換ができます。
引数を省略すると空白が分割文字になります。

文字列をシャッフルするにはrandomモジュールのsample関数を使います。
randomモジュール内のshuffle関数は破壊的メソッドです。
イミュータブルなオブジェクトについてシャッフルするためには、sampleメソッドを使います。

main.py

import random


def _09_typoglycemia(sentence):
    words = sentence.split()
    random.seed(1)
    return ' '.join(
        [word[0] + ''.join(random.sample(word[1:-1], len(word)-2)) + word[-1] 
         if len(word) > 4 else word for word in words]
    )

if __name__ == '__main__':
    print('09', _09_typoglycemia("Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."))

実行結果

$ python main.py
09 Now I need a dinrk, alhloocic of coseur, aeftr the haevy lteruecs iivlnovng qtauunm msheaincc.

まとめ

次回は第二章 UNIXコマンドの基礎にトライしていきます。