More than 5 years have passed since last update.

言語処理100本ノックをNLP屋がやってみる: 第1章 00～09

Last updated at 2019-03-16Posted at 2019-03-14

はじめに

言語処理100本ノックというのがあります。
http://www.cl.ecei.tohoku.ac.jp/nlp100/

言語処理100本ノックは，実践的な課題に取り組みながら，プログラミング，データ分析，研究のスキルを楽しく習得することを目指した問題集です

自然言語処理屋さんなので実務ではもっと応用編なことをやってるんですが、たまにはこういう素振り・筋トレ的なのもいいかなということで。

なお、使う言語は python 3.6。後々コードが長くなると面倒かもしれませんが、とりあえず doctest でテストもまとめて書いちゃうことにします。

指摘・ツッコミ等あればぜひお寄せください。
Twitter @watanabemoriwo まで。

00. 文字列の逆順

文字列"stressed"の文字を逆に（末尾から先頭に向かって）並べた文字列を得よ．

def reverse(s: str):
    """
    文字列を逆順にして返します.

    >>> reverse('stressed')
    'desserts'
    """
    result = ''
    for i in range(len(s)):
        result += s[len(s) - 1 - i]
    return result

一番シンプルと思われる書き方にしてみました。
一方、スライスを活用するとこんなに楽に。

def reverse(s: str):
    """
    文字列を逆順にして返します.

    >>> reverse('stressed')
    'desserts'
    """
    return s[::-1]

ちなみに、doctestはソースコードの最後に以下を入れるとコメント内のテストコード(>>> で始まってる部分) を実行してくれます。

if __name__ == '__main__':
    import doctest
    doctest.testmod()

01. 「パタトクカシーー」

「パタトクカシーー」という文字列の1,3,5,7文字目を取り出して連結した文字列を得よ．

リスト内包表記を使って書く。

def take_even_pos(s: str):
    """
    文字列の奇数文字目を取り出して連結した文字列を返します．
    奇数文字目なので、インデックス的には偶数だけを出します

    >>> take_even_pos('パタトクカシーー')
    'パトカー'
    """
    return ''.join(s[i] for i in range(0, len(s), 2))

スライスだと。

def take_even_pos(s: str):
    """
    文字列の奇数文字目を取り出して連結した文字列を返します．
    奇数文字目なので、インデックス的には偶数だけを出します

    >>> take_even_pos('パタトクカシーー')
    'パトカー'
    """
    return s[::2]

02. 「パトカー」＋「タクシー」＝「パタトクカシーー」

「パトカー」＋「タクシー」の文字を先頭から交互に連結して文字列「パタトクカシーー」を得よ．

単純にやるとこうかな。

def merge_two_strings(s1: str, s2: str):
    """
    s1とs2の文字を交互に連結した文字列を返します

    >>> merge_two_strings('パトカー', 'タクシー')
    'パタトクカシーー'
    """
    return ''.join(s1[i] + s2[i] for i in range(len(s1)))

ただ、渡された2つの文字列の長さが違うと、これだとエラーになっちゃいますね。

Trying:
    merge_two_strings('パトカー', 'バス')
Expecting nothing
**********************************************************************
File "nlp00-09.py", line 39, in __main__.merge_two_strings
Failed example:
    merge_two_strings('パトカー', 'バス')
Exception raised:
    Traceback (most recent call last):
      File "C:\Users\xxx\AppData\Local\Programs\Python\Python36\lib\doctest.py", line 1330, in __run
        compileflags, 1), test.globs)
      File "<doctest __main__.merge_two_strings[1]>", line 1, in <module>
        merge_two_strings('パトカー', 'バス')
      File "nlp00-09.py", line 41, in merge_two_strings
        return ''.join(s1[i] + s2[i] for i in range(len(s1)))
      File "nlp00-09.py", line 41, in <genexpr>
        return ''.join(s1[i] + s2[i] for i in range(len(s1)))
    IndexError: string index out of range

なので、長い方の文字列は残りをそのままくっつけることにします。

def merge_two_strings(s1: str, s2: str):
    """
    s1とs2の文字を交互に連結した文字列を返します

    >>> merge_two_strings('パトカー', 'タクシー')
    'パタトクカシーー'

    >>> merge_two_strings('パトカー', 'バス')
    'パバトスカー'

    >>> merge_two_strings('パトロールカー', 'タクシー')
    'パタトクロシーールカー'
    """
    return ''.join(s1[i] + s2[i] for i in range(min(len(s1), len(s2)))) + s1[len(s2):] + s2[len(s1):]

03. 円周率

"Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."という文を単語に分解し，各単語の（アルファベットの）文字数を先頭から出現順に並べたリストを作成せよ．

「単語に分割し」と言っていますが、ここでは「単語＝アルファベット(a～zおよびA～Z)のみが連続するもの」と定義します。

def count_length_of_each_word(s: str):
    """
    連続するアルファベット([a-zA-Z])の文字数をリストとして列挙します

    >>> count_length_of_each_word('Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics.')
    [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9]
    """
    return list(count_length_of_each_word_sub(s))

def count_length_of_each_word_sub(s: str):
    count = 0
    for c in s:
        if 'a' <= c <= 'z' or 'A' <= c <= 'Z':
            count += 1
        else:
            if count > 0:
                yield count
            count = 0
    if count > 0:
        yield count

正規表現を使うと、急に楽になります。

def count_length_of_each_word(s: str):
    """
    連続するアルファベット([a-zA-Z])の文字数をリストとして列挙します

    >>> count_length_of_each_word('Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics.')
    [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9]
    """
    import re
    return list(len(m) for m in re.findall('[a-zA-Z]+', s))

04. 元素記号

"Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."という文を単語に分解し，1, 5, 6, 7, 8, 9, 15, 16, 19番目の単語は先頭の1文字，それ以外の単語は先頭に2文字を取り出し，取り出した文字列から単語の位置（先頭から何番目の単語か）への連想配列（辞書型もしくはマップ型）を作成せよ．

素直に書いたらこんな感じ。

def atomic_symbols(s: str):
    """
    単語に分解し，1, 5, 6, 7, 8, 9, 15, 16, 19番目の単語は先頭の1文字，それ以外の単語は先頭に2文字を取り出し，
    取り出した文字列から単語の位置（先頭から何番目の単語か）への連想配列（辞書型もしくはマップ型）を返します
    単語の1文字目は大文字であるという前提にします

    >>> atomic_symbols('Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can.')
    {'H': 1, 'He': 2, 'Li': 3, 'Be': 4, 'B': 5, 'C': 6, 'N': 7, 'O': 8, 'F': 9, 'Ne': 10, 'Na': 11, 'Mi': 12, 'Al': 13, 'Si': 14, 'P': 15, 'S': 16, 'Cl': 17, 'Ar': 18, 'K': 19, 'Ca': 20}
    """
    single_char_atomic_symbols = [1, 5, 6, 7, 8, 9, 15, 16, 19]
    word_number = 0
    result = {}

    for i in range(len(s)):
        if 'A' <= s[i] <= 'Z':  # 大文字だったらここから単語
            word_number += 1  # 何番目の単語かカウント
            if word_number in single_char_atomic_symbols:
                # 1文字取る
                word = s[i]
            else:
                # 2文字取る
                word = s[i:i+2]
            result[word] = word_number  # 連想配列に入れる

    return result

一方、シンプルにリスト内包表記の組み合わせで書くとスッキリ。

def atomic_symbols(s: str):
    """
    単語に分解し，1, 5, 6, 7, 8, 9, 15, 16, 19番目の単語は先頭の1文字，それ以外の単語は先頭に2文字を取り出し，
    取り出した文字列から単語の位置（先頭から何番目の単語か）への連想配列（辞書型もしくはマップ型）を返します

    >>> atomic_symbols('Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can.')
    {'H': 1, 'He': 2, 'Li': 3, 'Be': 4, 'B': 5, 'C': 6, 'N': 7, 'O': 8, 'F': 9, 'Ne': 10, 'Na': 11, 'Mi': 12, 'Al': 13, 'Si': 14, 'P': 15, 'S': 16, 'Cl': 17, 'Ar': 18, 'K': 19, 'Ca': 20}
    """
    import re
    single_char_atomic_symbols = [1, 5, 6, 7, 8, 9, 15, 16, 19]
    return {
        word[0] if index + 1 in single_char_atomic_symbols else word[0:2]: index + 1
        for index, word enumerate(m for m in re.findall('[a-zA-Z]+', s))
    }

05. n-gram

与えられたシーケンス（文字列やリストなど）からn-gramを作る関数を作成せよ．この関数を用い，"I am an NLPer"という文から単語bi-gram，文字bi-gramを得よ．

n-gram の n には数字が入りまして、普通 2 か 3 までが使われます。（3の場合はtri-gram）
bi-gram (2-gram) の場合は連続する２つの要素の連続を順番に列挙するということなので、「神奈川県」を文字でbigramにすると「神奈」「奈川」「川県」になるはずです。
ここでは、「シーケンスからn-gramを作れ」と言っているので、引数にはシーケンスと "n" を渡し、返り値は n 個の要素を持つタプルにすることにします。

def make_ngram(arr: [], n: int):
    """
    渡されたリストからn-gramを作成します

    >>> make_ngram('I am an NLPer', 2)
    [('I', ' '), (' ', 'a'), ('a', 'm'), ('m', ' '), (' ', 'a'), ('a', 'n'), ('n', ' '), (' ', 'N'), ('N', 'L'), ('L', 'P'), ('P', 'e'), ('e', 'r')]

    >>> make_ngram('I am an NLPer'.split(' '), 2)
    [('I', 'am'), ('am', 'an'), ('an', 'NLPer')]
    """
    return list(tuple(arr[i:i + n]) for i in range(len(arr) + 1 - n))

06. 集合

"paraparaparadise"と"paragraph"に含まれる文字bi-gramの集合を，それぞれ, XとYとして求め，XとYの和集合，積集合，差集合を求めよ．さらに，'se'というbi-gramがXおよびYに含まれるかどうかを調べよ．

愚直なコードを書こうかと思ったんですが、Python には set という集合そのものを表すデータ型があります。（他の言語でもありますね。HashSet みたいなクラスとか）
で、これらが和集合、積集合、差集合の実装をしてくれちゃってるので、ありがたく使ってしまいましょう。

def nlp06():
    """
    >>> x = set(make_ngram('paraparaparadise', 2))
    >>> y = set(make_ngram('paragraph', 2))
    >>> sorted(x.union(y))  # 和集合
    [('a', 'd'), ('a', 'g'), ('a', 'p'), ('a', 'r'), ('d', 'i'), ('g', 'r'), ('i', 's'), ('p', 'a'), ('p', 'h'), ('r', 'a'), ('s', 'e')]

    >>> sorted(x.difference(y))  # 差集合
    [('a', 'd'), ('d', 'i'), ('i', 's'), ('s', 'e')]

    >>> sorted(x.intersection(y))  # 積集合（共通部分）
    [('a', 'p'), ('a', 'r'), ('p', 'a'), ('r', 'a')]

    >>> ('s', 'e') in x  # Xにseが含まれるか
    True

    >>> ('s', 'e') in y  # Yにseが含まれるか
    False
    """
    pass

make_ngram メソッドは05の使いまわしです。

07. テンプレートによる文生成

引数x, y, zを受け取り「x時のyはz」という文字列を返す関数を実装せよ．さらに，x=12, y="気温", z=22.4として，実行結果を確認せよ．

def template(x, y, z):
    """
    >>> template(12, '気温', 22.4)
    '12時の気温は22.4'
    """
    return '{}時の{}は{}'.format(x, y, z)

特にコメントなし。

08. 暗号文

与えられた文字列の各文字を，以下の仕様で変換する関数cipherを実装せよ．

英小文字ならば(219 - 文字コード)の文字に置換
その他の文字はそのまま出力
この関数を用い，英語のメッセージを暗号化・復号化せよ．

さて、まずこの「219 - 文字コード」ってどういう意味か考えてみましょう。
a の文字コードは 97。219 - 97 = 122 で、文字コード 122 は z。
z の文字コードは 122。219 - 97 = 122 で、文字コード 97 は a。
つまり、ローマ字小文字 a～z について z～a に入れ替えるという暗号化になっていますね。
ちなみに 219 - (219 - n) = n なので、２回cipher関数を通すと元の文字列に戻ります。
そのへんを意識してテストコードを書いてみましょう。

def cipher(s: str) -> str:
    """
    渡された英文を暗号化します

    >>> cipher('AaBbCc')
    'AzByCx'

    >>> cipher(cipher('She Sells Sea Shells by the Sea Shore'))
    'She Sells Sea Shells by the Sea Shore'
    """
    return ''.join(chr(219 - ord(c)) if 'a' <= c <= 'z' else c for c in s)

09. Typoglycemia

スペースで区切られた単語列に対して，各単語の先頭と末尾の文字は残し，それ以外の文字の順序をランダムに並び替えるプログラムを作成せよ．
ただし，長さが４以下の単語は並び替えないこととする．
適当な英語の文（例えば"I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."）を与え，その実行結果を確認せよ．

タイポグリセミア、ちょっと前に話題になりましたね。こういうやつです↓

Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.
—Matt Davis, MRC Cognition and Brain Sciences、"Reading jumbled texts"

(出典: https://ja.wikipedia.org/wiki/タイポグリセミア)

この例だと４文字以下でも並べ替えていますが、とにかく最初と最後以外の文字の順番が入れ替わっても読めちゃう！っていうやつです。
とりあえず1単語を受け取ってゴニョゴニョする部分を書いてみましょう。
ランダムな結果を生成する処理ってテストコード書きにくいですね。。。

def typoglycemia(word: str) -> str:
    """
    長さが4以下であれば渡された文字列をそのまま返す。
    長さが5以上であれば、最初と最後以外の文字をランダムに並び替える。
    ランダムに並び替えた結果、元と同じ文字列になっていたら、つまらないのでやり直し。

    # 最初の文字はキープ
    >>> typoglycemia('embassy')[0]  
    'e'

    # 最後の文字もキープ
    >>> typoglycemia('embassy')[-1]
    'y'

    # 必ず元と違う単語になっている
    >>> typoglycemia('embassy') != 'embassy'
    True

    # 4文字以下は並び替えない
    >>> typoglycemia('true')
    'true'
    """
    if len(word) <= 4:
        return word
    from random import sample
    while True:
        result = word[0] + ''.join(sample(word[1:-1], len(word) - 2)) + word[-1]
        if result != word:
            return result

この関数を使って、スペース区切りの文を受け取って各単語を5文字以上なら並び替えるとこうなります。

def typoglycemia_sentence(sentence: str) -> str:
    words = sentence.split(' ')
    return ' '.join(typoglycemia(word) for word in words)

print(typoglycemia_sentence("I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."))

結果はこんな感じ。

I cudlo'nt beveile that I colud aullacty urndetsand what I was ridnaeg : the peaeohmnnl pweor of the haumn mind .

ちなみに、上記に引いた元ネタの文章で再現してみましょう。

print(typoglycemia_sentence(
    "According to a research at Cambridge University , " +
    "it doesn't matter in what order the letters in a word are , " +
    "the only important thing is that the first and last letter be at the right place . " +
    "The rest can be a total mess and you can still read it without problem . " +
    "This is because the human mind does not raed every letter by itself , " +
    "but the word as a whole ."
))

Aroccding to a rcsaeerh at Cgidabmre Utivrneisy ,
it deson't mtetar in what oredr the lrttees in a word are ,
the only irtnmopat tihng is that the fsirt and last letetr be at the rhgit pcale .
The rest can be a taotl mess and you can siltl read it wuthiot perolbm .
This is buesace the huamn mind does not raed evrey leettr by istelf ,
but the word as a wlohe .

原文とは違いますが、確かに読めちゃいますね。人間の脳って不思議！
・・・と思ったら、wikipediaによれば ケンブリッジ大学でこのような研究が行われたことはない とのこと。
（この記事を書いて一番勉強になった点）

次回は第2章をやって行きます。それではまた。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up