More than 5 years have passed since last update.

言語処理100本ノック第1章

Last updated at 2018-01-18Posted at 2018-01-16

はじめに

他の方の記事を見て気になっていたので、挑戦してみました
言語処理100本ノック 2015のページはこちら

方針

「可能な限り短く」を目指しています。
短く書こうとした結果、内包表記やlambda式を多用し、可読性に欠けると思われます。
The Zen of Pythonには反しますが、そこは目を瞑って下さい。
コードが気持ち悪いと言われようとも、頑張りたいと思います。

可読性以外で、もっと短く書けるとか、その他アドバイスなどありましたらコメントお願い致します。

環境

macOS High Sierra
Python 3.6.3 :: Anaconda3 5.0.1

00. 文字列の逆順

文字列"stressed"の文字を逆に（末尾から先頭に向かって）並べた文字列を得よ．

print("stressed"[::-1])

実行結果：desserts

"stressed"をわざわざ変数に入れるところを省略。

01. 「パタトクカシーー」

「パタトクカシーー」という文字列の1,3,5,7文字目を取り出して連結した文字列を得よ．

print("パタトクカシーー"[::2])

実行結果：パトカー

No.00と同様に変数を使いません。

02. 「パトカー」＋「タクシー」＝「パタトクカシーー」

「パトカー」＋「タクシー」の文字を先頭から交互に連結して文字列「パタトクカシーー」を得よ．

print("".join(a+b for a, b in zip("パトカー", "タクシー")))

実行結果：パタトクカシーー

内包表記を使いました。

リストを同時にfor文で回したいときにはzip関数が使えます。
ただし、リストの長さが違うとき、短い方に揃えられるので注意。
参考：2. 組み込み関数 — Python 3.6.3 ドキュメント

03. 円周率

"Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."という文を単語に分解し，各単語の（アルファベットの）文字数を先頭から出現順に並べたリストを作成せよ．

sentence = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."

print([len(word.rstrip(".,")) for word in sentence.split()])

実行結果：[3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9]

空白文字列でsplitして、文字にくっついてるコンマやピリオドを取って、文字数を数えるだけ。
下はappendを使って書き直したver.

words = sentence.split()
n_letters = []
for word in words:
    word = word.rstrip(".,")
    n_letters.append(len(word))
print(n_letters)

04. 元素記号

"Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."という文を単語に分解し，1, 5, 6, 7, 8, 9, 15, 16, 19番目の単語は先頭の1文字，それ以外の単語は先頭に2文字を取り出し，取り出した文字列から単語の位置（先頭から何番目の単語か）への連想配列（辞書型もしくはマップ型）を作成せよ．

sentence = "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."
numbers = [1, 5, 6, 7, 8, 9, 15, 16, 19]

print({(word[0] if i in numbers else word[:2]):i for i, word in enumerate(sentence.split(), 1)})

実行結果：
{'H': 1, 'He': 2, 'Li': 3, 'Be': 4, 'B': 5, 'C': 6, 'N': 7, 'O': 8, 'F': 9, 'Ne': 10, 'Na': 11, 'Mi': 12, 'Al': 13, 'Si': 14, 'P': 15, 'S': 16, 'Cl': 17, 'Ar': 18, 'K': 19, 'Ca': 20}

ディクショナリも内包表記を使えます。
enumerate関数を使うとイテレータにインデックスが付けられるので便利。
第二引数に数字を入れると、インデックスの開始番号が指定できます。

内包表記の中にif文が使われていますが、これに関してはググるか、こちらを参照して下さい。

これも可読性が低いので書き直しました。

dic = {}
words = sentence.split()
for i, word in enumerate(words, 1):
    if i in numbers:
        dic[word[0]] = i
    else:
        dic[word[:2]] = i
print(dic)

05. n-gram

与えられたシーケンス（文字列やリストなど）からn-gramを作る関数を作成せよ．この関数を用い，"I am an NLPer"という文から単語bi-gram，文字bi-gramを得よ．

word_bigram = lambda sentence: ["-".join(sentence.split()[i:i+2]) for i in range(len(sentence.split())-1)]
letter_bigram = lambda sentence: [word[i:i+2] for word in sentence.split() for i in range(len(word)-1)]

sentence = "I am an NLPer"

print(word_bigram(sentence))
print(letter_bigram(sentence))

実行結果：
['I-am', 'am-an', 'an-NLPer']
['am', 'an', 'NL', 'LP', 'Pe', 'er']

lambda式と内包表記の入れ子を使ったことにより、著しく可読性を損なっています。
単語bi-gramはこれで正しいと思うのですが、文字bi-gramは、単語単位でbi-gramを作るのか、センテンスを全てつなげてbi-gramを作るのかよくわかりませんでした。
このプログラムでは、単語単位で文字bi-gramを出力しています。

内包表記の入れ子は一見複雑なように見えますよね。
しかし、下のプログラムと見比べて見ればわかりますが、for ... in ... の中身が前に出てきただけで、コードの構造自体はなんら変わりはありません。

def word_bigram(sentence):
    words = sentence.split()
    bigram = []
    for i in range(len(words) - 1):
        bigram.append(words[i] + "-" + words[i+1])
    return bigram

print(word_bigram(sentence))

def letter_bigram(sentence):
    words = sentence.split()
    bigram = []
    for word in words:
        for i in range(len(word) - 1):
            bigram.append(word[i] + word[i+1])
    return bigram

06. 集合

"paraparaparadise"と"paragraph"に含まれる文字bi-gramの集合を，それぞれ, XとYとして求め，XとYの和集合，積集合，差集合を求めよ．さらに，'se'というbi-gramがXおよびYに含まれるかどうかを調べよ．

word_1, word_2 = "paraparaparadise", "paragraph"
X, Y = set(letter_bigram(word_1)), set(letter_bigram(word_2))
print("和集合：", X | Y)
print("積集合：", X & Y)
print("差集合：", X - Y)
print("'se' in X：", "se" in X)
print("'se' in Y：", "se" in Y)

実行結果：
和集合： {'pa', 'gr', 'ar', 'ap', 'is', 'ag', 'ra', 'di', 'se', 'ad', 'ph'}
積集合： {'pa', 'ra', 'ap', 'ar'}
差集合： {'is', 'di', 'se', 'ad'}
'se' in X： True
'se' in Y： False

代入式は、num1, num2 = 1, 2のように書くと、２変数を同時に代入できます。
出来る限り短く書きたいのでこれを利用しています。

07. テンプレートによる文生成

引数x, y, zを受け取り「x時のyはz」という文字列を返す関数を実装せよ．さらに，x=12, y="気温", z=22.4として，実行結果を確認せよ．

temp = lambda x, y, z: "{0}時の{1}は{2}".format(x, y, z)

print(temp(12, "気温", 22.4))

実行結果：12時の気温は22.4

こういう短い関数を作るときにはlambdaは便利なのではないでしょうか。

08. 暗号文

与えられた文字列の各文字を，以下の仕様で変換する関数cipherを実装せよ．

英小文字ならば(219 - 文字コード)の文字に置換
その他の文字はそのまま出力

この関数を用い，英語のメッセージを暗号化・復号化せよ．

cipher = lambda string: "".join(chr(219 - ord(c)) if c.islower() else c for c in string)

string = cipher("It is 8:30 in San Francisco now.")
print(string)
string = cipher(string)
print(string)

実行結果：
Ig rh 8:30 rm Szm Fizmxrhxl mld.
It is 8:30 in San Francisco now.

islowerメソッドで文字列が小文字かどうかを調べ、小文字の場合は変換、それ以外はそのまま出力しています。
ord関数は文字をコードポイントに変換し、chr関数はコードポイントから文字に変換します。

def cipher(string):
    ciphered = []
    for c in string:
        if c.islower():
            ciphered.append(chr(219 - ord(c)))
        else:
            ciphered.append(c)
    ciphered = "".join(ciphered)
    return ciphered

09. Typoglycemia

スペースで区切られた単語列に対して，各単語の先頭と末尾の文字は残し，それ以外の文字の順序をランダムに並び替えるプログラムを作成せよ．ただし，長さが４以下の単語は並び替えないこととする．適当な英語の文（例えば"I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."）を与え，その実行結果を確認せよ．

import numpy as np
typoglycemia = lambda sentence: " ".join([
    word[0]+"".join(np.random.permutation(list(word[1:-1])))+word[-1]
    if len(word) > 3 else word for word in sentence.split()])

sentence = "I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."
print(typoglycemia(sentence))

実行結果：
I counld't beeilve taht I could autalcly unsdternad what I was rideang : the pnmheeaonl poewr of the hamun mind .

numpyのrandom.permutationを使うと、リストからランダムに取り出すことができます。
例えば、print(np.random.permutation([0, 1, 2, 3]))とすると、[2 3 1 0]と出力されます（実行するたびに順番は変わります）。
これを使うことによって、関数を作る部分はなんとか一行で済みました。
（長いので改行していますが一行です）

下は同じ内容を書き換えたもの。

def typoglycemia(sentence):
    words = sentence.split()
    typo = []
    for word in words:
        if len(word) > 3:
            listed_word = list(word[1:-1])
            middle = np.random.permutation(listed_word)
            middle = "".join(middle)
            typo.append(word[0] + middle + word[-1])
        else:
            typo.append(word)
    typo = " ".join(typo)
    return typo

第1章完

短く書いてみることで、内包表記やlambdaの書き方が分かってくるようになりました。
この記事が皆さんの参考になりましたら幸いです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

言語処理100本ノック 第1章