言語処理100本ノック第1章 #Python

どこまでやれるかわかったもんじゃないですが。(cf.言語処理100本ノック 2015)

※コメントで良いやり方を教えて頂いたので覚え書きと共に追記しました。

00. 文字列の逆順

# coding: utf-8

s = "stressed"

print(s[::-1])

01. 「パタトクカシーー」

# coding: utf-8

s = "パタトクカシーー"

print(s[::2])

02. 「パトカー」＋「タクシー」＝「パタトクカシーー」

# coding: utf-8

s1 = 'パトカー'
s2 = 'タクシー'

s = ''.join([i+j for i, j in zip(s1, s2)])
print(s)

zip()で複数のiterableからiteratorを作成できる。

追記

# coding: utf-8

s1 = 'パトカー'
s2 = 'タクシー'

s = ''.join(i+j for i, j in zip(s1, s2))
print(s)

ジェネレータ内包表記とすることで不要なリストを作成しなくてすむ。

03. 円周率

# coding: utf-8
import re

s = 'Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics.'

# カンマとコンマを除外してから単語毎のリストに分解
s = re.sub('[,.]', '', s)
s = s.split()

# 文字数をカウントしてリスト化
result = []
for w in s:
    result.append(len(w))

print(result)

追記

# coding: utf-8
s = 'Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics.'

# 単語毎のリストに分解してから,.を除外し文字数をカウント
result = [len(w.rstrip(',.')) for w in s.split()]

print(result)

空リストで初期化しforを回すような処理は内包表記で書き換えられる。rstrip()は右から指定した文字を取り除く。文字はまとめて指定可能。

04. 元素記号

# coding: utf-8
s = 'Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can.'.split()

target = [1, 5, 6, 7, 8, 9, 15, 16, 19]

result ={}

for i in range(len(s)):
    if i + 1 in target:
        result[i+1] = s[i][:1]
    else:
        result[i+1] = s[i][:2]

print(result)

追記

# coding: utf-8
s = 'Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can.'

target = 1, 5, 6, 7, 8, 9, 15, 16, 19

result = [w[: 1 if i in target else 2] for i, w in enumerate(s.split(), 1)]

print(result)

[: 1 if i in target else 2]の部分、最初内包表記でelseが使えるのかと思ってしまったけれども、これはスライス[:x]と三項演算子a if cond else bの組み合わせ。

05. n-gram

# coding: utf-8

def n_gram(n, s):
    result = []
    for i in range(0, len(s)-n+1):
        result.append(s[i:i+n])
    return result

print(n_gram(2, 'I am an NLPer'))

追記

# coding: utf-8

def n_gram(n, s):
    return [s[i:i+n] for i in range(0, len(s)-n+1)]

print(n_gram(2, 'I am an NLPer'))

これも初期化したリストにforでappendなので内包表記で書き換え可。内包表記、まだまだ慣れないので一旦forで書いて書き直しとかしないと難しいなぁという印象。

06. 集合

# coding: utf-8
# 2つのバイグラムの和集合、積集合、差集合

def bi_gram(s):
    result = []
    for i in range(0, len(s)-1):
        result.append(s[i:i+2])
    return result

s1 = 'paraparaparadise'
s2 = 'paragraph'

X = set(bi_gram(s1))
Y = set(bi_gram(s2))
print("X = ", X)
print("Y = ", Y)
print("union: ", X | Y) # union
print("intersection: ", X & Y) # intersection
print("difference: ", X - Y) # difference

if "se" in X:
    print("X contain 'se'.")
else:
    print("X doesn't contain 'se'.")

if "se" in Y:
    print("Y contain 'se'.")
else:
    print("Y doesn't contain 'se'.")

07. テンプレートによる文生成

# coding: utf-8

def gen_sentence(x, y, z):
    return "{}時の{}は{}".format(x, y, z)

x = 12
y = '気温'
z = 22.4
print(gen_sentence(x, y, z))

08. 暗号文

# coding: utf-8

def cipher(S):
    result = []
    for i in range(len(S)):
        if(S[i].islower()):
            result.append(chr(219 - ord(S[i])))
        else:
            result.append(S[i])
    return ''.join(result)

S = "abcDe"
print(cipher(S))
print(cipher(cipher(S)))

ord()で文字からUnicode code pointの整数値への変換。逆関数がchr()。

追記

# coding: utf-8

def cipher(S):
    return ''.join(chr(219 - ord(c)) if c.islower() else c for c in S)

S = "abcDe"
print(cipher(S))
print(cipher(cipher(S)))

初期化したリストに条件分岐しつつforでappendしていくようなものは三項演算子と内包表記で代替できる。そういうものだと知ってしまえば割と読めそうだし書けそう。

09. Typoglycemia

# coding: utf-8

import numpy.random as rd

def gen_typo(S):
    if len(S) <= 4:
        return S
    else:
        idx = [0]
        idx.extend(rd.choice(range(1, len(S)-1), len(S)-2, replace=False))
        idx.append(len(S)-1)
        result = []
        for i in range(len(S)):
            result.append(S[idx[i]])
        return ''.join(result)

s = "I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."
s = s.split()
print(' '.join([gen_typo(i) for i in s]))

ランダムサンプリングの方法は色々あるが、numpy.random.choiceを使うと復元抽出か非復元抽出か選択できる。

追記

# coding: utf-8

import random

def gen_typo(S):
    return ' '.join(
        s 
        # 長さ4以下の単語はそのまま返す
        if len(s) <= 4 
        # 長さ5以上の単語は1文字目と2文字目を残してシャッフル
        else s[0] + ''.join(random.sample(s[1:-1], len(s)-2)) + s[-1] 
        for s in S.split()
        )

S = "I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."
print(gen_typo(S))

これも内包表記と三項演算子で書き換えられる。

また、random.sample(population, k)はpopulation(シーケンスまたは集合)からk個の要素をランダムに非復元抽出する。

言語処理100本ノック 第1章