More than 5 years have passed since last update.

言語処理100本ノックをPython3で解く「第１章準備運動」- ②

Posted at 2019-10-08

はじめに

自然言語処理の勉強と発信も兼ねて、忘備録としてまとめていく。
問題は、乾・岡崎研究室が公開している、自然言語処理100本ノック。
Python勉強して数ヶ月程度なので、間違っているとこあると思いますが、指摘していただけると幸いです。

実行環境
OS:macOS Mojave
Python: python 3.7.4

05. n-gram

与えられたシーケンス（文字列やリストなど）からn-gramを作る関数を作成せよ．この関数を用い，"I am an NLPer"という文から単語bi-gram，文字bi-gramを得よ．

main.py

def word_n_gram(sentence, n, letter=False):
    if not letter:
        sentence = sentence.split()
    return [sentence[point:point+n] for point in range(len(sentence))]


s = 'I am an NLPer'
print(word_n_gram(s, 2))
print(word_n_gram(s, 2, True))

Output

[['I', 'am'], ['am', 'an'], ['an', 'NLPer'], ['NLPer']]
['I ', ' a', 'am', 'm ', ' a', 'an', 'n ', ' N', 'NL', 'LP', 'Pe', 'er', 'r']

単語bi-gramはスペース入るけどこの区切り方でいいんかな。
letterがT/Fで単語bi-gramで実装できるようにした。
n-gramなるものを初めて知ってなかなかおもしろいなと感じた。

06. 集合

「"paraparaparadise"と"paragraph"に含まれる文字bi-gramの集合を，それぞれ, XとYとして求め，XとYの和集合，積集合，差集合を求めよ．さらに，'se'というbi-gramがXおよびYに含まれるかどうかを調べよ．

main.py

def word_n_gram(sentence, n, letter=False):
    if not letter:
        sentence = sentence.split()
    return set([sentence[point:point+n] for point in range(len(sentence))])


x = 'paraparaparadise'
y = 'paragraph'
z = 'se'

X = word_n_gram(x, 2, True)
Y = word_n_gram(y, 2, True)
Z = word_n_gram(z, 2, True)

print(X)
print(Y)
print(Z)
print('--------------------')
print(X | Y)    # print(X.union(Y))
print(X & Y)    # print(X.intersection(Y))
print(X - Y)  # print(X.difference(Y))
print('--------------------')
print(Z <= X)
print(Z <= Y)

Output

{'ap', 'se', 'e', 'pa', 'ar', 'di', 'ra', 'ad', 'is'}
{'ap', 'pa', 'ar', 'ph', 'ag', 'gr', 'ra', 'h'}
{'se', 'e'}
--------------------
{'ap', 'se', 'e', 'pa', 'ar', 'ph', 'di', 'ag', 'gr', 'ra', 'h', 'ad', 'is'}  # 和集合
{'ap', 'ar', 'ra', 'pa'}                                                      # 積集合
{'se', 'e', 'di', 'ad', 'is'}                                                 # 差集合
--------------------
True
False

集合の処理は、setを使う。

`set`

set オブジェクトは、固有の hashable オブジェクトの順序なしコレクションです。通常の用途には、帰属テスト、シーケンスからの重複除去、積集合、和集合、差集合、対称差 (排他的論理和) のような数学的演算の計算が含まれます。Python 公式

重複なし！

和集合

積集合

差集合

X,Y,Zの関係

とりあえず図と記号が頭に入ってればよいだろう。

07. テンプレートによる文生成

引数x, y, z
を受け取り「x時のyはz」という文字列を返す関数を実装せよ．さらに，x=12, y="気温", z=22.4として，実行結果を確認せよ．

main.py

def words(x, y, z):
    words = str(x) + "時の" + y + "は" + str(z)
    return words

ans = words(x=12, y="気温", z=22.4)
print(ans)

Output

12時の気温は22.4

f-string使ったほうが良さそう。というかこれに限っては、そのまま
printで返したほうがいいのか。

というわけで、

`f-string`

フォーマット済み文字列リテラル（ formatted string literal ）または f-string は、接頭辞 'f' または 'F' の付いた文字列リテラルです。これらの文字列には、波括弧 {} で区切られた式である置換フィールドを含めることができます。他の文字列リテラルの場合は内容が常に一定で変わることが無いのに対して、フォーマット済み文字列リテラルは実行時に式として評価されます。Python 公式

For_example

a = 'A'
b = 'ココ'
c = 'hai'
print(f"{a}国には{b}と{c}がいる")
print(F"{a}国には{b}と{c}がいる")

Output

A国にはココとhaiがいる
A国にはココとhaiがいる

これを使うと、

main.py2

def words(x, y, z):
    print(f'{x}時の{y}は{z}')

words(x=12, y="気温", z=22.4)

Output

12時の気温は22.4

こっちのほうが良さげ。
文字の出力はf-stringを使おう。

08. 暗号文

与えられた文字列の各文字を，以下の仕様で変換する関数cipherを実装せよ．
英小文字ならば(219 - 文字コード)の文字に置換
その他の文字はそのまま出力
この関数を用い，英語のメッセージを暗号化・復号化せよ．

main.py

def cipher(s):
    res = ''
    for i in range(len(s)):
        if s[i].islower():
            res += chr(219 - ord(s[i]))
        else:
            res += s[i]
    return res

# 暗号化
ans = cipher('I like football')
print(ans)

# 復元
undo = cipher(ans)
print(undo)

Output

I orpv ullgyzoo
I like football

使ったのは、str.islower()、ord(c)

`str.islower()`

文字列中の大小文字の区別のある文字全てが小文字で、かつ大小文字の区別のある文字が 1 文字以上あるなら真を、そうでなければ偽を返します。Python 公式

For_example

s = 'abcde'
print(s.islower())

s = 'Abcde'
print(s.islower())

s = '1234'
print(s.islower())

Output

True
False
False

逆に、大文字判定を行う場合は、str.isupper()

`str.isupper()`

文字列中の大小文字の区別のある文字 4 全てが大文字で、かつ大小文字の区別のある文字が 1 文字以上あるなら真を、そうでなければ偽を返します。Python 公式

For_example

s = 'abcde'
print(s.isupper())

s = 'ABCDE'
print(s.isupper())

s = '1234'
print(s.islower())

output

False
True
False

`ord(c)`

文字の Unicode 文字を表す文字列に対し、その文字の Unicode コードポイントを表す整数を返します。例えば、 ord('a') は整数 97 を返し、 ord('€') (ユーロ記号) は 8364 を返します。これは chr() の逆です。Python公式

PythonはC++みたいに文字(chr())と数字の比較ができないので、ord(c)を使って比較する。

For_example

s = 'a'
print(ord(s))

s = 'A'
print(ord(s))

output

97
65

09. Typoglycemia

スペースで区切られた単語列に対して，各単語の先頭と末尾の文字は残し，それ以外の文字の順序をランダムに並び替えるプログラムを作成せよ．ただし，長さが４以下の単語は並び替えないこととする．適当な英語の文（例えば"I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."）を与え，その実行結果を確認せよ．

main.py

import random

moji = input()
words = moji.split()
lis = []
for word in words:
    # 文字数判断
    if len(word) <= 4:
        lis.append(word)
        continue

    # char -> list
    w = list(word)

    # 先頭と末尾の文字は残しそれ以外の文字の順序をランダムにするための番号取得
    l = list(range(1, len(w) - 1))
    num = random.sample(l, len(l))

    # char -> string
    res = ''
    # 先頭
    res += w[0]
    # 先頭末尾以外をランダム
    for i in num:
        res += w[i]
    # 末尾
    res += w[-1]
    lis.append(res)

ans = ' '.join(lis)
print(ans)

Output

I cd'nolut beileve that I cuold acltlauy ustrednand what I was reiadng : the peomnnaehl poewr of the hmuan mind .

使ったのは、random.sample(population, k)、str.join(iterable)

`random.sample(population, k)`

母集団のシーケンスまたは集合から選ばれた長さ k の一意な要素からなるリストを返します。重複無しのランダムサンプリングに用いられます。
Python公式

populationはシーケンスまたは集合を入れていいと、

シーケンスは以下参照

Python公式

簡単に言うと、list()、tuple()、range()、str()はシーケンス型。それぞれ、変更できる、変更できない、数字を扱う、文字列とかの違いがあるが、似たような処理ができる。例えば、list()、tuple()、str()の文字アクセスとかは同じ。

ということは、range()そのままぶっ込んで良さそうなので、

main.py2

import random

moji = input()
words = moji.split()
lis = []
for word in words:
    # 文字数判断
    if len(word) <= 4:
        lis.append(word)
        continue

    front = word[0]
    body = word[1:-1]
    back = word[-1]

    res = front + ''.join(random.sample(body, len(body)))+back
    lis.append(res)

ans = ' '.join(lis)
print(ans)

Output

I cdln'uot bieleve that I could aaclluty ureatsndnd what I was rnaideg : the poeanhemnl pwoer of the huamn mind .

きれいになった。。

`str.join(iterable)`

iterable 中の文字列を結合した文字列を返します。 iterable に bytes オブジェクトのような非文字列の値が存在するなら、 TypeError が送出されます。要素間のセパレータは、このメソッドを提供する文字列です。Python公式

使い方は'(間に挿入する文字)'.join(ここにstr型リストを入れる)。

For_example

vowel = ['a', 'i', 'u', 'e', 'o']
print('@'.join(vowel))
print(''.join(vowel))

Output

a@i@u@e@o
aiueo

For_example2

vowel = [1, 2, 3, 4, 5]
print('@'.join(vowel))
print(''.join(vowel))

Output

Traceback (most recent call last):
 print('@'.join(vowel))
TypeError: sequence item 0: expected str instance, int found

あとがき

Pythonicじゃないのは否めない。
書くのが時間かかるから1時間以内で終わるようにしたい。

次からUNIXコマンドかー
あまりわからん。。。。。。。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

言語処理100本ノックをPython3で解く「第１章 準備運動」- ②

はじめに

05. n-gram

06. 集合

set

和集合

積集合

差集合

X,Y,Zの関係

07. テンプレートによる文生成

f-string

08. 暗号文

str.islower()

str.isupper()

ord(c)

09. Typoglycemia

random.sample(population, k)

str.join(iterable)

あとがき

言語処理100本ノックをPython3で解く「第１章準備運動」- ②

`set`

`f-string`

`str.islower()`

`str.isupper()`

`ord(c)`

`random.sample(population, k)`

`str.join(iterable)`