More than 3 years have passed since last update.

Pythonで自然言語処理100本ノック 2020を解いたついでに死ぬほど詳しく解説を書いていく[第1章準備運動]

Last updated at 2020-06-13Posted at 2020-06-10

出てきた関数一つ一つ死ぬほど詳しく書いていきます。またどういった発想からそのようなコードになったのかも書きます

1章準備運動

00 文字列の逆順


str = "stressed"
reverse_str = str[::-1]
print(reverse_str)

解説
この問題で使うのは、「スライス」の知識
スライスとは...シーケンス(文字列、リスト、タプルetc)の一部を切り取ってコピーを返してくれる仕組みのこと。


str[n] #strのn文字目を取り出す
str[start:stop:step] 

# start...何番目の文字から？
# stop...終点(+1の値を指定しよう！)
# step...何個飛ばし？

01 「パタトクカシーー」


str = "パタトクカシーー"
print(str[1::2])

解説:
00で得たスライスの知識を用いれば良い。
str[1文字目から:(無記入なので最後まで):2飛ばし]
スライスにおいて値を入力しない時は最後までだと解釈される。

02 「パトカー」＋「タクシー」＝「パタトクカシーー」


str1 = "パトカー"
str2 = "タクシー"
ans = ""
for i in range(len(str1)):
   ans += str1[i]
   ans += str2[i]

print(ans)

解説:
range()...引数に指定した開始数から終了数までの連続した数値を要素として持つ range 型のオブジェクトを作成します。 range 関数の書式は次の通りです。


range(stop)
range(start, stop[,step])

range(5)
--> 0 1 2 3 4

range(0, 5)
--> 0 1 2 3 4

range(4,7)
--> 4 5 6

range(0, 5, 1)
--> 0 1 2 3 4

range(0, 10, 2)
--> 0 2 4 6 8

03 円周率


str = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."
ans = str.replace(",","").replace(".", "").split(" ")
pi = []
for i in range(len(ans)):
    pi.append(len(ans[i]))
print(pi)

解説:
replace()とsplit()を理解すれば良い。
replace関数....replace("取り除きたい文字列", "何に置き換えるか")
split関数....split("何で文章全体を分割するか")→リストが返される


s = 'one two one two one'

print(s.replace(' ', '-'))
====> one-two-one-two-one

print(s.split(" "))
====> ['one', 'two', 'one', 'two', 'one']

04 元素記号


# 先頭の文字を取り出すか、先頭の2文字を取り出すかを決定する関数
def extWord(i, word):
  if i in [1,5,6,7,8,9,15,16,19]
    return (word[0], i)
  else:
    return (word[:2], i)

str = 'Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can.'
text = str.replace('.', '').replace(',' '')
ans = []

解説:
まずはenumerate関数の解説...要素のインデックスと要素を同時に取り出すことができる関数。for文とかと一緒に使うことが多い。
enumerate(iterable, start)


x = ['a','b','c','d','e','f','g','h']
for a,b in enumerate(x):
    print(a,b, end=" ")

==========> 0 a 1 b 2 c 3 d 4 e 5 f 6 g 7 h
# 最初の変数にインデックス、2番目の変数に要素が代入される。

extWord()についての解説:
indexによって分類する→enumerate関数を使うという発想。
なのでenumerate関数を使うことを念頭において,extWord('インデックス','要素')となるように引数を指定する。
あとはif文で条件分岐しているだけ。
ans = [extWord(i, w) for i, w in enumerate(text.split())]についての解説:
最初見たら「は？」ってなると思いますが、これはリスト内包記というもの
下記の２つの表記は同じ結果となる


ans = [extWord(i, w) for i, w in enumerate(text.split())]

ans = []
for i, w in enumerate(text.split()):
  ans.append = extWord(i, w)

dict()についての解説:
dict()は組み込み関数で引数に入れたものを辞書型にキャストしてくれる関数

05 n-gram

まずそもそもn-gramとはなんだとというところ解説します。

n-gramとは

一言で説明すると、「n-gram」とは連続するn個の単語や文字のまとまりを表します。
具体例で見ていく！

こちら葛飾区亀有公園前派出所
これを1-gramにすると
['こ', 'ち', 'ら', '葛', '飾', '区', '亀', '有', '公', '園', '前', '派', '出', '所']
2-gram
['こち', 'ちら', 'ら葛', '葛飾', '飾区', '区亀', '亀有', '有公', '公園', '園前', '前派', '派出', '出所']
3-gram
['こちら', 'ちら葛', 'ら葛飾', '葛飾区', '飾区亀', '区亀有', '亀有公', '有公園', '公園前', '園前派', '前派出', '派出所']

これを踏まえて、


def n_gram(target, n):
  return [target[index: index + n] for index in range(len(target) - n + 1)]

str = 'I am an NLPer'
for i in range(1,4)
  print(n_gram(str, i))
  print(n_gram(str.split(' '), i))

解説:


def n_gram(target, n):
  return [target[index: index + n] for index in range(len(target) - n + 1)]

これなにやっとんねんって話。
n_gram('目的の文字列', なんgramにしたい？)
2行はリスト内包記


list = [target[index: index + n] for index in range(len(target) - n + 1)]

list = []
for index in range(len(target) - n + 1) #一番うしろを指定したい。すなわちラストn文字未満になったらできないからそこでストップ
 list.append(target[index: index + n])

リスト内包記は一回普通のfor文に治すとわかりやすいですね。

06 集合


def n_gram(target, n):
    return {target[idx:idx + n] for idx in range(len(target) - n + 1)}

str1 = "paraparaparadise"
str2 = "paragraph"

X = n_gram(str1, 2)
Y = n_gram(str2, 2)

# XとYの和集合
union_set = X | Y # X.union(Y)でも可
print(union_set)
# 積集合
intersection_set = X & Y # X.intersection(Y)でも可
print(intersection_set)
# 差集合
difference_set = X - Y # X.difference(Y)でも可
print(difference_set)
# seがX及びYに含まれているか
print('se' in (X & Y))

これに関しては特に解説することはありません。
コメントを読んでいただければ理解できるかと思います。

07 テンプレートにおける文生成


def make_sentence(x, y, z):
    sentence = str(x) + "時の" + y + "は" + str(z)
    return sentence

print(make_sentence(12, "気温", 22.4))

これに関しても解説は特にないですね。

08 暗号文


def cipher(sentence):
    sentence = [chr(219 - ord(x)) if x.islower() else x for x in sentence]
    return ''.join(sentence)

x = 'Hey! Are ready ?? 123456'
print('平文', x)
x = cipher(x)
print('暗号文', x)
x = cipher(x)
print('復号文', x)

解説:
まずはリスト内包記を普通のfor文に治す


def cipher(sentence):
  sentence = [chr(219 - ord(x)) if x.islower() else x for x in sentence]
  return ''.join(sentence)
# 2つは同じ
def chiper(sentence):
  sentence = []
  for x in sentence:
    if x.islower:
      sentence.append(char(219 - ord(x)))
    else:
      sentence.append(x)
  return ''.join(sentence)

islower()について解説:
islower関数...英字がすべて小文字であればTrueを返す、一文字でもそうでなければFalseを返す。
join()について解説:
joinはstrのメソッドであり、iterableを文字列に連結する


list = ['a','b','c','d','e,'f]
x = ''.join(list)
====> abcdef
x = ','.join(list)
====> a,b,c,d,e,f

暗号化復号化の解説:
chr(219-ord(x))部分に注目２回やるとchr(x)のxの値がもとに戻ることに注目。
実際に好きな数字を入れてみよう！

09 Typoglycemia

Typoglycemiaとは:
単語を構成する文字を並べ替えても、最初と最後の文字が合っていれば読めてしまう現象のことである。
具体例:

Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.
—Matt Davis, MRC Cognition and Brain Sciences、"Reading jumbled texts"

なんか読めちゃいますよね。この現象のことです。


import random
def shuffle_word(word):
    if len(word) <= 4:
        return word
    else:
        head = word[0]
        tail = word[-1]
        centers = word[1:-1]
        centers = list(centers)
        random_centers = random.sample(centers, len(centers))

        return "".join(list(head) + random_centers + list(tail))

str = "I couldn’t believe that I could actually understand what I was reading : the phenomenal power of the human mind .".replace(".","").split(" ")
list_str = [shuffle_word(w) for w in str]
print(list_str)

解説:
shuffle_word関数の解説の前に以下を理解してください。
shuffle_word関数を作る前にどのように引数に単語を与えるかを念頭に置きます。
今回は以下のfor文で1単語1単語をshuffle_word()に単語を与えます。


list_str = [shuffle_word(w) for w in str]

毎度お馴染みのリスト内包記ですね。わかりやすいように、普通のfor文に直しておきます。


for w in str
  list_str.append(shuffle_word(w))

つまり流れとしては、str（※リストになってます。）からひとつひとつ要素を取り出して、shuffle_word関数を通して、list_strにappend()しています。
shuffle_word関数の解説:
コメントを読んでいただければ理解できるかと思います。


def shuffle_word(word):
    if len(word) <= 4: #文字数4文字以下は何もしないでそのまま返す。
        return word
    else: #シャッフルしたいのは、最初の文字と最後の文字以外なので
        #シャッフルしたくない文字を確保
        head = word[0]
        tail = word[-1]
        #シャッフルしたい文字を確保
        centers = word[1:-1]
        #random.sampleが引数に取るのはリストだからリストへキャストする
        centers = list(centers)
        #ランダムに並び替えたリストを作る。
        random_centers = random.sample(centers, len(centers))
        #それぞれを結合して文字列にして返す。リストの結合は''.join()を使う
        return "".join(list(head) + random_centers + list(tail))

random.sample()の解説:
randomモジュールに含まれます。
ランダムにシャッフルすると言うと、2つ選択肢が浮かびます。
・ random.sample()...リストからランダムで複数の要素を取得できる。要素の重複はなし（非復元抽出）。
・ random.choices()...リストからランダムで複数の要素を取得できる。sample()とは異なり、要素の重複を許して選択される（復元抽出）。
今回はrandom.sample()を使う。
random.sampleの使い方:
random.sample(リスト, 何個取り出すか)


l = [0, 1, 2, 3, 4]

print(random.sample(l, 3))
=====> [2, 4, 0]

以上で第1章は終了です。
近いうちに第２章の解説もあげます。

109

137

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

Pythonで自然言語処理100本ノック 2020を解いたついでに死ぬほど詳しく解説を書いていく[第1章 準備運動]

1章 準備運動