More than 5 years have passed since last update.

言語処理100本ノック　～1章まで

Last updated at 2017-03-05Posted at 2017-03-05

pythonやりたいけどどっから手を付ければいいのかよく分からないので
言語処理100本ノック 2015を進めていきます
環境はWindows10,python3.6.0です

00. 文字列の逆順

文字列”stressed”の文字を逆に（末尾から先頭に向かって）並べた文字列を得よ．

00.py

string = "stressed"
print(string[::-1])
# スライスで実装
# string[開始位置:終了位置:ステップ数]
# 開始、終了位置は負の数指定で末尾から数える
# 以下のようにもできる
print(string[-1::-1])

sliceで実装

01. 「パタトクカシーー」

「パタトクカシーー」という文字列の1,3,5,7文字目を取り出して連結した文字列を得よ．

01.py

string = "パタトクカシーー"
print(string[::2])
# 00と同じくスライスのステップ指定で2ずつ

00と同じくステップの利用で

02. 「パトカー」＋「タクシー」＝「パタトクカシーー」

「パトカー」＋「タクシー」の文字を先頭から交互に連結して文字列「パタトクカシーー」を得よ．

02.py

string1 = "パトカー"
string2 = "タクシー"
string3 = ""
i = 0
while i < len(string1):
    string3 += string1[i] + string2[i]
    i+=1
print(string3)

ちょっと微妙…同じ文字の長さしか連結できないし

03. 円周率

"Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."という文を単語に分解し，各単語の（アルファベットの）文字数を先頭から出現順に並べたリストを作成せよ．

03.py

import re # 正規表現の利用
from collections import defaultdict # 文字数カウント用

string = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."

# ,と.を削除後に単語でリスト化
words_list = re.sub('[,.]','',string).split()

# カウンターの初期化
counter = defaultdict(int)

# 先頭の単語・文字からカウントしていく
for word in words_list:
	for c in word:
		counter[c] += 1

# 辞書型なので(文字,カウント数)のタプルリスト型に変換
count_list = dict(counter).items()

print(count_list)

全体的に型変換処理が多すぎる...もう少し減らしたい.

04. 元素記号

"Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."という文を単語に分解し，1, 5, 6, 7, 8, 9, 15, 16, 19番目の単語は先頭の1文字，それ以外の単語は先頭に2文字を取り出し，取り出した文字列から単語の位置（先頭から何番目の単語か）への連想配列（辞書型もしくはマップ型）を作成せよ．

04.py

import re # 正規表現の利用

elements = "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."
mono_words = [1, 5, 6, 7, 8, 9, 15, 16, 19]
shortened_elements = {}

# .を削除後に単語でリスト化
elements_list = elements.replace('.','').split()

# 元素単語リストを1つずつ探査しながら
# 定義しておいた1文字表現かどうかのチェック
# 元素の辞書に省略元素名と先頭から何番目かを入れる
# enumerate使えばイテレータ取得できる
for i,e in enumerate(elements_list):
    count = i + 1
    if(count in mono_words):
        shortened_elements[e[:1]] = count
    else:
        shortened_elements[e[:2]] = count

print(shortened_elements)

05. n-gram

与えられたシーケンス（文字列やリストなど）からn-gramを作る関数を作成せよ．この関数を用い，"I am an NLPer"という文から単語bi-gram，文字bi-gramを得よ．

文字bi-gram?単語bi-gram?

”I am an NLPer”の単語bi-gramだと
{‘I am’,’am an’,’an NLPer’}
文字bi-gramだと
{‘I a’,’a m ‘,’m a’,’a n’,’n N’,’N L’,’L P’,’P e’,’e r’}

という結果になればいいはず

05.py

import re # 正規表現の利用

# シーケンスとして文字列とリスト用意
sentence_string = "I am an NLPer"
sentence_list = sentence_string.split()

# 数字nとシーケンスを引数にするn-gram関数
def n_gram(n,sequence):

    # 返り値用のリスト
    ngram = []

    # 文字列、リストで共通処理にするために
    # 文字列が引数に与えられた場合は1文字ごとのリストに変換
    # ,と.とスペースを削除
    if isinstance(sequence, str):
        sequence = list(re.sub('[,. ]','',sequence))

    # n-gram作成処理
    # for文でiの位置+引数のnをスライスする
    # forの終了はリスト長からnを引いて1足した箇所まで
    for i in range(len(sequence)-n+1):
        ngram.append(sequence[i:i+n])

    return ngram

# 単語bi-gram
print(n_gram(2,sentence_list))
# 文字bi-gram
print(n_gram(2,sentence_string))

06. 集合

"paraparaparadise"と"paragraph"に含まれる文字bi-gramの集合を，それぞれ, XとYとして求め，XとYの和集合，積集合，差集合を求めよ．さらに，'se'というbi-gramがXおよびYに含まれるかどうかを調べよ．

06.py

import re # 正規表現の利用

# シーケンスとして文字列とリスト用意
X = "paraparaparadise"
Y = "paragraph"

# n-gram関数 05の再利用
def n_gram(n,sequence):

    ngram = []

    if isinstance(sequence, str):
        sequence = list(re.sub('[,. ]','',sequence))

    for i in range(len(sequence)-n+1):
        # 05と変えた箇所、リスト内リストだと後述のset型に変換できなかったので
        # タプル型に変換処理を入れている
        ngram.append(tuple(sequence[i:i+n]))

    return ngram

# X,Y のbi-gram作成
# 集合計算のためにset型で定義
X = set(n_gram(2,X))
Y = set(n_gram(2,Y))

# 和集合
print(X | Y)
# 積集合
print(X & Y)
# 差集合
print(X - Y)
print(Y - X)
# 'se'が含まれるかチェック
if ('s','e') in X & Y:
    print("'se'はX及びYに含まれます")
else:
    print("'se'はXもしくはYに含まれていません")

07. テンプレートによる文生成

引数x, y, zを受け取り「x時のyはz」という文字列を返す関数を実装せよ．さらに，x=12, y="気温", z=22.4として，実行結果を確認せよ．

07.py

def tostr(x,y,z):
    return ("%s時の%sは%s" % (x,y,z))

print( tostr(12,"気温",22.4))

08. 暗号文

与えられた文字列の各文字を，以下の仕様で変換する関数cipherを実装せよ．

英小文字ならば(219 - 文字コード)の文字に置換
その他の文字はそのまま出力
この関数を用い，英語のメッセージを暗号化・復号化せよ．

08.py

import re # 正規表現の利用

def cipher(str):
    # いったんリストにして、1文字づつ処理
    str = list(str)
    re_str = []
    for s in str:
        if re.search('[a-z]',s):
            # 文字コードで英子文字は97~122 以下の処理でa->z,b->y,c->x...z->aのように変換される
            re_str.append(chr(219-ord(s)))
        else:
            re_str.append(s)
    return "".join(re_str)

test_str = "I am a esaka!!"

print(cipher(test_str))
# 結果:I zn z vhzpz!!
print(cipher(cipher(test_str)))
# 結果:I am a esaka!!

09. Typoglycemia

スペースで区切られた単語列に対して，各単語の先頭と末尾の文字は残し，それ以外の文字の順序をランダムに並び替えるプログラムを作成せよ．ただし，長さが４以下の単語は並び替えないこととする．適当な英語の文（例えば"I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."）を与え，その実行結果を確認せよ．

09.py

import random # 乱数処理の利用

def rand_str(str):
    # スペースで区切ってリスト化
    str = str.split(' ')
    re_str = []
    for i,s in enumerate(str):
        if len(s) > 4 and i != 0 and i != len(str)-1:
            re_str.append("".join(random.sample(s,len(s))))
        else:
            re_str.append(s)
    # 返り値は単語をスペースで付け直す
    return " ".join(re_str)

test_str = "I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."

print(rand_str(test_str))
# 結果: I tdcuon'l evibele that I ludoc ltyucala andnetrsdu what I was drienag : the lnpaeneohm erpow of the uahmn mind .

訂正

与えられた文字列の最初と最後の単語だけ並び替えしないと勘違いしていました。
各単語の最初と最後ですね。
ちょっと前に流行っていたこれですかね

というわけで修正したのが下