More than 5 years have passed since last update.

言語処理100本ノック第1章 by Python

Last updated at 2017-07-05Posted at 2017-07-05

最近、Pythonを勉強する必要があったので、言語処理100本ノックにトライ。
まずは、第1章: 準備運動から。

言語処理100本ノック

00. 文字列の逆順

文字列"stressed"の文字を逆に（末尾から先頭に向かって）並べた文字列を得よ．

q00='stressed'
print(q00[::-1])

01. 「パタトクカシーー」

「パタトクカシーー」という文字列の1,3,5,7文字目を取り出して連結した文字列を得よ

q01='パタトクカシーー'
# print(q01[1]+q01[3]+q01[5]+q01[7])

# -> 更新版
print(q01[1::2])

02. 「パトカー」＋「タクシー」＝「パタトクカシーー」

「パトカー」＋「タクシー」の文字を先頭から交互に連結して文字列「パタトクカシーー」を得よ．

解法１

q021='パトカー'
q022='タクシー'

length=min(len(q021),len(q022))

ansq02=''
for i in range(length):
    temp=q021[i]+q022[i]
    ansq02+=temp

print(ansq02)

解法2

q021='パトカー'
q022='タクシー'

ansq022="".join(i+j for i,j in zip(q021,q022))

print(ansq022)

03. 円周率

"Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."という文を単語に分解し，各単語の（アルファベットの）文字数を先頭から出現順に並べたリストを作成せよ．

q03="Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."

ansq03=[len(i.strip(",.")) for i in q03.split()]

print(ansq03)

04. 元素記号

"Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."という文を単語に分解し，1, 5, 6, 7, 8, 9, 15, 16, 19番目の単語は先頭の1文字，それ以外の単語は先頭に2文字を取り出し，取り出した文字列から単語の位置（先頭から何番目の単語か）への連想配列（辞書型もしくはマップ型）を作成せよ．

q04="Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."

dict={}

q04_list=[(i.strip(",.")) for i in q04.split()]
print(q04_list)

q04_listNum=[1, 5, 6, 7, 8, 9, 15, 16, 19]

for idx,val in enumerate(q04_list):
    temp_char=val
    idx += 1
    if ((idx) in q04_listNum):
        dict[temp_char[0]] = idx
    else:
        dict[temp_char[:2:1]] =idx

print(dict)

05. n-gram

与えられたシーケンス（文字列やリストなど）からn-gramを作る関数を作成せよ．この関数を用い，"I am an NLPer"という文から単語bi-gram，文字bi-gramを得よ．

q05="I am an NLPer"

# bi-gram for char
char_bigram=[q05[i:i+2] for i in range(len(q05)-1)]
print(char_bigram)

# n-bigram for words
words=[(i.strip(".,")) for i in q05.split()]
words_bigram=["-".join(words[i:i+2]) for i in range(len(words)-1)]
print(words_bigram)

06. 集合

"paraparaparadise"と"paragraph"に含まれる文字bi-gramの集合を，それぞれ, XとYとして求め，XとYの和集合，積集合，差集合を求めよ．さらに，'se'というbi-gramがXおよびYに含まれるかどうかを調べよ．

import copy

def bigram(a):
    result=[a[i:i+2] for i in range(len(a)-1)]
    return result

q061="paraparaparadise"
q062="paragraph"

bigramX_list = copy.deepcopy(bigram(q061))
bigramY_list = copy.deepcopy(bigram(q062))

bigramX_set=set(bigramX_list)
bigramY_set=set(bigramY_list)
print ('bigramX_set =', bigramX_set)
print ('bigramY_set =', bigramY_set)

# 和集合
print ('和集合 = ',  (bigramX_set | bigramY_set))
# 差集合
print ('差集合 = ',  (bigramX_set - bigramY_set))
# 積集合
print ('積集合 = ',  (bigramX_set & bigramY_set))
# 検索
print ('検索結果= ', 'se' in  (bigramX_set | bigramY_set))

07. テンプレートによる文生成

引数x, y, zを受け取り「x時のyはz」という文字列を返す関数を実装せよ．さらに，x=12, y="気温", z=22.4として，実行結果を確認せよ．

def maketext(x=1,y='あんこ',z=10):
    result="".join(str(x)+'時の'+y+'は'+str(z))
    return result

x,y,z=12,'気温',22.4

print (maketext(x,y,z))
# print (maketext())

08. 暗号文

与えられた文字列の各文字を，以下の仕様で変換する関数cipherを実装せよ．

英小文字ならば(219 - 文字コード)の文字に置換
その他の文字はそのまま出力
この関数を用い，英語のメッセージを暗号化・復号化せよ．

# Q08
def cipher(a):
    temp_list=[a[i:i+1] for i in range(len(a))]
    ciptex_list=[]
    for i in temp_list:
 
        texCode=ord(i)
        if (texCode>96 & texCode<123):
            updtexCode=chr(219-texCode)
        else:
            updtexCode=chr(texCode)

        ciptex_list.append(updtexCode)

    result="".join(i for i in ciptex_list)
    return result

print (cipher('abcdef')) #=> 'zyxwyu'

09. Typoglycemia

スペースで区切られた単語列に対して，各単語の先頭と末尾の文字は残し，それ以外の文字の順序をランダムに並び替えるプログラムを作成せよ．ただし，長さが４以下の単語は並び替えないこととする．適当な英語の文（例えば"I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."）を与え，その実行結果を確認せよ．

import random
def randsort(a):

    result = []
    listA = [(i.strip(',.')) for i in a.split()]
    randchar = lambda x: ''.join(random.sample(x,len(x)))

    for i in listA:
        if len(i) > 4:
            temp_word=i[:1:1]+randchar(i[1:len(i)-1:1])+i[len(i)-1::1]
            result.append(temp_word)
        else:
            result.append(i)
    return (result)

q09="I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."

print(randsort(q09))

とりあえず、ここまではじめてPythonを自分で書いてみたけど、いろいろ調べることが多くて勉強になることが多かった。
他にもっと効率のよいやり方があるかもしれないけど、今はこれで良しとするとしようと思う。

参考

Python での「文字のコード値取得」／「コード値からの文字取得」
http://d.hatena.ne.jp/flying-foozy/20111204/1323009984

Unicode HOWTO(原文)
https://docs.python.jp/3/howto/unicode.html

Python: set型の集合演算で2つのリスト要素を比較
http://www.yukun.info/blog/2008/08/python-set-list-comparison.html

3.7 set（集合）型 -- set, frozenset
http://docs.python.jp/2.5/lib/types-set.html

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up