More than 5 years have passed since last update.

言語処理100本ノックに挑戦 / 第1章: 準備運動

Last updated at 2018-04-06Posted at 2018-04-03

💪 前書き

プログラミングのスキルアップのため、言語処理100本ノック 2015 に挑戦し、ここに記録していきます。成果物は Github にも上げます。より良い書き方などありましたらコメントでご意見いただけると嬉しいです。

言語処理100本ノックは，実践的な課題に取り組みながら，プログラミング，データ分析，研究のスキルを楽しく習得することを目指した問題集です
・実用的でワクワクするような題材を厳選しました
・言語処理に加えて，統計や機械学習などの周辺分野にも親しめます
・研究やデータ分析の進め方，作法，スキルを修得できます
・問題を解くのに必要なデータ・コーパスを配布しています
・言語はPythonを想定していますが，他の言語にも対応しています

💪 前提・縛り・目的とか

Python 3.6.5
毎日プログラミングに触れること
処理や構文について他人の解答は参考にせず自分で考えて解答する、言語処理について調べるときは Python 言語リファレンスのみを参照すること
知らなかった型や関数のリファレンスは一通り読むこと
解答をより良くするためいくつか解答案を考えること、その際はグーグル検索したり他人の解答を参考にしてもよい
型や関数を深く理解することを目的とし、リスト内包表記と三項演算子による速度向上や行削減は後回しとする

💪 第1章: 準備運動

テキストや文字列を扱う題材に取り組みながら，プログラミング言語のやや高度なトピックを復習します．

⚾ 00. 文字列の逆順

文字列"stressed"の文字を逆に（末尾から先頭に向かって）並べた文字列を得よ．

想定解

desserts

解1 ループ

list([iterable]) で1文字づつの配列に分解する
- 4. 組み込み型 — list
len(s) で要素数を取得し、配列を逆順に組み立てる
- 2. 組み込み関数 — len

answer

odai_str = ('stressed')
ansr_str = ''

list_str = list(odai_str)
for index in range(len(list_str)):
    rev_index = len(list_str) - index -1
    ansr_str += list_str[rev_index]

print(ansr_str)

解2 reversed

reversed(seq) で配列を逆にしたものを取得する
- 2. 組み込み関数 — reversed
str.join(iterable) で配列を結合する
- 4. 組み込み型 — str.join

answer

odai_str = ('stressed')
ansr_str = ''

list_str = list(odai_str)
ansr_str = ''.join(reversed(odai_str))
print(ansr_str)

解3 スライス表記

slicing で文字列を取得
- 6. 式 (expression) — スライス表記 (slicing)

answer

odai_str = 'stressed'
ansr_str = ''

ansr_str = odai_str[::-1]
print(ansr_str)

⚾ 01. 「パタトクカシーー」

「パタトクカシーー」という文字列の1,3,5,7文字目を取り出して連結した文字列を得よ．

想定解

パトカー

解1 スライシング

slicing で文字列を取得

answer

odai_str = ('パタトクカシーー')
ansr_str = ''

ansr_str = odai_str[::2]
print(ansr_str)

⚾ 02. 「パトカー」＋「タクシー」＝「パタトクカシーー」

「パトカー」＋「タクシー」の文字を先頭から交互に連結して文字列「パタトクカシーー」を得よ．

想定解

パタトクカシーー

解1 ループ

len(s) で要素数を取得し、ループする

answer

odai_str = ('パトカー','タクシー')
ansr_str = ''

for index in range(len(odai_str[0])):
    ansr_str += list(odai_str[0])[index]
    ansr_str += list(odai_str[1])[index]

print(ansr_str)

解2 zip

zip(*iterables) で複数の配列を同時にループする
- 2. 組み込み関数 — zip

answer

odai_str = ('パトカー','タクシー')
ansr_str = ''

for item1, item2 in zip(odai_str[0],odai_str[1]):
    ansr_str += item1 + item2

print(ansr_str)

⚾ 03. 円周率

"Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."という文を単語に分解し，各単語の（アルファベットの）文字数を先頭から出現順に並べたリストを作成せよ．

想定解

[3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9]

解1 re.split

re.split(pattern, string, maxsplit=0, flags=0) で分割した配列を取得
- 6.2. re — 正規表現操作 — split
re.sub で記号 ., を削除しないと末尾に[0]が出てしまう、円周率という問題なので無いのが正しそう
- 6.2. re — 正規表現操作 — sub
正規表現 \w 単語構成文字:[a-zA-Z_0-9] 、\W 非単語文字:[^\w]

answer

import re
odai_str = ('Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics.')
ansr_lst = []

for value in re.split(' ', odai_str):
    ansr_lst.append(len(re.sub(r'\W','',value)))

print(ansr_lst)

⚾ 04. 元素記号

"Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."という文を単語に分解し，1, 5, 6, 7, 8, 9, 15, 16, 19番目の単語は先頭の1文字，それ以外の単語は先頭に2文字を取り出し，取り出した文字列から単語の位置（先頭から何番目の単語か）への連想配列（辞書型もしくはマップ型）を作成せよ．

想定解

{'H': 1, 'He': 2, 'Li': 3, 'Be': 4, 'B': 5, 'C': 6, 'N': 7, 'O': 8, 'F': 9, 'Ne': 10, 'Na': 11, 'Mi': 12, 'Al': 13, 'Si': 14, 'P': 15, 'S': 16, 'Cl': 17, 'Ar': 18, 'K': 19, 'Ca': 20}
原子番号12番 Mg が Mi になってしまうのは問題文の誤りっぽい

解1 enumerate

enumerate(iterable, start=0) を使うことで index と value 両方取得しながらループ
- 2. 組み込み関数 — enumerate

answer

import re
odai_str = ('Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can.')
ansr_dct = {}

tmp_lst = []
for value in re.split(' ', odai_str):
    tmp_lst.append(re.sub(r'\W','',value))

for index,value in enumerate(tmp_lst):
    if (index + 1) in (1, 5, 6, 7, 8, 9, 15, 16, 19):
        ansr_dct.update( { tmp_lst[index][0:1]: index + 1 } )
    else:
        ansr_dct.update( { tmp_lst[index][0:2]: index + 1 } )

print(ansr_dct)

⚾ 05. n-gram

与えられたシーケンス（文字列やリストなど）からn-gramを作る関数を作成せよ．この関数を用い，"I am an NLPer"という文から単語bi-gram，文字bi-gramを得よ．

想定解

n-gram とは / 隣り合うN個の塊のこと
- 自然言語処理はじめました - Ngramを数え上げまくる
  - 「金持ち喧嘩せず」の文字2gram(bigram) →{金持, 持ち, ち喧, 喧嘩, 嘩せ, せず}
  - 「This is apple computer」の単語3gram(trigram) →{This-is-apple, is-apple-computer}

result

単語 bi-gram：I-am,am-an,an-NLPer のような配列
文字 bi-gram：I , a, a,am,m , a,an,n , N,NL,LP,Pe,er  のような配列

解1 str.split

str.split(sep=None, maxsplit=-1) で文字を配列に分割（正規表現が必要な場合は re.split を使う）
- 4. 組み込み型 — str.split
def() によるユーザ定義関数
- 8. 複合文 (compound statement) — def

answer

odai_str = ('I am an NLPer')

def ngram(n,words):
    ansr_lst = []
    for index in range(0, len(words) - n + 1):
        ansr_lst.append(words[index:index + n])
    return ansr_lst

print(ngram(2,odai_str.split(' '))) # 配列で渡すと単語n-gram
print(ngram(2,odai_str)) # 文字列で渡すと文字n-gram

⚾ 06. 集合

"paraparaparadise"と"paragraph"に含まれる文字bi-gramの集合を，それぞれ, XとYとして求め，XとYの和集合，積集合，差集合を求めよ．さらに，'se'というbi-gramがXおよびYに含まれるかどうかを調べよ．

想定解

result

X = ['pa', 'ar', 'ra', 'ap', 'pa', 'ar', 'ra', 'ap', 'pa', 'ar', 'ra', 'ad', 'di', 'is', 'se']
Y = ['pa', 'ar', 'ra', 'ag', 'gr', 'ra', 'ap', 'ph']

和集合 … XとYに含まれる全要素を求める：
 'pa', 'ar', 'ra', 'ap', 'ad', 'di', 'is', 'se',
 'ag', 'gr', 'ph'
積集合 … XとYに重複する要素を求める：
 'ar', 'ra', 'pa', 'ap'
差集合 … Xに存在しYに存在しない要素を求める：
 'ad', 'di', 'is', 'se' 
se が含まれるかどうか：
 X に含まれる

解1 愚直にループ

def 文を用いて関数を定義

answer

odai_str = ('paraparaparadise','paragraph')

def ngram(n,words):
    ansr_lst = []
    for index in range(0, len(words) - n + 1):
        ansr_lst.append(words[index:index + n])
    return ansr_lst

def union_set(x,y):
    tmp_lst = x + y
    tmp_lst2 = []
    for val in tmp_lst:
        if tmp_lst2.count(val) == 0:
            tmp_lst2.append(val)
    return tmp_lst2

def prod_set(x,y):
    tmp_lst = []
    for val_x in x:
        for val_y in y:
            if (val_x == val_y) and (val_y not in tmp_lst):
               tmp_lst.append(val_y)
    return tmp_lst

def diff_set(x,y):
    tmp_lst = []
    for val_x in x:
        if (val_x not in y) and (val_x not in tmp_lst):
            tmp_lst.append(val_x)
    return tmp_lst

def check_se(x):
    return 'se' in (x)

lst_x = ngram(2,odai_str[0])
lst_y = ngram(2,odai_str[1])

print(union_set(lst_x,lst_y))
# ['pa', 'ar', 'ra', 'ap', 'ad', 'di', 'is', 'se', 'ag', 'gr', 'ph']
print(prod_set(lst_x,lst_y))
# ['pa', 'ar', 'ra', 'ap']
print(diff_set(lst_x,lst_y))
# ['ad', 'di', 'is', 'se']
print(check_se(lst_x))
# True
print(check_se(lst_y))
# False

解2 集合型

set（集合）型 の関数を使う
- 4. 組み込み型 — set

answer

odai_str = ('paraparaparadise','paragraph')

def ngram(n,words):
    ansr_lst = []
    for index in range(0, len(words) - n + 1):
        ansr_lst.append(words[index:index + n])
    return ansr_lst

lst_x = ngram(2,odai_str[0])
lst_y = ngram(2,odai_str[1])
set_x = set(lst_x)
set_y = set(lst_y)

print(set_x.union(set_y))
print(set_x.intersection(set_y))
print(set_x.difference(set_y))

解3 集合型＋演算子

set（集合）型 の演算子を使う

answer

odai_str = ('paraparaparadise','paragraph')

def ngram(n,words):
    ansr_lst = []
    for index in range(0, len(words) - n + 1):
        ansr_lst.append(words[index:index + n])
    return ansr_lst

lst_x = ngram(2,odai_str[0])
lst_y = ngram(2,odai_str[1])
set_x = set(lst_x)
set_y = set(lst_y)

print(set_x | set_y)
print(set_x & set_y)
print(set_x - set_y)

⚾ 07. テンプレートによる文生成

引数x, y, zを受け取り「x時のyはz」という文字列を返す関数を実装せよ．さらに，x=12, y="気温", z=22.4として，実行結果を確認せよ．

想定される結果：12の時の気温は22.4

解1 文字列変換

class str(object='') で文字列型に変換
- 4. 組み込み型 — str

answer

odai_str = (12,'気温',22.4)

def xyz(x,y,z):
    return str(x) + 'の時の' + str(y) + 'は' + str(z)

print(xyz(odai_str[0],odai_str[1],odai_str[2]))

解2 format

str.format(*args, **kwargs) で見やすく
- 4. 組み込み型 — str.format
- 6.1. string — formatstrings

answer

odai_str = (12,'気温',22.4)

def xyz(x,y,z):
    return '{}の時の{}は{}'.format(x,y,z)

print(xyz(odai_str[0],odai_str[1],odai_str[2]))

⚾ 08. 暗号文

与えられた文字列の各文字を，以下の仕様で変換する関数cipherを実装せよ．
・英小文字ならば(219 - 文字コード)の文字に置換
・その他の文字はそのまま出力
・この関数を用い，英語のメッセージを暗号化・復号化せよ．

想定解

result

 復号：私は Taro Yamada です (a-97 r-114 o-111/m-109 a-97 d-100 a-97)
 暗号：私は Tzil Yznzwz です (122-z 105-i 108-l /122-z 111-n 122-z 119-w 122-z)

解1 コードポイント

ord(c) でUnicodeコードポイントを取得
- 2. 組み込み関数 — ord
chr(i) でコードポイントから文字列を取得
- 2. 組み込み関数 — chr

answer

def cipher(texts):
    ret = ''
    for text in texts:
        #a-zなら
        if ord(text) in range(97,123):
            text = chr(219 - ord(text))
        ret += text
    return ret

print(cipher('私は Taro Yamada です'))
# 私は Tzil Yznzwz です
print(cipher(cipher('私は Taro Yamada です')))
# 私は Taro Yamada です

解2 大文字小文字判定

str.islower() で大文字小文字判定
- 4. 組み込み型 — str.islower()

answer

def cipher(texts):
    ret = ''
    for text in texts:
        if text.islower():
            text = chr(219 - ord(text))
        ret += text
    return ret

print(cipher('私は Taro Yamada です'))
print(cipher(cipher('私は Taro Yamada です')))

⚾ 09. Typoglycemia

スペースで区切られた単語列に対して，各単語の先頭と末尾の文字は残し，それ以外の文字の順序をランダムに並び替えるプログラムを作成せよ．ただし，長さが４以下の単語は並び替えないこととする．適当な英語の文（例えば"I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."）を与え，その実行結果を確認せよ．

想定解

I c'ldtuon bieelve that I cluod aalctluy utsannderd what I was reanidg : the pnoehenmal power of the hamun mind .

解1 文字シャッフル

random.shuffle(x[, random]) で文字シャッフル
- 9.6. random — shuffle

answer

import random

def mixer(text):
    ret = ''
    for words in text.split(' '):
        if len(words) > 4:
            shufflelist = list(words[1:-1:])
            random.shuffle(shufflelist) # なかシャッフル
            shufflelist.insert(0,words[0:1:]) # 1文字目 挿入
            shufflelist.append(words[-1::]) # 末尾文字 追加
            words = ''.join(shufflelist)
        ret += words + ' '
    return ret

print(mixer("I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."))
# I cnul'dot bvileee that I cluod aaltlcuy usdnnrtead what I was radieng : the pmhoeaennl power of the human mind .

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

言語処理100本ノックに挑戦 / 第1章: 準備運動

💪 前書き

💪 前提・縛り・目的 とか

💪 第1章: 準備運動

⚾ 00. 文字列の逆順

想定解

解1 ループ

解2 reversed

解3 スライス表記

⚾ 01. 「パタトクカシーー」

想定解

解1 スライシング

⚾ 02. 「パトカー」＋「タクシー」＝「パタトクカシーー」

想定解

解1 ループ

解2 zip

⚾ 03. 円周率

想定解

解1 re.split

⚾ 04. 元素記号

想定解

解1 enumerate

⚾ 05. n-gram

想定解

解1 str.split

⚾ 06. 集合

想定解

解1 愚直にループ

解2 集合型

解3 集合型＋演算子

⚾ 07. テンプレートによる文生成

解1 文字列変換

解2 format

⚾ 08. 暗号文

想定解

解1 コードポイント

解2 大文字小文字判定

⚾ 09. Typoglycemia

想定解

解1 文字シャッフル

💪 前提・縛り・目的とか