Qiita全国学生対抗戦Advent Calendar 2024

言語処理100本ノック【NLP 00 〜 09】

Posted at 2024-11-11

はじめに

自然言語処理を専門として研究に励んでいる大学院生なのですが，言語処理100本ノックをやったことがなかったので，アドベントカレンダに合わせて取り組んでみることにしました．
簡単な説明を付した記事を投稿予定ですが，どこまで続くかは忙しさによるのであまり期待しないでください……

さて，今回は00〜09です．（使用言語Python）

00. 文字列の逆順

文字列”stressed”の文字を逆に（末尾から先頭に向かって）並べた文字列を得よ．

input_str = 'stressed'
print(input_str[::-1])

Pythonの基本操作の一つであるスライス操作です．
:で区切られ，1つ目の値で開始インデックス，2つ目の値で終了インデックス，3つ目の値で増分を指定します．
今回であれば，3つ目の値のみ指定され，-1だけ増加，即ち，リバース操作に該当します．

01. 「パタトクカシーー」

「パタトクカシーー」という文字列の1,3,5,7文字目を取り出して連結した文字列を得よ．

input_str = 'パタトクカシーー'
print(input_str[::2])

00.と同じ方法で操作できます．今回は奇数文字目を取り出したいので，増分を2とすれば良いことになります．

02. 「パトカー」＋「タクシー」＝「パタトクカシーー」

「パトカー」＋「タクシー」の文字を先頭から交互に連結して文字列「パタトクカシーー」を得よ．

input_str1 = 'パトカー'
input_str2 = 'タクシー'
print(*[i + j for i, j in zip(input_str1, input_str2)], sep='')

こちらもPythonの基本操作である内包表記になります．
[item for item in iterable]の形式でリストを作成することができます．（ちなみに，内包表記はリストに限らない記法です．）
この記法の可読性が高いか否かと言う議論があるのは知っていますが，本記事では個人的には好きな書き方であると言うことでまとめたいと思います．

03. 円周率

“Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics.”という文を単語に分解し，各単語の（アルファベットの）文字数を先頭から出現順に並べたリストを作成せよ．

input_str = 'Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics.'
char_nums = [
    len(
        word
        .removesuffix(',')
        .removesuffix('.')
    ) for word in input_str.split()
]

print(char_nums)

02.と同様に内包表記で書いていますが，これについては私の研究分野に近い領域特有の書き方のような気がします．決して無理に使う必要はないかなと思います．

それよりも特筆すべきはremovesuffix関数かなと思います．こちらはPython3.9以上で利用可能な関数であり，str型の組み込み関数です．逆にprefixを削除するremoveprefix関数も併せて知っておくと便利かもしれません．

04. 元素記号

“Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can.”という文を単語に分解し，1, 5, 6, 7, 8, 9, 15, 16, 19番目の単語は先頭の1文字，それ以外の単語は先頭の2文字を取り出し，取り出した文字列から単語の位置（先頭から何番目の単語か）への連想配列（辞書型もしくはマップ型）を作成せよ．

input_str = '''
    Hi He Lied Because Boron Could Not Oxidize Fluorine. 
    New Nations Might Also Sign Peace Security Clause. 
    Arthur King Can.
'''
ONE_CHAR_WORD_INDEXES = [1, 5, 6, 7, 8, 9, 15, 16, 19]

words = [char for char in input_str.split()]

print({i + 1: word[0] if i + 1 in ONE_CHAR_WORD_INDEXES else word[:2] for i, word in enumerate(words)})

上記でおこなっている操作は非常に単純ではありますが，記法が少しややこしいかもしれません．（慣れれば見やすくて個人的には好きなのですが．）
これまでに出てきたもの（内包表記）と参考演算の組み合わせです．

{i + 1: word[0] if i + 1 in ONE_CHAR_WORD_INDEXES else word[:2] for i, word in enumerate(words)}

これを分解すると，

answer = {}
for i, word in enumerate(words):
    if i + 1 in ONE_CHAR_WORD_INDEXES:
        answer[i + 1] = word[0]
    else:
        answer[i + 1] = word[:2]

になります．

05. n-gram

与えられたシーケンス（文字列やリストなど）からn-gramを作る関数を作成せよ．この関数を用い，”I am an NLPer”という文から単語bi-gram，文字bi-gramを得よ．

input_str = 'I am an NLPer'

def n_gram(input_str: str, n: int, mode: str) -> list[str]:
    '''
    This function returns n-gram of input_str.
    
    Args:
        input_str (str): input string to be processed.
        n (int): size of n-gram.
        mode (str): mode of n-gram. 'char' or 'word'.
        
    Returns:
        list[str]: list of n-gram.
        
    Raises:
        ValueError: if mode is not 'char' or 'word'.
        
    Examples:
        >>> n_gram('I am an NLPer', 2, 'char')
        ['I ', ' a', 'am', 'm ', ' a', 'an', 'n ', ' N', 'NL', 'LP', 'Pe', 'er']
        >>> n_gram('I am an NLPer', 2, 'word')
        ['I am', 'am an', 'an NLPer']
    '''
    if mode == 'char':
        # The size of list is len(input_str) - n + 1 because the last index of n-gram is len(input_str) - n.
        return [input_str[i: i+n] for i in range(len(input_str) - n + 1)]
    elif mode == 'word':
        # The size of list is len(input_str.split()) - n + 1 because the last index of n-gram is len(input_str.split()) - n.
        return [' '.join(input_str.split()[i:i+n]) for i in range(len(input_str.split()) - n + 1)]
    else:
        raise ValueError('mode must be either "char" or "word".')


print(n_gram(input_str, 2, 'char'))
print(n_gram(input_str, 2, 'word'))

n-gram自体が何かということが分かれば何ということない問題ですね．n-gramとはn文字（単語）だけ連続した塊を一つのものとして扱う方法です．
単語のつながりという情報を追加することができるとか，研究をしているのであれば「塊」なんて使うんじゃないとか，色々とありますが，本記事は簡単に紹介するということでこの程度で……

06. 集合

“paraparaparadise”と”paragraph”に含まれる文字bi-gramの集合を，それぞれ, XとYとして求め，XとYの和集合，積集合，差集合を求めよ．さらに，’se’というbi-gramがXおよびYに含まれるかどうかを調べよ．

input_str1 = 'paraparaparadise'
input_str2 = 'paragraph'

def n_gram(input_str: str, n: int, mode: str) -> list[str]:
    '''
    This function returns n-gram of input_str.
    
    Args:
        input_str (str): input string to be processed.
        n (int): size of n-gram.
        mode (str): mode of n-gram. 'char' or 'word'.
        
    Returns:
        list[str]: list of n-gram.
        
    Raises:
        ValueError: if mode is not 'char' or 'word'.
        
    Examples:
        >>> n_gram('I am an NLPer', 2, 'char')
        ['I ', ' a', 'am', 'm ', ' a', 'an', 'n ', ' N', 'NL', 'LP', 'Pe', 'er']
        >>> n_gram('I am an NLPer', 2, 'word')
        ['I am', 'am an', 'an NLPer']
    '''
    if mode == 'char':
        # The size of list is len(input_str) - n + 1 because the last index of n-gram is len(input_str) - n.
        return [input_str[i: i+n] for i in range(len(input_str) - n + 1)]
    elif mode == 'word':
        # The size of list is len(input_str.split()) - n + 1 because the last index of n-gram is len(input_str.split()) - n.
        return [' '.join(input_str.split()[i:i+n]) for i in range(len(input_str.split()) - n + 1)]
    else:
        raise ValueError('mode must be either "char" or "word".')


X = set(n_gram(input_str1, 2, 'char'))
Y = set(n_gram(input_str2, 2, 'char'))

print(f'Sum of X and Y: {X | Y}')
print(f'Intersection of X and Y: {X & Y}')
print(f'Difference of X and Y: {X - Y}')
print(f'Is "se" in X? {"se" in X}')
print(f'Is "se" in Y? {"se" in Y}')

n-gramを提供する関数は05.のものと同じです．

Pythonの集合の操作を抑えていれば簡単な問題ですね．

07. テンプレートによる文生成

引数x, y, zを受け取り「x時のyはz」という文字列を返す関数を実装せよ．さらに，x=12, y=”気温”, z=22.4として，実行結果を確認せよ．

input_str_x = 12
input_str_y = '気温'
input_str_z = 22.4

def template(x, y, z):
    return f'{x}時の{y}は{z}'

print(template(input_str_x, input_str_y, input_str_z))

Python3.6以降に追加されたf-stringが利用できる問題ですね．デバッグにも有用なので，是非とも覚えておきたい操作です．

08. 暗号文

与えられた文字列の各文字を，以下の仕様で変換する関数cipherを実装せよ．

英小文字ならば(219 - 文字コード)の文字に置換
その他の文字はそのまま出力

この関数を用い，英語のメッセージを暗号化・復号化せよ．

from string import ascii_lowercase

sample_str = 'I am an NLPer'

def cipher(input_str: str) -> str:
    '''
    This function returns a string encrypted by cipher.
    
    Args:
        input_str (str): input string to be encrypted.
        
    Returns:
        str: encrypted string.
        
    Examples:
        >>> cipher('I am an NLPer')
        'I zn zm NLPvi'
    '''
    return ''.join([c if c not in ascii_lowercase else chr(219 - ord(c)) for c in input_str])


print(cipher(sample_str))

今回初めて利用したのですが，ord関数というものがあるようです．1文字をUnicodeの値に変換してくれる組み込み関数のようです．

09. Typoglycemia

スペースで区切られた単語列に対して，各単語の先頭と末尾の文字は残し，それ以外の文字の順序をランダムに並び替えるプログラムを作成せよ．ただし，長さが４以下の単語は並び替えないこととする．適当な英語の文（例えば”I couldn’t believe that I could actually understand what I was reading : the phenomenal power of the human mind .”）を与え，その実行結果を確認せよ．

import random 


random.seed(42)

input_text = 'I couldn’t believe that I could actually understand what I was reading : the phenomenal power of the human mind .'


def typoglycemia(input_text: str) -> str:
    '''
    This function returns a string with its words shuffled except for the first and last character.
    
    Args:
        input_text (str): input text to be shuffled.
        
    Returns:
        str: shuffled text.
        
    Note:
        - Words which have less than 4 characters are not shuffled.
    Examples:
        >>> typoglycemia('I couldn’t believe that I could actually understand what I was reading : the phenomenal power of the human mind .')
        'I c’ldnuot bveleie that I cloud aalltcuy uadnrsnetd what I was rneaidg : the pnaemnoehl pwoer of the human mind .'
    '''
    words = input_text.split()
    shuffled_words = []
    for word in words:
        if len(word) <= 4:
            shuffled_words.append(word)
        else:
            shuffled_words.append(word[0] + ''.join(random.sample(word[1:-1], len(word) - 2)) + word[-1])
            
    return ' '.join(shuffled_words)


print(typoglycemia(input_text))

最後の問題です．
やることは単純ですが，場合分けが少し面倒になりました．内包表記で書けなくはないですが，私の場合はこの程度の操作の場合は内包表記を使わずに書きます．
内包表記は前から順に読むことができるという点で理解しやすい（個人の感想）ですが，長くなりすぎると見切れるなどして前の方の操作が何であったかが分かりにくくなるというデメリットがあります．その他にも内包表記だと実行時間が短くなるとか，メモリの使用量が…とか色々あるようですが，今回のように1度きりの関数であれば気にしなくて良いと思います．

最後に

いかがでしたでしょうか？
第1章の問題は基本的な操作を問う問題が多かったような印象です．
私自身，普段はGithub Copilotに基本的な処理を書いてもらっているので，いざ自分で書こうとすると分からなくはないけど少し迷うといった具合で，精進する必要を感じました．

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up