More than 5 years have passed since last update.

言語処理100本ノック第1章: 準備運動 … がすでに難しい

Last updated at 2018-10-10Posted at 2018-10-09

前提

エンジニアとしての基礎力をあげるために毎日の習慣として取り組んでいます。
完璧な答えではないと思うので、より簡潔な回答がありましたら、ぜひ教えていただきたいですm(_ _)m
普段から仕事で使っていることもあり、使用言語は Ruby としています。

環境

ruby 2.5.0

TL;DR

以下に自分が考えた回答をこちらのリポジトリで公開しています。
https://github.com/FumiyaShibusawa/language-processing-100-drills

なお、問題を確認したい方はこちらの元サイトを参照してください。
http://www.cl.ecei.tohoku.ac.jp/nlp100/

各回答の詳細

00. 文字列の逆順

# 00 文字列の逆順
# 文字列"stressed"の文字を逆に（末尾から先頭に向かって）並べた文字列を得よ．

puts "stressed".reverse

最初の問題から、「は？どうやんの？」という感じでしたが、さすが Ruby 。まさにドンピシャのメソッドがありました。これ、 reverse を使わないで同じことを実現しようとしたらどんな感じになるんだろう…。

01. 「パタトクカシーー」

# 01. 「パタトクカシーー」
# 「パタトクカシーー」という文字列の1,3,5,7文字目を取り出して連結した文字列を得よ．

output = ''
'パタトクカシーー'.chars.each_with_index do |letter, i|
  output << letter if i.even?
end

puts output

もうちょっとシンプルにならないかなー…と思いつつ、偶数番地の文字列を取り出して新たな文字列に順に保存していく方法を取ってみました。

02. 「パトカー」＋「タクシー」＝「パタトクカシーー」

# 02. 「パトカー」＋「タクシー」＝「パタトクカシーー」
# 「パトカー」＋「タクシー」の文字を先頭から交互に連結して文字列「パタトクカシーー」を得よ．

patrol_car = "パトカー"
taxi = "タクシー"

output = patrol_car.chars
taxi.length.times do |i|
  output.insert(i * 2 + 1, taxi[i])
end
puts output.join("")

# 他の方法
# $stdout.write ['パトカー'.chars, 'タクシー'.chars].transpose.join
# 'パトカー'.each_char.zip('タクシー'.each_char).join

この問題とっても難しかったです。結局、 パトカー という文字列の奇数番地に タクシー の各文字を順番に一つずつ挿入していく方法を取りました。 # 他の方法 に書いてある2つの回答は、うちの会社のメンバーが教えてくれたものです。どういう思考をしたらこんな回答になるのか皆目見当もつきません。

Array#transpose

transpose は、転置行列をつくるメソッドです。配列を行列と見立てて、各要素の同じ番地の配列をそれぞれ作っていきます。ただし、転置する配列が一次行列だったり、要素の数が合ってないとエラーを出してしまうようです。

Array#transpose

[[0, 1, 2], [3, 4, 5]].transpose 
# => [[0, 3], [1, 4], [2, 5]]
[0, 1, 2, [3, 4, 5]].transpose
# => TypeError: no implicit conversion of Integer into Array
[[0, 1, 2], [3, 4]].transpose
# => IndexError: element size differs (2 should be 3)

Enumerable#zip

zip は、レシーバーと引数の各要素をそれぞれペアにした配列を返すメソッドです。 Array#zip もあって、メソッドの中身は微妙に違っているようですが C で書かれてるんで読めませんでした…。ただ、 each を持たない引数が入ってきた時は、どちらも TypeError が発生します。

Enumerable#zip

(0..2).zip((3..5))
# => [[0, 3], [1, 4], [2, 5]]
(0..2).zip("3")
# => TypeError: wrong argument type String (must respond to :each)

03. 円周率

# 03. 円周率
# "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."
# という文を単語に分解し，各単語の（アルファベットの）文字数を先頭から出現順に並べたリストを作成せよ．

sentence = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."
p sentence.split(/\s|\.|,/).reject(&:empty?).collect(&:length)

ここは、 （アルファベットの）文字数 ってところがミソですね。 , や . も含めるなら、 .split(" ") でOKだったと思います。ちなみに、 collect は map の別名なので、どっちでも可ですね。なんで collect にしたんだろう…。

04. 元素記号

# 04. 元素記号
# "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also
# Sign Peace Security Clause. Arthur King Can."という文を単語に分解し，1, 5, 6, 7,
# 8, 9, 15, 16, 19番目の単語は先頭の1文字，それ以外の単語は先頭に2文字を取り出し，取り出した
# 文字列から単語の位置（先頭から何番目の単語か）への連想配列（辞書型もしくはマップ型）を作成せよ．

sentence = "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations \
Might Also Sign Peace Security Clause. Arthur King Can."

hash_list = {}
sentence.split(/\s|\.|,/).reject(&:empty?).map.with_index(1) do |word, i|
  if [1, 5, 6, 7, 8, 9, 15, 16, 19].include?(i)
    word = word.slice(0, 1)
  else
    word = word.slice(0, 2)
  end
  hash_list[word] = i
end

puts hash_list

これは俗に言うクソコードというヤツです。具体的に言うと、条件分岐があって読みづらいです。実現したいことを文章にしてそのままコードに変換したみたいな感じですね。もっとシンプルにならないかな…。

05. n-gram

# 05. n-gram
# 与えられたシーケンス（文字列やリストなど）からn-gramを作る関数を作成せよ．この関数を用い，
# 'I am an NLPer'という文から単語bi-gram，文字bi-gramを得よ．

#
# Returns n-gram word by word
#
# @param [<String>] sentence <to be n-grammed>
# @param [<Integer>] n_gram <n as in n-gram to be assigned>
#
# @return [<Array>] <pair of n-gram word by word>
#
def create_word_n_gram(sentence, n_gram)
  words = sentence.split
  words.each_cons(n_gram).with_object([]) do |word, array|
    array << word
  end
end

#
# Returns n-gram letter by letter
#
# @param [<String>] sentence <to be n-grammed>
# @param [<Integer>] n_gram <n as in n-gram to be assigned>
#
# @return [<Array>] <pair of n-gram letter by letter>
#
def create_letter_n_gram(sentence, n_gram)
  letters = sentence.delete(' ')
  letters.chars.each_cons(n_gram).with_object([]) do |letter, array|
    array << letter
  end
end

# 単語bi-gram
word_bi_gram = create_word_n_gram('I am an NLPer', 2)

# 文字bi-gram
letter_bi_gram = create_letter_n_gram('I am an NLPer', 2)

puts "word_bi_gram: #{word_bi_gram}"
puts "letter_bi_gram: #{letter_bi_gram}"

そもそも n-gram とはなにか？というところから勉強が必要だったんですが、以下の技術評論社の記事が分かりやすかったです。
検索エンジンを作る第5回 N-gramのしくみ

ざっくり言うと、検索される文字列を任意の数に分割する作法のことを指します。

n-gram（bi-gramの場合）

"今日は良い天気です。"
=> "今日", "日は", "は良", "良い", "い天", "天気", "気で", "です", "す。"

06. 集合

# 06. 集合
# 'paraparaparadise'と'paragraph'に含まれる文字bi-gramの集合を，それぞれ, XとYとして求め，XとYの和集合，積集合，差集合を求めよ．
# さらに，'se'というbi-gramがXおよびYに含まれるかどうかを調べよ．

# 'paraparaparadise' bi-gram
x = []
'paraparaparadise'.chars.each_cons(2) { |w| x << w.join('') }
puts "X: #{x}"

# 'paragraph' bi-gram
y = []
'paragraph'.chars.each_cons(2) { |w| y << w.join('') }
puts "Y: #{y}"

# 和集合
puts "和集合: #{x | y}"

# 積集合
puts "積集合: #{x & y}"

# 差集合
puts "差集合: #{x - y}"

これはわりと簡単だったなって感じなんですが…各集合の定義と合ってるよな…？

07. テンプレートによる文生成

# 07. テンプレートによる文生成
# 引数x, y, zを受け取り「x時のyはz」という文字列を返す関数を実装せよ．
# さらに，x=12, y="気温", z=22.4として，実行結果を確認せよ．

def format_values_with_template(x, y, z)
  puts "#{x}の時の#{y}は#{z}"
end

format_values_with_template(12, '気温', 22.4)

ここは特筆すべきことはないはず。ただ、メソッド名は悩みました。単なる値を人間が見ても分かりやすいように整形することを、よく pretty_print とか format とか言ったりしますよね。どっちにしようかなと思ったんですが、意味も特に大きい違いはなさそうなので、短い format を選びました。

08. 暗号文

# 08. 暗号文
# 与えられた文字列の各文字を，以下の仕様で変換する関数cipherを実装せよ．
#
# 英小文字ならば(219 - 文字コード)の文字に置換
# その他の文字はそのまま出力
# この関数を用い，英語のメッセージを暗号化・復号化せよ．

def cipher(message)
  message.gsub(/[[:alpha:]&&[:lower:]]+/) do |w|
    w.codepoints.map { |n| (219 - n).chr(Encoding::UTF_8) }.join
  end
end

sentence = 'Hi, My name is John. I\'m a software engineer for 10 years.'
ciphered = cipher(sentence)
deciphered = cipher(ciphered)

puts "ciphered: #{ciphered}"
puts "deciphered: #{deciphered}"

https://docs.ruby-lang.org/ja/2.5.0/method/String/i/codepoints.html
これで初めて知ったんですが、 Unicode の文字にはそれぞれ番号が振られているんですね。その番号を取得するには、 String#codepoints というメソッドを使えばOKらしいです。そして、 219 から各値を2回引くと、同じ文字の codepoint が得られるらしいです。なんでだ…。
（TODO: ブロックを渡す場合は String#each_codepoint を使った方が良いとのこと。後で修正プルリク出す。）

09. Typoglycemia

# 09. Typoglycemia
# スペースで区切られた単語列に対して，各単語の先頭と末尾の文字は残し，それ以外の文字の順序をランダムに並び替えるプログラムを作成せよ．
# ただし，長さが４以下の単語は並び替えないこととする．適当な英語の文（例えば"I couldn't believe that I could actually
# understand what I was reading : the phenomenal power of the human mind ."）を与え，その実行結果を確認せよ．

# Shuffles all the letters except the first/last one of
# words which has more than 4 letters in a sentence.
#
# @param [<String>] a sentence which has a series of words.
# @return [<String>] a new sentence with the words shuffled.
#
def shuffle_letters_in(sentence)
  sentence.split.map do |word|
    word[1..-2] = word[1..-2].chars.shuffle.join if word.length > 4
    word
  end.join(' ')
end

sentence = "I couldn't believe that I could actually understand what I was \
reading : the phenomenal power of the human mind ."
puts shuffle_letters_in(sentence)

そもそも、 Typoglycemia ってなんのこっちゃという話なんですが、ニコニコ大百科に載っていました。さすがっす。

Typoglycemia（タイポグリセミア）
Typoglycemiaとは、単語を構成する文字を並べ替えても、最初と最後の文字が合っていれば読めてしまう現象のことである。

とは言うものの、これ見て本当に分かるんでしょうか…？ｗ
I cou'dlnt beivele that I cluod auactlly udarsenntd what I was ranedig : the pnehnmoael poewr of the hamun mind .

参考文献

言語処理100本ノック 2015

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up