文字列の表記揺れをUnicode正規化で簡単に解決する方法 #Ruby

ユニコード正規化をすると、半角英数字や機種依存文字などの表記が統一できます。
表記ブレが吸収されることで検索性が高まったり、データの比較なども行いやすくなります。

正規化の手法にはNFD, NFC, NFKD, NFKCがありますが、その中でもNFKCという次のような正規化を行う方法をコードを交えて紹介します。

ウ゛ェ → ヴェ
ＡＢＣ → ABC
① → 1
㊤ → 上
Ⅲ → III
㌶ → ヘクタール
ﾊﾝｶｸｶﾅ → ハンカクカナ
﹣ → -
※ 左辺はU+FE63 Small Hyphen-Minus: 小さいハイフンマイナス
－ → -
※ 左辺はU+FF0D Fullwidth Hyphen-Minus: 全角ハイフンマイナス

動作環境

$ ruby -v
ruby 2.2.0p0 (2014-12-25 revision 49005) [x86_64-darwin13]

実装サンプル

Ruby2環境でも動く、ユニコード正規化できるライブラリは次の4種類です。
順番に使い方を確認してみたいと思います。

unf
unicode
unicode_utils
ActiveSupport::Multibyte::Unicode

unf編

unfを利用したサンプルです。

$ gem install unf
Fetching: unf_ext-0.0.6.gem (100%)
Building native extensions.  This could take a while...
Successfully installed unf_ext-0.0.6
Fetching: unf-0.1.4.gem (100%)
Successfully installed unf-0.1.4
2 gems installed

$ pry
[1] pry(main)> require 'unf'
=> true
[2] pry(main)> text = "㈱㌧㌦Ⅲ"
=> "㈱㌧㌦Ⅲ"
[3] pry(main)> UNF::Normalizer.normalize(text, :nfkc)
=> "(株)トンドルIII"

unicode編

unicodeを利用したサンプルです。

$ gem install unicode
Fetching: unicode-0.4.4.2.gem (100%)
Building native extensions.  This could take a while...
Successfully installed unicode-0.4.4.2
1 gem installed

$ pry
[1] pry(main)> require 'unicode'
=> true
[2] pry(main)> text = "㈱㌧㌦Ⅲ"
=> "㈱㌧㌦Ⅲ"
[3] pry(main)> Unicode::nfkc('㈱㌧㌦Ⅲ')
=> "(株)トンドルIII"

unicode_utils編

unicode_utilsを利用したサンプルです。

$ gem install unicode_utils
Fetching: unicode_utils-1.4.0.gem (100%)
Successfully installed unicode_utils-1.4.0
1 gem installed

$ pry
[1] pry(main)> require 'unicode_utils/nfkc'
=> true
[2] pry(main)> text = "㈱㌧㌦Ⅲ"
=> "㈱㌧㌦Ⅲ"
[3] pry(main)> UnicodeUtils.nfkc text
=> "(株)トンドルIII"

ActiveSupport::Multibyte::Unicode編

ActiveSupport::Multibyte::Unicodeを利用したサンプルです。

$ gem install activesupport
Fetching: thread_safe-0.3.4.gem (100%)
Successfully installed thread_safe-0.3.4
Fetching: tzinfo-1.2.2.gem (100%)
Successfully installed tzinfo-1.2.2
Fetching: i18n-0.7.0.gem (100%)
Successfully installed i18n-0.7.0
Fetching: activesupport-4.2.0.gem (100%)
Successfully installed activesupport-4.2.0
4 gems installed


$ pry
[1] pry(main)> require 'active_support/multibyte/unicode'
=> true
[2] pry(main)> text = "㈱㌧㌦Ⅲ"
=> "㈱㌧㌦Ⅲ"
[3] pry(main)> ActiveSupport::Multibyte::Unicode.normalize(text, :kc)
=> "(株)トンドルIII"

ベンチマーク

結局どれが一番速いのか計測してみましょう。
先ほどの例で用いた文字列を使い、それぞれ100万回の正規化に掛かった時間で比べます。
結果は次の通り、C拡張で作られたunicodeやunfがダントツで速いことが分かりました。

ライブラリ	user	system	total	real
unf	4.91	0	4.91	4.917251
unicode	4.03	0.01	4.04	4.041364
unicode_utils	14.32	0.01	14.33	14.337881
ActiveSupport	24.49	0.04	24.53	24.616207

計測に利用したソースコードは次の通りです。

# -*- coding: utf-8 -*-
require 'unf'
require 'unicode'
require 'unicode_utils/nfkc'
require 'active_support/multibyte/unicode'
require 'benchmark'
require 'time'

text = '㈱㌧㌦Ⅲ'
n = 1000000
Benchmark.bm(7) do |x|
  x.report("unf")  {  n.times { UNF::Normalizer.normalize(text, :nfkc) } }
  x.report("unicode") { n.times { Unicode::nfkc(text) } }
  x.report("unicode_utils") { n.times { UnicodeUtils.nfkc(text) } }
  x.report("ActiveSupport") { n.times { ActiveSupport::Multibyte::Unicode.normalize(text, :kc) } }
end

結合用ではない濁点と半濁点を合字とする実装例

PDF化したテキストからコピーしたときに、このように濁点・半濁点が分かれていることがあります。
結合用の濁点・半濁点ではないものは上記の正規化では戻せないので、次のようにすると良さそうです。

# ウ と ゛という具合に分かれてしまっている場合の対処法です
$ pry
[1] pry(main)> text = "ウ゛ァイオリン"
=> "ウ゛ァイオリン"
[2] pry(main)> text.gsub("\u309B", "\u3099").gsub("\u309C", "\u309A")
=> "ヴァイオリン"
[3] pry(main)> convert = {"\u309B" => "\u3099", "\u309C" => "\u309A"}
=> {"゛"=>"゙", "゜"=>"゚"}
[4] pry(main)> text.gsub(/\u309B|\u309C/, convert)
=> "ヴァイオリン"

後者ではgsubの中で正規表現オブジェクトを作るため多少のオーバーヘッドがあります。
なお、convertを外に出さずに中に入れるとさらに遅くなります。

次にベンチマークでも確認してみましょう。

$ ruby benchmark_gsub.rb
              user     system      total        real
single_args  2.660000   0.010000   2.670000 (  2.675610)
single_block  2.920000   0.010000   2.930000 (  2.931155)
chain_method  2.100000   0.000000   2.100000 (  2.104693)

# -*- coding: utf-8 -*-
require 'benchmark'
require 'time'

n = 1000000
text = "ウ゛ァイオリン"
convert = {"\u309B" => "\u3099", "\u309C" => "\u309A"}
Benchmark.bm(7) do |x|
  x.report("single_args")  {  n.times { text.gsub(/\u309B|\u309C/, convert) } }
  x.report("single_block")  {  n.times { text.gsub(/\u309B|\u309C/) { |chr| convert[chr] }} }
  x.report("chain_method") { n.times { text.gsub("\u309B", "\u3099").gsub("\u309C", "\u309A") } }
end

併せて読みたい

Unicode正規化
http://nomenclator.la.coocan.jp/unicode/normalization.htm