More than 5 years have passed since last update.

Rubyでテンプレートファイルから日本語の文字列を抽出する

Ruby

Last updated at 2015-10-29Posted at 2015-10-28

タイトルのままですが、非エンジニアの方にHTMLファイルやテンプレートファイルを翻訳してもらう際、日本語を手動で抽出するのが面倒だったため、つくってみました。

完璧ではありませんので、あくまで自己責任で参考程度にお使いください。
例えば、"Rails大好き"みたいなテキストで"Rails"は抽出されないので。。

get_japanese_from_file.rb

# 正規表現
japanese_regex = /[\p{Han}\p{Hiragana}\p{Katakana}，．、。ー・]+/

File.open('./japanese.txt', 'w') do |japanese_file|
  Dir.glob('views/**/*.html.erb') do |template_file|
    # ファイル名出力
    japanese_file.puts("対象ファイル名：#{template_file}")

    # ファイル読み込み
    text = File.read(template_file, encoding: Encoding::UTF_8)

    # 日本語抽出
    japanese_words = text.scan(japanese_regex)
    uniq_japanese_words = japanese_words.uniq

    # 日本語出力
    japanese_file.puts(uniq_japanese_words)
  end
end

PHPよりRubyの方が正規表現つくりやすそうだったので、Rubyで書きましたが、Ruby慣れていないので書きっぷりで変なところがあればご指摘ください。
もちろん、もっとよい抽出方法があれば教えてください！

参考サイト

Ruby の正規表現で日本語（ひらがな/カタカナ/漢字）にマッチさせる：
http://easyramble.com/japanese-regex-with-ruby-oniguruma.html

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up