LoginSignup
7
6

More than 5 years have passed since last update.

Rubyで英語記事に含まれてる英単語を数えて出現数順にソートする(改良1)

Last updated at Posted at 2016-05-20

Rubyで英語記事に含まれてる英単語を数えて出現数順にソートする
これの改良版を作る。人の名前とか、複数の英単語をまとめて抽出できるようにする。

コードがこれ。

test.rb
class ExtractWord

  def initialize
    @word = Hash.new(0) # 通常の英単語
    @word_sp = Hash.new(0) # 複数の単語で意味をなす英単語 ex) "Google Play Awards"
    @file = File.open('test.txt', 'r').read
  end

  def get_word_sp # 複数の単語で意味をなす英単語を取得するメソッド ex) "Google Play Awards"
    reg2 = /([A-Z]\w*\s)([A-Z]\w*\s)/
    reg3 = /([A-Z]\w*\s)([A-Z]\w*\s)([A-Z]\w*\s)/
    reg4 = /([A-Z]\w*\s)([A-Z]\w*\s)([A-Z]\w*\s)([A-Z]\w*\s)/
    reg5 = /([A-Z]\w*\s)([A-Z]\w*\s)([A-Z]\w*\s)([A-Z]\w*\s)([A-Z]\w*\s)/

    array =  @file.scan(reg5)
    array.each do |word|
      key = word.join(",").gsub(",", "").strip
      @word_sp[key] += 1
    end
    array =  @file.scan(reg4)
    array.each do |word|
      key = word.join(",").gsub(",", "").strip
      @word_sp[key] += 1
    end
    array =  @file.scan(reg3)
    array.each do |word|
      key = word.join(",").gsub(",", "").strip
      @word_sp[key] += 1
    end
    array =  @file.scan(reg2)
    array.each do |word|
      key = word.join(",").gsub(",", "").strip
      @word_sp[key] += 1
    end

    get_word
  end

  def get_word
    array = []
    @word_sp.each do |word, frequency|
      array = @file.gsub("#{word}", "")
    end

    array = array.split("\s").map{|m| m.gsub(/\.|\”|\,|\n|\“/, "")}
    array.each do |word|
      @word[word] += 1
    end
    puts "文字数:#{array.count}"
    puts "英熟語?---------------------------------------------------------------"
    @word_sp.sort_by{|_, v| -v}.each do |word, frequency|
      puts "%3d %s" % [frequency, word]
    end
    puts "英単語------------------------------------------------------------------"
    @word.sort_by{|_, v| -v}.each do |word, frequency|
      puts "%3d %s" % [frequency, word]
    end
  end
end

extract_word = ExtractWord.new
extract_word.get_word_sp

で、成果がこれ。

文字数:392
英熟語?---------------------------------------------------------------
  4 Google Play
  2 Best Use
  2 Google Play Awards
  1 And Google
  1 Google Play Game
  1 Star Wars Galaxy
  1 TuneIn Radio
  1 Clash Royale
  1 MARVEL Future
  1 Star Wars
  1 Best Standout
英単語------------------------------------------------------------------
 22 the
 16 of
 13 Google
 11 and
  8 a
  6 apps
  5 app
  5 for
  5 to
  5 best
  5 Best
  4 this
  4 Play
  4 that
  3 most
  3 from
  3 on
  3 had
  3 also
  3 company
  3 or
...

まずはこんな感じで、大文字 * 2個以上という条件を作って複数で意味を成す英単語を抽出する。それを5個まで設定して、Hashに入れる。※このひどい書き方なんとかしたい。

Screen Shot 2016-05-20 at 20.17.56.png

test.rb
def get_word_sp # 複数の単語で意味をなす英単語を取得するメソッド ex) "Google Play Awards"
    reg2 = /([A-Z]\w*\s)([A-Z]\w*\s)/
    reg3 = /([A-Z]\w*\s)([A-Z]\w*\s)([A-Z]\w*\s)/
    reg4 = /([A-Z]\w*\s)([A-Z]\w*\s)([A-Z]\w*\s)([A-Z]\w*\s)/
    reg5 = /([A-Z]\w*\s)([A-Z]\w*\s)([A-Z]\w*\s)([A-Z]\w*\s)([A-Z]\w*\s)/

    array =  @file.scan(reg5)
    array.each do |word|
      key = word.join(",").gsub(",", "").strip
      @word_sp[key] += 1
    end
    array =  @file.scan(reg4)
    array.each do |word|
      key = word.join(",").gsub(",", "").strip
      @word_sp[key] += 1
    end
    array =  @file.scan(reg3)
    array.each do |word|
      key = word.join(",").gsub(",", "").strip
      @word_sp[key] += 1
    end
    array =  @file.scan(reg2)
    array.each do |word|
      key = word.join(",").gsub(",", "").strip
      @word_sp[key] += 1
    end

    get_word
  end

で、ここで既に取得した

4 Google Play
2 Best Use
2 Google Play Awards

こういう単語を元のテキストから削除して、残りをHashに入れていく。

test.rb
def get_word
    array = []
    @word_sp.each do |word, frequency|
      array = @file.gsub("#{word}", "")
    end

    array = array.split("\s").map{|m| m.gsub(/\.|\”|\,|\n|\“/, "")}
    array.each do |word|
      @word[word] += 1
    end
    puts "文字数:#{array.count}"
    puts "英熟語?---------------------------------------------------------------"
    @word_sp.sort_by{|_, v| -v}.each do |word, frequency|
      puts "%3d %s" % [frequency, word]
    end
    puts "英単語------------------------------------------------------------------"
    @word.sort_by{|_, v| -v}.each do |word, frequency|
      puts "%3d %s" % [frequency, word]
    end
  end

で、完成。
かっこ良く書けるようになりたい。

7
6
2

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
7
6