LoginSignup
0
0

More than 3 years have passed since last update.

サ変名詞を文節で取り出す

Last updated at Posted at 2019-11-21

大学の授業で使う必要があったため「名詞,サ変接続」から始まる文節を取り出すものを書いてみました。日本語文法の知識を持っていないため勘でやっています。

追記2019/11/27
コメントありがとうございます。とても勉強になりました(>_<)
コードに反映させてみます。

コード

require 'MeCab'
tmp="" #文字出力に使う
count=0 #動詞出現回数
flag = 0 #サ変の前に文字が入るかどうか
lines = IO.readlines("all_hyouka.txt", chomp: true)

File.open("phrase_data_sahen.rb", "w") do |f|
  lines.each do | line |
    begin
      model = MeCab::Model.new(ARGV.join(" "))
      tagger = model.createTagger()
      n = tagger.parseToNode(line)
      while n do
        feature = n.feature.split(",")
        if tmp!="" && flag == 1
          if feature[0]=="名詞" and feature[1]=="サ変接続" and n.surface!=")" and n.surface!="," and n.surface!="~" and n.surface!="("
            tmp << n.surface
            flag = 0
          else
            tmp = ""
            flag = 0
          end
        elsif tmp!="" && count==0 
          if feature[0]=="動詞" and feature[1]=="自立"
            tmp << n.surface
            count=count+1
          else
            tmp = ""
          end
        elsif tmp!="" && count==1
          case feature[0]
          when "動詞"
            case feature[1]
            when "自立"
              f.print "\n",tmp
              tmp = ""
              count=0
            when "非自立", "接尾"
              tmp << n.surface
            end
          when "助詞","助動詞","フィラー"
            tmp << n.surface
          when "形容詞"
            case feature[1]
            when "自立"
              f.print "\n",tmp
              tmp = ""
              count=0
            when "非自立"
              tmp << n.surface
            end
          when "名詞"
            case feature[1]
            when "サ変接続"
              f.print "\n",tmp
              tmp = n.surface
              count=0
            when "接尾","特殊" #,"非自立"
              tmp << n.surface
            else
              f.print "\n",tmp
              tmp = ""
              count=0
            end
          when "感動詞","接続詞","接頭詞","副詞","記号","連体詞","BOS/EOS"
            f.print "\n",tmp
            tmp = ""
            count=0
          else
            f.print "\n",tmp
            tmp = ""
            count=0
          end
        else
          if (feature[0]=="接頭詞"and feature[1]=="接続名詞")#or(n.feature.split(",")[0]=="副詞" and n.feature.split(",")[1]=="一般")
            tmp = n.surface
            flag = 1
          elsif feature[0]=="名詞" and feature[1]=="サ変接続" and n.surface!=")" and n.surface!="," and n.surface!="~" and n.surface!="("
            tmp = n.surface
          end
        end
        n = n.next
      end
    rescue
      print "RuntimeError: ", $!, "\n";
    end
  end
end

###################以下、語の出現回数計算して大きい順に並べる処理

require "find"

words = Hash.new(0)

lines = IO.readlines("phrase_data_sahen.rb", chomp: true)
lines.each do | line | 
  words[line] += 1
end

File.open("output_sahen.rb","w") do |output|
  output.print "\357\273\277"
  words.sort_by{|word,count| [-count,word]}.each do |word,count|
    output.print word.chomp,"\n"
  end
end

取り出した語の例

image.png

0
0
1

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0