大学の授業で使う必要があったため「名詞,サ変接続」から始まる文節を取り出すものを書いてみました。日本語文法の知識を持っていないため勘でやっています。
追記2019/11/27
コメントありがとうございます。とても勉強になりました(>_<)
コードに反映させてみます。
#コード
require 'MeCab'
tmp="" #文字出力に使う
count=0 #動詞出現回数
flag = 0 #サ変の前に文字が入るかどうか
lines = IO.readlines("all_hyouka.txt", chomp: true)
File.open("phrase_data_sahen.rb", "w") do |f|
lines.each do | line |
begin
model = MeCab::Model.new(ARGV.join(" "))
tagger = model.createTagger()
n = tagger.parseToNode(line)
while n do
feature = n.feature.split(",")
if tmp!="" && flag == 1
if feature[0]=="名詞" and feature[1]=="サ変接続" and n.surface!=")" and n.surface!="," and n.surface!="~" and n.surface!="("
tmp << n.surface
flag = 0
else
tmp = ""
flag = 0
end
elsif tmp!="" && count==0
if feature[0]=="動詞" and feature[1]=="自立"
tmp << n.surface
count=count+1
else
tmp = ""
end
elsif tmp!="" && count==1
case feature[0]
when "動詞"
case feature[1]
when "自立"
f.print "\n",tmp
tmp = ""
count=0
when "非自立", "接尾"
tmp << n.surface
end
when "助詞","助動詞","フィラー"
tmp << n.surface
when "形容詞"
case feature[1]
when "自立"
f.print "\n",tmp
tmp = ""
count=0
when "非自立"
tmp << n.surface
end
when "名詞"
case feature[1]
when "サ変接続"
f.print "\n",tmp
tmp = n.surface
count=0
when "接尾","特殊" #,"非自立"
tmp << n.surface
else
f.print "\n",tmp
tmp = ""
count=0
end
when "感動詞","接続詞","接頭詞","副詞","記号","連体詞","BOS/EOS"
f.print "\n",tmp
tmp = ""
count=0
else
f.print "\n",tmp
tmp = ""
count=0
end
else
if (feature[0]=="接頭詞"and feature[1]=="接続名詞")#or(n.feature.split(",")[0]=="副詞" and n.feature.split(",")[1]=="一般")
tmp = n.surface
flag = 1
elsif feature[0]=="名詞" and feature[1]=="サ変接続" and n.surface!=")" and n.surface!="," and n.surface!="~" and n.surface!="("
tmp = n.surface
end
end
n = n.next
end
rescue
print "RuntimeError: ", $!, "\n";
end
end
end
###################以下、語の出現回数計算して大きい順に並べる処理
require "find"
words = Hash.new(0)
lines = IO.readlines("phrase_data_sahen.rb", chomp: true)
lines.each do | line |
words[line] += 1
end
File.open("output_sahen.rb","w") do |output|
output.print "\357\273\277"
words.sort_by{|word,count| [-count,word]}.each do |word,count|
output.print word.chomp,"\n"
end
end