LoginSignup
1
1

More than 5 years have passed since last update.

【Java】形態素解析「lucene-gosen」を利用してスパム判定

Posted at

やったこと

機械学習のための第一歩として、形態素解析を利用したスパム判定をやってみた(開発言語は、Javaで、playframeworkを利用)

lucene-gosenとは

Javaの形態素解析ツールで、jarを落とすだけで利用可能

下記サイトからjarをダウンロード
https://code.google.com/p/lucene-gosen/

解析したいファイルの読み書き

public static void writing(File readfile,File writefile) throws IOException, FileNotFoundException{
    StringTagger stirngTagger = SenFactory.getStringTagger(null);

        Reader reader = new InputStreamReader(new FileInputStream(readfile), "UTF-8");
        StreamTagger tagger = new StreamTagger(stirngTagger, reader);
        FileWriter filewriter = new FileWriter(writefile);
        BufferedWriter bw = new BufferedWriter(filewriter);

        while (tagger.hasNext()) {
            Token token = tagger.next();
            bw.write(token.getSurface());
            bw.newLine();
        }
        bw.close();
    }

分かち書きにしたワードをカウントをして、出現数の多い順に並び替える

public class Wordseparated {
        public CountTable count(String readfile,String writefile) throws IOException, FileNotFoundException{

            CountTable table = new CountTable();        
            BufferedReader brfile = new BufferedReader(new FileReader(readfile));           
            BufferedWriter bwfile = new BufferedWriter(new FileWriter(writefile));

            while (true) {
                        String linefile = brfile.readLine();
                if (linefile == null) {
                    break;
                }
                for (String s : linefile.split("\\s+")) {
                    if (!s.equals("")) {
                        int count = table.get(s);
                        table.add(s);
                    }
                }
            }
            brfile.close();

            for (String s : table.getKeysByCount()) {
                int count = table.get(s);

                bwfile.write(s);
                bwfile.newLine();
            }
            bwfile.close();
            return table;
        }
}
1
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
1