More than 5 years have passed since last update.

FESS で収集した情報を Ruby で取り出す

Last updated at 2016-11-02Posted at 2016-10-25

FESS で収集した情報を Ruby で取り出す

機械学習関連情報の収集と分類(構想)の FESS / Elasticsearch 関連部分の詳細です。

実際に調べた時点から半年ほど時間が経っていますので、現状と若干ずれがあるかもしれません。

(1) 記事のスキーマ

FESS はクロールした記事の管理に Elasticsearch を使っています。そのスキーマ構造はドキュメントを探しても見つかりませんでした。

が…、FESS の管理画面( http://localhost:8090/admin/dashboard/ ¹ ) で、more -> analysis を選択すると ANALYZE BY FIELD TYPE というプルダウンが現れるので、index name / type /field name をプルダウンしてスキーマ構造を確認することができるようです。

結果、/fess/doc に記事本体があると推定できたので、FESS の管理画面( http://localhost:8090/admin/dashboard/ ¹) で、rest を選択し、/fess/doc/_serach で GET すると、その内容を見ることができました。

JSON で表現して、下記の例のようになっています。

{
  "_index": "fess",
  "_type": "doc",
  "_id": "http://b.hatena.ne.jp/entry/www.publickey1.jp/blog/14/tamr.html;role=guest",
  "_score": 1,
  "_source": {
    "title": "Webページタイトル",
    "config_id": "WAVNT47Q1yptIFHWqnp2t",
    "expires": "2016-03-22T03:00:19.572Z",
    "lang": "ja",
    "content": "タグを除去したWebページの内容",
    "cash": "生のWebページの内容",
    "has_cache": "true",
    "digest": "検索結果一覧に表示する要約文",
    "segment": "20160322120000",
    "host": "news.nifty.com",
    "site": "news.nifty.com/cs/technology/techalldetail/yomi...",
    "url": "http://news.nifty.com/cs/technology/techalldetail/yomiuri-20160321-50078/1.htm",
    "created": "2016-03-22T03:10:08.561Z",
    "anchor":["このWebページからリンクしているWebページのURL"],
    "mimetype": "text/html",
    "filetype": "html",
    "content_length": "179232",
    "last_modified": "2016-03-08T02:48:36.000Z",
    "timestamp": "2016-03-08T02:48:36.000Z",
    "boost": "1.0",
    "label": [],
    "role": ["guest"],
    "parent_id": "http://ai.paint-ink.com/;role=guest",
    "click_count": 0,
    "favorite_count": 0,
    "doc_id": "e0d87bae582c4a8d84ff5270e593da0c"
  }
}, ...

(2) 新規検索結果の取得

netstat で確認したところ Elasticsearch が port 9201¹ で待ち受けていることが分かったので、localhost:9201 に対してREST request を行って、FESS 外からクロール結果を取得してみました。

過去4時間に更新されたWebページを全て取得する REST request は、

curl -XPOST 'http://localhost:9201/fess/doc/_search?pretty=true' -d '{
  "query": {
    "range": {
      "last_modified": {
        "gt": "now-4h"
      }
    }
  }
}'

となります²。

(3) 形態素解析

前項で取得したクロール結果の _source/content を kuromoji で形態素に分解してみます³。

(4) 取得スクリプト

過去4時間に更新されたWebページを全て取得して形態素解析をかけて、Fess::Docオブジェクトに格納する Ruby スクリプトは、

# -*- coding: utf-8 -*-

require 'rest-client'
require 'json'
require 'rjb'

Encoding.default_external = 'UTF-8'

module Kuromoji

  Rjb::load('kuromoji-0.7.7.jar')
  tokenizer = Rjb::import('org.atilika.kuromoji.Tokenizer')
  @@tknizer = tokenizer.builder.build

  def self.tokenize(sentence)
    list = []
    it   = @@tknizer.tokenize(sentence).iterator
    while it.has_next
      list << it.next
    end
    list
  end
end

module Fess

  URL    = 'http://localhost:9201'
  QUERY  = <<QUERY
{
  "query" : {
    "range" : {
       "last_modified" : {
         "gt" : "%s"
      }
    }
  }
}
QUERY

  @@client = RestClient::Resource.new(URL)

  class Doc

    attr_reader :tokens

    def initialize(doc)
      @doc = doc
      doc.keys.each do |key|
        instance_eval  %Q{
          def #{key.sub(/^_/,'')}
            @doc['#{key}']
          end
        }
      end
      @tokens = Kuromoji.tokenize(source['content'])
    end
  end

  def self.retrieve(edge)
    result = @@client['/fess/doc/_search?pretty=true'].post(QUERY % edge)
    json   = JSON.parse(result.body)
    json['hits']['hits'].map {|doc| Doc.new(doc)}
  end
end

Fess.retrieve('now-4h').each do |doc|
  print [doc.id, doc.source['last_modified'], doc.source['title']]
  doc.tokens[0..10].each do |x|
    print x.surface_form
    print " : "
    puts x.all_features
  end
end

といった感じです⁴。

localhost とポート 8090,9201 は適当に読み替えてください。 ↩ ↩² ↩³
→Ranges ↩
→rubyから形態素解析ライブラリkuromojiを使う ↩
このスクリプトでは kuromoji-0.7.7.jar をカレントディレクトリに配置しています。実際の環境にあわせください。 ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up