ʕ ◔ϖ◔ʔシステム辞書に単語追加して形態素解析APIつくってみた【ソース付き】

Posted at 2018-06-14

はじめに

GoでREST APIサーバをつくることができたので、今度はDockerで実行できるようにしてみます。
ただ、Hello Worldを返すだけのサーバだと味気ないのでMeCabを使って形態素解析を行い、その結果を返すようにします。

武豊騎手は「タケトヨ」じゃないんだ

形態素解析辞書にNeologdを使っていたところ、武豊騎手の読みが「タケトヨ」となっていました。
siva の運用をやっている以上、これはいただけないと思い単語追加を行うことにしたのですが、イメージビルドしますし、せっかくなのでユーザ辞書ではなくシステム辞書に追加するようにしました。

実施手順

以下の手順で進めていきます。

辞書登録用の単語ファイル作成
REST APIサーバの作成
Dockerファイルの作成
Dockerイメージの作成
Dockerコンテナの起動
REST APIの呼び出し

1. 辞書登録用の単語ファイル作成

単語の追加方法を参考にCSVファイルをUTF-8で作成します。
発音の後に追加エントリをつけれるため、辞書名、種別、キー情報をいれています。

seed/jockey.csv

武豊,1289,1289,2000,名詞,固有名詞,人名,一般,*,*,たけゆたか,タケユタカ,タケユタカ,siva,騎手名,00666
武幸四郎,1289,1289,2000,名詞,固有名詞,人名,一般,*,*,たけこうしろう,タケコウシロウ,タケコウシロウ,siva,騎手名,01026
福永祐一,1289,1289,2000,名詞,固有名詞,人名,一般,*,*,ふくながゆういち,フクナガユウイチ,フクナガユウイチ,siva,騎手名,01014
福永洋一,1289,1289,2000,名詞,固有名詞,人名,一般,*,*,ふくながよういち,フクナガヨウイチ,フクナガヨウイチ,siva,騎手名,00274
藤田菜七子,1289,1289,2000,名詞,固有名詞,人名,一般,*,*,ふじたななこ,フジタナナコ,フジタナナコ,siva,騎手名,01164

はじめはmecab-ipadicのモデルファイルを用いてコストの自動推定を行いましたが、Neologdのコストより高くなってしまったので結局はベタ打ちでコストを設定しました。

2. REST APIサーバの作成

MeCabのGolangバインディングを使って形態素解析を行うエンドポイントを作成します。

handlers/mecab.go

package handlers

import (
	"net/http"
	"strconv"
	"strings"

	"gauss/go-mecab/application"
	"gauss/go-mecab/dto/request"
	"gauss/go-mecab/dto/response"

	"github.com/fatih/structs"
	"github.com/labstack/echo"
	"github.com/shogo82148/go-mecab"
)

// MecabRouter ルーター設定
func MecabRouter(e *echo.Echo) {
	g := e.Group("/mecab")

	g.GET("", application.AppHandler(getParse))
}

// getParse godoc
// @Summary 形態素解析
// @Description 形態素解析結果を取得する。
// @Param sentence query string true "形態素解析対象"
// @Success 200 {object} response.MecabResults
// @Router /mecab [get]
func getParse(c *application.AppContext) error {
	param := new(request.MecabParam)
	if err := c.BindValidate(param); err != nil {
		return err
	}
	tagger, err := mecab.New(map[string]string{"output-format-type": "wakati"})
	defer tagger.Destroy()

	node, err := tagger.ParseToNode(param.Sentence)
	if err != nil {
		return c.JSON(http.StatusBadRequest, map[string]string{"error_message": err.Error()})
	}

	res := &response.MecabResults{Results: make([]*response.MecabResult, 0)}
	for ; node != (mecab.Node{}); node = node.Next() {
		features := strings.Split(node.Feature(), ",")
		if node.Surface() == "" || features[0] == "BOS/EOS" {
			continue
		}
		elements := &response.MecabResult{Surface: node.Surface()}
		es := structs.New(elements)
		for idx, feature := range features {
			for _, fld := range es.Fields() {
				fno, err := strconv.Atoi(fld.Tag("feature"))
				if err != nil {
					continue
				}
				if idx == fno {
					fld.Set(feature)
					break
				}
			}
		}

		res.Results = append(res.Results, elements)
	}

	return c.JSON(http.StatusOK, res)
}

2. Dockerファイルの作成

MeCabのインストールが重いのでアプリ部分と別イメージを作ったほうがいいかと思いますが、今回は1個のDockerfileでやります。
ベースはgolang:alpineにしてMeCab システム辞書への単語追加（mecab-ipadic-neologd）を参考に記述しました。
nkfはapkで入らないのでソースからビルドしています。

3. Dockerイメージの作成

以下のコマンドを実行してビルドします。

docker build -t go-mecab:1.0 .

イメージサイズは3.65GBで結構なサイズです。。。

4. Dockerコンテナの起動

正常にイメージがビルドできたらコンテナを立ち上げます。
アプリが1323ポートで動くのでホスト側も1323にしています。

docker run -d -p 1323:1323 go-mecab:1.0

5. REST APIの呼び出し

コンテナが起動したらSwaggerUIが立ち上がっているのでそこにアクセスします。
localhost:1323

形態素解析対象に追加した単語と追加していない単語をいれて解析結果を確認してみます。

Response_body

{
  "results": [
    {
      "surface": "ディープインパクト",
      "pos": "名詞",
      "pos_detail1": "固有名詞",
      "pos_detail2": "一般",
      "pos_detail3": "*",
      "conjugated_type": "*",
      "conjugated_form": "*",
      "baseform": "でぃーぷいんぱくと",
      "reading": "ディープインパクト",
      "pronunciation": "ディープインパクト",
      "custom1": "siva",
      "custom2": "馬名",
      "custom3": "2002100816"
    },
    {
      "surface": "と",
      "pos": "助詞",
      "pos_detail1": "並立助詞",
      "pos_detail2": "*",
      "pos_detail3": "*",
      "conjugated_type": "*",
      "conjugated_form": "*",
      "baseform": "と",
      "reading": "ト",
      "pronunciation": "ト",
      "custom1": "",
      "custom2": "",
      "custom3": ""
    },
    {
      "surface": "武豊",
      "pos": "名詞",
      "pos_detail1": "固有名詞",
      "pos_detail2": "人名",
      "pos_detail3": "一般",
      "conjugated_type": "*",
      "conjugated_form": "*",
      "baseform": "たけゆたか",
      "reading": "タケユタカ",
      "pronunciation": "タケユタカ",
      "custom1": "siva",
      "custom2": "騎手名",
      "custom3": "00666"
    },
    {
      "surface": "騎手",
      "pos": "名詞",
      "pos_detail1": "一般",
      "pos_detail2": "*",
      "pos_detail3": "*",
      "conjugated_type": "*",
      "conjugated_form": "*",
      "baseform": "騎手",
      "reading": "キシュ",
      "pronunciation": "キシュ",
      "custom1": "",
      "custom2": "",
      "custom3": ""
    },
    {
      "surface": "、",
      "pos": "記号",
      "pos_detail1": "読点",
      "pos_detail2": "*",
      "pos_detail3": "*",
      "conjugated_type": "*",
      "conjugated_form": "*",
      "baseform": "、",
      "reading": "、",
      "pronunciation": "、",
      "custom1": "",
      "custom2": "",
      "custom3": ""
    },
    {
      "surface": "エポカドーロ",
      "pos": "名詞",
      "pos_detail1": "固有名詞",
      "pos_detail2": "一般",
      "pos_detail3": "*",
      "conjugated_type": "*",
      "conjugated_form": "*",
      "baseform": "エポカドーロ",
      "reading": "エポカドーロ",
      "pronunciation": "エポカドーロ",
      "custom1": "",
      "custom2": "",
      "custom3": ""
    },
    {
      "surface": "と",
      "pos": "助詞",
      "pos_detail1": "並立助詞",
      "pos_detail2": "*",
      "pos_detail3": "*",
      "conjugated_type": "*",
      "conjugated_form": "*",
      "baseform": "と",
      "reading": "ト",
      "pronunciation": "ト",
      "custom1": "",
      "custom2": "",
      "custom3": ""
    },
    {
      "surface": "戸崎圭太",
      "pos": "名詞",
      "pos_detail1": "固有名詞",
      "pos_detail2": "一般",
      "pos_detail3": "*",
      "conjugated_type": "*",
      "conjugated_form": "*",
      "baseform": "戸崎圭太",
      "reading": "トサキケイタ",
      "pronunciation": "トサキケイタ",
      "custom1": "",
      "custom2": "",
      "custom3": ""
    },
    {
      "surface": "ジョッキー",
      "pos": "名詞",
      "pos_detail1": "一般",
      "pos_detail2": "*",
      "pos_detail3": "*",
      "conjugated_type": "*",
      "conjugated_form": "*",
      "baseform": "ジョッキー",
      "reading": "ジョッキー",
      "pronunciation": "ジョッキー",
      "custom1": "",
      "custom2": "",
      "custom3": ""
    },
    {
      "surface": "。",
      "pos": "記号",
      "pos_detail1": "句点",
      "pos_detail2": "*",
      "pos_detail3": "*",
      "conjugated_type": "*",
      "conjugated_form": "*",
      "baseform": "。",
      "reading": "。",
      "pronunciation": "。",
      "custom1": "",
      "custom2": "",
      "custom3": ""
    }
  ]
}

カスタムで追加した項目も正しく取れています。

まとめ

GolangからMeCabを使う機会はそうないかもしれませんが、自然言語処理で期待した結果を出すためにMeCabのシステム辞書をつくる方法は知っておくと良いかと思いました。
今回のソースはGAUSS-inc/go-mecabにあげてありますので参考にしていただけると幸いです。

参考文献

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up