More than 5 years have passed since last update.

Node.jsでFastTextを使う方法

Node.js

Posted at 2018-10-19

Facebookが開発しているFastTextを使ってテキストのカテゴライズなどをやります。FastTextはPythonをサポートしているのですが、システムのアーキテクチャの関係でNode.jsを使いたいと考えました。

すでに幾つかライブラリがあるのですが、うまく動かないものも多数あります。唯一うまくいったのがloretoparisi/fasttext.js: FastText for Node.jsで、これの使い方を備忘録的に書いておきます。

事前準備

あらかじめFastTextが必要です。

git clone https://github.com/facebookresearch/fastText.git
cd fastText
mkdir build
cd build/
cmake ..
make 
make install

fasttext.jsのインストール

まず最初にfasttext.jsをインストールします。

mkdir fasttext-demo
cd $_
npm init .
npm i fasttext.js -S

トレーニングデータのダウンロード

今回はFastTextに付随していたクラシフィケーションのデモデータを使います。

# !/usr/bin/env bash
#
# Copyright (c) 2016-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree. An additional grant
# of patent rights can be found in the PATENTS file in the same directory.
#

myshuf() {
  perl -MList::Util=shuffle -e 'print shuffle(<>);' "$@";
}

normalize_text() {
  tr '[:upper:]' '[:lower:]' | sed -e 's/^/__label__/g' | \
    sed -e "s/'/ ' /g" -e 's/"//g' -e 's/\./ \. /g' -e 's/<br \/>/ /g' \
        -e 's/,/ , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
        -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' | tr -s " " | myshuf
}

RESULTDIR=result
DATADIR=data

mkdir -p "${DATADIR}"

if [ ! -f "${DATADIR}/dbpedia.train" ]
then
  wget -c "https://github.com/le-scientifique/torchDatasets/raw/master/dbpedia_csv.tar.gz" -O "${DATADIR}/dbpedia_csv.tar.gz"
  tar -xzvf "${DATADIR}/dbpedia_csv.tar.gz" -C "${DATADIR}"
  cat "${DATADIR}/dbpedia_csv/train.csv" | normalize_text > "${DATADIR}/dbpedia.train"
fi

これを実行すると、 dataというディレクトリにdbpedia.train というファイルが作られます。これがトレーニング用のデータです。

トレーニングを行う

トレーニングを行うコードは以下のようになります。トレーニング時のパラメータは、FastTextのテストスクリプトにあるパラメータをそのまま適用しています。serializeToで指定している名前 + ".bin" というファイル名でモデルが生成されます。

const FastText = require('fasttext.js');

(async () => {
  var fastText = new FastText({
    serializeTo: './dbpedia',
    trainFile: './data/dbpedia.train',
    
  });
  await fastText.train({
    dim: 10,
    lr: 0.1,
    wordNgrams: 2,
    minCount: 1,
    bucket: 10000000,
    epoch: 5,
    thread: 4
  });
  console.log('Done');
})();

予測実行

ではこのモデルデータを使ってみます。このテキストは data/dbpedia.test から適当にピックアップしたものです。このデータのラベルが6と判定されればOKです。

const FastText = require('fasttext.js');

(async () => {
  const fastText = new FastText({
    loadModel: './dbpedia.bin',
  });
  await fastText.load();
  const result = await fastText.predict(`independent film trust , the independent film trust is a uk-registered charity which was set up to advance the cause of independent film-making . it works with groups such as the british independent film awards and the raindance film festival in fostering promoting and celebrating independent film-making in the uk . the ift is run by the board of trustees which has been chaired since 2006 by neil mccartney . `);
  console.log('result', result);
  fastText.unload();
})();

実行した結果です。

$ node index.js
result [ { label: '6', score: '0.773276' },
  { label: '1', score: '0.119041' } ]

ということで正しく判定されています。なお、fasttest.jsでは子プロセスを立ち上げて実行しているため、fastText.unload(); を実行しないと終了しませんので注意してください。

モデルを読み込むのに若干時間がかかりますが、それさえ終わってしまえば判定は高速です。何よりトレーニングデータの作成が簡単（文字をスペース区切りで渡すだけ）で、学習も高速とあって使い勝手が良さそうです。Node.jsから使えるのも嬉しいポイントです。

facebookresearch/fastText: Library for fast text representation and classification.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up