PostgreSQL の全文検索機能を使って文脈を分類したりする

Last updated at 2013-12-04Posted at 2013-12-04

PostgreSQL の全文検索機能で, 実際のテキストデータをゴニョゴニョしてみます.
いわゆるやってみた系の記事です. クエリの効率とかそっちのけです.

やること

口コミやアンケート形式のテキストデータに対して,

出現単語の頻度を調べる
文脈の傾向とその推移を調べる

学術研究用に livedoor グルメのデータが公開されているので, そちらを利用させていただきます.

livedoor グルメの DataSet を公開
使用するのは ratings.csv(評価データ) と restaurants.csv(店舗データ) です.

textsearch_ja の準備

用意するもの

textsearch_ja: 9.0.0
MeCab: 本体及び IPA 辞書

MeCab のインストール

tar xf mecab-0.996.tar
cd mecab-0.996
./configure --enable-utf8-only
make
sudo make install

IPA 辞書のインストール

tar xf mecab-ipadic-2.7.0-20070801.tar
cd mecab-ipadic-2.7.0-20070801
./configure --with-charset=utf8
make
sudo make install

textsearch_ja のインストール

tar xf textsearch_ja-9.0.0.tar
cd textsearch_ja-9.0.0
make
sudo make install

関数登録

psql -f textsearch_ja.sql

PostgreSQL 9.2以降では, textsearch_ja.sql中の
"LANGUAGE 'C'"の部分を"LANGUAGE 'c'"に変えておく必要があります.

テーブル作成

元のデータの他に, インデックスに使う tsvector 型のカラムも作成しておきます.
また, データ更新時に自動更新されるようトリガを作成しておきます.

create table ratings(
id integer not null,
restaurant_id integer,
user_id text,
total integer,
food integer,
service integer,
atmosphere integer,
cost_performance integer,
title text,
body text,
purpose integer,
created_on timestamp,
body_tsv tsvector);

create trigger tsvector_update before insert or update
on ratings for each row execute procedure
tsvector_update_trigger(body_tsv, 'pg_catalog.japanese', body);

create table restaurants (
id integer not null unique,
name text,
-- 略
);

データ投入

ratings.csv には created_on が 0000-00-00 00:00:00 な行がありそのままではエラーとなるので, 予め取り除いておきます.

copy ratings(
id, restaurant_id, user_id, 
total, food, service, atmosphere, cost_performance, 
title, body, purpose, created_on) 
from '/tmp/ratings.csv' csv header;

copy restaurants from '/tmp/restaurants.csv' csv header;

ビュー作成

ratings と restaurants を restaurant_id をキーにして join します.
そういえば PostgreSQL 9.3 ではマテリアライズドビューが追加されたのでした.
せっかくなので使います.
インデックスはマテビューに対して張ります.

create materialized view rating_restaurants
as select
tbl2.name, tbl1.title, tbl1.body,
tbl1.body_tsv, tbl1.created_on,
tbl1.total, tbl1.food, tbl1.service,
tbl1.atmosphere, tbl1.cost_performance
from ratings tbl1 inner join restaurants tbl2
on tbl1.restaurant_id = tbl2.id;

create index body_tsv_idx on rating_restaurants using gin(body_tsv);

テキスト分析

単語の出現頻度を調べる

select * from ts_stat('select body_tsv from rating_restaurants')
order by nentry desc, ndoc desc, word limit 10;

実行結果はこんな感じ

  word  |  ndoc  | nentry 
--------+--------+--------
 する   | 159589 | 556802
 いる   | 134863 | 384592
 ある   | 133476 | 320285
 店     | 127809 | 277266
 円     |  71713 | 220179
 食べる |  98682 | 183696
 味     |  84546 | 142662
 思う   |  83732 | 135453
 れる   |  74668 | 133128
 なる   |  78962 | 131695
(10 rows)

これだと "する" や "いる" など, 特に意味をなさない単語が上位に来てしまいます.
品詞で絞り込めれば, 有用なランキングが得られそうです.
textsearch_ja をインストールすると ja_analyze という関数が使えるようになります.
文章を mecab で解析した結果が取得でき, 品詞の情報も得られます. それをうまいこと利用すれば品詞で絞り込むこともできそうです.

が, 手元の環境 (PostgreSQL 9.3) では残念ながら ja_analyze を実行するとサーバーとの接続が切れてしまいます. なので今回はおあずけさせて頂きます.

文脈の傾向を調べる

味に関する投稿は何件, 接客に関する投稿は何件, といった件数を取得します.
文脈に関するテーブルを別途作成します.

create table contexts(
context text,
rule tsquery);

insert into contexts values
('food', to_tsquery('おいしい|美味しい|うまい|美味い|まずい|不味い|おいしくない')),
('service', to_tsquery('丁寧|接客|態度|店員|スタッフ|サービス|対応|応対')),
('atmosphere', to_tsquery('雰囲気|居心地|店内|内装|インテリア|外装')),
('cost_performance', to_tsquery('安い|高い|お得|割高|コスト|値段|価格|相場'));

新たに作成した contexts と rating_restaurants を, body_tsv をキーとして join します.

select year, month, coalesce(context, 'others') as tag, count(*) as tag_count
from rating_restaurants left join contexts
on body_tsv @@ rule
group by year, month, tag
order by year, month, tag;

結果はこんな感じです.

 year | month |       tag        | tag_count 
------+-------+------------------+-----------
 2000 |    10 | atmosphere       |        31
 2000 |    10 | cost_performance |        33
 2000 |    10 | food             |        59
 2000 |    10 | others           |       116
 2000 |    10 | service          |        11
 2000 |    11 | atmosphere       |         9
 2000 |    11 | cost_performance |        19
 2000 |    11 | food             |        28
 2000 |    11 | others           |        21
 2000 |    11 | service          |         2
 2000 |    12 | atmosphere       |        14
 2000 |    12 | cost_performance |        12
 2000 |    12 | food             |        25
 2000 |    12 | others           |        20
 2000 |    12 | service          |        10
...
以下略

店の評判の文脈を分類して集計し, 時系列にそって追うことができました.
上のクエリは全店舗を対象にしているので, 結果に特に意味はないのですが.
contexts テーブルの中身を変えれば, ポジネガ分析みたいなこともできそうです.
特定の店舗を対象にして, 評価ポイントの推移とテキストの傾向を比較してゴニョゴニョと.

やっつけで rule を作ったので, 分類の精度はまぁよくありません.
機械学習なテクニックで contexts を自動アップデートするなどの遊びもできそうです.

前準備が大半を占める記事になってしまいました.

参考link:
本家マニュアル
 textsearch_ja
MeCab
livedoor tech ブログ

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up