More than 5 years have passed since last update.

Hyper Estraierを使ってWikipediaのダンプデータをいじってみた

Last updated at 2014-04-18Posted at 2014-04-17

目的

構造化された文書であるWikipediaの記事からアンカーテキストを抽出します（実際には恥ずかしながら最近知ったHyper Estraierを使ってみたかっただけです）。アンカーテキストとは、[[バンド (音楽)|バンド]]のようにテキストをWikipediaの他の記事へリンクされたテキストです。この場合、バンドというテキストがバンド (音楽)という記事へリンクされます。自然言語処理ではこのようなアンカーテキストを取り出して、あるテキストがどういった意味で使われるかという知識を作ることがあります。論文では一文で済まされそうな処理なので、実際にどうやって処理するかは人それぞれですが、例えば次のような手段が考えられます。

手段

Wikipediaの記事をHyper Estraierで検索できるようにします。Hyper EstraierにはWikipediaのダンプデータをHyper Estraierの文書ドラフトへ変換するrubyスクリプトwpxmltoest.rbが含まれています。

今回はWikipediaのマークアップからアンカーテキストを取り出したいのでwpxmltoest.rbで書かれているマークアップ除去の正規表現をコメントアウトします:

hyperestraier-1.4.13/lab/wpxmltoest

#     text = text.gsub(/<[^>]+>/, "")
    text = Text::unnormalize(text)
    return if text =~ /^#REDIRECT/ || text.size < TEXTMINSIZE
#     text = text.gsub(/^=+([^=]+)=+/, "\\1\n")
#     text = text.gsub(/^ *[\*#:|;-]+ */, "")
#     text = text.gsub(/\[\[[^\]\|]+\|([^\]]+)\]\]/, "\\1")
#     text = text.gsub(/\[\[([a-zA-Z-]+:)?([^\]]+)\]\]/, "\\2")
#     text = text.gsub(/\{\{([^\}\|]+)\|[^\}]+\}\}/, "\\1")
#     text = text.gsub(/\{\{([^\}]+)\}\}/, "\\1")
#     text = text.gsub(/\[http:[^ \]]+ ([^\]]+)\]/, "\\1")
#     text = text.gsub(/''+/, "")
#     text = text.gsub(/^ *\{?|/, "")
#     text = text.gsub(/^ *[\!\|\}]/, "")
#     text = text.gsub(/^\*+/, "")
#     text = text.gsub(/[a-zA-Z]+=\"[^\"].*\"/, "")
#     text = text.gsub(/[a-z][a-z]+=[0-9]+/, "")
#     text = text.gsub(/.*border-style.*/, "")
#     text = text.gsub(/.*valign=.*/, "")
#     text = text.gsub(/\&[a-zA-Z]+;/, "")
#     text = text.gsub(/.*(利用者|会話|ノート):.*/, "")
#     text = text.gsub(/(Wikipedia|Category):/, "")
#     text = text.gsub(/.*語:/, "")
#     text = text.gsub(/^thumb\|/, "")
#     text = text.gsub(/画像:/, "")
#     text = text.gsub(/^[ +]*[\|]*/, "")
#     text = text.gsub(/\|\|/, "\n")

で、wikipediaのダンプデータをHyper Estraierの文書ドラフトへ変換してやります:

bunzip2 jawiki-latest-pages-articles.xml.bz
~/local/src/hyperestraier-1.4.13/lab/wpxmltoest jawiki-latest-pages-articles.xml

文書ドラフトへ変換する処理は時間がかかります。あと、カレントディレクトリに大量のディレクトリを作成するので注意。

出力はこんな感じでマークアップをそのままにしています:

@uri=http://ja.wikipedia.org/wiki/%E3%82%A2%E3%83%B3%E3%83%91%E3%82%B5%E3%83%B3%E3%83%89
@title=アンパサンド
@author=Beatclick
@mdate=2013-10-13T23:40:46Z
@size=4126

{{記号文字|&}}
[[Image:Trebuchet MS ampersand.svg|right|thumb|100px|[[Trebuchet MS]] フォント]]
'''アンパサンド''' ({{lang|en|ampersand}}, '''&''') とは「…と…」を意味する[[記号]]である。[[英語]]の {{lang|en|"and"}} ...

次のようにしてインデクシングします:

estcmd create -attr @title str -attr @author str -attr @mdate seq -attr @size seq wiki-index
estcmd gather -pc UTF-8 -cl -fe -il ja -lf -1 -xl -sd -cm -cs 80000 -um wiki-index wikipedia

アンカーテキストは[[foo|bar]]のような感じで記述されているので，次のようにして正規表現で検索してやります:

estcmd list wiki-index|cut -f1|xargs -I{} sh -c 'estcmd get wiki-index {}|estcmd regex "\[\["|egrep -o "\[\[[^]]*\]\]"'

上のコマンドでは、

est cmd listで文書IDとURIのペアを出力して、
cut -f1で文書IDのみを切り取って、
estcmd getでその文書IDに該当する文書を出力して、
それを受け取り、estcmd regexで[[を含む行を出力して、
[[]]で囲まれたテキストを出力しています。

出力は次のようになります:

[[カリフォルニア州]]
[[サンフランシスコ]]
[[インディー・ロック]]
[[ダンス・パンク]]
[[ポストパンク]]
[[1995年]]
[[ワープ・レコーズ]]
[[1995年]]
[[アメリカ合衆国]]
[[ロック (音楽)|ロック]]
[[バンド (音楽)|バンド]]
[[2003年]]
[[2004年]]
[[2007年]]
[[フジロック・フェスティバル|フジロックフェスティバル]]
...

結論

楽して知識を構築できました。これであるテキストが（Wikipediaという閉じた世界の中で）どの意味（つまり記事）で使われうるのか、あるいは統計を取ればあるテキストがどの意味として使われることが多いかを知ることができます。恥ずかしながら最近知ったHyper Estraierですが、他にもいろいろなことができそうで大変便利そうだと思いました。

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up