More than 3 years have passed since last update.

Pythonプログラミング：ウィキペディアのデータを使ってword2vecをしてみる｛1. データ取得＆前処理編｝

Last updated at 2020-10-18Posted at 2020-08-08

はじめに

今回から4本立ての記事を投稿します。

データ取得＆前処理編　★本稿
モデル作成編
モデル利用編
モデル応用編

本稿で紹介すること

Ubuntuのセットアップ　※Ubuntu 18.04.4 LTS
前処理用ツールのインストール
前処理の実作業（ウィキペディア日本語版のデータベースDUMPファイルの取得・加工）

Wikipedia Extractor - Medialab
GitHub - attardi/wikiextractor: A tool for extracting plain text from Wikipedia dumps
MeCab: Yet Another Part-of-Speech and Morphological Analyzer

なぜUbuntu？
過去、Windowsベースで環境構築とPythonプログラミングを紹介しましたが、、、
本稿の作業で使用するPythonライブラリ、wikiextractorに関して、GitHub上で以下の記載を見つけたためです。

WikiExtractor
WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump.
The tool is written in Python and requires Python 3 but no additional library. Warning: problems have been reported on Windows due to poor support for StringIO in the Python implementation on Windows.
For further information, see the project Home Page or the Wiki.

MeCabはWindowsでも使えることは知っていたのですが、上述の理由により、データ取得＆前処理をまとめてUbuntuで実施します。
※Ubuntuを選択した理由は、大学時代から慣れ親しんでいるためです。

本稿で紹介しないこと

Windows10のセットアップ（含むWebブラウザ）　※Windows10 Pro Ver.1909
VirtualBoxのインストール　※VirtualBox 6.1.10を使用
Vagrantのインストール　※Vagrant 2.2.9を使用
Pythonライブラリの使い方
- wikiextractor　※ウィキペディアのデータベースDUMPファイルから記事を平文で抽出するPythonライブラリ
- mecab　※自然言語の文書ファイルを形態素解析（分かち書き）するPythonライブラリ

Oracle VM VirtualBox
Vagrant by HashiCorp
【まとめ】Vagrant コマンド一覧

データ取得＆前処理編

Ubuntuのセットアップ

まずはVagrantプラグインを導入します。
今回、仮想OSで扱うデータ量が多いため、Diskサイズを大きくすべく、2つ目のvagrant-disksizeを使います。
3つ目のvagrant-proxyconfは職場とか社内ネットワーク（プロキシ環境）でVagrantを利用する場合は必須です。

> vagrant plugin install vagrant-vbguest
> vagrant plugin install vagrant-disksize
> vagrant plugin install vagrant-proxyconf

以下、筆者の環境における、各VagrantプラグインのVer.情報です。

> vagrant plugin list
vagrant-disksize (0.1.3, global)
vagrant-proxyconf (2.0.8, global)
vagrant-vbguest (0.24.0, global)

次に、Vagrantの設定ファイルを準備します。
以下、今回の環境構築で使ったVagrantの設定ファイルです。

Vagrantfile

# -*- mode: ruby -*-
# vi: set ft=ruby :

Vagrant.configure("2") do |config|
  config.vm.box = "ubuntu/bionic64"
  config.vm.box_download_insecure = true
  config.vm.box_version = "20200525.0.0"
  config.vm.hostname = "ubuntu18"
  config.vm.box_check_update = false
  config.vm.network "forwarded_port", guest: 8888, host: 10088
  config.vm.network "private_network", ip: "100.0.0.5"
  config.disksize.size = '50GB'

  config.vm.provider "virtualbox" do |vb|
    vb.gui = false
    vb.cpus = "8"
    vb.memory = "16384"
  end
end

最後、VagrantでOSのセットアップ処理を流します。

> vagrant up
（省略）
==> default: Machine already provisioned. Run `vagrant provision` or use the `--provision`
==> default: flag to force provisioning. Provisioners marked to run always will still run.

程なくすると、OSセットアップが完了します。
SSHログオンしてみます。

> vagrant ssh
Welcome to Ubuntu 18.04.4 LTS (GNU/Linux 4.15.0-101-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

  System information as of Sat Aug  8 05:19:07 UTC 2020

  System load:  0.0                Processes:             147
  Usage of /:   32.3% of 48.41GB   Users logged in:       0
  Memory usage: 1%                 IP address for enp0s3: 10.0.2.15
  Swap usage:   0%                 IP address for enp0s8: 100.0.0.5

 * Are you ready for Kubernetes 1.19? It's nearly here! Try RC3 with
   sudo snap install microk8s --channel=1.19/candidate --classic

   https://microk8s.io/ has docs and details.

30 packages can be updated.
0 updates are security updates.


*** System restart required ***
Last login: Sat Aug  8 04:01:36 2020 from 10.0.2.2
vagrant@ubuntu18:~$

前処理用ツールのインストール

デフォルトだと、Ubuntu 18にはPython 3.6.9が導入されている。

vagrant@ubuntu18:~$ python3 -V
Python 3.6.9

必要なツール群を以下のコマンドでインストール。

# PIPのインストール&アップグレード
sudo apt install python3-pip
python3 -m pip install pip -U

# wikiextractorのインストール
python3 -m pip install wikiextractor

# mecabのインストール（本体と、Pythonインターフェース）
sudo apt-get install -y mecab libmecab-dev mecab-ipadic
sudo apt-get install -y mecab-ipadic-utf8
sudo apt-get install -y libc6-dev build-essential
python3 -m pip install mecab-python3

前処理の実作業

大きく、3ステップです。

ウィキペディア日本語版のデータベースDUMPファイルをDownload
データベースDUMPファイルから記事を平文で抽出
記事を分かち書き

1. ウィキペディア日本語版のデータベースDUMPファイルをDownload

公開されているファイル群から、「jawiki-latest-pages-articles.xml.bz2」をDownload。
Wikipedia:データベースダウンロード
 Index of /jawiki/latest/

（本稿執筆、2020/08/08時点）

2. データベースDUMPファイルから記事を平文で抽出

wikiextractorのヘルプを参照してみる。

vagrant@ubuntu18:~$ python3 -m wikiextractor.WikiExtractor --help
usage: WikiExtractor.py [-h] [-o OUTPUT] [-b n[KMG]] [-c] [--json] [--html]
                        [-l] [-s] [--lists] [-ns ns1,ns2]
                        [--templates TEMPLATES] [--no_templates] [-r]
                        [--min_text_length MIN_TEXT_LENGTH]
                        [--filter_disambig_pages] [-it abbr,b,big]
                        [-de gallery,timeline,noinclude] [--keep_tables]
                        [--processes PROCESSES] [-q] [--debug] [-a]
                        [--log_file LOG_FILE] [-v]
                        [--filter_category FILTER_CATEGORY]
                        input

Wikipedia Extractor:
Extracts and cleans text from a Wikipedia database dump and stores output in a
number of files of similar size in a given directory.
Each file will contain several documents in the format:

    <doc id="" revid="" url="" title="">
        ...
        </doc>

If the program is invoked with the --json flag, then each file will
contain several documents formatted as json ojects, one per line, with
the following structure

    {"id": "", "revid": "", "url":"", "title": "", "text": "..."}

Template expansion requires preprocesssng first the whole dump and
collecting template definitions.

positional arguments:
  input                 XML wiki dump file

optional arguments:
  -h, --help            show this help message and exit
  --processes PROCESSES
                        Number of processes to use (default 7)

Output:
  -o OUTPUT, --output OUTPUT
                        directory for extracted files (or '-' for dumping to
                        stdout)
  -b n[KMG], --bytes n[KMG]
                        maximum bytes per output file (default 1M)
  -c, --compress        compress output files using bzip
  --json                write output in json format instead of the default one

Processing:
  --html                produce HTML output, subsumes --links
  -l, --links           preserve links
  -s, --sections        preserve sections
  --lists               preserve lists
  -ns ns1,ns2, --namespaces ns1,ns2
                        accepted namespaces in links
  --templates TEMPLATES
                        use or create file containing templates
  --no_templates        Do not expand templates
  -r, --revision        Include the document revision id (default=False)
  --min_text_length MIN_TEXT_LENGTH
                        Minimum expanded text length required to write
                        document (default=0)
  --filter_disambig_pages
                        Remove pages from output that contain disabmiguation
                        markup (default=False)
  -it abbr,b,big, --ignored_tags abbr,b,big
                        comma separated list of tags that will be dropped,
                        keeping their content
  -de gallery,timeline,noinclude, --discard_elements gallery,timeline,noinclude
                        comma separated list of elements that will be removed
                        from the article text
  --keep_tables         Preserve tables in the output article text
                        (default=False)
  --filter_category FILTER_CATEGORY
                        specify the file that listing the Categories you want
                        to include or exclude. One line for one category.
                        starting with: 1) '#' comment, ignored; 2) '^'
                        exclude; Note: excluding has higher priority than
                        including

Special:
  -q, --quiet           suppress reporting progress info
  --debug               print debug info
  -a, --article         analyze a file containing a single article (debug
                        option)
  --log_file LOG_FILE   path to save the log info
  -v, --version         print program version

以下、今回用いたコマンドです。
約500MBのファイルサイズで複数のファイルが出力されるので、1つのファイルに集約します。
元のデータベースDUMPファイルが2.95GBに対し、抽出結果の平文ファイルは4.81GBでした。

python3 -m wikiextractor.WikiExtractor jawiki-latest-pages-articles.xml.bz2 --lists --output wikiext --processes 8 --bytes 500M
cat wikiext/*/* > jawiki.txt

3. 記事を分かち書き

mecabのヘルプを参照してみる。

vagrant@ubuntu18:~$ mecab --help
MeCab: Yet Another Part-of-Speech and Morphological Analyzer

Copyright(C) 2001-2012 Taku Kudo
Copyright(C) 2004-2008 Nippon Telegraph and Telephone Corporation

Usage: mecab [options] files
 -r, --rcfile=FILE              use FILE as resource file
 -d, --dicdir=DIR               set DIR  as a system dicdir
 -u, --userdic=FILE             use FILE as a user dictionary
 -l, --lattice-level=INT        lattice information level (DEPRECATED)
 -D, --dictionary-info          show dictionary information and exit
 -O, --output-format-type=TYPE  set output format type (wakati,none,...)
 -a, --all-morphs               output all morphs(default false)
 -N, --nbest=INT                output N best results (default 1)
 -p, --partial                  partial parsing mode (default false)
 -m, --marginal                 output marginal probability (default false)
 -M, --max-grouping-size=INT    maximum grouping size for unknown words (default 24)
 -F, --node-format=STR          use STR as the user-defined node format
 -U, --unk-format=STR           use STR as the user-defined unknown node format
 -B, --bos-format=STR           use STR as the user-defined beginning-of-sentence format
 -E, --eos-format=STR           use STR as the user-defined end-of-sentence format
 -S, --eon-format=STR           use STR as the user-defined end-of-NBest format
 -x, --unk-feature=STR          use STR as the feature for unknown word
 -b, --input-buffer-size=INT    set input buffer size (default 8192)
 -P, --dump-config              dump MeCab parameters
 -C, --allocate-sentence        allocate new memory for input sentence
 -t, --theta=FLOAT              set temparature parameter theta (default 0.75)
 -c, --cost-factor=INT          set cost factor (default 700)
 -o, --output=FILE              set the output file name
 -v, --version                  show the version and exit.
 -h, --help                     show this help and exit.

以下、今回用いたコマンドです。
仮想OSには8コア／16GBを割当て起動しているので、バッファサイズをデフォルトの16倍で指定しました。
抽出結果の平文ファイルは4.81GBに対し、抽出結果の分かち書きファイルは5.71GBでした。

mecab -b 131072 -Owakati jawiki.txt -o jawiki_wakati.txt

まとめ

ウィキペディアのデータを取得して、前処理（分かち書き）する方法を紹介。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up