Qiita Teams that are logged in
You are not logged in to any team

Log in to Qiita Team
Community
OrganizationAdvent CalendarQiitadon (β)
Service
Qiita JobsQiita ZineQiita Blog
Help us understand the problem. What is going on with this article?

wikiextractorの結果のゴミ処理

More than 1 year has passed since last update.

目的

タグやテーブルタグの残骸を除去し、改行を保存した状態(一行ずつWord2Vecで学習するため)で出力したい。

ソース

wiki_clean.py

if __name__ == '__main__':
    # WikiExtractorの出力結果を1ファイルにまとめたファイルを読み込む
    with open('wiki.txt', 'r') as fr:
        wiki_text = fr.readlines() 
        fr.close()

    # 全テキストリスト
    all_text = []

    # 除去ロジック
    for line in wiki_text:
        # docタグと空行の除去
        if line == '\n' or line.startswith('<doc') or line.startswith('</doc'):
            continue
        # テーブル部分の除去
        if line.count('||') > 4:
            continue
        # colspan部分の除去
        if line.__contains__('colspan='):
            continue
        all_text.append(line.rstrip())

    # 保存
    with open('wiki_clean.txt', 'w') as fw:
        fw.write("\n".join(all_text))
        fw.close()

終わりに

適当に書いたコードなので汚い
実はもっと楽な方法があるのではと思ってしまう

Why not register and get more from Qiita?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away