More than 1 year has passed since last update.

wiki40b/jaをダウンロードして前処理もする

Posted at 2023-01-27

はじめに

TensorFlowDatasetsのwiki40bのダウンロードが少しむずかしかったので，やり方を書きます．

環境構築

venvやvirtualenvなどで適当にpython3の環境を作っておく．

pip install tensorflow tensorflow-datasets

をする．tensorflowも入れないとだめみたいです．

ダウンロード

以下のコードを実行する．文ごとに改行された単言語コーパスが得られる．

make_dataは，ダウンロードしてきたデータを行ごとに返すイテレーターです．ダウンロードしてきたデータには，パラグラフや行の切れ目を示すトークンがいろいろ入っているので，そのままでは単言語コーパスとしては使いづらくなっています．

get_line_dataは，行ごとに文を取り出して返すイテレーターです．get_paragraphsでパラグラフごとのイテレーターにして，get_linesで文ごとのイテレータにします．

import tensorflow_datasets as tfds


def main():
    ss = ['train', 'validation', 'test']
    for s in ss:
        with open(f'{s[:5]}.txt', 'w') as f: # validationは長いのでvalidにしている
            for line in get_line_data(make_data(s)):
                print(line, file = f)


def make_data(s):
    ds = tfds.load('wiki40b/ja', split = s)
    for text in ds.as_numpy_iterator():
        lines = text['text'].decode()
        for line in lines.split('\n'):
            yield line


def get_line_data(s):
    s = get_paragraphs(s)
    s = get_lines(s)
    return s


def get_paragraphs(s):
    mode = True

    for x in s:
        x = x.strip()

        if x == '_START_ARTICLE_':
            mode = False
        elif x == '_START_SECTION_':
            mode = False
        elif x == '_START_PARAGRAPH_':
            mode = True
        elif mode and (len(x) > 0):
            yield x


def get_lines(s):
    for para in s:
        for line in para.split('_NEWLINE_'):
            if len(line) > 0:
                yield line


if __name__ == '__main__':
    main()

おしまい

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up