More than 5 years have passed since last update.

言語処理100本ノック-91:アナロジーデータの準備

Posted at 2020-01-18

言語処理100本ノック 2015の91本目「アナロジーデータの準備」の記録です。
今回は後のノックのための前処理系ということで技術的には超簡単です。

参考リンク

リンク	備考
091.アナロジーデータの準備.ipynb	回答プログラムのGitHubリンク
素人の言語処理100本ノック:91	言語処理100本ノックで常にお世話になっています

環境

種類	バージョン	内容
OS	Ubuntu18.04.01 LTS	仮想で動かしています
pyenv	1.2.15	複数Python環境を使うことがあるのでpyenv使っています
Python	3.6.9	pyenv上でpython3.6.9を使っています 3.7や3.8系を使っていないことに深い理由はありませんパッケージはvenvを使って管理しています

課題

第10章: ベクトル空間法 (II)

第10章では，前章に引き続き単語ベクトルの学習に取り組む．

91. アナロジーデータの準備

単語アナロジーの評価データをダウンロードせよ．このデータ中で": "で始まる行はセクション名を表す．例えば，": capital-common-countries"という行は，"capital-common-countries"というセクションの開始を表している．ダウンロードした評価データの中で，"family"というセクションに含まれる評価事例を抜き出してファイルに保存せよ．

※単語アナロジーの評価データのオリジナルのリンクは、リンク切れなのでここでは変えています。

課題補足

「アナロジーデータ」とは類推のためのデータのようです。
下記に先頭10行を出しています。: capital-common-countriesのように先頭にコロンがあるとブロックを意味していて、その後にAthens Greece Baghdad Iraqと首都と国の関係が2セット1行に並びます。
このようにブロックとその後に数十行何らかの関係性が2セット1行で並ぶデータです。今回は、このデータからfamilyブロックのコンテンツを抜き出します。

questions-words.txt

: capital-common-countries
Athens Greece Baghdad Iraq
Athens Greece Bangkok Thailand
Athens Greece Beijing China
Athens Greece Berlin Germany
Athens Greece Bern Switzerland
Athens Greece Cairo Egypt
Athens Greece Canberra Australia
Athens Greece Hanoi Vietnam
Athens Greece Havana Cuba

回答

回答プログラム 091.アナロジーデータの準備.ipynb

with open('./questions-words.txt') as file_in, \
       open('./091.analogy_family.txt', 'w') as file_out:

    target = False      # 対象のデータ
    for line in file_in:

        if target:

            # 対象データの場合は別のセクションになるまで出力
            if line.startswith(': '):
                break
            print(line.strip(), file=file_out)

        elif line.startswith(': family'):

            # 対象データ発見
            target = True

回答解説

正直、技術的に特別なことをしていないので解説する点がないです。強いて言うならば9割以上が素人の言語処理100本ノック:91のコピペということぐらいです。
結果のテキストの先頭10行は以下の通りです。

091.analogy_family.txt

boy girl brother sister
boy girl brothers sisters
boy girl dad mom
boy girl father mother
boy girl grandfather grandmother
boy girl grandpa grandma
boy girl grandson granddaughter
boy girl groom bride
boy girl he she
boy girl his her
以後略

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up