More than 3 years have passed since last update.

【Jupyterで作業効率化】英文をピリオドで改行し行番号を付与する【Google翻訳活用効率化】

Last updated at 2021-05-05Posted at 2021-05-04

背景

自己学習のために英語論文を読もうと思っても英語の壁に阻まれている方が多いかと思います。（かくいう私もそうです。）
Google翻訳を活用しようと思っても改行が正しくされていないと中々上手く翻訳されません。
正規表現を使って置換しようとしても結構面倒くさいです。
また、翻訳の精度が上がってきたとはいえ、時々怪ししい日本語訳があります。
この場合、多くの方は英文と見比べているかと思いますが、長い文章の場合、英文と日本語訳の対応付けに時間がかかります。
そこで、英文をピリオドで改行し行番号を付与するツールを作成してみました。

必要なライブラリのインストール

Jupyter notebookで実行してみるのが一番手っ取り早いかと思います。

pip install jupyterlab
pip install pyperclip

コード

'''
必要に応じてメンテしてください。
'''
import pyperclip
import os
import re

def to_1_line_of_text(_str):
    '''
    不要な文字コードを削除して1行のテキストに変換する
    '''
    _str = _str.replace(os.linesep, " ")
    _str = _str.replace("\x02", "")
    _str = _str.replace("\r", "") 
    _str = _str.replace("  ", " ")
    return _str

def new_line(_str):
    '''
    改行の対象となる文字列を探して改行する
    対象：ピリオド、中黒
    '''
    _str = _str.replace(".", "." + os.linesep)
    _str = _str.replace("•", os.linesep + "•")
    return _str

def adjust(_str):
    '''
    綺麗にする
    '''
    # ダブルクォーテーションもしくはシングルクォーテーション後に改行
    _str = _str.replace("." + os.linesep + '"', '."' + os.linesep)
    _str = _str.replace("." + os.linesep + "'", ".'" + os.linesep)
    _str = _str.replace("." + os.linesep + "”", ".”" + os.linesep)
    # 行頭の空白を削除
    _str = _str.replace(os.linesep + " ", os.linesep)
    # 空白行を削除
    _str = _str.replace(os.linesep + os.linesep, os.linesep)
    # 本来改行されたくない文字を戻す
    # "e.g."
    _str = _str.replace("e." + os.linesep + "g." + os.linesep, "e.g.")
    # "i.e."
    _str = _str.replace("i." + os.linesep + "e." + os.linesep, "i.e.")
    # "et al."
    _str = _str.replace("et al." + os.linesep, "et al.")
    # 小数点もしくはピリオドを含む数値（小数、セクションなど）への不要な改行コードを削除
    i = 0
    _aft_str = ""
    while i < 5 and _str != _aft_str:
        if i > 0:
            _str = _aft_str
        i += 1
        _aft_str = re.sub(r"([0-9])\." + os.linesep +"([0-9])",r"\1.\2", _str)
    return _aft_str

def add_line_numer(_str, _start_number):
    '''
    行番号を付与する
    _start_number: 開始番号
    '''
    _str2clip = ""
    i = _start_number
    for txt in _str.split(os.linesep):
        i += 1
        if len(txt) > 0:
            _str2clip += str(i) + ":" + txt + os.linesep
    return _str2clip

# クリップボードの値を取得
clip_str = pyperclip.paste()

# 加工
str2clip = to_1_line_of_text(clip_str)
str2clip = new_line(str2clip)
str2clip = adjust(str2clip)
str2clip = add_line_numer(str2clip, 0)

# クリップボードにコピー
pyperclip.copy(str2clip)

print("== 元テキスト ==")
print(clip_str)
print("== 加工後テキスト ==")
print(str2clip)

使い方

コピー＆ペーストの途中で本コードを実行します。

論文などから英文テキストをコピー(ctrl+C)
Jupyter notebookなどを使用してコードを実行（コード内でクリップボードに勝手にコピーされます）
Google翻訳などにペースト(ctrl+V)

実行例

A Survey on Data Collection for Machine Learning: a Big Data - AI Integration PerspectiveのAbstractをコピーしてJupyter notebookで実行してみます。
実行結果は下記のようになります。"."（ピリオド）で改行され、行番号が付与されているのが解ります。

== 元テキスト ==
Data collection is a major bottleneck in machine learning and an active research topic in multiple communities. There are
largely two reasons data collection has recently become a critical issue. First, as machine learning is becoming more widely-used, we
are seeing new applications that do not necessarily have enough labeled data. Second, unlike traditional machine learning, deep
learning techniques automatically generate features, which saves feature engineering costs, but in return may require larger amounts
of labeled data. Interestingly, recent research in data collection comes not only from the machine learning, natural language, and
computer vision communities, but also from the data management community due to the importance of handling large amounts of data.
In this survey, we perform a comprehensive study of data collection from a data management point of view. Data collection largely
consists of data acquisition, data labeling, and improvement of existing data or models. We provide a research landscape of these
operations, provide guidelines on which technique to use when, and identify interesting research challenges. The integration of
machine learning and data management for data collection is part of a larger trend of Big data and Artificial Intelligence (AI) integration
and opens many opportunities for new research.
== 加工後テキスト ==
1:Data collection is a major bottleneck in machine learning and an active research topic in multiple communities.
2:There are largely two reasons data collection has recently become a critical issue.
3:First, as machine learning is becoming more widely-used, we are seeing new applications that do not necessarily have enough labeled data.
4:Second, unlike traditional machine learning, deep learning techniques automatically generate features, which saves feature engineering costs, but in return may require larger amounts of labeled data.
5:Interestingly, recent research in data collection comes not only from the machine learning, natural language, and computer vision communities, but also from the data management community due to the importance of handling large amounts of data.
6:In this survey, we perform a comprehensive study of data collection from a data management point of view.
7:Data collection largely consists of data acquisition, data labeling, and improvement of existing data or models.
8:We provide a research landscape of these operations, provide guidelines on which technique to use when, and identify interesting research challenges.
9:The integration of machine learning and data management for data collection is part of a larger trend of Big data and Artificial Intelligence (AI) integration and opens many opportunities for new research.

Google翻訳結果

元テキストと加工後テキストのGoogle翻訳結果です。
加工後テキストの方が日本語として解りやすいかと思います。
また、行番号が採番されているので元の英文の対応付けも簡単です。（実はこっちの方が重要だったり）

データ収集は、機械学習の主要なボトルネックであり、複数のコミュニティで活発な研究トピックです。がある
データ収集が最近重大な問題になっている主な2つの理由。まず、機械学習がより広く使用されるようになるにつれて、私たちは
必ずしも十分なラベル付きデータがない新しいアプリケーションが見られます。第二に、従来の機械学習とは異なり、深い
学習手法は自動的に機能を生成し、機能エンジニアリングのコストを節約しますが、その見返りとして、より多くの金額が必要になる場合があります
ラベル付けされたデータの。興味深いことに、データ収集に関する最近の研究は、機械学習、自然言語、および
コンピュータビジョンコミュニティだけでなく、大量のデータを処理することの重要性のためにデータ管理コミュニティからも。
この調査では、データ管理の観点からデータ収集の包括的な調査を行います。主にデータ収集
データの取得、データのラベル付け、および既存のデータまたはモデルの改善で構成されます。これらの研究風景を提供します
運用、いつ使用する手法に関するガイドラインを提供し、興味深い研究課題を特定します。の統合
データ収集のための機械学習とデータ管理は、ビッグデータと人工知能（AI）統合のより大きなトレンドの一部です
そして新しい研究のための多くの機会を開きます。

1：データ収集は、機械学習の主要なボトルネックであり、複数のコミュニティで活発な研究トピックです。
2：データ収集が最近重大な問題になっている主な理由は2つあります。
3：まず、機械学習がより広く使用されるようになるにつれて、必ずしも十分なラベル付きデータがない新しいアプリケーションが見られます。
4：第2に、従来の機械学習とは異なり、深層学習手法は自動的に特徴を生成し、特徴エンジニアリングのコストを節約しますが、その見返りとして、大量のラベル付きデータが必要になる場合があります。
5：興味深いことに、データ収集に関する最近の研究は、機械学習、自然言語、コンピュータービジョンのコミュニティだけでなく、大量のデータを処理することの重要性から、データ管理のコミュニティからも得られています。
6：本調査では、データ管理の観点からデータ収集の総合的な調査を行います。
7：データ収集は、主にデータの取得、データのラベル付け、および既存のデータまたはモデルの改善で構成されます。
8：これらの操作の調査状況を提供し、いつどの手法を使用するかについてのガイドラインを提供し、興味深い調査の課題を特定します。
9：データ収集のための機械学習とデータ管理の統合は、ビッグデータと人工知能（AI）統合のより大きなトレンドの一部であり、新しい研究の多くの機会を開きます。

追記

2021/05/05：adjustに小数点もしくはピリオドを含む数値（小数、セクションなど）への不要な改行コードを削除する処理を追加
2021/05/05：adjustにダブルクォーテーションもしくはシングルクォーテーション後に改行する処理を追加
2021/05/06：adjustに"i.e."への不要な改行コードを削除する処理を追加
2021/05/06：adjustに"et al."への不要な改行コードを削除する処理を追加

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

【Jupyterで作業効率化】英文をピリオドで改行し行番号を付与する【Google翻訳活用効率化】

目次

背景

必要なライブラリのインストール

コード

使い方

実行例

Google翻訳結果

追記