11
10

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

【Python】PDFからコピーした改行コードだらけのテキストを上手いこと整形する

Last updated at Posted at 2020-08-12

#はじめに
もともとは前々回、前回の記事
【Python】英文PDF(に限らないけど)をDeepLやGoogle翻訳で自動で翻訳させてテキストファイルにしてしまおう。
続【Python】英文PDF(に限らないけど)をDeepLやGoogle翻訳で自動で翻訳させてテキストファイル、いやHTMLにしてしまおう。

で使用するために書いたものですが、役に立ちそうなので別途紹介する次第です。

#PDFからコピーしたテキストの問題点
PDFについての詳しい知識は持ち合わせていないのですが、
PDF内ではテキストが細かいパーツに分割されて書き込んであるようで、コピーしたテキストにもPDFでの表示の通りの位置に改行コードが含まれます。

例えば、PDFで

$$ABC.\\DFE.\\GHI.$$

のような表示の場合、コピーしたテキストは、

$$ABC.{\r\n}DEF.{\r\n}GHI.$$

といった具合です。(上の例はWindows系の場合)

だったらその改行コードを消して文章を繋げればいいじゃないかということで、

$$ABC.DEF.GHI.$$

このようにすると、この例の場合、ピリオドがあるのでそれぞれの文章が混ざることなく済んでいます。


ところがこれで全て解決するかと言うと、そんなに単純な話ではないのです。

以下のような場合はどうでしょう

$$1. Introduction\\ABCDEF.\\GHIJKL.\\MNOPQR.$$

単純に改行コードを消すだけだと、

$$1. IntroductionABCDEF.GHIJKL.MNOPQR.$$

となりピリオドのない1行目と2行目の区別がつかなくなってしまいました。

つまり問題となるのは、
見出しなどといったピリオドのような切れ目の目印が必ずしも付かないパーツを、
文章と改行コードしかヒントの存在しないコピーしてきたテキストから如何にして推測し、分解するか。

という点です。

#やったこと

  1. 改行コードで分割する
  2. 空行を消す
  3. 分割された 注目している文章と次の文章との文字数の差から、本文か見出し文か推測する
  4. 次の文章の1文字目が小文字かどうかで判断する
  5. すべて大文字の場合、見出し文と判断する
  6. 数字(アラビア数字、ローマ数字)+.(ピリオド)が頭についていたら見出し文と判断する
  7. 注目している文章と次の文章との文字数の差が大きかった場合でも、次の文章ののほうが短く、かつピリオド(または句点)がついていた場合、連続する文と判断する
  8. 括弧が閉じていない限り連続する文と判断する。

などといった方法を採用しました。
結構シンプルですが大抵の文章は

・見出し
・段落
・文章

のいずれかの単位で分割する事ができる関数が出来上がりました。

#コード

import re
import unicodedata


def len_(text):
    cnt = 0
    for t in text:
        if unicodedata.east_asian_width(t) in "FWA":
            cnt += 2
        else:
            cnt += 1
    return cnt


def textParser(text, n=30, bracketDetect=True):
    text = text.splitlines()
    sentences = []
    t = ""
    bra_cnt = ket_cnt = bra_cnt_jp = ket_cnt_jp = 0
    for i in range(len(text)):
        if not bool(re.search("\S", text[i])): continue
        if bracketDetect:
            bra_cnt += len(re.findall("[\((]", text[i]))
            ket_cnt += len(re.findall("[\))]", text[i]))
            bra_cnt_jp += len(re.findall("[「「『]", text[i]))
            ket_cnt_jp += len(re.findall("[」」』]", text[i]))
        if i != len(text) - 1:
            if bool(re.fullmatch(r"[A-Z\s]+", text[i])):
                if t != "": sentences.append(t)
                t = ""
                sentences.append(text[i])
            elif bool(
                    re.match(
                        "(\d{1,2}[\.,、.]\s?(\d{1,2}[\.,、.]*)*\s?|I{1,3}V{0,1}X{0,1}[\.,、.]|V{0,1}X{0,1}I{1,3}[\.,、.]|[・•●])+\s",
                        text[i])) or re.match("\d{1,2}.\w", text[i]) or (
                            bool(re.match("[A-Z]", text[i][0]))
                            and abs(len_(text[i]) - len_(text[i + 1])) > n
                            and len_(text[i]) < n):
                if t != "": sentences.append(t)
                t = ""
                sentences.append(text[i])
            elif (
                    text[i][-1] not in ("", ".", "") and
                (abs(len_(text[i]) - len_(text[i + 1])) < n or
                 (len_(t + text[i]) > len_(text[i + 1]) and bool(
                     re.search("[。\..]\s\d|..[。\..]|.[。\..]", text[i + 1][-3:])
                     or bool(re.match("[A-Z]", text[i + 1][:1]))))
                 or bool(re.match("\s?[a-z,\)]", text[i + 1]))
                 or bra_cnt > ket_cnt or bra_cnt_jp > ket_cnt_jp)):
                t += text[i]
            else:
                sentences.append(t + text[i])
                t = ""
        else:
            sentences.append(t + text[i])
    return sentences

結果がイマイチなときはnの値を調節してみてください(大きいほど纏まって、小さいほどばらけます)。
括弧の数が何らかの理由でズレて文章が変に固まってしまった場合はbracketDetectFalseにしてください。
#使用例
Python 3.8.5 Documentation
PDF (US-Letter paper size)\tutorial.pdf ページ番号5(p.11)より

原文をコピーしたもの

CHAPTER
TWO
USING THE PYTHON INTERPRETER
2.1 Invoking the Interpreter
The Python interpreter is usually installed as /usr/local/bin/python3.8 on those machines where it is available;
putting /usr/local/bin in your Unix shell’s search path makes it possible to start it by typing the command:
python3.8
to the shell.1 Since the choice of the directory where the interpreter lives is an installation option, other places are possible;
check with your local Python guru or system administrator. (E.g., /usr/local/python is a popular alternative
location.)
On Windows machines where you have installed Python from the Microsoft Store, the python3.8 command will be
available. If you have the py.exe launcher installed, you can use the py command. See setting-envvars for other ways to
launch Python.
Typing an end-of-file character (Control-D on Unix, Control-Z on Windows) at the primary prompt causes the
interpreter to exit with a zero exit status. If that doesn’t work, you can exit the interpreter by typing the following command:
quit().
The interpreter’s line-editing features include interactive editing, history substitution and code completion on systems that
support the GNU Readline library. Perhaps the quickest check to see whether command line editing is supported is typing
Control-P to the first Python prompt you get. If it beeps, you have command line editing; see Appendix Interactive
Input Editing and History Substitution for an introduction to the keys. If nothing appears to happen, or if ^P is echoed,
command line editing isn’t available; you’ll only be able to use backspace to remove characters from the current line.
The interpreter operates somewhat like the Unix shell: when called with standard input connected to a tty device, it reads
and executes commands interactively; when called with a file name argument or with a file as standard input, it reads and
executes a script from that file.
A second way of starting the interpreter is python -c command [arg] ..., which executes the statement(s) in
command, analogous to the shell’s -c option. Since Python statements often contain spaces or other characters that are
special to the shell, it is usually advised to quote command in its entirety with single quotes.
Some Python modules are also useful as scripts. These can be invoked using python -m module [arg] ...,
which executes the source file for module as if you had spelled out its full name on the command line.
When a script file is used, it is sometimes useful to be able to run the script and enter interactive mode afterwards. This
can be done by passing -i before the script.
All command line options are described in using-on-general.
1 On Unix, the Python 3.x interpreter is by default not installed with the executable named python, so that it does not conflict with a simultaneously
installed Python 2.x executable.

各行末に改行コードが入っています。わかりやすいように”改行コードのまま”表示すると

CHAPTER\r\nTWO\r\nUSING THE PYTHON INTERPRETER\r\n2.1 Invoking the Interpreter\r\nThe Python interpreter is usually installed as /usr/local/bin/python3.8 on those machines where it is available;\r\nputting /usr/local/bin in your Unix shell’s search path makes it possible to start it by typing the command:\r\npython3.8\r\nto the shell.1 Since the choice of the directory where the interpreter lives is an installation option, other places are possible;\r\ncheck with your local Python guru or system administrator. (E.g., /usr/local/python is a popular alternative\r\nlocation.)\r\nOn Windows machines where you have installed Python from the Microsoft Store, the python3.8 command will be\r\navailable. If you have the py.exe launcher installed, you can use the py command. See setting-envvars for other ways to\r\nlaunch Python.\r\nTyping an end-of-file character (Control-D on Unix, Control-Z on Windows) at the primary prompt causes the\r\ninterpreter to exit with a zero exit status. If that doesn’t work, you can exit the interpreter by typing the following command:\r\nquit().\r\nThe interpreter’s line-editing features include interactive editing, history substitution and code completion on systems that\r\nsupport the GNU Readline library. Perhaps the quickest check to see whether command line editing is supported is typing\r\nControl-P to the first Python prompt you get. If it beeps, you have command line editing; see Appendix Interactive\r\nInput Editing and History Substitution for an introduction to the keys. If nothing appears to happen, or if ^P is echoed,\r\ncommand line editing isn’t available; you’ll only be able to use backspace to remove characters from the current line.\r\nThe interpreter operates somewhat like the Unix shell: when called with standard input connected to a tty device, it reads\r\nand executes commands interactively; when called with a file name argument or with a file as standard input, it reads and\r\nexecutes a script from that file.\r\nA second way of starting the interpreter is python -c command [arg] ..., which executes the statement(s) in\r\ncommand, analogous to the shell’s -c option. Since Python statements often contain spaces or other characters that are\r\nspecial to the shell, it is usually advised to quote command in its entirety with single quotes.\r\nSome Python modules are also useful as scripts. These can be invoked using python -m module [arg] ...,\r\nwhich executes the source file for module as if you had spelled out its full name on the command line.\r\nWhen a script file is used, it is sometimes useful to be able to run the script and enter interactive mode afterwards. This\r\ncan be done by passing -i before the script.\r\nAll command line options are described in using-on-general.\r\n1 On Unix, the Python 3.x interpreter is by default not installed with the executable named python, so that it does not conflict with a simultaneously\r\ninstalled Python 2.x executable.

このような形になっています。

今回作った関数に投げてみます。
上の文がクリップボードにコピーされている状態を想定して、

from pyperclip import paste #クリップボードから値(テキスト)を取得する関数

print("\n".join(textParser(paste())))
out
CHAPTER
TWO
USING THE PYTHON INTERPRETER
2.1 Invoking the Interpreter
The Python interpreter is usually installed as /usr/local/bin/python3.8 on those machines where it is available;putting /usr/local/bin in your Unix shell’s search path makes it possible to start it by typing the command:python3.8to the shell.1 Since the choice of the directory where the interpreter lives is an installation option, other places are possible;check with your local Python guru or system administrator. (E.g., /usr/local/python is a popular alternativelocation.)On Windows machines where you have installed Python from the Microsoft Store, the python3.8 command will beavailable. If you have the py.exe launcher installed, you can use the py command. See setting-envvars for other ways tolaunch Python.
Typing an end-of-file character (Control-D on Unix, Control-Z on Windows) at the primary prompt causes theinterpreter to exit with a zero exit status. If that doesn’t work, you can exit the interpreter by typing the following command:quit().
The interpreter’s line-editing features include interactive editing, history substitution and code completion on systems thatsupport the GNU Readline library. Perhaps the quickest check to see whether command line editing is supported is typingControl-P to the first Python prompt you get. If it beeps, you have command line editing; see Appendix InteractiveInput Editing and History Substitution for an introduction to the keys. If nothing appears to happen, or if ^P is echoed,command line editing isn’t available; you’ll only be able to use backspace to remove characters from the current line.
The interpreter operates somewhat like the Unix shell: when called with standard input connected to a tty device, it readsand executes commands interactively; when called with a file name argument or with a file as standard input, it reads andexecutes a script from that file.
A second way of starting the interpreter is python -c command [arg] ..., which executes the statement(s) incommand, analogous to the shell’s -c option. Since Python statements often contain spaces or other characters that arespecial to the shell, it is usually advised to quote command in its entirety with single quotes.
Some Python modules are also useful as scripts. These can be invoked using python -m module [arg] ...,which executes the source file for module as if you had spelled out its full name on the command line.
When a script file is used, it is sometimes useful to be able to run the script and enter interactive mode afterwards. Thiscan be done by passing -i before the script.
installed Python 2.x executable.

Pretty good!

とは言ったものの、これだけきれいなPDFをきれいに分割できるのは当然で、
もっと複雑な1カラムと2カラムを行き来するようなPDFにも使えますのでお試しあれ。

#まとめ
あらゆるPDFに対して完璧に整えられるわけでは決してありませんが、そこそこ使える物ができました。
日本語のPDFにも対応していますが、日本語PDFはOCRでテキストが振ってあることが結構あるのでそういう場合は.replace(" ","")などしてスペースを消してやればきれいになると思います(英文が含まれてるとこの方法は使えませんが)。
前記事のようにPDFを翻訳したいときなど、何気に使い道が多いと思うので、ぜひ使ってやってください。

11
10
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
11
10

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?