4
5

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

Python3: pdfminer で PDF から文字を抽出

Last updated at Posted at 2018-10-02

次の記事を参考にしました。
【Python】pdfから文字を抽出。pdfminer.sixの使い方

pdf_parse.py
# ! /usr/bin/python
# -*- coding: utf-8 -*-
#
#	pdf_parse.py
#
#						Oct/02/2018
#
# ------------------------------------------------------------------
import	sys
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
# ------------------------------------------------------------------
sys.stderr.write("*** 開始 ***\n")
file_pdf = sys.argv[1]
file_text = sys.argv[2]
sys.stderr.write(file_pdf + "\n")
sys.stderr.write(file_text + "\n")
#
rsrcmgr = PDFResourceManager()
codec = 'utf-8'
params = LAParams()
with open(file_text, "wb") as output:
	device = TextConverter(rsrcmgr, output, codec=codec, laparams=params)
	with open(file_pdf, 'rb') as input:
		interpreter = PDFPageInterpreter(rsrcmgr, device)
		for page in PDFPage.get_pages(input):
			interpreter.process_page(page)
	device.close()
#
sys.stderr.write("*** 終了 ***\n")
# ------------------------------------------------------------------

実行方法

./pdf_parse.py cities.pdf result.txt

Arch Linux でのライブラリーのインストール方法

yay -S python-pdfminer.six


Ubuntu でのライブラリーのインストール方法

>```bash
sudo apt install python3-pdfminer
4
5
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
4
5

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?