Help us understand the problem. What is going on with this article?

PDFMiner 使ってみる

More than 1 year has passed since last update.

はじめに

PDFファイルを解析する必要がありました。
とりあえずPythonでやりたかった。
PDFMinerというライブラリが便利そうなので、使ってみました。

PDFMiner

http://www.unixuser.org/~euske/python/pdfminer/index.html
https://github.com/euske/pdfminer/

Python2環境の作成

手元には、Python3環境しかなかったので、Python2環境を増やしました。

$ pyenv install 2.7.13
...省略...
$ pyenv local 2.7.13
$ python --version
Python 2.7.13

できました。

PDFMinerでPDF解析

導入

$ git clone https://github.com/euske/pdfminer.git
Cloning into 'pdfminer'...
remote: Counting objects: 3164, done.
remote: Total 3164 (delta 0), reused 0 (delta 0), pack-reused 3164
Receiving objects: 100% (3164/3164), 6.01 MiB | 406.00 KiB/s, done.
Resolving deltas: 100% (2245/2245), done.
$ cd ./pdfminer
$ make cmap
...省略...
$ python ./setup.py install
...省略...

おためし

ためしに使ってみます。

$ cat ./samples/simple1.pdf | head
%PDF-1.4
1 0 obj
<<
 /Type /Catalog
 /Outlines 2 0 R
 /Pages 3 0 R
>>
endobj
2 0 obj
<<
$ ./tools/pdf2txt.py ./samples/simple1.pdf
Hello

World

Hello

World

H e l l o

W o r l d

H e l l o

W o r l d


どうやらこのpdf2txt.pyというツールは問題なく動作するようです。

Why do not you register as a user and use Qiita more conveniently?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
Comments
Sign up for free and join this conversation.
If you already have a Qiita account
Why do not you register as a user and use Qiita more conveniently?
You need to log in to use this function. Qiita can be used more conveniently after logging in.
You seem to be reading articles frequently this month. Qiita can be used more conveniently after logging in.
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away