PDFをhtml化してNokogiriなどでパースする処理を実装する際に便利そうなpdf2htmlEXというものがあります。
Homebrew にあったのでbrew install pdf2htmlex
を実行したところ、次のWarningが出力されて終了しました。
$ brew install pdf2htmlex
==> Installing dependencies for pdf2htmlex: sqlite, openssl, python, glib
==> Installing pdf2htmlex dependency: sqlite
...snip...
==> Installing pdf2htmlex
==> Downloading https://downloads.sf.net/project/machomebrew/Bottles/pdf2htmlex-
######################################################################## 100.0%
==> Pouring pdf2htmlex-0.12.mavericks.bottle.1.tar.gz
Warning: pdf2htmlex dependency libtiff was built with a different C++ standard
library (libstdc++ from clang). This may cause problems at runtime.
? /usr/local/Cellar/pdf2htmlex/0.12: 1421 files, 34M
上手く行っていない気がしながらもpdf2htmlEXコマンドを実行するとやはり、
libpoppler.48.dylibから参照されているlibtiffが見えないというエラーとなりました。
$ pdf2htmlEX
dyld: Library not loaded: /usr/local/lib/libtiff.5.dylib
Referenced from: /usr/local/lib/libpoppler.48.dylib
Reason: image not found
Trace/BPT trap: 5
libtiff.5.dylib
が見つからないそうです。確かにlsコマンドを実行してもありません。
どうやら、MacOSX10.8から10.9へアップデートした影響で、関連ライブラリが見えなくなっているようです。
再度入れ直す事で対処してみましょう。
$ brew install libtiff
Warning: libtiff-4.0.3 already installed
$ brew uninstall libtiff
Uninstalling /usr/local/Cellar/libtiff/4.0.3...
$ brew install libtiff
==> Downloading https://downloads.sf.net/project/machomebrew/Bottles/libtiff-4.0
######################################################################## 100.0%
==> Pouring libtiff-4.0.3.mavericks.bottle.tar.gz
? /usr/local/Cellar/libtiff/4.0.3: 254 files, 3.8M
再度試すと、今度はlibjpegが見つからないと言われます。
$ pdf2htmlEX
dyld: Library not loaded: /usr/local/lib/libjpeg.8.dylib
Referenced from: /usr/local/lib/libpoppler.48.dylib
Reason: image not found
Trace/BPT trap: 5
先ほどと同様に、uninstallとinstallを行います。
$ brew uninstall libjpeg
Uninstalling /usr/local/Cellar/jpeg/8d...
$ brew install libjpeg
==> Downloading https://downloads.sf.net/project/machomebrew/Bottles/jpeg-8d.mav
######################################################################## 100.0%
==> Pouring jpeg-8d.mavericks.bottle.2.tar.gz
? /usr/local/Cellar/jpeg/8d: 18 files, 780K
今度こそ。やっと動きましたね!!
お疲れ様でした。
$ pdf2htmlEX
Usage: pdf2htmlEX [options] <input.pdf> [<output.html>]
-f,--first-page <int> first page to convert (default: 1)
-l,--last-page <int> last page to convert (default: 2147483647)
--zoom <fp> zoom ratio
--fit-width <fp> fit width to <fp> pixels
--fit-height <fp> fit height to <fp> pixels
--use-cropbox <int> use CropBox instead of MediaBox (default: 1)
--hdpi <fp> horizontal resolution for graphics in DPI (default: 144)
--vdpi <fp> vertical resolution for graphics in DPI (default: 144)
--embed <string> specify which elements should be embedded into output
--embed-css <int> embed CSS files into output (default: 1)
--embed-font <int> embed font files into output (default: 1)
--embed-image <int> embed image files into output (default: 1)
--embed-javascript <int> embed JavaScript files into output (default: 1)
--embed-outline <int> embed outlines into output (default: 1)
--split-pages <int> split pages into separate files (default: 0)
--dest-dir <string> specify destination directory (default: ".")
--css-filename <string> filename of the generated css file (default: "")
--page-filename <string> filename template for split pages (default: "")
--outline-filename <string> filename of the generated outline file (default: "")
--process-nontext <int> render graphics in addition to text (default: 1)
--process-outline <int> show outline in HTML (default: 1)
--process-annotation <int> show annotation in HTML (default: 0)
--printing <int> enable printing support (default: 1)
--fallback <int> output in fallback mode (default: 0)
--tmp-file-size-limit <int> Maximum size (in KB) used by temporary files, -1 for no limit. (default: -1)
--embed-external-font <int> embed local match for external fonts (default: 1)
--font-format <string> suffix for embedded font files (ttf,otf,woff,svg) (default: "woff")
--decompose-ligature <int> decompose ligatures, such as fi -> fi (default: 0)
--auto-hint <int> use fontforge autohint on fonts without hints (default: 0)
--external-hint-tool <string> external tool for hinting fonts (overrides --auto-hint) (default: "")
--stretch-narrow-glyph <int> stretch narrow glyphs instead of padding them (default: 0)
--squeeze-wide-glyph <int> shrink wide glyphs instead of truncating them (default: 1)
--override-fstype <int> clear the fstype bits in TTF/OTF fonts (default: 0)
--process-type3 <int> convert Type 3 fonts for web (experimental) (default: 0)
--heps <fp> horizontal threshold for merging text, in pixels (default: 1)
--veps <fp> vertical threshold for merging text, in pixels (default: 1)
--space-threshold <fp> word break threshold (threshold * em) (default: 0.125)
--font-size-multiplier <fp> a value greater than 1 increases the rendering accuracy (default: 4)
--space-as-offset <int> treat space characters as offsets (default: 0)
--tounicode <int> how to handle ToUnicode CMaps (0=auto, 1=force, -1=ignore) (default: 0)
--optimize-text <int> try to reduce the number of HTML elements used for text (default: 0)
--correct-text-visibility <int> try to detect texts covered by other graphics and properly arrange them (default: 0)
--bg-format <string> specify background image format (default: "png")
--svg-node-count-limit <int> if node count in a svg background image exceeds this limit, fall back this page to bitmap background; negative value means no limit. (default: -1)
--svg-embed-bitmap <int> 1: embed bitmaps in svg background; 0: dump bitmaps to external files if possible. (default: 1)
-o,--owner-password <string> owner password (for encrypted files)
-u,--user-password <string> user password (for encrypted files)
--no-drm <int> override document DRM settings (default: 0)
--clean-tmp <int> remove temporary files after conversion (default: 1)
--tmp-dir <string> specify the location of temporary directory. (default: "/var/folders/20/tbf8_j6s33954cpxwrqft8sr0000gn/T/")
--data-dir <string> specify data directory (default: "/usr/local/Cellar/pdf2htmlex/0.12/share/pdf2htmlEX")
--debug <int> print debugging information (default: 0)
--proof <int> texts are drawn on both text layer and background for proof. (default: 0)
-v,--version print copyright and version info
-h,--help print usage information
Rubyから扱うには?
kristinというgemパッケージを用いると出来るようです。
- kristin
(Convert PDF docs to beautiful HTML files without losing text or format. This gem uses pdf2htmlEX to do the conversion.)
https://github.com/ricn/kristin
Herokuで動かすには?
このbuildpackを用いると動きそうです。
https://github.com/rricard/heroku-buildpack-dpkg
但し、次のページにある最新のOverview of published packagesの内容に
heroku-buildpack-dpkg/test/Debfile
を更新する必要がありそうです。
https://launchpad.net/~coolwanglu/+archive/ubuntu/pdf2htmlex
併せて読みたい
-
あらゆるPDFをHTMLに変換する「pdf2htmlEX」がすごい
http://www.softantenna.com/wp/software/pdf2htmlex/ -
これがHTML?と言いたくなるようなPDF変換ソフトウェア·pdf2htmlEX MOONGIFT
http://www.moongift.jp/2012/09/20120926/