LoginSignup
12
10

More than 5 years have passed since last update.

pdf2htmlEX(PDFをHTMLに変換するコマンド)をOSXでbrew installする方法

Posted at

PDFをhtml化してNokogiriなどでパースする処理を実装する際に便利そうなpdf2htmlEXというものがあります。
Homebrew にあったのでbrew install pdf2htmlexを実行したところ、次のWarningが出力されて終了しました。

$ brew install pdf2htmlex
==> Installing dependencies for pdf2htmlex: sqlite, openssl, python, glib
==> Installing pdf2htmlex dependency: sqlite

...snip...

==> Installing pdf2htmlex
==> Downloading https://downloads.sf.net/project/machomebrew/Bottles/pdf2htmlex-
######################################################################## 100.0%
==> Pouring pdf2htmlex-0.12.mavericks.bottle.1.tar.gz
Warning: pdf2htmlex dependency libtiff was built with a different C++ standard
library (libstdc++ from clang). This may cause problems at runtime.
?  /usr/local/Cellar/pdf2htmlex/0.12: 1421 files, 34M

上手く行っていない気がしながらもpdf2htmlEXコマンドを実行するとやはり、
libpoppler.48.dylibから参照されているlibtiffが見えないというエラーとなりました。

$ pdf2htmlEX
dyld: Library not loaded: /usr/local/lib/libtiff.5.dylib
  Referenced from: /usr/local/lib/libpoppler.48.dylib
  Reason: image not found
Trace/BPT trap: 5

libtiff.5.dylibが見つからないそうです。確かにlsコマンドを実行してもありません。
どうやら、MacOSX10.8から10.9へアップデートした影響で、関連ライブラリが見えなくなっているようです。
再度入れ直す事で対処してみましょう。

$ brew install libtiff
Warning: libtiff-4.0.3 already installed

$ brew uninstall libtiff
Uninstalling /usr/local/Cellar/libtiff/4.0.3...

$ brew install libtiff
==> Downloading https://downloads.sf.net/project/machomebrew/Bottles/libtiff-4.0
######################################################################## 100.0%
==> Pouring libtiff-4.0.3.mavericks.bottle.tar.gz
?  /usr/local/Cellar/libtiff/4.0.3: 254 files, 3.8M

再度試すと、今度はlibjpegが見つからないと言われます。

$ pdf2htmlEX
dyld: Library not loaded: /usr/local/lib/libjpeg.8.dylib
  Referenced from: /usr/local/lib/libpoppler.48.dylib
  Reason: image not found
Trace/BPT trap: 5

先ほどと同様に、uninstallとinstallを行います。

$ brew uninstall libjpeg
Uninstalling /usr/local/Cellar/jpeg/8d...

$ brew install libjpeg
==> Downloading https://downloads.sf.net/project/machomebrew/Bottles/jpeg-8d.mav
######################################################################## 100.0%
==> Pouring jpeg-8d.mavericks.bottle.2.tar.gz
?  /usr/local/Cellar/jpeg/8d: 18 files, 780K

今度こそ。やっと動きましたね!!
お疲れ様でした。

$ pdf2htmlEX
Usage: pdf2htmlEX [options] <input.pdf> [<output.html>]
  -f,--first-page <int>         first page to convert (default: 1)
  -l,--last-page <int>          last page to convert (default: 2147483647)
  --zoom <fp>                   zoom ratio
  --fit-width <fp>              fit width to <fp> pixels
  --fit-height <fp>             fit height to <fp> pixels
  --use-cropbox <int>           use CropBox instead of MediaBox (default: 1)
  --hdpi <fp>                   horizontal resolution for graphics in DPI (default: 144)
  --vdpi <fp>                   vertical resolution for graphics in DPI (default: 144)
  --embed <string>              specify which elements should be embedded into output
  --embed-css <int>             embed CSS files into output (default: 1)
  --embed-font <int>            embed font files into output (default: 1)
  --embed-image <int>           embed image files into output (default: 1)
  --embed-javascript <int>      embed JavaScript files into output (default: 1)
  --embed-outline <int>         embed outlines into output (default: 1)
  --split-pages <int>           split pages into separate files (default: 0)
  --dest-dir <string>           specify destination directory (default: ".")
  --css-filename <string>       filename of the generated css file (default: "")
  --page-filename <string>      filename template for split pages  (default: "")
  --outline-filename <string>   filename of the generated outline file (default: "")
  --process-nontext <int>       render graphics in addition to text (default: 1)
  --process-outline <int>       show outline in HTML (default: 1)
  --process-annotation <int>    show annotation in HTML (default: 0)
  --printing <int>              enable printing support (default: 1)
  --fallback <int>              output in fallback mode (default: 0)
  --tmp-file-size-limit <int>   Maximum size (in KB) used by temporary files, -1 for no limit. (default: -1)
  --embed-external-font <int>   embed local match for external fonts (default: 1)
  --font-format <string>        suffix for embedded font files (ttf,otf,woff,svg) (default: "woff")
  --decompose-ligature <int>    decompose ligatures, such as fi -> fi (default: 0)
  --auto-hint <int>             use fontforge autohint on fonts without hints (default: 0)
  --external-hint-tool <string> external tool for hinting fonts (overrides --auto-hint) (default: "")
  --stretch-narrow-glyph <int>  stretch narrow glyphs instead of padding them (default: 0)
  --squeeze-wide-glyph <int>    shrink wide glyphs instead of truncating them (default: 1)
  --override-fstype <int>       clear the fstype bits in TTF/OTF fonts (default: 0)
  --process-type3 <int>         convert Type 3 fonts for web (experimental) (default: 0)
  --heps <fp>                   horizontal threshold for merging text, in pixels (default: 1)
  --veps <fp>                   vertical threshold for merging text, in pixels (default: 1)
  --space-threshold <fp>        word break threshold (threshold * em) (default: 0.125)
  --font-size-multiplier <fp>   a value greater than 1 increases the rendering accuracy (default: 4)
  --space-as-offset <int>       treat space characters as offsets (default: 0)
  --tounicode <int>             how to handle ToUnicode CMaps (0=auto, 1=force, -1=ignore) (default: 0)
  --optimize-text <int>         try to reduce the number of HTML elements used for text (default: 0)
  --correct-text-visibility <int> try to detect texts covered by other graphics and properly arrange them (default: 0)
  --bg-format <string>          specify background image format (default: "png")
  --svg-node-count-limit <int>  if node count in a svg background image exceeds this limit, fall back this page to bitmap background; negative value means no limit. (default: -1)
  --svg-embed-bitmap <int>      1: embed bitmaps in svg background; 0: dump bitmaps to external files if possible. (default: 1)
  -o,--owner-password <string>  owner password (for encrypted files)
  -u,--user-password <string>   user password (for encrypted files)
  --no-drm <int>                override document DRM settings (default: 0)
  --clean-tmp <int>             remove temporary files after conversion (default: 1)
  --tmp-dir <string>            specify the location of temporary directory. (default: "/var/folders/20/tbf8_j6s33954cpxwrqft8sr0000gn/T/")
  --data-dir <string>           specify data directory (default: "/usr/local/Cellar/pdf2htmlex/0.12/share/pdf2htmlEX")
  --debug <int>                 print debugging information (default: 0)
  --proof <int>                 texts are drawn on both text layer and background for proof. (default: 0)
  -v,--version                  print copyright and version info
  -h,--help                     print usage information

Rubyから扱うには?

kristinというgemパッケージを用いると出来るようです。

  • kristin (Convert PDF docs to beautiful HTML files without losing text or format. This gem uses pdf2htmlEX to do the conversion.)
    https://github.com/ricn/kristin

Herokuで動かすには?

このbuildpackを用いると動きそうです。
https://github.com/rricard/heroku-buildpack-dpkg

但し、次のページにある最新のOverview of published packagesの内容に
heroku-buildpack-dpkg/test/Debfileを更新する必要がありそうです。
https://launchpad.net/~coolwanglu/+archive/ubuntu/pdf2htmlex

併せて読みたい

12
10
2

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
12
10