ligature 処理(pdftotext)

Last updated at 2023-05-04Posted at 2022-04-22

ligature

望月新一　ABC予想論文を読む。新人にお勧めの基本動作。

でfi文字の処理をしうった。ligatureというらしい。

早速調べた。質問する前に５分しらべよう。

#pdftotext

ターミナル上でPDFのテキストを読む

PDF からのテキスト抽出をいろいろ試してみた

リンク切れ。ligatureという名前だとわかる。

PDFと文字 (43) – ラテンアルファベットのリガチャ

A Comparison of Utilities for converting from PostScript or
Portable Document Format to Text
Author: Nicholas Robinson

Cleaning up pdftotext font issues

bash

# pdftotext -enc ASCII7 input.pdf output.txt

other document

Count the number of words in a PDF file

bash

# pdftotext myfile.pdf - | wc -w

PDF中のTEX記号の復元とACL Anthologyへの適用　磯崎秀樹
岡山県立大学大学情報工学部

[1] 磯崎秀樹. 最近の自動評価法の研究動向と
RIBES, 2012. http://aamtjapio.com/kenkyu/
discussion01-01.html.
[2] Hideki Isozaki, Tsutomu Hirao, Katsuhito Sudoh, Jun Suzuki, Akinori Fujino, Hajime
Tsukada, and Masaaki Nagata. A patient support system based on crosslingual IR and semisupervised learning. In Proceedings of SIGIR2009 Workshop on Information Access in a Multilingual World, pp. 59–61, 2009.
[3] Hideki Isozaki, Katsuhito Sudoh, and Hajime Tsukada. NTT’s Japanese-English crosslanguage question answering system. In Working Notes of the NTCIR Workshop Meeting (NTCIR), pp. 186–193, 2005.
[4] Hideki Isozaki, Katsuhito Sudoh, Hajime Tsukada, and Kevin Duh. HPSG-based preprocessing for English-to-Japanese translation.
ACM Transactions on Asian Language Information Processing, Vol. 11, Issue 3, , 2012.
[5] Giovanni Yoko Kristianto, Minh-Quoc Nghiem, Yuichiroh Matsubayashi, and Akiko Aizawa. Extracting definitions of mathematical expressions in scientific papers. In International Organized Session, Proceedings of the 26th Annual Conference of the Japanese Society for Arftificial Intelligence, 2012.
[6] Minh-Quoc Nghiem, Giovanni Yoko, Yuichiroh Matsubayashi, and Akiko Aizawa. Automatic approach to understanding mathematical expressions using MathML parallel markup corpora. In International Organized Session, Proceedings of
the 26th Annual Conference of the Japanese Society for Arftificial Intelligence, 2012.
[7] Jun Suzuki, Hideki Isozaki, and Masaaki Nagata.
Learning condensed feature representations from
large unsupervided data sets for supervised learning. In Proc. of the Annual Meeting of the Association of Computational Linguistics (ACL), pp.636–641, 2011.
[8] John Whitington. PDF 構造解説. オライリー・ジャパン, 2012.

pdftotext version 0.86.1
Copyright 2005-2020 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
Usage: pdftotext [options] <PDF-file> [<text-file>]
  -f <int>             : first page to convert
  -l <int>             : last page to convert
  -r <fp>              : resolution, in DPI (default is 72)
  -x <int>             : x-coordinate of the crop area top left corner
  -y <int>             : y-coordinate of the crop area top left corner
  -W <int>             : width of crop area in pixels (default is 0)
  -H <int>             : height of crop area in pixels (default is 0)
  -layout              : maintain original physical layout
  -fixed <fp>          : assume fixed-pitch (or tabular) text
  -raw                 : keep strings in content stream order
  -nodiag              : discard diagonal text
  -htmlmeta            : generate a simple HTML file, including the meta information
  -enc <string>        : output text encoding name
  -listenc             : list available encodings
  -eol <string>        : output end-of-line convention (unix, dos, or mac)
  -nopgbrk             : don't insert page breaks between pages
  -bbox                : output bounding box for each word and page size to html.  Sets -htmlmeta
  -bbox-layout         : like -bbox but with extra layout bounding box data.  Sets -htmlmeta
  -opw <string>        : owner password (for encrypted files)
  -upw <string>        : user password (for encrypted files)
  -q                   : don't print any messages or errors
  -v                   : print copyright and version info
  -h                   : print usage information
  -help                : print usage information
  --help               : print usage information
  -?                   : print usage information

＜この記事は個人の過去の経験に基づく個人の感想です。現在所属する組織、業務とは関係がありません。＞

文書履歴(document history)

ver. 0.01 初稿　20220413
ver. 0.02 ありがとう追記 20230504

最後までおよみいただきありがとうございました。

いいね　💚、フォローをお願いします。

Thank you very much for reading to the last sentence.

Please press the like icon 💚　and follow me for your happy life.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up