0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

pdfファイル処理:ページ単位で一行のテキストへ変換(シェルスクリプト)

Last updated at Posted at 2021-06-23

こんにちは
pdftotextコマンドを用い、pdfファイルのページ単位で一行のテキストへ変換するシェルスクリプトを作りました。

これを用い、指定した単語の検索を行い、それが出現するページ番号を得ました。

$ ./pdftotext.sh example.pdf | grep -n -e "search_word" | cut -d: -f1
1
2
$
pdftotext.sh
# !/bin/sh

# constant
FF=$(printf '\f')

# functions
remove_form_feed_f() {
  str="${1#$FF}"
  echo "$str"
  [ "$str" != "$1" ]; return $?
}

pdftotext_f() {
  file="$1"
  pdftotext "$file" - | while read -r line; do
    line=$(remove_form_feed_f "$line") && echo
    printf "%s" "$line"
  done
  echo
}

# main
for file in "$@"; do
  [ -d "$file" ] || [ "${file##*.}" != "pdf" ] && continue
  pdftotext_f "$file"
done
0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?