1
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

PyMuPDFでページの内容をJSON形式で抜き出そうとしたら"Object of type bytes is not JSON serializable"って出たので直した

Posted at

文字のフォントとかの指定を取る必要があったのでJSON形式で抜き出してみました。
んが、"Object of type bytes is not JSON serializable"ってでるので、何とか直してみたのが以下のコード。
default=format_for_jsonと指定しているのがミソ。

import base64
import json
import fitz

def format_for_json(obj):
    encoded = base64.b64encode(obj)
    decoded = encoded.decode('utf-8')
    return decoded

doc = fitz.open("hogefuga.pdf")
for page in doc:
    page_dic = page.get_text("dict")
    json_test = json.dumps(page_dic, ensure_ascii=False, default=format_for_json)
    print(json_test)

参考

この2つのページを組み合わせました
https://www.yoheim.net/blog.php?q=20170703
https://qiita.com/Haaamaaaaa/items/54bdb372d0e58a976a55

1
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?