More than 5 years have passed since last update.

青空文庫のepub形式からテキストを抜き出して、1つのテキストファイルにまとめる

Last updated at 2019-09-21Posted at 2019-09-21

はじめに

青空文庫のepubからテキストを抜き出して、1つのテキストファイルにまとめるメモです。青空文庫のepubは下記から探しました。
https://bookwalker.jp/ex/sp/aozora/

抜き出したものから形態素解析もしてみましたが、これは省きます。

コード


import ebooklib
from ebooklib import epub 
from bs4 import BeautifulSoup

# 下記でepubファイルを読み込む
book = epub.read_epub('ダウンロードしたepubファイル名')

# メタデータを読み込む
title = book.get_metadata('DC', 'title')
creator = book.get_metadata('DC', 'creator')
publisher = book.get_metadata('DC', 'publisher')
language = book.get_metadata('DC', 'language')

# タイトル
print(title) 
# 執筆者
print(creator) 
# 発行人
print(publisher) 
# 言語
print(language) 

###############################

# ファイルの結合を結合する関数
# 章ごとにファイルがわかれていたので、最終的にすべてを一つのテキストファイルに結合するため
def join_file(filepath):
    with open(filepath, 'wb') as savefile:
        #for f in filelist:
        for f in textfileList:
            data = open(f, "rb").read()
            savefile.write(data)
            savefile.flush()

# 連番生成のたねの初期値
i = 1

textfileList = []
# 最終的に下記のファイルに出力する
filepath = "all_text.txt"

# epubに含まれている全ファイルを読み込む
items = book.get_items()

for item in items:
    
# ドキュメントが記述されたファイルのみを読み混む
    if item.get_type() == ebooklib.ITEM_DOCUMENT:
        #読み込んだものをhtmlファイルとして保存し、
        #textファイルに変換する
        path = str(i) + '.html'
        path_txt = str(i) + '.txt'

        with open(path , mode = 'w') as file:
             #get_content().decode()でファイルを読み込み
            file.write(item.get_content().decode())
            file.close 
        #一旦保存した、htmlファイルを読み込む
        htmlfile = open(path) 
        #beautifulsoupで読み込ませる
        line = BeautifulSoup(htmlfile.read())
        #get_text()でhtmlのタグを全て除去しテキストのみを取り出す
        line = line.get_text()
        print(line)
        htmlfile.close()
        
        # 取り出したテキストをtextファイルに順次保存
        with open(path_txt , mode = 'w') as file_txt:
            file_txt.write(line)
            file_txt.close
        
        #textファイルを順次、appendしていく
        textfileList.append(path_txt)
        
print(textfileList)
# filepathを下記の関数に渡す
join_file(filepath)

おわりに

雑ですが、こんな感じです。
形態素解析するときに使えそう。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up