More than 1 year has passed since last update.

googleドライブからpdfファイルを取得して、テキストファイルに変換

Posted at 2023-06-02

googleドライブAPIを利用して、ファイルを取得し、pdfminerライブラリを利用してpdfファイルを.txtファイルに変換するプログラム。

制作した背景

私の大学では、試験の過去問が先輩方によってドライブにpdfファイルでまとめられているので、過去問の傾向を分析するツールを作ってほしいと友人に頼まれた。

コード

from googleapiclient.discovery import build
from googleapiclient.http import MediaIoBaseDownload
from google.oauth2 import service_account
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
import io



# 認証情報の読み込み
credentials = service_account.Credentials.from_service_account_file(
    'service.json',
    scopes=['https://www.googleapis.com/auth/drive']
)

# Google Drive APIのクライアントを作成
drive_service = build('drive', 'v3', credentials=credentials)

#ここまでがapiサービスオブジェクトの作成


folder_id = '1G2sOpExyh8q_wCoi2b3Ey9pq1fEasHF4'

# フォルダ内のファイルを一覧表示
files = drive_service.files().list(q="'{0}' in parents".format(folder_id)).execute()
file_list = files.get('files', [])


for i,file in enumerate(file_list):
    if file['mimeType'] == 'application/pdf':  # PDFファイルのみを対象とする
    # ダウンロードリンクを取得
        file_id = file['id']
        download_link = f'https://drive.google.com/file/d/{file_id}/view?usp=drivesdk'

         # ファイルのダウンロード
        request = drive_service.files().get_media(fileId=file_id)
        fh = io.BytesIO()
        downloader = MediaIoBaseDownload(fh, request)
        done = False
        while done is False:
            status, done = downloader.next_chunk()
        fh.seek(0)

     



       
       # 書き込み用のテキストファイル
        with open(f'output_{i}.txt', 'w') as output_file:

            # Layout Analysisのパラメーターを設定
            laparams = LAParams()

            # 共有リソースを格納するPDFリソースマネージャーオブジェクトを作成
            resource_manager = PDFResourceManager()

            # テキストに変換
            device = TextConverter(resource_manager, output_file, laparams=laparams)

            # ページの内容を処理するためのPDFインタプリタオブジェクトを作成
            interpreter = PDFPageInterpreter(resource_manager, device)

            # ドキュメントに含まれる各ページを処理
            try:
                for page in PDFPage.get_pages(fh):
                    interpreter.process_page(page)
            except Exception as e:
                print(f"An error occurred: {e}")

import openai

# OpenAI APIキーの設定
openai.api_key = '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'

# 各テキストファイルを読み込み、要約を生成
for i in range(len(file_list)):
    with open(f'output_{i}.txt', 'r') as input_file:
        text = input_file.read()

        # テキストをパラグラフに分割
        paragraphs = text.split('\n')

        for j, paragraph in enumerate(paragraphs):
            # 空のパラグラフはスキップ
            if not paragraph.strip():
                continue

            # パラグラフの要約を生成
            response = openai.Completion.create(
                engine="text-davinci-002",
                prompt=paragraph,
                temperature=0.3,
                max_tokens=150
            )

            # 要約を出力
            print(f"Summary for paragraph {j} of file output_{i}.txt: {response.choices[0].text.strip()}")

各コードの解説

from googleapiclient.discovery import build
from googleapiclient.http import MediaIoBaseDownload
from google.oauth2 import service_account
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
import io

必要なライブラリをインポート。
それぞれのライブラリはこの先説明していく。


# 認証情報の読み込み
credentials = service_account.Credentials.from_service_account_file(
    'service.json',
    scopes=['https://www.googleapis.com/auth/drive']
)

# Google Drive APIのクライアントを作成
drive_service = build('drive', 'v3', credentials=credentials)

#ここまでがapiサービスオブジェクトの作成

ドライブのAPIサービスオブジェクトを作成した。
service.jsonファイルはドライブAPIのサービスアカウントを作成して取得。

folder_id = '1G2sOpExyh8q_wCoi2b3Ey9pq1fEasHF4'

# フォルダ内のファイルを一覧表示
files = drive_service.files().list(q="'{0}' in parents".format(folder_id)).execute()
file_list = files.get('files', [])

folder_idはドライブのurlの最後の部分から取得。
フォルダー内のファイルをリストに入れる。

for i,file in enumerate(file_list):
    if file['mimeType'] == 'application/pdf':  # PDFファイルのみを対象とする
    # ダウンロードリンクを取得
        file_id = file['id']
        download_link = f'https://drive.google.com/file/d/{file_id}/view?usp=drivesdk'

         # ファイルのダウンロード
        request = drive_service.files().get_media(fileId=file_id)
        fh = io.BytesIO()
        downloader = MediaIoBaseDownload(fh, request)
        done = False
        while done is False:
            status, done = downloader.next_chunk()
        fh.seek(0)

     



       
       # 書き込み用のテキストファイル
        with open(f'output_{i}.txt', 'w') as output_file:

            # Layout Analysisのパラメーターを設定
            laparams = LAParams()

            # 共有リソースを格納するPDFリソースマネージャーオブジェクトを作成
            resource_manager = PDFResourceManager()

            # テキストに変換
            device = TextConverter(resource_manager, output_file, laparams=laparams)

            # ページの内容を処理するためのPDFインタプリタオブジェクトを作成
            interpreter = PDFPageInterpreter(resource_manager, device)

            # ドキュメントに含まれる各ページを処理
            try:
                for page in PDFPage.get_pages(fh):
                    interpreter.process_page(page)
            except Exception as e:
                print(f"An error occurred: {e}")

フォルダ内の各ファイルを取得し、テキストファイルに変換するループプログラム。
ファイルのmineTypeがpdfであれば実行する。
ファイルのメタ情報からfile['id']でidを取得。これを利用してファイルのバイナリデータを取得。
ioを使って、バイナリデータをpdfオブジェクトに変換。
MediaIoBaseDownloadでpdfオブジェクトをダウンロード。これでようやくpdfをローカルに取得できた。
書き込み用のファイルを開き、
pdfminerの各オブジェクトを定義していく。
LAParams>テキストを文字単位ではなく、行などの塊で認識。
PDFResourceManager>フォントや画像などの共有リソースを格納
TextConverter>テキストに変換
PDFPageInterpreter>ページの内容を処理

import openai

# OpenAI APIキーの設定
openai.api_key = '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'

# 各テキストファイルを読み込み、要約を生成
for i in range(len(file_list)):
    with open(f'output_{i}.txt', 'r') as input_file:
        text = input_file.read()

        # テキストをパラグラフに分割
        paragraphs = text.split('\n')

        for j, paragraph in enumerate(paragraphs):
            # 空のパラグラフはスキップ
            if not paragraph.strip():
                continue

            # パラグラフの要約を生成
            response = openai.Completion.create(
                engine="text-davinci-002",
                prompt=paragraph,
                temperature=0.3,
                max_tokens=150
            )

            # 要約を出力
            print(f"Summary for paragraph {j} of file output_{i}.txt: {response.choices[0].text.strip()}")

こちらはおまけのようなものだが、openaiモジュールからchatgptを導入し、各テキストファイルの内容をようやくしてもらう。
初めてchatgptを導入したのだが、アカウントのキーさえあれば簡単に導入でき、質問などもできるので、今後も積極的に組み込んでいきたい。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up