More than 5 years have passed since last update.

pdf/docxファイルからのテキストマイニング

Last updated at 2018-06-02Posted at 2018-06-02

概要

社外に送付する書類の中に、
コードネームや部署名などの不必要な情報が混入してしまうことや、
帳票番号の更新漏れ、Copyright表記の年号更新漏れなどが
含まれてしまうことがある。

通常は、ファイルを１つずつ手で確認するのだが、
特に、wordファイルやPDFファイルは、開いてチェックするのが時間も手間もかかるので
Pythonを使って自動化・効率化したいと考えた。

docxファイル
pdfファイル

今回は、
ファイルの内容をテキスト化し、それをgrep的に抽出して、チェックする手順をまとめた。
ファイルの内容は、本文だけでなく、プロパティにも含まれるので、合わせてテキスト化した。

本文
プロパティ

環境

macOS 10.12
Python 3.5.1
pdfminer3k 1.3.1
docx2txt 0.7
python-docx 0.8.6

詳細

docxファイル

docx2txtというモジュールがあり、これを使うと本文のテキスト化ができる。
プロパティのテキスト化はできないので、docxを使う。

本文テキストの取得

import docx2txt
text = docx2txt.process('sample.docx')
text

出力結果

'環境マネジメントシステム(EMS：Environmental Management System)\nみちのくEMS規格第3版対応\n環境マネジメントシステム(EMS：Environmental Management System)\nみちのくEMS規格第3版対応\n状況分析・取組表\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t\n\t更新日\u30002017年○月○日\n\t\n\t株式会社サンプル\n\t\n組織の状況とその理解\n戦略的な方向性\n公共工事の入札の競争性の強化\n内部\nStrengths（強み）\nWeaknesses（弱み）\n・優良工事（宮城県）\n・優良工事（仙台市）\n・工事評価点平均の高さ\n・蓄積されたノウハウ\n・地元行政からの信頼\n・管理主体の身軽さ\n・管理と施工一体のフットワーク\n・自己資本\n・自社重機によるフットワーク\n・リースによる最新機種導入\n・協力会社のネットワ
・・・
(省略)
・・・

簡易grep

keyword = r'(ICT|ドローン)'

import re

## 不要な改行を取り除く
text = re.sub(r'\n+', '\n', text)

lines = text.split('\n')
for i in range(len(lines)):
    m = re.search(keyword, lines[i])
    if m:
        try:
            print(lines[i-1])
        except:
            pass
        print(i, ':', lines[i])
        try:
            print(lines[i+1])
        except:
            pass
        print('----')

出力結果

・施工技術の技術革新
55 : （ＩＣＴとドローン）
・現場特有の機会（工事評価点向上の要素）
----
・ＩＣＴ
67 : ・ドローンの最新技術による施工
・無災害施工、工期内竣工、環境対策、品質管理
----

プロパティの取得

import re
from pprint import pprint

def get_core_properties(doc):
    dic = {}
    methods = dir(doc.core_properties)
    for method in methods:
        if re.match(r'^_', method):
            continue
        cmd = 'doc.core_properties.{}'.format(method)
        result = eval(cmd)
        dic[cmd] = result 
        return dic

doc = docx.Document('sample.docx')
dict = get_core_properties(doc)

pprint(dic)

出力結果

{'doc.core_properties.author': 'PBS Consulting',
 'doc.core_properties.category': '',
 'doc.core_properties.comments': '',
 'doc.core_properties.content_status': '',
 'doc.core_properties.created': datetime.datetime(2017, 9, 13, 7, 48),
 'doc.core_properties.identifier': '',
 'doc.core_properties.keywords': '',
 'doc.core_properties.language': '',
 'doc.core_properties.last_modified_by': '環境会議所東北',
 'doc.core_properties.last_printed': None,
 'doc.core_properties.modified': datetime.datetime(2017, 9, 13, 7, 54),
 'doc.core_properties.revision': 4,
 'doc.core_properties.subject': '',
 'doc.core_properties.title': '',
 'doc.core_properties.version': ''}

pdfファイル

pdfファイルはdocx2txtのような便利モジュールが見つからなかったので、pdfminerを使う。
その延長で、ファイルプロパティも取得する。

本文テキストの取得

# https://euske.github.io/pdfminer/programming.html

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfparser import PDFDocument
from pdfminer.pdfparser import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams
from pdfminer.layout import LTTextBoxHorizontal

fp = open('sample.pdf', 'rb')
parser = PDFParser(fp)
document = PDFDocument()
parser.set_document(document)
password=''
document.set_parser(parser)
document.initialize(password)
if not document.is_extractable:
    raise #PDFTextExtractionNotAllowed
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)

interpreter = PDFPageInterpreter(rsrcmgr, device)
pages = list(document.get_pages())
texts = []
for page in pages:
    interpreter.process_page(page)
    layout = device.get_result()
    for l in layout:
        if isinstance(l, LTTextBoxHorizontal):
            text = l.get_text()
            texts.append(text)

print('\n'.join(texts))

出力結果

オープンデータ基本指針 概要

本基本指針の位置づけ

平成28年12月14日に公布・施行された「官民データ活用推進基本法」において、国、地方公共団体、事業者が保有する官民データ
の容易な利用等について規定された。本文書は、これまでの取組を踏まえ、オープンデータ・バイ・デザイン（注）の考えに基づき、国、地方
公共団体、事業者が公共データの公開及び活用に取り組む上での基本方針をまとめたものである。

１．オープンデータの意義
・・・
（省略）
・・・

簡易grep

import re
from pprint import pprint

matches = []
for text in texts:
    ms = re.findall(r'(xxxx|yyyy)', text, re.IGNORECASE)
    for m in ms:
        matches.append(m)

matches_uniq = list(set(matches))
pprint(sorted(matches_uniq))

出力結果

['府']

プロパティの取得

document.info

出力結果

[{'CreationDate': "D:20170529114507+09'00'",
  'Creator': b'\xfe\xff\x00P\x00o\x00w\x00e\x00r\x00P\x00o\x00i\x00n\x00t\x00 u(\x00 \x00A\x00c\x00r\x00o\x00b\x00a\x00t\x00 \x00P\x00D\x00F\x00M\x00a\x00k\x00e\x00r\x00 \x001\x001',
  'ModDate': "D:20170529114511+09'00'",
  'Producer': 'Adobe PDF Library 11.0',
  'Title': ''}]

res = document.info[0]['Creator'].decode('cp932', 'ignore')
print(res)
PowerPoint u( Acrobat PDFMaker 11

まとめ

docxファイルは意外と簡単に本文テキスト化ができることがわかった。
逆にpdfファイルは面倒。良い方法が見つかったらアップデートしたい。

参考

Programming with PDFMiner
https://euske.github.io/pdfminer/programming.html

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up