More than 5 years have passed since last update.

iTextSharpをExcelから召喚してみる

Last updated at 2017-06-09Posted at 2016-03-24

Adobe Acrobatがインストールされていない環境でExcelからPDF内のテキストなどを参照したかったので。

準備

.NET Framework 4.0環境
Excel-DNA一式
ExcelDna.xllは作成するdnaファイル名に合わせてPdfTool.xllにリネームしておく。
iTextSharp.dll

呪文

以下をテキストエディタで作成して、上記xllと同じフォルダに置いておくこと。

PdfTool.dna

<DnaLibrary RuntimeVersion="v4.0" Name="itextsharp" Description="PdfTool" Language="CS">
<Reference Path="itextsharp.dll" />
<![CDATA[

using System;
using System.IO;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Collections.Generic;
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using ExcelDna.Integration;

public class PdfToolForExcel
{
    [ExcelFunction(Description = "Get number of pages", Category = "Using iTextSharp")]
    public static int PdfPageNum(string fname)
    {
        var pr = new PdfReader(fname);
        var num = pr.NumberOfPages;
        pr.Close();
        return num;
    }

    [ExcelFunction(Description = "Get page size of document", Category = "Using iTextSharp")]
    public static object[] PdfPageSize(string fname, int page)
    {
        var pr = new PdfReader(fname);
        var rect = pr.GetPageSize(page);
        pr.Close();
        return new object[] { rect.Width, rect.Height };
    }

    [ExcelFunction(Description = "Read text from region in page", Category = "Using iTextSharp")]
    public static string PdfTextInPos(string fname, int page, double posL, double posB, double posR, double posT)
    {
        var pr = new PdfReader(fname);
        var filter = new RegionTextRenderFilter(new iTextSharp.text.Rectangle((float)posL, (float)posB, (float)posR, (float)posT));
        var strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
        var result = PdfTextExtractor.GetTextFromPage(pr, page, strategy);
        pr.Close();
        return result.ToString();
    }
}

]]>
</DnaLibrary>

詠唱

ワークブックにPdfTool.xllをドラッグ＆ドロップし、「このアドインをこのセッションに限り有効にする」を選択しておく。

Excel VBAからの呼び出し

総ページ数を取得

    ' filename: PDFをフルパスで指定
    ' 戻り値は整数
    size = Application.Run "PdfPageNum", filename

指定ページのサイズを取得

    ' filename: PDFをフルパスで指定
    ' page:     サイズを取得したいページ
    ' 戻り値はオブジェクトで受けて、幅と高さに分離のこと
    o = Application.Run "PdfPageSize", filename, page

指定個所にあるテキストを取得

    ' filename: PDFをフルパスで指定
    ' page:     表示したいページ
    ' bx, by:   取得したい領域の左下座標
    ' tx, ty:   取得したい領域の右上座標
    ' 戻り値はString
    str = Application.Run "PdfTextInPos", filename, page, bx, by, tx, ty

ちなみにPDFは用紙の左下が原点(0,0)なので、混乱しないように。

ワークシート関数としての利用

セルに数式として埋め込んでも使えます。

総ページ数を取得

=PdfPageNum(filename)

指定ページのサイズを取得

=PdfPageSize(filename, page)

※これは配列数式にする必要があります

指定個所にあるテキストを取得

=PdfTextInPos(filename, page, bx, by, tx, ty)

R.I.P.

http://stackoverflow.com/questions/23909893/getting-coordinates-of-string-using-itextextractionstrategy-and-locationtextextr
を見ると、指定した文字列の位置も取得できるみたいですが、正規表現での全文検索との組み合わせ方が今一つよく分からなかったので、公開は断念しました。
ムネン　アトヲ　タノム。

R.I.P. 2

iTextAsian.dllを組み合わせても、UniJIS-UCS2-HW-Hとかでエンコードされた特定のPDF(某運送会社のWEBサイトから取得する輸入許可書)が読めないんですよねぇ…。
困った困った。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up