More than 5 years have passed since last update.

C#でエクセル、ワード、パワーポイントのテキスト抽出

Posted at 2018-07-15

概要

Office内のテキストをgrepする必要があり、エクセル、ワード、パワーポイントからテキストを抽出するツールを作成したので作り方も踏まえて紹介します。
JavaのPOIでもいいのですが、簡単に作ることができるのでC#で作りました。
.Net Framework 4.5.1(コマンドラインのパースライブラリが必須のため。これを使わなければもっと古いバージョンでも動作可能と思われます)、Officeのインストールが必須です。
　ソース：https://github.com/tashxii/office-extract-text
　実行ファイル：https://github.com/tashxii/office-extract-text/releases/download/1.0/OfficeExtractText.exe.zip

特徴

コンソールアプリケーションなので、バッチなどで抽出することができます。
対象はエクセル(xls,xlsx,xlsm)、ワード(doc,docx,docm)、パワーポイント(ppt,pptx,pptm)のファイルすべてです。
図形内テキスト、キャンバス内の図形内テキスト、コメント内テキストも抽出します。
エクセルだけ、ワードとパワーポイントだけなど、対象をオプションで指定することができます。

使用方法

Usage: OfficeExtractText.exe [arguments] [options]

Arguments:
  <target file or directory>  対象ファイルまたはディレクトリを指定します。

Options:
  -o|--output      <output_directory>（必須）抽出するテキストファイルの出力ディレクトリを指定します。
  -s|--subdir      サブディレクトリを対象に抽出します。
  -e|--excel       エクセルファイルを抽出します。
                   -e, -w, -pオプションをすべて省略した場合は、
                   ワード、エクセル、パワーポイント全てを抽出します。
  -w|--word        ワードファイルを抽出します。
  -p|--powerpoint  パワーポイントファイルを抽出します。
  --no-log         詳細ログを出力しないで実行します。

-o オプションで指定したディレクトリに見つけたOfficeファイル名.txtファイルで作成します。
作成されるファイルパスは入力パスからの相対パスで作成されます。
C:\inputを対象に、C:\outputを出力パスとして実行した場合、txtファイルは以下のように出力されます。

  C:\input\test1.xlsx          -> C:\output\test1.txt
  C:\input\aaa\bbb\test2.xlsx  -> C:\output\aaa\bbb\test2.txt

たとえば、C:\OfficeDocuments以下にあるすべてのエクセルとワードだけ抽出したい場合
OfficeExtractText.exe -s -e -w -o C:\Temp\ExtractedText C:\OfficeDocumentsのように実行します。

#コード解説
基本方針は、エクセル、ワード、パワーポイントファイルをCOMで開いて、SaveAs系のメソッドでテキスト形式にして保存するだけでよいです。コード1行でほとんどをテキストにできます。
ほとんどと書いたのは、図形内のテキストを抽出する場合は、さらにコードが必要になります。

Officeを操作するライブラリについて

Officeファイルを操作する場合は、2通り（エクセルの場合は3通り）のライブラリがあります。

Microsoft.Office.Interop (COM)
OpenXml（MS製のライブラリ、高速だがありえないほど使い辛い、専用の機械でもないとコーディング不可能）
ClosedXml(Excel専用のOSSライブラリ、高速で使いやすいが、図形に対応していないなど制限も多い。）
実用上、1番上のCOMを使うことになりました。

ComのReleaseを自動管理する

Excelファイルを C# と VB.NET で読み込む "正しい" 方法に記載されているように、COMを使う場合は、「Marshal.ReleaseComObject(comObject)」のようにリリースする必要があります。

記事にも書かれていますが、リリースを自動化しないと大変ですのでその方法を紹介します。

毎回ここまで書くのは大変なので何度も使うのであればIDisposableの実装クラスを作成したほうがいいです）。

ComWrapperクラスです。IDisposableを実装しているため、usingを使ってクローズを自動化できます。

ComWrapper.cs

    class ComWrapper<T> : IDisposable
    {
        public T ComObject { get; }

        public ComWrapper(T comObject)
        {
            this.ComObject = comObject;
        }

        private bool disposedValue = false;

        protected virtual void Dispose(bool disposing)
        {
            if (!disposedValue)
            {
                if (disposing)
                {
                    //nop
                }
                Marshal.ReleaseComObject(ComObject);
                disposedValue = true;
            }
        }

        ~ComWrapper()
        {
            Dispose(false);
        }

        public void Dispose()
        {
            Dispose(true);
        }

使い方は、こんな感じです。

using (var excelObj = new ComWrapper<Excel.Application>(
  new Excel.Application() { Visible = false, DisplayAlerts = false })
){
  var excel = excelObj.ComObject;
  //excel への操作
}

Wordのテキスト抽出

比較的簡単である、ワードのテキスト抽出から書いていきます。
ワードファイルを操作するには、Word.Applicationをnewし、アプリケーションのWord.DocumentsプロパティのOpenメソッドでワードの1ファイル（Word.Document）を開きます。

using (var word = new ComWrapper<Word.Application>(new Word.Application() { Visible = false, DisplayAlerts = Word.WdAlertLevel.wdAlertsNone }))
using (var docs = new ComWrapper<Word.Documents>(word.ComObject.Documents))
{
    using (var doc = new ComWrapper<Word.Document>(
        docs.Open(file,
            ReadOnly: true,
            AddToRecentFiles: false,
            Visible: false)
        ))
    {
        //doc(Word.Document)への操作
    }
}

ワードの本文のテキスト保存

Word.Document#SaveAs2 で FileFormat: Word.WdSaveFormat.wdFormatText を引数に渡してテキストファイルとして保存します。（テンポラリファイル１として保存しています。図形情報をテンポラリファイル２として保存して、その２つをマージして返しています）
コメントは、Word.Document#Commentsでループしてコンテンツとして保存しています。コメントを構成しているComment#Authorと、Comment#Range#Textを合成しています。

WordTextExporter.cs

internal List<string> Export()
{
    var contents = new List<string>();
    using (var doc = new ComWrapper<Word.Document>(
        docs.Open(file,
            ReadOnly: true,
            AddToRecentFiles: false,
            Visible: false)
        ))
    {
        var tempFiles = new string[2];
        bool success = false;
        try
        {
            tempFiles[0] = Path.GetTempFileName();
            tempFiles[1] = Path.GetTempFileName();

            //Text in word.　★ここでテキストとして保存
            doc.ComObject.SaveAs2(tempFiles[0], FileFormat: Word.WdSaveFormat.wdFormatText);
            //Text in shapes.
            List<string> otherContents = new List<string>();
            foreach (Word.Shape shape in doc.ComObject.Shapes)
            {//★図形のテキスト抽出
                ExtractShapeContents(otherContents, shape);
            }
            foreach (Word.Comment comment in doc.ComObject.Comments)
            {//★コメントの保存
                otherContents.Add(comment.Author + ":" + comment.Range.Text);
            }
            File.WriteAllLines(tempFiles[1], otherContents, Encoding.Default);
            success = true;
        }
        finally
        {
            doc.ComObject.Close(false);
            //merge contents after closing doc.
            if (success)
            {
                contents = FileUtils.MergeTextContents(tempFiles);
            }
            FileUtils.DeleteFiles(tempFiles);
        }
        return contents;
    }
}

ワード内図形のテキスト抽出

図形のテキストは、Word.Shape#TextFrame.TextRange.Textで取得します。
しかし、以下のコードは少し複雑になっています。その理由は、テキストを持たない場合やグループ化図形に対して上のAPIを呼ぶと例外が飛んでくるためです。
また、shape#Selectメソッドを処理の前に実行していないと、shape#Typeなどで例外が飛んできます。
ワード以外では、Selectを呼ぶ必要はありませんでした。

グループ化図形か、または描画キャンバスかの判別は、shape.Typeで行います。
　グループ化図形＝Microsoft.Office.Core.MsoShapeType.msoGroup
　キャンバス図形＝Microsoft.Office.Core.MsoShapeType.msoCanvas
あとは、再帰処理で図形をcontentsという文字列リストに詰めるだけです。

WordTextExporter.cs

private void ExtractShapeContents(List<string> contents, Word.Shape shape)
{
    shape.Select();//shape.Type fails if not selected. This problem is word only.
    if (shape.Type == Microsoft.Office.Core.MsoShapeType.msoGroup)
    {
        //To check group or not, use only shape.AutoShapeType == msoShapeMixed or shape.Type == msoGroup,
        //because other ways like shape.GroupItem.Count & shape.Ungroup thow an exception when shape is not a group.
        foreach (Word.Shape subShape in shape.GroupItems)
        {//★グループ内図形に対して再帰呼び出し
            ExtractShapeContents(contents, subShape);
        }
    }
    else if (shape.Type == Microsoft.Office.Core.MsoShapeType.msoCanvas)
    {
        foreach (Word.Shape subShape in shape.CanvasItems)
        {//★キャンバス内図形に対して再帰呼び出し
            ExtractShapeContents(contents, subShape);
        }
    }
    else
    {
        if (shape.TextFrame != null && shape.TextFrame.HasText != 0)
        {//★図形内テキストの保存
            var text = shape.TextFrame?.TextRange?.Text;
            if (!String.IsNullOrEmpty(text))
            {
                contents.Add(text);
            }
        }
    }
}

Excelのテキスト抽出

エクセルファイルを操作するには、Excel.Applicationをnewし、アプリケーションのExcel.WorkbooksプロパティのOpenメソッドでエクセルの1ファイル（Excel.Workbook）を開きます。

using (var excel = new ComWrapper<Excel.Application>(new Excel.Application() { Visible = false, DisplayAlerts = false }))
using (var books = new ComWrapper<Excel.Workbooks>(excel.ComObject.Workbooks))
{
    using (var book = new ComWrapper<Excel.Workbook>(books.Open(file,
            UpdateLinks: Excel.XlUpdateLinks.xlUpdateLinksNever,
            ReadOnly: true,
            IgnoreReadOnlyRecommended: true,
            Editable: false)
        ))
    {
        //bookへの操作
    }

}

シートのテキスト保存

エクセルは、シートごとにテキスト保存と、図形内テキストの保存を行う必要があります。
以下のメソッドでは各シートの、セルの内容と図形内テキストをテンポラリファイルに出力して最後にマージしています。
セルの内容は、シート(Excel.Worksheet)のSaveAsメソッドを FileFormat: Excel.XlFileFormat.xlCSV を渡して呼び出すことで、保存できます。

ExcelTextExporter.cs

internal List<string> Export()
{
    List<string> contents = new List<string>();
    //★ブックを読み取り専用で開く
    using (var book = new ComWrapper<Excel.Workbook>(books.Open(file,
            UpdateLinks: Excel.XlUpdateLinks.xlUpdateLinksNever,
            ReadOnly: true,
            IgnoreReadOnlyRecommended: true,
            Editable: false)
        ))
    {
        List<string> sheetNames = new List<string>();
        List<string> tempFiles = new List<string>();
        bool success = false;
        try
        {
            for (int i = 1; i <= book.ComObject.Worksheets.Count; i++)
            {
                using (var sheet = new ComWrapper<Excel.Worksheet>(book.ComObject.Worksheets[i]))
                {
                    var sheetName = sheet.ComObject.Name;//Not after save, because sheet name will be changed after saving.
                    sheetNames.Add(sheetName);
                    var tempFile1 = Path.GetTempFileName();//for sheet
                    tempFiles.Add(tempFile1);
                    //Text in sheet.
                    sheetNames.Add(sheetName);
                    var tempFile2 = Path.GetTempFileName();//for shapes & comments
                    tempFiles.Add(tempFile2);
                    sheet.ComObject.SaveAs(tempFile1, FileFormat: Excel.XlFileFormat.xlCSV);
                    //Text in shapes & comments
                    List<string> otherContents = new List<string>();
                    foreach (Excel.Shape shape in sheet.ComObject.Shapes)
                    {//★図形のテキスト抽出
                        ExtractShapesContents(otherContents, shape);
                    }
                    foreach (Excel.Comment comment in sheet.ComObject.Comments)
                    {//★コメントの抽出
                        otherContents.Add(comment.Author + ":" + comment.Text());
                    }
                    File.WriteAllLines(tempFile2, otherContents, Encoding.Default);
                    success = true;
                }
            }
        }
        finally
        {
            book.ComObject.Close(false);
            //merge contents after closing 
            if (success)
            {
                int i = 0;
                foreach (var tempFile in tempFiles)
                {
                    if (i % 2 == 0)
                    {//sheet=n+0, shapes=n+1
                        contents.Add("[" + sheetNames[i] + "]");
                    }
                    i++;
                    var sheetContents = FileUtils.MergeTextContents(new string[] { tempFile });
                    File.Delete(tempFile);
                    contents.AddRange(sheetContents);
                }
            }
            FileUtils.DeleteFiles(tempFiles.ToArray());
        }
        return contents;
    }
}

シート内図形のテキスト抽出

シート内の図形は、以下のようなコードで出力できます。

図形のテキストは、Excel.Shape#TextFrame.Characters().Textで取得します。
グループ図形かキャンバス図形化の判別は、Shape#Typeで行えます。
基本的にワードと同じやり方で再帰処理を行えます。

ExcelTextExporter.cs

private void ExtractShapesContents(List<string> contents, Excel.Shape shape)
{
    if (shape.Type == Microsoft.Office.Core.MsoShapeType.msoGroup)
    {
        //To check group or not, use only shape.Type or shape.AutoShapeType, 
        //because other ways like shape.GroupItem.Count & shape.Ungroup thow an exception when shape is not a group.
        var groupShapes = shape.GroupItems;
        foreach(Excel.Shape subShape in shape.GroupItems)
        {
            ExtractShapesContents(contents, subShape);
        }
    }
    else if (shape.Type == Microsoft.Office.Core.MsoShapeType.msoCanvas)
    {
        foreach (Excel.Shape subShape in shape.CanvasItems)
        {
            ExtractShapesContents(contents, subShape);
        }
    }
    else
    {
        var text = shape.TextFrame?.Characters()?.Text;
        if (!String.IsNullOrEmpty(text))
        {
            contents.Add(text);
        }
    }
}

PowerPointのテキスト抽出

パワーポイントのファイルを操作するには、PowerPoint.Applicationをnewし、アプリケーションのPowerPoint.PresentationsプロパティのOpenメソッドでパワーポイントの1ファイル（PowerPoint.Presentation）を開きます。

リッチテキストとして保存

パワーポイントをテキストとして保存する機能はありませんが、リッチテキストとしては保存できます。

    ppt.ComObject.SaveAs(tempFiles[0], FileFormat: PowerPoint.PpSaveAsFileType.ppSaveAsRTF);

リッチテキストをテキストに変換

リッチテキストをテキストに変換するために、WindowsのForm部品である、RichTextBoxをnewして変換します。

    RichTextBox richTextBox = new RichTextBox();
    richTextBox.Rtf = richText;
    File.WriteAllText(tempFiles[0], richTextBox.Text, Encoding.Default);

このページなどが参考になりました。

スライドの下のノートページのテキスト抽出

図形とノートページは、スライドごとに取得する必要があります。
foreach(PowerPoint.Slide slide in presentation.Slides)のようにスライドをループさせることができます。

ノートページは、slide.NotePage.shapes.Placeholders[2]で取得します。１つ目はスライド自身、２つ目がノートを表しています。
ノートページのテキストは、slide.NotesPage.Shapes.Placeholders[2].TextFrame.TextRange.Textで取得します。

全体のコードは以下のようになります。

PowerPointTextExporter.cs

internal List<string> Export()
{
    var contents = new List<string>();
    using (var ppt = new ComWrapper<PowerPoint.Presentation>(
        ppts.Open(file,
            ReadOnly: Microsoft.Office.Core.MsoTriState.msoTrue,
            WithWindow: Microsoft.Office.Core.MsoTriState.msoFalse)
        ))
    {
        var tempFiles = new string[2];
        var success = false;
        try
        {
            tempFiles[0] = Path.GetTempFileName();
            tempFiles[1] = Path.GetTempFileName();

            //Text in PPT
            //Save as rich text file.
            ppt.ComObject.SaveAs(tempFiles[0], FileFormat: PowerPoint.PpSaveAsFileType.ppSaveAsRTF);
            //Read and save as a text file.
            string richText = File.ReadAllText(tempFiles[0], Encoding.Default);
            //Cheep trick to convert text from rtf.
            RichTextBox richTextBox = new RichTextBox();
            richTextBox.Rtf = richText;
            File.WriteAllText(tempFiles[0], richTextBox.Text, Encoding.Default);

            //Text in shapes & comments
            var slideContents = new List<string>();
            foreach (PowerPoint.Slide slide in ppt.ComObject.Slides)
            {
                foreach (PowerPoint.Shape shape in slide.Shapes)
                {
                    ExtractShapeContents(slideContents, shape);
                }
                foreach (PowerPoint.Comment comment in slide.Comments)
                {
                    slideContents.Add(comment.Author + ":" + comment.Text);
                }
                slideContents.Add(slide.NotesPage.Shapes.Placeholders[2].TextFrame.TextRange.Text);//placefolders[1] is slide itself.
            }
            File.WriteAllLines(tempFiles[1], slideContents, Encoding.Default);
            success = true;
        }
        finally
        {
            ppt.ComObject.Close();
            //merge contents after closing ppt.
            if(success)
            {
                contents = FileUtils.MergeTextContents(tempFiles);
            }
            FileUtils.DeleteFiles(tempFiles);
        }
        return contents;
    }
}

パワーポイント内図形のテキスト抽出

パワーポイント内の図形の取得は、ワード、エクセルと同様、shape.Typeで判別して、グループ、キャンバスであれば、再帰呼び出しを行うことで抽出できます。

PowerPointTextExporter.cs

private void ExtractShapeContents(List<string> contents, PowerPoint.Shape shape)
{
    if (shape.Type == Microsoft.Office.Core.MsoShapeType.msoGroup)
    {
        //To check group or not, use only shape.AutoShapeType == msoShapeMixed or shape.Type == msoGroup,
        //because other ways like shape.GroupItem.Count & shape.Ungroup thow an exception when shape is not a group.
        foreach (PowerPoint.Shape subShape in shape.GroupItems)
        {
            ExtractShapeContents(contents, subShape);
        }
    }
    else if (shape.Type == Microsoft.Office.Core.MsoShapeType.msoCanvas)
    {
        foreach (PowerPoint.Shape subShape in shape.CanvasItems)
        {
            ExtractShapeContents(contents, subShape);
        }
    }
    else
    {
        if (shape.TextFrame != null && shape.TextFrame.HasText == Microsoft.Office.Core.MsoTriState.msoTrue)
        {
            var text = shape.TextFrame?.TextRange?.Text;
            if (!String.IsNullOrEmpty(text))
            {
                contents.Add(text);
            }
        }
    }
}

その他TIPS

このツール開発で使った、下にあるTIPSのいつくかは、単独の記事として書く予定です。

国際化対応
OfficeのCOMのReleaseの自動化（Disposable）
.Netのコンソール引数パーサーの使用方法
コンソール出力、コンソールエラー出力の色を変える
リッチテキストからテキストへの変換
アセンブリDLLのExeへのマージ方法

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up