@yuutyposted at 2023-06-09

DocumentFormat.OpenXml.Paragraphを使用して、.DOCXから画像を抽出し画像をテキストで置換する方法は？

Q&A

Closed

解決したいこと

DocumentFormat.OpenXml.Paragraphを使用して、Word（.DOCX）から画像を抽出し画像を抽出した箇所をテキストで置換する方法を知りたいです。

例）
DocumentFormat.OpenXmlを使用してWordファイルを見出し単位で.docxに分解して指定フォルダに保存するwindowsアプリをc#で作成しています。その際、分解する.docxに画像（図）が存在したら抽出してこちらも指定フォルダに保存し、画像が在った位置に抽出した画像のファイル名を出力しようとしております。
画像が在った位置に抽出した画像のファイル名を出力したいので、DocumentFormat.OpenXml.Paragraphから画像を抽出してRunを画像のファイル名で置換しようとしているのですがそもそも画像が抽出出来ずにおります。
DocumentFormat.OpenXml.Paragraphから画像を抽出する方法があればご教示いただきたいです。

例）保存する.docxの中身

テストてすとSTART

@@image01.jpeg@@　←　画像を抽出した位置に画像のファイル名を出力したい

テストてすとEND

試しているソースコード

public static int ExtractImages(WordprocessingDocument document, Paragraph paragraph, int imageIndex, string imageOutPath, string midashi)
{
    var imageFileIndex = imageIndex;

    foreach (var run in paragraph.Descendants<Run>())
    {
        var drawing = run.Descendants<Drawing>().FirstOrDefault();
        if (drawing != null)
        {
            var blip = drawing.Descendants<DocumentFormat.OpenXml.Drawing.Blip>().FirstOrDefault();
            if (blip != null)
            {
                string imageId = blip.Embed.Value;
                var imagePart = (ImagePart)document.MainDocumentPart.GetPartById(imageId);

                // 画像のファイル名を取得
                string imageName = imagePart.Uri.OriginalString;

                // 画像を指定のパスに保存
                string fileName = $"image{imageFileIndex}_{midashi}_{imageName}";
                string savedImagePath = System.IO.Path.Combine(imageOutPath, fileName);
                using (var stream = new FileStream(savedImagePath, FileMode.Create))
                {
                    imagePart.GetStream().CopyTo(stream);
                }

                // 画像を削除し、テキストで置き換える
                run.RemoveAllChildren();
                run.Append(new Text("@@" + fileName + "@@"));

                imageFileIndex++;
            }
        }
    }
    Console.WriteLine($"Extracted {imageFileIndex} images:");
    return imageFileIndex;
}

自分で試したこと

var drawing = run.Descendants<Drawing>().FirstOrDefault();

上記のコードだとdrawingが必ずNullになり画像を取得できずにおります。読み込んでいるWordには画像が存在しております。
DocumentFormat.OpenXml.Paragraphから画像を抽出する方法があればご教示いただきたいです。

0 likes

1Answer

@mrbonjin posted at 2023-06-09

画像はDrawingではなくPictureでは？

1Like

Comments

@t0208n
"ExtractImages"メソッドはどのように呼び出されているのでしょうか。
@yuuty
Questioner
@mrbonjin さん
ありがとうございます。

var picture = run.Descendants<DocumentFormat.OpenXml.Wordprocessing.Picture>().FirstOrDefault();

にしたらpicture取得できました。ただ、以降の処理をどう書けばいいか模索中です。

@yuuty

Questioner

@t0208n さん
長いですが下のコードのimageFileIndex = WordTextAnalysis.ExtractImages(document, nextParagraph, imageFileIndex, outputFolderPath, midashi);から呼び出しています。

using (WordprocessingDocument document = WordprocessingDocument.Open(outputFolderPath + "\\" + fileNameDocx, true))
{
    Paragraph[] paragraphs = document.MainDocumentPart.Document.Body.Descendants<Paragraph>().ToArray();
    int sectionIndex = 1;
    int imageFileIndex = 0;
    for (int i = 0; i < paragraphs.Length; i++)
    {

        // 同じ表をコピーしないようにするためのチェック用List
        List<string> TableFileList = new List<string>();

        Paragraph paragraph = paragraphs[i];
        if (WordTextAnalysis.IsHeading(paragraph))
        {
            var midashi = paragraph.InnerText;

            // 新しい.docxファイルを作成して元のコンテンツをコピーする
            string outputFilePath = Path.Combine(outputFolderPath, $"本文{sectionIndex.ToString("D3")}_{midashi}.docx");
            Console.WriteLine($":::{outputFilePath}");
            using (WordprocessingDocument newDocument = WordprocessingDocument.Create(outputFilePath, WordprocessingDocumentType.Document))
            {
         
                foreach (var part in document.Parts)
                {
                    newDocument.AddPart(part.OpenXmlPart, part.RelationshipId);
                }
                newDocument.MainDocumentPart.Document.Body.RemoveAllChildren();

                // 元のWordのページ設定を引き継ぐ
                foreach (var sourceSectionProps in document.MainDocumentPart.Document.Descendants<SectionProperties>())
                {
                    SectionProperties newSectionProps = (SectionProperties)sourceSectionProps.CloneNode(true);
                    newDocument.MainDocumentPart.Document.Body.Append(newSectionProps);
                }

                // 見出し内のコンテンツをコピー ※見出しは出力する.docxに含めない
                for (int j = i + 1; j < paragraphs.Length; j++)
                {
                    Paragraph nextParagraph = paragraphs[j];
                    if (WordTextAnalysis.IsHeading(nextParagraph))
                    {
                        // 次の見出しまでコンテンツをコピー
                        break;
                    }



                    // 画像抽出
                    // 画像を段落から抜き取り、フォルダに保存し、画像の位置に変数出力
                    imageFileIndex = WordTextAnalysis.ExtractImages(document, nextParagraph, imageFileIndex, outputFolderPath, midashi);



                    // 段落内の表コピー
                    foreach (var element in nextParagraph.Ancestors())
                    {
                        // 同じ表が既に新しいドキュメントにコピーされていない場合は表コピー
                        if (element.LocalName == "tbl" && !TableFileList.Contains(element.InnerXml))
                        {
                            if (element.Parent.LocalName == "tc")
                            {
                               // 表の中の表なので処理しない
                                continue;
                            }

                            newDocument.MainDocumentPart.Document.Body.Append(element.CloneNode(true));
                            TableFileList.Add(element.InnerXml);
                            break;
                        } 
                    }

                    // 段落が表以外の場合はコンテンツをコピー
                    if (nextParagraph.Parent.LocalName != "tc")
                    {
                        newDocument.MainDocumentPart.Document.Body.Append(nextParagraph.CloneNode(true));

                        // このタイミングで同じ表をコピーしないようにするためのチェック用Listをクリア
                        // ※別段落で同じ表が存在するためクリアする必要あり
                        TableFileList = new List<string>();
                    }

                }

            }

            sectionIndex++;
        }
    }
}

@t0208n
ありがとうございます。Pictureに変更したら取得できたということで、前段のコードには問題がなさそうですね。

ちなみに取得したPictureの子要素に画像の実態があるような気がしますが、その子要素はどのクラスになっていますでしょうか？

@yuuty

Questioner

以下、実際のxmlになります。タグ構造だけ抜粋しております。

<w:pict xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <v:shapetype>
    <v:stroke />
    <v:formulas>
      <v:f />
    </v:formulas>
    <v:path />
    <o:lock />
  </v:shapetype>
  <v:shape id="_x0000_i1025">
    <v:imagedata o:title="" r:id="rId7" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:o="urn:schemas-microsoft-com:office:office" />
  </v:shape>
</w:pict>

上記のxmlの内容を踏まえて、以下のコードでやりたいことを実現できました。
@mrbonjin さん、@t0208n さん
ご協力ありがとうございました。

※document.Packageを使用するために、Open Xmlはversion 2.15.0を使用してます。

public static void ExtractImages(WordprocessingDocument document, Paragraph paragraph, string imageOutPath, string midashi)
{

    foreach (var run in paragraph.Descendants<Run>())
    {
        var picture = run.Descendants<Picture>().FirstOrDefault();

        if (picture != null)
        {
            // Picture要素内のv:imagedata要素を取得する
            ImageData imageData = picture.Descendants<ImageData>().FirstOrDefault();

            // v:imagedata要素のr:id属性を取得する
            string imageId = imageData.RelationshipId;
            var imagePart = (ImagePart)document.MainDocumentPart.GetPartById(imageId);

            // 画像を指定のパスに保存
            var uri = imagePart.Uri;
            var extention = System.IO.Path.GetExtension(uri.ToString().Split('/').Last());   //拡張子取得
            var imageFilename = uri.ToString().Split('/').Last().Replace(extention, "");     //ファイル名取得
            string fileName = $"{imageFilename}_{midashi}{extention}";
            var stream = document.Package.GetPart(uri).GetStream();

            Bitmap b = new Bitmap(stream);
            b.Save(imageOutPath + "\\" + fileName);


            // 画像を削除し、テキストで置き換える
            run.RemoveAllChildren();
            run.Append(new Text("@@" + fileName + "@@"));
        }
       
    }
}

Are you sure you want to delete the question?