More than 5 years have passed since last update.

javascriptでDOCX,ODTをプレーンテキストに

Last updated at 2016-02-18Posted at 2016-02-18

javascriptでDOCX,ODTをプレーンテキストに

MSWordの.docx及びLibreOfficeの.odtはXMLをzipで固めたものなのでjavascriptでも解読は可能(なはず)。機会があったので試してみた。

動作環境

DOMParserが使えるブラウザ
JSZip ( https://stuk.github.io/jszip/ ) jsでzipファイルを読み書きするため

DOCXの構造調査

Wordで作られたdocxを展開したところ、テキスト情報は word/document.xml にあり、以下のような構造だった。

word/document.xml

<w:document>
  <w:body>
    <w:p>
      <w:r>
        <w:t>plain text</w:t>
      </w:r>
      <w:r>
        <w:tab/>
      </w:r>
    </w:p>
    <w:p>....</w:p>
  </w:body>
</w:document>

<w:p>はHTMLのP要素のように改行されて表示される
たくさんの<w:r>で細かくテキストが分割されている
タブ(U+0009)は<w:t>内には存在せず、別の<w:r>内の<w:tab>に変換されるようだ

docx2txtの実装

"use strict";
/**
 * .docxファイルを読み取りプレーンテキストにする
 * @param file {File|Blob} docxのファイル
 * @param callback {function} プレーンテキストを引数にする関数
 */
var docx2txt = function(file, callback) {
  var fr = new FileReader();
  fr.onload = function() {
    var xml,dom,txt,p,i,r,j,t,k;
    xml = new JSZip(fr.result).file('word/document.xml').asText();
    dom = (new DOMParser()).parseFromString(xml, 'application/xml');
    txt = "";
    p = dom.firstChild.firstChild.childNodes; //w:document>w:body>w:p
    for(i=0; i<p.length; i++) {
      if (p[i].nodeName !== 'w:p') {continue;}
      r = p[i].childNodes;
      for(j=0; j<r.length; j++) {
        if (r[j].nodeName !== 'w:r') {continue;}
        t = r[j].childNodes;
        for(k=0; k<t.length; k++) {
          if (t[k].nodeName === 'w:t') {txt += t[k].textContent;}
          else if (t[k].nodeName === 'w:tab') {txt += "\t";}
        }
      }
      txt += "\n";
    }
    callback(txt);
  };
  fr.readAsArrayBuffer(file);
};

//FileオブジェクトはDrag and Dropやinput[type='file']などで作成。以下の例はinputタグ
document.getElementById('inputFile').onchange = function(e) {
  var file = e.target.files[0];
  if (file && file.name.match(/\.docx$/i)) {
    docx2txt(file, function(txt) {
      console.info(txt);
    });
  }
};

ODTの構造調査

テキスト情報は .odt を展開した直下の content.xml にあり以下のような構造をしている

content.xml

<office:document-content>
  <office:body>
    <office:text>
      <text:h>
        <text:span>plain text</text:span>
      </text:h>
      <text:p>
        <text:span>
          <text:s/>はスペース
          <text:tab/>はタブ
        </text:span>
      </text:p>
      ...
    </office:text>
  </office:body>
</office:document-content>

DOCXに似ているが多少違う
<office:text>以下には<text:h>か<text:p>がある
<text:p>(またはh)直下の<text:span>内にplainTextがある
<text:span>の中にはplainText以外に、スペースを表す<text:s/>やタブを表す<text:tab/>が存在することもある

odt2txtの実装

ほぼ同じなので略:P

考察

javascriptでDOCXやODTからプレーンテキストを抜き出すのはそう難しくはなかった。
1MB程度のファイルで試したが処理は数秒だった。
詳細に検証したわけではないので変換できていない文字があるかもしれない。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up