0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

从docx文件中提取纯文本

Posted at

解压docx文件

直接使用unzip file.docs 命令,解压出来很多文件

├── [Content_Types].xml
├── _rels
├── docProps
│   ├── app.xml
│   └── core.xml
└── word
    ├── _rels
    │   └── document.xml.rels
    ├── document.xml
    └── settings.xml

查看下 word/document.xml的内容,非常标准的xml格式的文件

提取 xml中的纯文本

cat word/document.xml sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'

组合命如下, unzip -p 是解压文件到管道流,而不是输出文件。

unzip -p file.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'

参考:

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?