3
5

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

POI、Tikaを使ってドキュメントからテキスト抽出

Posted at

はじめに

全文検索などの機能では、アップロードされたファイルの中身を検索したい場合があります。
ElasticSearchなどのindexを作成する際にファイルからテキストを抽出し、indexのドキュメントに追加で実現します。

抽出するツールは、POI、Tikaなどがあります。

POI:https://poi.apache.org/
Tika:https://tika.apache.org/

POI

image.png

POIとは

Java API for Microsoft Documents
Word,Excel,PowerPointの作成、編集、抽出などが可能です。

POIで抽出するサンプル

POIライブラリ導入

build.gradle
// https://mvnrepository.com/artifact/org.apache.poi/poi-ooxml
compile group: 'org.apache.poi', name: 'poi-ooxml', version: '4.1.0'

Excel⇒text

ExcelExtractor.java
import java.io.File;
import java.io.IOException;

import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.xssf.extractor.XSSFExcelExtractor;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import org.apache.poi.xssf.usermodel.XSSFWorkbookFactory;

public class ExcelExtractor {

	public static void main(String[] args) throws InvalidFormatException, IOException {
		XSSFWorkbook workbook = XSSFWorkbookFactory.createWorkbook(new File("/data/test.xlsx"), true);
		XSSFExcelExtractor excel = new XSSFExcelExtractor(workbook);
		
		// テキストを抽出する
		System.out.println(excel.getText());
		
		// 閉じる
		excel.close();
	}
}

Word⇒text

WordExtractor.java
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;

import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

public class WordExtractor {

	public static void main(String[] args) throws FileNotFoundException, IOException {
		XWPFDocument doc = new XWPFDocument(new FileInputStream(new File("/data/test.docx")));
		XWPFWordExtractor word = new XWPFWordExtractor(doc);
		// テキストを抽出
		System.out.println(word.getText());
		// 閉じる
		word.close();
	}
}

PowerPoint⇒text

PowerPointExtractor.java
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;

import org.apache.poi.xslf.extractor.XSLFPowerPointExtractor;
import org.apache.poi.xslf.usermodel.XMLSlideShow;

public class PowerPointExtractor {

	public static void main(String[] args) throws FileNotFoundException, IOException {
		XMLSlideShow ppt = new XMLSlideShow(new FileInputStream("/data/test.pptx"));
		XSLFPowerPointExtractor powerPointExtractor = new XSLFPowerPointExtractor(ppt);
		// テキストを抽出する
		System.out.println(powerPointExtractor.getText());
		// 閉じる
		powerPointExtractor.close();
	}
}

Tika

image.png

Tikaとは

The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types。
ドキュメントのメタデータ、テキストの抽出、ファイル種類の検知などが可能です。

Tikaライブラリ導入

build.gradle
// https://mvnrepository.com/artifact/org.apache.tika/tika-core
compile group: 'org.apache.tika', name: 'tika-core', version: '1.22'

// https://mvnrepository.com/artifact/org.apache.tika/tika-parsers
compile group: 'org.apache.tika', name: 'tika-parsers', version: '1.22'

Tikaで抽出するサンプル

TikaExtractor.java
import java.io.File;
import java.io.IOException;

import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;

public class TikaExtractor {

	public static void main(String[] args) throws IOException, TikaException  {
		Tika tika = new Tika();
		String data = tika.parseToString(new File("/data/test.pdf"));
		System.out.println(data);
	}
}

テキストファイル、PDF、Word、Excel、PowerPointなどでも抽出できますので、便利です。

以上

3
5
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
3
5

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?