More than 5 years have passed since last update.

GoogleのCloud Visionを使ってPDFを読み込ませてみる。

Posted at 2019-05-12

Cloud Visionを使うための下準備

まずは、GCPを使えるようにするところから始める。
無料トライアルで申し込みします。

GCPアカウント発行後、「Cloud Vision」を検索して、API有効化をします。

API利用にあたって認証情報（サービスアカウントキー）が必要なので作成する。
公式手順のサービスアカウントの使用に沿って進めます。

任意のサービスアカウント名、役割にはサービスアカウント管理者、キーのタイプはJSONで作ります。

次に環境変数 GOOGLE_APPLICATION_CREDENTIALS を設定します。

自分は、「C:\GCP\Credential\CloudVision」みたいなフォルダを作って、その配下に認証用のJSONファイルを配置しました。

　GOOGLE_APPLICATION_CREDENTIALS=C:\GCP\Credential\CloudVision\xxx.json

という形です。

Cloud Vision APIを叩いてみる。

今回は、JavaでGradleプロジェクトでやってみるので、まずは依存関係を追加する。

Beta Client Librariesを参照します。

記事作成時点の2019/05/12だとEnglishだと「compile 'com.google.cloud:google-cloud-vision:1.68.0'」になっていますが、日本語だと「compile 'com.google.cloud:google-cloud-vision:1.14.0'」のままです。

後ほど「PDF/TIFF ドキュメントテキスト検出」を試すので、そのBeta版の設定に合わせます。

現時点で上記、サンプルのpomを確認すると、

    <dependency>
      <groupId>com.google.cloud</groupId>
      <artifactId>google-cloud-vision</artifactId>
      <version>1.64.0</version>
    </dependency>
    <dependency>
      <groupId>com.google.cloud</groupId>
      <artifactId>google-cloud-storage</artifactId>
      <version>1.62.0</version>
    </dependency>

となっていますので、この設定で行きます。

    compile 'com.google.cloud:google-cloud-vision:1.64.0'
    compile 'com.google.cloud:google-cloud-storage:1.62.0'

手始めに「LGTM」とだけ書かれた画像で試してみます。

以下、ほぼサンプルコードそのままです。


package cloudVision;

import java.io.FileInputStream;
import java.io.IOException;
import java.io.PrintStream;
import java.util.ArrayList;
import java.util.List;

import com.google.cloud.vision.v1.AnnotateImageRequest;
import com.google.cloud.vision.v1.Feature;
import com.google.cloud.vision.v1.Feature.Type;
import com.google.cloud.vision.v1.Image;
import com.google.cloud.vision.v1.ImageAnnotatorClient;
import com.google.protobuf.ByteString;

public class Program {

	public static void main(String[] args) {
		try {
			var path = System.getProperty("user.dir");

			detectText(path + "\\src\\main\\resources\\image\\lgtm.jpg", System.out);
		} catch (Exception e) {
			System.out.println(e.getMessage());
		}
	}

	public static void detectText(String filePath, PrintStream out) throws Exception, IOException {
		List<AnnotateImageRequest> requests = new ArrayList<>();

		var imgBytes = ByteString.readFrom(new FileInputStream(filePath));

		var img = Image.newBuilder().setContent(imgBytes).build();
		var feat = Feature.newBuilder().setType(Type.TEXT_DETECTION).build();
		var request = AnnotateImageRequest.newBuilder().addFeatures(feat).setImage(img).build();
		requests.add(request);

		try (var client = ImageAnnotatorClient.create()) {
			var response = client.batchAnnotateImages(requests);
			var responses = response.getResponsesList();

			for (var res : responses) {
				if (res.hasError()) {
					out.printf("Error: %s\n", res.getError().getMessage());
					return;
				}

				// For full list of available annotations, see http://g.co/cloud/vision/docs
				for (var annotation : res.getTextAnnotationsList()) {
					out.printf("Text: %s\n", annotation.getDescription());
					out.printf("Position : %s\n", annotation.getBoundingPoly());
				}
			}
		}
	}
}

WARNING出ますが、ここで話されているissueかつ将来のリリースで修正されるようなのでいったん気にせずに行きます。



Text: LGTM

Position : vertices {
  x: 569
  y: 385
}
vertices {
  x: 822
  y: 385
}
vertices {
  x: 822
  y: 463
}
vertices {
  x: 569
  y: 463
}

上記の結果が返ってきました。

Windowsのペイントで作った文字だけの画像なのでそのまま「LGTM」と出ています。

PDF/TIFF ドキュメントテキスト検出を使うためには、PDFファイルをGoogle Cloud Storageに配置する必要があります。
任意のバケットを作って、サンプル用のPDFを配置します。

作成したバケットへのアクセス権限をサービスアカウントに付与してしておきます。
バケットの権限タブ　＞　メンバーを追加　＞　「xxx@xxx.iam.gserviceaccount.com」のサービスアカウントを追加します。

実行結果は、そのままStorageに保存するようなので、ストレージ管理者で権限設定をしておきます。

「Done is better than perfect」とだけ書かれたPDFで実行しました。
こちらも、ほぼサンプルそのままですが、

package cloudVision;

import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.TimeUnit;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import com.google.api.gax.longrunning.OperationFuture;
import com.google.cloud.storage.Blob;
import com.google.cloud.storage.Bucket;
import com.google.cloud.storage.Storage;
import com.google.cloud.storage.Storage.BlobListOption;
import com.google.cloud.storage.StorageOptions;
import com.google.cloud.vision.v1p4beta1.AnnotateFileResponse;
import com.google.cloud.vision.v1p4beta1.AnnotateImageResponse;
import com.google.cloud.vision.v1p4beta1.AsyncAnnotateFileRequest;
import com.google.cloud.vision.v1p4beta1.AsyncAnnotateFileResponse;
import com.google.cloud.vision.v1p4beta1.AsyncBatchAnnotateFilesResponse;
import com.google.cloud.vision.v1p4beta1.Feature;
import com.google.cloud.vision.v1p4beta1.GcsDestination;
import com.google.cloud.vision.v1p4beta1.GcsSource;
import com.google.cloud.vision.v1p4beta1.ImageAnnotatorClient;
import com.google.cloud.vision.v1p4beta1.InputConfig;
import com.google.cloud.vision.v1p4beta1.OperationMetadata;
import com.google.cloud.vision.v1p4beta1.OutputConfig;
import com.google.protobuf.util.JsonFormat;

public class Program {

	public static void main(String[] args) {
		try {
			///xxxは、書き換えてください。
			String gcsPath = "gs://xxx/Done is better than perfect.pdf";
			detectDocumentsGcs(gcsPath, gcsPath);
		} catch (Exception e) {
			System.out.println(e.getMessage());
		}
	}

	/**
	 * Performs document text OCR with PDF/TIFF as source files on Google Cloud Storage.
	 *
	 * @param gcsSourcePath The path to the remote file on Google Cloud Storage to detect document
	 *                      text on.
	 * @param gcsDestinationPath The path to the remote file on Google Cloud Storage to store the
	 *                           results on.
	 * @throws Exception on errors while closing the client.
	 */
	public static void detectDocumentsGcs(String gcsSourcePath, String gcsDestinationPath) throws
	    Exception {
	  try (ImageAnnotatorClient client = ImageAnnotatorClient.create()) {
	    List<AsyncAnnotateFileRequest> requests = new ArrayList<>();

	    // Set the GCS source path for the remote file.
	    GcsSource gcsSource = GcsSource.newBuilder()
	        .setUri(gcsSourcePath)
	        .build();

	    // Create the configuration with the specified MIME (Multipurpose Internet Mail Extensions)
	    // types
	    InputConfig inputConfig = InputConfig.newBuilder()
	        .setMimeType("application/pdf") // Supported MimeTypes: "application/pdf", "image/tiff"
	        .setGcsSource(gcsSource)
	        .build();

	    // Set the GCS destination path for where to save the results.
	    GcsDestination gcsDestination = GcsDestination.newBuilder()
	        .setUri(gcsDestinationPath)
	        .build();

	    // Create the configuration for the output with the batch size.
	    // The batch size sets how many pages should be grouped into each json output file.
	    OutputConfig outputConfig = OutputConfig.newBuilder()
	        .setBatchSize(2)
	        .setGcsDestination(gcsDestination)
	        .build();

	    // Select the Feature required by the vision API
	    Feature feature = Feature.newBuilder().setType(Feature.Type.DOCUMENT_TEXT_DETECTION).build();

	    // Build the OCR request
	    AsyncAnnotateFileRequest request = AsyncAnnotateFileRequest.newBuilder()
	        .addFeatures(feature)
	        .setInputConfig(inputConfig)
	        .setOutputConfig(outputConfig)
	        .build();

	    requests.add(request);

	    // Perform the OCR request
	    OperationFuture<AsyncBatchAnnotateFilesResponse, OperationMetadata> response =
	        client.asyncBatchAnnotateFilesAsync(requests);

	    System.out.println("Waiting for the operation to finish.");

	    // Wait for the request to finish. (The result is not used, since the API saves the result to
	    // the specified location on GCS.)
	    List<AsyncAnnotateFileResponse> result = response.get(180, TimeUnit.SECONDS)
	        .getResponsesList();

	    // Once the request has completed and the output has been
	    // written to GCS, we can list all the output files.
	    Storage storage = StorageOptions.getDefaultInstance().getService();

	    // Get the destination location from the gcsDestinationPath
	    Pattern pattern = Pattern.compile("gs://([^/]+)/(.+)");
	    Matcher matcher = pattern.matcher(gcsDestinationPath);

	    if (matcher.find()) {
	      String bucketName = matcher.group(1);
	      String prefix = matcher.group(2);

	      // Get the list of objects with the given prefix from the GCS bucket
	      Bucket bucket = storage.get(bucketName);
	      com.google.api.gax.paging.Page<Blob> pageList = bucket.list(BlobListOption.prefix(prefix));

	      Blob firstOutputFile = null;

	      // List objects with the given prefix.
	      System.out.println("Output files:");
	      for (Blob blob : pageList.iterateAll()) {
	        System.out.println(blob.getName());

	        // Process the first output file from GCS.
	        // Since we specified batch size = 2, the first response contains
	        // the first two pages of the input file.
	        if (firstOutputFile == null) {
	          firstOutputFile = blob;
	        }
	      }

	      // Get the contents of the file and convert the JSON contents to an AnnotateFileResponse
	      // object. If the Blob is small read all its content in one request
	      // (Note: the file is a .json file)
	      // Storage guide: https://cloud.google.com/storage/docs/downloading-objects
	      String jsonContents = new String(firstOutputFile.getContent());
	      com.google.cloud.vision.v1p4beta1.AnnotateFileResponse.Builder builder = AnnotateFileResponse.newBuilder();
	      JsonFormat.parser().merge(jsonContents, builder);

	      // Build the AnnotateFileResponse object
	      AnnotateFileResponse annotateFileResponse = builder.build();

	      // Parse through the object to get the actual response for the first page of the input file.
	      AnnotateImageResponse annotateImageResponse = annotateFileResponse.getResponses(0);

	      // Here we print the full text from the first page.
	      // The response contains more information:
	      // annotation/pages/blocks/paragraphs/words/symbols
	      // including confidence score and bounding boxes
	      System.out.format("\nText: %s\n", annotateImageResponse.getFullTextAnnotation().getText());
	    } else {
	      System.out.println("No MATCH");
	    }
	  }
	}
}

で実行しています。

出来上がった結果のJSONファイルを確認すると、「"languageCode": "en"」で「"text": "Done is better than perfect\n"」できれいに取得できています。

{
	"inputConfig": {
		"gcsSource": {
			//xxxは伏せ字にしています。
			"uri": "gs://xxx/Done is better than perfect.pdf"
		},
		"mimeType": "application/pdf"
	},
	"responses": [
		{
			"fullTextAnnotation": {
				"pages": [
					{
						"property": {
							"detectedLanguages": [
								{
									"languageCode": "en",
									"confidence": 1
								}
							]
						},
//...中略...
				"text": "Done is better than perfect\n"
			},
			"context": {
				"uri": "gs://gcptesthasumi/Done is better than perfect.pdf",
				"pageNumber": 1
			}
		}
	]
}

今度は、日本語でやってみたいと思います。
内容は、タイトルと同じです。

String gcsPath = "gs://xxx/たぶん動くからリリースしようぜ.pdf";

にパスを変更して実行します。

省略せずにそのまま載せますが、

{
	"inputConfig": {
		"gcsSource": {
			//xxxは伏せ字。
			"uri": "gs://xxx/たぶん動くからリリースしようぜ.pdf"
		},
		"mimeType": "application/pdf"
	},
	"responses": [
		{
			"fullTextAnnotation": {
				"pages": [
					{
						"property": {
							"detectedLanguages": [
								{
									"languageCode": "ja",
									"confidence": 1
								}
							]
						},
						"width": 596,
						"height": 843,
						"blocks": [
							{
								"boundingBox": {
									"normalizedVertices": [
										{
											"x": 0.12416107,
											"y": 0.085409254
										},
										{
											"x": 0.3942953,
											"y": 0.08422301
										},
										{
											"x": 0.3942953,
											"y": 0.09845789
										},
										{
											"x": 0.12416107,
											"y": 0.10083037
										}
									]
								},
								"paragraphs": [
									{
										"boundingBox": {
											"normalizedVertices": [
												{
													"x": 0.12416107,
													"y": 0.085409254
												},
												{
													"x": 0.3942953,
													"y": 0.08422301
												},
												{
													"x": 0.3942953,
													"y": 0.09845789
												},
												{
													"x": 0.12416107,
													"y": 0.10083037
												}
											]
										},
										"words": [
											{
												"property": {
													"detectedLanguages": [
														{
															"languageCode": "ja"
														}
													]
												},
												"boundingBox": {
													"normalizedVertices": [
														{
															"x": 0.12416107,
															"y": 0.085409254
														},
														{
															"x": 0.3942953,
															"y": 0.08422301
														},
														{
															"x": 0.3942953,
															"y": 0.09845789
														},
														{
															"x": 0.12416107,
															"y": 0.10083037
														}
													]
												},
												"symbols": [
													{
														"property": {
															"detectedLanguages": [
																{
																	"languageCode": "ja"
																}
															]
														},
														"text": "た",
														"confidence": 0.98
													},
													{
														"property": {
															"detectedLanguages": [
																{
																	"languageCode": "ja"
																}
															]
														},
														"text": "ぶ",
														"confidence": 0.98
													},
													{
														"property": {
															"detectedLanguages": [
																{
																	"languageCode": "ja"
																}
															]
														},
														"text": "ん",
														"confidence": 0.99
													},
													{
														"property": {
															"detectedLanguages": [
																{
																	"languageCode": "ja"
																}
															]
														},
														"text": "動",
														"confidence": 0.97
													},
													{
														"property": {
															"detectedLanguages": [
																{
																	"languageCode": "ja"
																}
															]
														},
														"text": "く",
														"confidence": 0.98
													},
													{
														"property": {
															"detectedLanguages": [
																{
																	"languageCode": "ja"
																}
															]
														},
														"text": "か",
														"confidence": 0.98
													},
													{
														"property": {
															"detectedLanguages": [
																{
																	"languageCode": "ja"
																}
															]
														},
														"text": "ら",
														"confidence": 0.99
													},
													{
														"property": {
															"detectedLanguages": [
																{
																	"languageCode": "ja"
																}
															]
														},
														"text": "リ",
														"confidence": 0.97
													},
													{
														"property": {
															"detectedLanguages": [
																{
																	"languageCode": "ja"
																}
															]
														},
														"text": "リ",
														"confidence": 0.99
													},
													{
														"property": {
															"detectedLanguages": [
																{
																	"languageCode": "ja"
																}
															]
														},
														"text": "ー",
														"confidence": 0.99
													},
													{
														"property": {
															"detectedLanguages": [
																{
																	"languageCode": "ja"
																}
															]
														},
														"text": "ス",
														"confidence": 0.99
													},
													{
														"property": {
															"detectedLanguages": [
																{
																	"languageCode": "ja"
																}
															]
														},
														"text": "し",
														"confidence": 0.98
													},
													{
														"property": {
															"detectedLanguages": [
																{
																	"languageCode": "ja"
																}
															]
														},
														"text": "よ",
														"confidence": 0.98
													},
													{
														"property": {
															"detectedLanguages": [
																{
																	"languageCode": "ja"
																}
															]
														},
														"text": "う",
														"confidence": 0.99
													},
													{
														"property": {
															"detectedLanguages": [
																{
																	"languageCode": "ja"
																}
															],
															"detectedBreak": {
																"type": "LINE_BREAK"
															}
														},
														"text": "ぜ",
														"confidence": 0.99
													}
												],
												"confidence": 0.98
											}
										],
										"confidence": 0.98
									}
								],
								"blockType": "TEXT",
								"confidence": 0.98
							}
						]
					}
				],
				"text": "たぶん動くからリリースしようぜ\n"
			},
			"context": {
				"uri": "gs://gcptesthasumi/たぶん動くからリリースしようぜ.pdf",
				"pageNumber": 1
			}
		}
	]
}

このぐらいの日本語なら行けるようです。

今回、検証に使ったPDFは、Google Documentで文章入力＝＞PDFでダウンロードしました。

EDINETでダウンロードできる有価証券報告書で、このあと試してみましたが、これはさっぱり上手く行きませんでした。
Beta版ということもあり、まだフォントやファイル内の文章の形式でだいぶ左右される雰囲気を感じます。

参考記事

Google Cloud Visionを使ってみた
 JavaでGCPのCloud Vision APIを使ってみる
 公式サンプル

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up