0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

Node.js PDFから文字を抽出

Posted at
test.js
// npm install pdf-parse
const fs = require('fs');
const pdf = require('pdf-parse'); // npm install pdf-parse

async function extractTextFromTextBasedPDF(pdfPath) {
try {
const dataBuffer = fs.readFileSync(pdfPath);
const data = await pdf(dataBuffer);

console.log('抽出されたテキスト:');
console.log(data.text);

// 日本語文字が正しく抽出されているかチェック
const hasJapanese = /[\u3040-\u309F\u30A0-\u30FF\u4E00-\u9FAF]/.test(data.text);
console.log('日本語文字が含まれている:', hasJapanese);

return data.text;
} catch (error) {
console.error('PDF解析エラー:', error);
return null;
}
}

// 使用例
extractTextFromTextBasedPDF('./test.pdf')
.then(text => {
if (text) {
console.log('成功: 日本語テキストが正しく抽出されました');
}
});
C:\Users\XX\textract-test>node test.js
Warning: TT: undefined function: 32
抽出されたテキスト:


日本語開始
改行
改行
日本語終了
日本語文字が含まれている: true
成功: 日本語テキストが正しく抽出されました
test.pdf
日本語開始
改行
改行
日本語終了
0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?