Rustで学術論文からテキスト抽出するクレートを実装するAdvent Calendar 2024

Rustで学術論文からテキストを抽出する #2 - pdftotext

Last updated at 2024-12-01Posted at 2024-12-01

Summary

pdftotextでPDFに含まれる文字の位置がわかる！ので，意外となんとかなりそう
pdftotextで抽出される文字はタイトルなのか本文なのかといった属性がわからないので，方法を考える必要がある
pdftotextでは図表に含まれる文字も出力されてしまうので，これは除外したい

GiHub -> https://github.com/akitenkrad/rsrpp
crates.io -> https://crates.io/crates/rsrpp

前回までのあらすじ

前回：Rustで学術論文からテキストを抽出する #1

Rustで学術論文をパースするクレート rsrpp (Rust Research Paper Parser) の実装#2です．
rsrppは Poppler のツール群をベースにするので，まずはコレを使い倒します．

Popplerのインストール

Ubuntuで開発していますので，以下のコマンドでインストールできます．

sudo apt install -y poppler-utils

使えるコマンドは以下の通り¹．

コマンド	機能
pdfdetach	pdfdetachは，PDFから埋め込みファイル（添付ファイル）を一覧表示または抽出します．
pdffonts	pdffontsは，PDFで使用されているフォントを，各フォントに関する様々な情報とともに一覧表示します．
pdfimages	pdfimagesは，PDFから画像を抽出し，PPM，PBM，PNG，TIFF，JPEG，JPEG2000，またはJBIG2ファイルとして保存します．
pdfinfo	pdfinfoは，PDFの'Info'辞書の内容（および他の有用な情報）を出力します．
pdfseparate	pdfseparateは，PDFから単一のページを抽出します．
pdftocairo	pdftocairoは，popplerPDFライブラリのcairo出力デバイスを使用して，Portable Document Format（PDF）ファイルを以下のいずれかの出力形式に変換します: PNG，JPEG，TIFF，PDF，PS，EPS，SVG，Windows Printer
pdftohtml	pdftohtmlは，PDF文書をHTMLに変換するプログラムです．
pdftoppm	pdftoppmは，PDFを，Portable Pixmap（PPM）形式のカラー画像ファイル，PGM形式のグレースケール画像ファイル，またはPGM形式のモノクロ画像ファイルに変換します．
pdftops	pdftopsは，PDFをPostScriptに変換し，印刷できるようにします．
pdftotext	pdftotextは，PDFをプレーンテキストに変換します．
pdfunite	pdfuniteは，コマンドライン上に指定された順序で複数のPDFファイルを1つのPDF結果ファイルに統合します．

やはり気にあるのは pdftotext です．これで論文の本文を抽出できればそもそもクレートを実装する必要はないわけで．
というわけでやってみます．

pdftotext

Attention Is All You Need²を題材にしてPDFをテキストに変換します．

pdftotext -nopgbrk -htmlmeta -bbox-layout PDF.pdf

変換結果は以下のようになります．
-bbox-layout を指定しているので，プレーンテキストではなく，HTMLで単語ごとのBounding Boxを返してくれます． -nopgbrk はページ区切りを出力しないようにする設定です．

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
<meta name="Subject" content=""/>
<meta name="Keywords" content=""/>
<meta name="Author" content=""/>
<meta name="Creator" content="LaTeX with hyperref"/>
<meta name="Producer" content="pdfTeX-1.40.25"/>
<meta name="CreationDate" content="2024-04-10T21:11:43Z"/>
<meta name="ModDate" content="2024-04-10T21:11:43Z"/>
</head>
<body>
<doc>
  <page width="612.000000" height="792.000000">
    <flow>
      <block xMin="124.313000" yMin="73.857374" xMax="487.894542" yMax="112.440323">
        <line xMin="124.666000" yMin="73.857374" xMax="487.338947" yMax="84.545323">
          <word xMin="124.666000" yMin="73.857374" xMax="167.656899" yMax="84.545323">Provided</word>
          <word xMin="170.645699" yMin="73.857374" xMax="201.848771" yMax="84.545323">proper</word>
          <word xMin="204.837571" yMin="73.857374" xMax="254.415786" yMax="84.545323">attribution</word>
          <word xMin="257.404586" yMin="73.857374" xMax="265.378704" yMax="84.545323">is</word>
          <word xMin="268.367504" yMin="73.857374" xMax="313.677712" yMax="84.545323">provided,</word>
          <word xMin="316.666512" yMin="73.857374" xMax="351.862621" yMax="84.545323">Google</word>
          <word xMin="354.851421" yMin="73.857374" xMax="387.381520" yMax="84.545323">hereby</word>
          <word xMin="390.370320" yMin="73.857374" xMax="419.588829" yMax="84.545323">grants</word>
          <word xMin="422.577629" yMin="73.857374" xMax="475.049002" yMax="84.545323">permission</word>
          <word xMin="478.037802" yMin="73.857374" xMax="487.338947" yMax="84.545323">to</word>
        </line>
        <line xMin="124.313000" yMin="87.804374" xMax="487.894542" yMax="98.492323">
          <word xMin="124.313000" yMin="87.804374" xMax="172.109890" yMax="98.492323">reproduce</word>
          <word xMin="175.098690" yMin="87.804374" xMax="189.707944" yMax="98.492323">the</word>
          <word xMin="192.696744" yMin="87.804374" xMax="220.588226" yMax="98.492323">tables</word>
          <word xMin="223.577026" yMin="87.804374" xMax="240.840334" yMax="98.492323">and</word>
          <word xMin="243.829134" yMin="87.804374" xMax="276.371189" yMax="98.492323">figures</word>
          <word xMin="279.359989" yMin="87.804374" xMax="288.661134" yMax="98.492323">in</word>
          <word xMin="291.649934" yMin="87.804374" xMax="308.925198" yMax="98.492323">this</word>
          <word xMin="311.913998" yMin="87.804374" xMax="338.466498" yMax="98.492323">paper</word>
          <word xMin="341.455298" yMin="87.804374" xMax="370.016270" yMax="98.492323">solely</word>
          <word xMin="373.005070" yMin="87.804374" xMax="386.944834" yMax="98.492323">for</word>
          <word xMin="389.933634" yMin="87.804374" xMax="405.869915" yMax="98.492323">use</word>
          <word xMin="408.858715" yMin="87.804374" xMax="418.159861" yMax="98.492323">in</word>
          <word xMin="421.148661" yMin="87.804374" xMax="474.947061" yMax="98.492323">journalistic</word>
          <word xMin="477.935861" yMin="87.804374" xMax="487.894542" yMax="98.492323">or</word>
        </line>
        <line xMin="267.594000" yMin="101.752374" xMax="346.498320" yMax="112.440323">
          <word xMin="267.594000" yMin="101.752374" xMax="311.421763" yMax="112.440323">scholarly</word>
          <word xMin="314.410563" yMin="101.752374" xMax="346.498320" yMax="112.440323">works.</word>
        </line>
      </block>
    </flow>
    <flow>
      <block xMin="18.340000" yMin="213.920000" xMax="36.340000" yMax="555.000000">
        <line xMin="18.340000" yMin="213.920000" xMax="36.340000" yMax="555.000000">
          <word xMin="18.340000" yMin="388.900000" xMax="36.340000" yMax="555.000000">arXiv:1706.03762v7</word>
          <word xMin="18.340000" yMin="318.360000" xMax="36.340000" yMax="378.900000">[cs.CL]</word>
          <word xMin="18.340000" yMin="298.360000" xMax="36.340000" yMax="308.360000">2</word>
          <word xMin="18.340000" yMin="258.920000" xMax="36.340000" yMax="293.360000">Aug</word>
          <word xMin="18.340000" yMin="213.920000" xMax="36.340000" yMax="253.920000">2023</word>
        </line>
      </block>
    </flow>
    <flow>
      <block xMin="211.488000" yMin="150.164374" xMax="399.893338" yMax="165.641019">
        <line xMin="211.488000" yMin="150.164374" xMax="399.893338" yMax="165.641019">
          <word xMin="211.488000" yMin="150.164374" xMax="281.296447" yMax="165.641019">Attention</word>
          <word xMin="285.600297" yMin="150.164374" xMax="298.993878" yMax="165.641019">Is</word>
          <word xMin="303.297728" yMin="150.164374" xMax="325.299009" yMax="165.641019">All</word>
          <word xMin="329.602859" yMin="150.164374" xMax="358.300931" yMax="165.641019">You</word>
          <word xMin="362.604781" yMin="150.164374" xMax="399.893338" yMax="165.641019">Need</word>
        </line>
      </block>
    </flow>
    <flow>
      <block xMin="116.681000" yMin="233.849650" xMax="216.039006" yMax="266.724697">
        <line xMin="132.908000" yMin="233.849650" xMax="203.888765" yMax="244.777183">
          <word xMin="132.908000" yMin="235.820806" xMax="161.699914" yMax="244.777183">Ashish</word>
          <word xMin="164.190564" yMin="235.820806" xMax="199.806859" yMax="244.777183">Vaswani</word>
          <word xMin="199.807000" yMin="233.849650" xMax="203.888765" yMax="240.432917">∗</word>
        </line>
        <line xMin="139.379000" yMin="246.849357" xMax="193.336442" yMax="255.755922">
          <word xMin="139.379000" yMin="246.849357" xMax="168.708894" yMax="255.755922">Google</word>
          <word xMin="171.199544" yMin="246.849357" xMax="193.336442" yMax="255.755922">Brain</word>
        </line>
        <line xMin="116.681000" yMin="258.425851" xMax="216.039006" yMax="266.724697">
          <word xMin="116.681000" yMin="258.425851" xMax="216.039006" yMax="266.724697">avaswani@google.com</word>
        </line>
      </block>
      <block xMin="126.882000" yMin="283.846650" xMax="210.551900" yMax="316.722697">
        <line xMin="144.292000" yMin="283.846650" xMax="197.219765" yMax="294.774183">
          <word xMin="144.292000" yMin="285.817806" xMax="166.996765" yMax="294.774183">Llion</word>
          <word xMin="169.487415" yMin="285.817806" xMax="193.138628" yMax="294.774183">Jones</word>
          <word xMin="193.138000" yMin="283.846650" xMax="197.219765" yMax="290.429917">∗</word>
        </line>
        <line xMin="134.548000" yMin="296.846357" xMax="202.881473" yMax="305.752922">
          <word xMin="134.548000" yMin="296.846357" xMax="163.877894" yMax="305.752922">Google</word>
          <word xMin="166.368544" yMin="296.846357" xMax="202.881473" yMax="305.752922">Research</word>
        </line>
        <line xMin="126.882000" yMin="308.423851" xMax="210.551900" yMax="316.722697">
          <word xMin="126.882000" yMin="308.423851" xMax="210.551900" yMax="316.722697">llion@google.com</word>
        </line>

正直，意外といける！と思いました．
出力は page > flow > block > line > word のようになっているので，block や line が論文のどの章に含まれるのかを判定することができれば，目的を達成できそうです．

ただし，block や line がただの本文なのかタイトルや脚注なのかといった判定ロジックを考える必要があります．
また，表や一部の図などに含まれる本文ではない文字列が混入してきます．これもできればなんとかしたい．

ToDo
- テキストの属性 (本文, タイトル, 脚注, etc.) を判定する
- 図表に含まれるテキストを除外する

Rustによるデータ構造の実装

さて，「Rustで〜」と書いておきながらRustが全く出てこないのもアレなので，まずは今回のクレートの核になるであろうデータ構造に関する構造体を定義してしまいます．
page > flow > block > line > word の構造をそのまま流用して，RustでPDFのテキストを構造的に扱えるようにします．

以下のようにディレクトリとファイルを作成して，mod.rsに実装していきます．
rsrpp > rsrpp > src > parser > mod.rs
(プログラムの全体はこちら→https://github.com/akitenkrad/rsrpp)

rsrpp > rsrpp > src > parser > mod.rs

#[derive(Debug, Clone, PartialEq)]
pub struct Word {
    pub text: String,
    pub x: f32,
    pub y: f32,
    pub width: f32,
    pub height: f32,
}

基本的な方針は，Word にテキストを持たせ，Page〜Line はそれぞれの下位要素を包含するようにします．また，せっかく位置情報があるのでそれらも保持します．

rsrpp > rsrpp > src > parser > mod.rs

#[derive(Debug, Clone, PartialEq)]
pub struct Line {
    pub words: Vec<Word>,
    pub x: f32,
    pub y: f32,
    pub width: f32,
    pub height: f32,
}

impl Line {
    pub fn new(x: f32, y: f32, width: f32, height: f32) -> Line {
        Line {
            words: Vec::new(),
            x: x,
            y: y,
            width: width,
            height: height,
        }
    }
    pub fn add_word(&mut self, text: String, x: f32, y: f32, width: f32, height: f32) {
        self.words.push(Word {
            text: text.trim().to_string(),
            x: x,
            y: y,
            width: width,
            height: height,
        });
    }
    pub fn get_text(&self) -> String {
        let mut words = Vec::new();
        for word in &self.words {
            words.push(word.text.clone());
        }
        return words.join(" ");
    }
}


#[derive(Debug, Clone, PartialEq)]
pub struct Block {
    pub lines: Vec<Line>,
    pub x: f32,
    pub y: f32,
    pub width: f32,
    pub height: f32,
    pub section: String,
}

impl Block {
    pub fn new(x: f32, y: f32, width: f32, height: f32) -> Block {
        Block {
            lines: Vec::new(),
            x: x,
            y: y,
            width: width,
            height: height,
            section: String::new(),
        }
    }
    pub fn add_line(&mut self, x: f32, y: f32, width: f32, height: f32) {
        self.lines.push(Line::new(x, y, width, height));
    }

    pub fn get_text(&self) -> String {
        let mut text = String::new();
        for line in &self.lines {
            text = text.trim_end_matches("- ").to_string();
            text.push_str(&line.get_text());
            text.push_str(" ");
        }
        return text;
    }
}

Page〜Line では簡単に下位の要素を追加できると楽なので，add_... 関数をそれぞれ定義しました．多分後で使う．

また，最終的には「各Pageに含まれるテキストをまとめてとってきたい」みたいなことになると想定されるので，今のうちに get_text 関数を実装しておきます．

Page には tables: Vec<Coordinate> と number_of_columns: i8 が出てきていますが，前者は後ほど活躍します．後者はPDFが2段組の場合に対応するためのフィールドです．

rsrpp > rsrpp > src > parser > mod.rs

#[derive(Debug, Clone, PartialEq)]
pub struct Page {
    pub blocks: Vec<Block>,
    pub width: f32,
    pub height: f32,
    pub tables: Vec<Coordinate>,
    pub page_nubmer: PageNumber,
    pub number_of_columns: i8,
}

impl Page {
    pub fn new(width: f32, height: f32, page_number: PageNumber) -> Page {
        Page {
            blocks: Vec::new(),
            width: width,
            height: height,
            tables: Vec::new(),
            page_nubmer: page_number,
            number_of_columns: 1,
        }
    }

    pub fn add_block(&mut self, x: f32, y: f32, width: f32, height: f32) {
        self.blocks.push(Block::new(x, y, width, height));
    }

    pub fn get_text(&self) -> String {
        let mut text = String::new();
        for block in &self.blocks {
            text.push_str(&block.get_text());
            text.push_str("\n\n");
        }
        return text;
    }

次回

ここで定義したデータ構造をうまく使って，ToDoを一つずつ解決していきます．
次回はテキストの属性判定に取り組みます．

次回：Rustで学術論文からテキストを抽出する #3

https://manpages.debian.org/experimental/poppler-utils/index.html ↩
Vaswani, A. "Attention is all you need." Advances in Neural Information Processing Systems (2017). ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up