Rustで学術論文からテキスト抽出するクレートを実装するAdvent Calendar 2024

Rustで学術論文からテキストを抽出する #1 - 要件検討

Last updated at 2024-12-01Posted at 2024-11-30

Summary

シリーズを通して学術論文から本文を構造的に抽出するライブラリを実装する
入力は論文のPDFファイル，出力は各章のタイトルと本文からなるJSONとする
Popplerを用いてPDFをパースする

GiHub -> https://github.com/akitenkrad/rsrpp
crates.io -> https://crates.io/crates/rsrpp

このシリーズの目的

知識の宝庫である学術論文からできるだけ多くの情報をテキストとして抽出して分析できるようにしたい．

これに尽きます．

これまでは論文の情報としてAbstractを利用したり，unstracturedやpdfplumberといったPDFからテキストを抽出できるライブラリを使用してしのいでいたのですが，文章が不完全だったり，PDFのレイアウトによっては順序が壊れてしまったりとなかなかかゆいところに手が届かず，とうとう我慢できなくなりました．
最近の言語モデルを利用するには，品質の高い言語資源が重要なのです！
学術論文から綺麗にテキストを抽出したい！
文法的に成立していないテキストを抽出してくるなどもってのほか！
あと速度！たくさん処理させたいので，できるだけ速度を出したい！

ということで，作ることにしました．
速度を出したいので，Rustで．
テキスト抽出した結果をそのまま使える程度には品質の高いテキストを抽出できるモノを．

クレートの名前は rsrpp (Rust Research Ppaer Parser) とします．

このシリーズでは要件の検討や試行錯誤を含めて少しずつライブラリを実装していった過程を追いかけていこうと思います．

シリーズの構成

具体的にやりたいこと

まずは，実現したいことをきちんと言語化します．

入力

ほとんどの論文はPDF形式で公開されているので，実装するクレート¹の入力は論文のPDFファイルを想定します．
対象とする論文は主にComputer Scienceの論文で，論文によっては2段組のものも存在します．²³

出力

出力はJSONを想定します．章ごとに章のタイトルと本文をセットにします．また，論文によってはReferencesの後にAppendixがついているものがありますが，Appendixは対象外とします．

[
    {
        "section": "Introduction",
        "text": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
    },
    {
        "section": "Background",
        "text": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
    },
    {
        "section": "Model Architecture",
        "text": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
    }，
    ...
]

PDFのパースについて

PDF (Portable Document Format) には一応仕様があってISO32000で規定されているらしい⁴ですが，非常に複雑かつPDFを生成するプログラムの個別実装に合わせた対応などを考えると，正直全くのゼロベースでPDFを解析しにかかるのは難しいと思っていました．
そこで，PDFのパース自体はPopplerという既存のツールを利用します．
巨人の肩に乗る．大切．
Popplerだけで解決すれば，新しくライブラリを開発する必要なかったのですが，なかなかそううまくはいかないので，今回はPopplerでパースしたPDFの情報をフル活用して学術論文特化型のテキスト抽出ライブラリを構築することとします．

完成系のイメージ

こんな感じで出力したい (Attention Is All You Needの場合)

[
  {
    "index": 0,
    "title": "Abstract",
    "contents": [
      "Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works. ",
      "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. ",
      "∗ Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head attention and the parameter-free position representation and became the other person involved in nearly every detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating our research. † Work performed while at Google Brain. ‡ Work performed while at Google Research. ",
      "31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. ",
      "Abstract "
    ]
  },
  {
    "index": 1,
    "title": "Introduction",
    "contents": [
      "Introduction ",
      "Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15]. ",
      "Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states h t , as a function of the previous hidden state h t−1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains. ",
      "Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network. ",
      "In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs. "
    ]
  },
  {
    "index": 2,
    "title": "Background",
    "contents": [...]
    ...

Rustのプロジェクト作成

これからRustのクレートを実装していくにあって，まずはプロジェクトを作成します．
Rustのインストール方法などはこちらを参照．

> cargo new rsrpp
rsrpp
├── Cargo.toml
└── src
    └── main.rs

さて，rsrpp 本体はライブラリですが，将来的にはコマンドラインツールもあった方が良いので，workspace を利用します．workspaceってなんぞやという方はこちら．

> cd rsrpp
> rm -r src
> vim Cargo.toml
# [workspace]
# resolver = "2"
# members = []
# 
# [workspace.package]
# version = "0.1.0"
# edition = "2021"
# 
# [dependenceis]

プロジェクトを workspace 化したので，改めて rsrpp を追加します．ついでに将来のCLIも追加しておきます．rsrpp はライブラリとして使用するので，--lib をつけて追加します．

> cargo new rsrpp --lib
> cargo new rsrpp-cli
rsrpp
├── Cargo.toml
├── rsrpp
│   ├── Cargo.toml
│   └── src
│       └── lib.rs
└── rsrpp-cli
    ├── Cargo.toml
    └── src
        └── main.rs

次回

まずはPopplerのツール群を使い倒します．

次回：Rustで学術論文からテキストを抽出する #2

RustではPythonのパッケージなどライブラリに相当するものをクレートと呼びます ↩
Attention Is All You Need (Vaswani et al., 2017) ↩
Cross-modal Information Flow in Multimodal Large Language Models (Zhang et al., 2024) ↩
https://ja.wikipedia.org/wiki/Portable_Document_Format ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up