Rustで学術論文からテキスト抽出するクレートを実装するAdvent Calendar 2024

Rustで学術論文からテキストを抽出する #12 - cliの実装

Last updated at 2024-12-15Posted at 2024-12-12

Summary

Rustのコマンドラインツールの実装はclapがオススメ
rsrppをcliで使えるようにした

GiHub -> https://github.com/akitenkrad/rsrpp
crates.io/rsrpp -> https://crates.io/crates/rsrpp
crates.io/rsrpp-cli -> https://crates.io/crates/rsrpp-cli

ToDo

pdftotextで論文から単語単位のテキストと位置情報を取得する (Word,Line,Block,Page)
テキストの属性 (本文, タイトル, 脚注, etc.) を判定する
- テキストが含まれるエリアを抽出する
  - 2段組みを扱えるようにする
- セクションのタイトルを識別する
図表に含まれるテキストを除外する
- 表を除外する
  - PDFを画像に変換
  - 画像処理で表の位置を特定
cliの実装

今日のファイル

rsrpp
├── Cargo.toml
├── rsrpp
│   ├── Cargo.toml
│   └── src
│       ├── lib.rs
│       └── parser
│           ├── mod.rs
│           ├── structs.rs
│           └── tests.rs
└── rsrpp-cli
    ├── Cargo.toml
    └── src
        └── main.rs <-- today

前回までのあらすじ

前回：Rustで学術論文からテキストを抽出する #11

rsrppの機能がすべて実装完了しました．
今回は，実装した機能をコマンドラインから簡単に使えるように，cliを実装します．

Rustでのコマンドラインツール実装には`clap`がオススメ

Rustでのcliの実装に関しては，こちらのサイト(Command Line Applications in Rust) に詳しいことが書かれています．

Rustのコアライブラリを使っても一応cliを実装することはできますが，なかなか煩雑です．

cliの実装に関してもクレートがいくつか存在しますが，現時点 (2024.12) ではRustでcliを実装するならclapかなと．

clapでは，コマンドラインの引数を構造体として定義して扱います．cliに必要な機能はほとんど具備されていると思います．

今回実装するcliは，入力として論文のPDFのパスかURLを受け取るだけなので，cliとしては非常にシンプルな部類です．

clapを用いて引数を定義すると以下のようになります．

#[derive(Parser, Debug)]
#[command(version, about, long_about=None)]
struct Args {
    #[arg(short, long)]
    pdf: String,

    #[arg(short, long)]
    out: Option<String>,
}

outはオプションです．指定されていなければデフォルトのファイル名で解析結果を出力するようにします．

main()関数内では，

let args = Args::parse();

とすれば引数にアクセスできるようになります．非常にシンプルです．

`rsrpp`の実行

論文のPDFはダウンロードしてローカルファイルを扱うこともありますが，URLのまま扱うケースも (個人的には) 多々あるので，rsrppはファイルパスまたはURLを指定できるようにしていました．

判定は単純で，引数の値がhttpから始まっていればURLとみなします．
パスかURLかの判定はrsrppが行ってくれるので，cli側ではローカルファイルの存在確認だけ実施するようにしています．

    let is_url = args.pdf.starts_with("http");
    if !is_url && !Path::new(args.pdf.as_str()).exists() {
        eprintln!("File not found: {}", args.pdf);
        std::process::exit(-1);
    }

続いてrsrppでPDFをパースします．

    let mut config = ParserConfig::new();
    let pages = parse(args.pdf.as_str(), &mut config).await.unwrap();
    let sections = Section::from_pages(&pages);

これまで実装してきた通り，まずはPDFをパースしてpages: Vec<Page>を取得し，それをSectionに変換します．
最後にSectionをJSONに変換して出力ファイルに保存して処理終了です．

    let json = serde_json::to_string_pretty(&sections).unwrap();
    std::fs::write(format!("{}", outfile), json).unwrap();

cliの動作確認

使い方は以下の通りです．

1. 依存ライブラリのインストール

rsrppは，PDFの変換にpopplerを，表領域の解析にopencvをそれぞれ利用しているので，あらかじめインストールが必要です．

> sudo apt update -y
> sudo apt install -y poppler-utils
> sudo apt install -y libopencv-dev clang libclang-dev

2. CLIのインストール

> cargo install rsrpp-cli

3. 動作確認

> rsrpp --help
A Rust project for research paper pdf.

Usage: rsrpp [OPTIONS] --pdf <PDF>

Options:
  -p, --pdf <PDF>
  -o, --out <OUT>
  -h, --help       Print help
  -V, --version    Print version

試しに，実装で何度もお世話になっている"Attention Is All You Need"をパースします．

> time rsrpp --pdf "https://arxiv.org/pdf/1706.03762" --out output.json

real	0m4.325s
user	0m2.146s
sys     0m0.295s

処理時間はおよそ4秒．PythonでPDFをパースしてた頃は平気で数分かかっていたので，大幅なスピードアップです．

出力結果は以下．

Attention Is All You Need

[
  {
    "index": 0,
    "title": "Abstract",
    "contents": [
      "Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works. ",
      "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. ",
      "∗ Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head attention and the parameter-free position representation and became the other person involved in nearly every detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating our research. † Work performed while at Google Brain. ‡ Work performed while at Google Research. ",
      "31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. ",
      "Abstract "
    ]
  },
  {
    "index": 1,
    "title": "Introduction",
    "contents": [
      "Introduction ",
      "Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15]. ",
      "Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states h t , as a function of the previous hidden state h t−1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains. ",
      "Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network. ",
      "In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs. "
    ]
  },
  {
    "index": 2,
    "title": "Background",
    "contents": [
      "Background ",
      "The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU [16], ByteNet [18] and ConvS2S [9], all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions [12]. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2. ",
      "Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations [4, 27, 28, 22]. ",
      "End-to-end memory networks are based on a recurrent attention mechanism instead of sequencealigned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks [34]. ",
      "To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequencealigned RNNs or convolution. In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as [17, 18] and [9]. "
    ]
  },
  {
    "index": 3,
    "title": "Model Architecture",
    "contents": [
      "Model Architecture ",
      "Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35]. Here, the encoder maps an input sequence of symbol representations (x 1 , ..., x n ) to a sequence of continuous representations z = (z 1 , ..., z n ). Given z, the decoder then generates an output sequence (y 1 , ..., y m ) of symbols one element at a time. At each step the model is auto-regressive [10], consuming the previously generated symbols as additional input when generating the next. ",
      "Figure 1: The Transformer - model architecture. ",
      "The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively. ",
      "Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, positionwise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension d model = 512. ",
      "Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i. ",
      "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum ",
      "Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel. ",
      "of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. ",
      "We call our particular attention \"Scaled Dot-Product Attention\" (Figure 2). The input consists of queries and keys of dimension d k , and √ values of dimension d v . We compute the dot products of the query with all keys, divide each by d k , and apply a softmax function to obtain the weights on the values. ",
      "In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V . We compute the matrix of outputs as: ",
      "QK T Attention(Q, K, V ) = softmax( √ )V d k ",
      "The two most commonly used attention functions are additive attention [2], and dot-product (multiplicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of √ 1 d . Additive attention computes the compatibility function using a feed-forward network with k a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code. ",
      "While for small values of d k the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of d k [3]. We suspect that for large values of d k , the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients 4 . To counteract this effect, we scale the dot products by √ 1 d . ",
      "Instead of performing a single attention function with d model -dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to d k , d k and d v dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding d v -dimensional ",
      "To illustrate why the dot products get large, assume that the components of q and k are independent random P k variables with mean 0 and variance 1. Then their dot product, q · k = d i=1 q i k i , has mean 0 and variance d k . ",
      "output values. These are concatenated and once again projected, resulting in the final values, as depicted in Figure 2. ",
      "Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this. ",
      "MultiHead(Q, K, V ) = Concat(head 1 , ..., head h )W O ",
      "where head i = Attention(QW i Q , KW i K , V W i V ) ",
      "Where the projections are parameter matrices W i Q ∈ R d model ×d k , W i K ∈ R d model ×d k , W i V ∈ R d model ×d v and W O ∈ R hd v ×d model . ",
      "In this work we employ h = 8 parallel attention layers, or heads. For each of these we use d k = d v = d model /h = 64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality. ",
      "The Transformer uses multi-head attention in three different ways: ",
      "• In \"encoder-decoder attention\" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as [38, 2, 9]. ",
      "• The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder. ",
      "• Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections. See Figure 2. ",
      "In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between. ",
      "While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1. The dimensionality of input and output is d model = 512, and the inner-layer has dimensionality d f f = 2048. ",
      "Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension d model . We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax √ linear transformation, similar to [30]. In the embedding layers, we multiply those weights by d model . ",
      "Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations for different layer types. n is the sequence length, d is the representation dimension, k is the kernel size of convolutions and r the size of the neighborhood in restricted self-attention. ",
      "Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add \"positional encodings\" to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension d model as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed [9]. ",
      "In this work, we use sine and cosine functions of different frequencies: ",
      "where pos is the position and i is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000 · 2π. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, P E pos+k can be represented as a linear function of P E pos . ",
      "We also experimented with using learned positional embeddings [9] instead, and found that the two versions produced nearly identical results (see Table 3 row (E)). We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training. "
    ]
  },
  {
    "index": 4,
    "title": "Why Self-Attention",
    "contents": [
      "Why Self-Attention ",
      "In this section we compare various aspects of self-attention layers to the recurrent and convolutional layers commonly used for mapping one variable-length sequence of symbol representations (x 1 , ..., x n ) to another sequence of equal length (z 1 , ..., z n ), with x i , z i ∈ R d , such as a hidden layer in a typical sequence transduction encoder or decoder. Motivating our use of self-attention we consider three desiderata. ",
      "One is the total computational complexity per layer. Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required. ",
      "The third is the path length between long-range dependencies in the network. Learning long-range dependencies is a key challenge in many sequence transduction tasks. One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network. The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies [12]. Hence we also compare the maximum path length between any two input and output positions in networks composed of the different layer types. ",
      "As noted in Table 1, a self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires O(n) sequential operations. In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence ",
      "length n is smaller than the representation dimensionality d, which is most often the case with sentence representations used by state-of-the-art models in machine translations, such as word-piece [38] and byte-pair [31] representations. To improve computational performance for tasks involving very long sequences, self-attention could be restricted to considering only a neighborhood of size r in the input sequence centered around the respective output position. This would increase the maximum path length to O(n/r). We plan to investigate this approach further in future work. ",
      "A single convolutional layer with kernel width k < n does not connect all pairs of input and output positions. Doing so requires a stack of O(n/k) convolutional layers in the case of contiguous kernels, or O(log k (n)) in the case of dilated convolutions [18], increasing the length of the longest paths between any two positions in the network. Convolutional layers are generally more expensive than recurrent layers, by a factor of k. Separable convolutions [6], however, decrease the complexity considerably, to O(k · n · d + n · d 2 ). Even with k = n, however, the complexity of a separable convolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer, the approach we take in our model. ",
      "As side benefit, self-attention could yield more interpretable models. We inspect attention distributions from our models and present and discuss examples in the appendix. Not only do individual attention heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic and semantic structure of the sentences. "
    ]
  },
  {
    "index": 5,
    "title": "Training",
    "contents": [
      "Training ",
      "This section describes the training regime for our models. ",
      "We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding [3], which has a shared sourcetarget vocabulary of about 37000 tokens. For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary [38]. Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens. ",
      "We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We trained the base models for a total of 100,000 steps or 12 hours. For our big models,(described on the bottom line of table 3), step time was 1.0 seconds. The big models were trained for 300,000 steps (3.5 days). ",
      "We used the Adam optimizer [20] with β 1 = 0.9, β 2 = 0.98 and ϵ = 10 −9 . We varied the learning rate over the course of training, according to the formula: ",
      "−0.5 lrate = d −0.5 , step_num · warmup_steps −1.5 ) model · min(step_num ",
      "This corresponds to increasing the learning rate linearly for the first warmup_steps training steps, and decreasing it thereafter proportionally to the inverse square root of the step number. We used warmup_steps = 4000. ",
      "We employ three types of regularization during training: ",
      "Table 2: The Transformer achieves better BLEU scores than previous state-of-the-art models on the English-to-German and English-to-French newstest2014 tests at a fraction of the training cost. ",
      "Residual Dropout We apply dropout [33] to the output of each sub-layer, before it is added to the sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of P drop = 0.1. ",
      "Label Smoothing During training, we employed label smoothing of value ϵ ls = 0.1 [36]. This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score. "
    ]
  },
  {
    "index": 6,
    "title": "Results",
    "contents": [
      "Results ",
      "On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big) in Table 2) outperforms the best previously reported models (including ensembles) by more than 2.0 BLEU, establishing a new state-of-the-art BLEU score of 28.4. The configuration of this model is listed in the bottom line of Table 3. Training took 3.5 days on 8 P100 GPUs. Even our base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of the competitive models. ",
      "On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41.0, outperforming all of the previously published single models, at less than 1/4 the training cost of the previous state-of-the-art model. The Transformer (big) model trained for English-to-French used dropout rate P drop = 0.1, instead of 0.3. ",
      "For the base models, we used a single model obtained by averaging the last 5 checkpoints, which were written at 10-minute intervals. For the big models, we averaged the last 20 checkpoints. We used beam search with a beam size of 4 and length penalty α = 0.6 [38]. These hyperparameters were chosen after experimentation on the development set. We set the maximum output length during inference to input length + 50, but terminate early when possible [38]. ",
      "Table 2 summarizes our results and compares our translation quality and training costs to other model architectures from the literature. We estimate the number of floating point operations used to train a model by multiplying the training time, the number of GPUs used, and an estimate of the sustained single-precision floating-point capacity of each GPU 5 . ",
      "To evaluate the importance of different components of the Transformer, we varied our base model in different ways, measuring the change in performance on English-to-German translation on the ",
      "We used values of 2.8, 3.7, 6.0 and 9.5 TFLOPS for K80, K40, M40 and P100, respectively. ",
      "Table 3: Variations on the Transformer architecture. Unlisted values are identical to those of the base model. All metrics are on the English-to-German translation development set, newstest2013. Listed perplexities are per-wordpiece, according to our byte-pair encoding, and should not be compared to per-word perplexities. ",
      "development set, newstest2013. We used beam search as described in the previous section, but no checkpoint averaging. We present these results in Table 3. ",
      "In Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions, keeping the amount of computation constant, as described in Section 3.2.2. While single-head attention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads. ",
      "In Table 3 rows (B), we observe that reducing the attention key size d k hurts model quality. This suggests that determining compatibility is not easy and that a more sophisticated compatibility function than dot product may be beneficial. We further observe in rows (C) and (D) that, as expected, bigger models are better, and dropout is very helpful in avoiding over-fitting. In row (E) we replace our sinusoidal positional encoding with learned positional embeddings [9], and observe nearly identical results to the base model. ",
      "To evaluate if the Transformer can generalize to other tasks we performed experiments on English constituency parsing. This task presents specific challenges: the output is subject to strong structural constraints and is significantly longer than the input. Furthermore, RNN sequence-to-sequence models have not been able to attain state-of-the-art results in small-data regimes [37]. ",
      "We trained a 4-layer transformer with d model = 1024 on the Wall Street Journal (WSJ) portion of the Penn Treebank [25], about 40K training sentences. We also trained it in a semi-supervised setting, using the larger high-confidence and BerkleyParser corpora from with approximately 17M sentences [37]. We used a vocabulary of 16K tokens for the WSJ only setting and a vocabulary of 32K tokens for the semi-supervised setting. ",
      "We performed only a small number of experiments to select the dropout, both attention and residual (section 5.4), learning rates and beam size on the Section 22 development set, all other parameters remained unchanged from the English-to-German base translation model. During inference, we ",
      "Table 4: The Transformer generalizes well to English constituency parsing (Results are on Section 23 of WSJ) ",
      "increased the maximum output length to input length + 300. We used a beam size of 21 and α = 0.3 for both WSJ only and the semi-supervised setting. ",
      "Our results in Table 4 show that despite the lack of task-specific tuning our model performs surprisingly well, yielding better results than all previously reported models with the exception of the Recurrent Neural Network Grammar [8]. ",
      "In contrast to RNN sequence-to-sequence models [37], the Transformer outperforms the BerkeleyParser [29] even when training only on the WSJ training set of 40K sentences. "
    ]
  },
  {
    "index": 7,
    "title": "Conclusion",
    "contents": [
      "Conclusion ",
      "In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. ",
      "For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles. ",
      "We are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video. Making generation less sequential is another research goals of ours. ",
      "The code we used to train and evaluate our models is available at https://github.com/ tensorflow/tensor2tensor. ",
      "Acknowledgements We are grateful to Nal Kalchbrenner and Stephan Gouws for their fruitful comments, corrections and inspiration. "
    ]
  },
  {
    "index": 8,
    "title": "References",
    "contents": [
      "References ",
      "[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016. ",
      "[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014. ",
      "[3] Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc V. Le. Massive exploration of neural machine translation architectures. CoRR, abs/1703.03906, 2017. ",
      "[4] Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733, 2016. ",
      "[5] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014. ",
      "[6] Francois Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357, 2016. ",
      "[7] Junyoung Chung, Çaglar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014. ",
      "[8] Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. Recurrent neural network grammars. In Proc. of NAACL, 2016. ",
      "[9] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122v2, 2017. ",
      "[10] Alex Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013. ",
      "[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016. ",
      "[12] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001. ",
      "[13] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. ",
      "[14] Zhongqiang Huang and Mary Harper. Self-training PCFG grammars with latent annotations across languages. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 832–841. ACL, August 2009. ",
      "[15] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016. ",
      "[16] Łukasz Kaiser and Samy Bengio. Can active memory replace attention? In Advances in Neural Information Processing Systems, (NIPS), 2016. ",
      "[17] Łukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. In International Conference on Learning Representations (ICLR), 2016. ",
      "[18] Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099v2, 2017. ",
      "[19] Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. Structured attention networks. In International Conference on Learning Representations, 2017. ",
      "[20] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. ",
      "[21] Oleksii Kuchaiev and Boris Ginsburg. Factorization tricks for LSTM networks. arXiv preprint arXiv:1703.10722, 2017. ",
      "[22] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017. ",
      "[23] Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114, 2015. ",
      "[24] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attentionbased neural machine translation. arXiv preprint arXiv:1508.04025, 2015. ",
      "[25] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993. ",
      "[26] David McClosky, Eugene Charniak, and Mark Johnson. Effective self-training for parsing. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pages 152–159. ACL, June 2006. ",
      "[27] Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model. In Empirical Methods in Natural Language Processing, 2016. ",
      "[28] Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017. ",
      "[29] Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. Learning accurate, compact, and interpretable tree annotation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 433–440. ACL, July 2006. ",
      "[30] Ofir Press and Lior Wolf. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859, 2016. ",
      "[31] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015. ",
      "[32] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017. ",
      "[33] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014. ",
      "[34] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2440–2448. Curran Associates, Inc., 2015. ",
      "[35] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112, 2014. ",
      "[36] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015. ",
      "[37] Vinyals & Kaiser, Koo, Petrov, Sutskever, and Hinton. Grammar as a foreign language. In Advances in Neural Information Processing Systems, 2015. ",
      "[38] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016. ",
      "[39] Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei Xu. Deep recurrent models with fast-forward connections for neural machine translation. CoRR, abs/1606.04199, 2016. ",
      "[40] Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang, and Jingbo Zhu. Fast and accurate shift-reduce constituent parsing. In Proceedings of the 51st Annual Meeting of the ACL (Volume 1: Long Papers), pages 434–443. ACL, August 2013. ",
      "Figure 3: An example of the attention mechanism following long-distance dependencies in the encoder self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency of the verb ‘making’, completing the phrase ‘making...more difficult’. Attentions here shown only for the word ‘making’. Different colors represent different heads. Best viewed in color. ",
      "Input-Input Layer5 ",
      "Input-Input Layer5 ",
      "Figure 4: Two attention heads, also in layer 5 of 6, apparently involved in anaphora resolution. Top: Full attentions for head 5. Bottom: Isolated attentions from just the word ‘its’ for attention heads 5 and 6. Note that the attentions are very sharp for this word. ",
      "Input-Input Layer5 ",
      "Input-Input Layer5 ",
      "Figure 5: Many of the attention heads exhibit behaviour that seems related to the structure of the sentence. We give two such examples above, from two different heads from the encoder self-attention at layer 5 of 6. The heads clearly learned to perform different tasks. "
    ]
  }
]

うむ．少々余計なテキストや数式のかけらのようなテキストが混ざっていたりしますが，LLMに与えるテキストとしてはなんとかなりそうかな？

実装時はずっと"Attention Is All You Need"ばかり使っていたので，過学習していないか念の為別のPDFでも確認．
最近出た論文かつ2段組のものを試してみます．

Truth or Mirage? Towards End-To-End Factuality Evaluation with LLM-O ASIS

[
  {
    "index": 0,
    "title": "Abstract",
    "contents": [
      "Truth or Mirage? Towards End-To-End Factuality Evaluation with LLM-O ASIS",
      "Alessandro Scirè * 1,2 Andrei Stefan Bejgu ∗1,2 Simone Tedeschi 1,2 2 Karim Ghonim Federico Martelli 2 Roberto Navigli 2",
      "Abstract",
      "After the introduction of Large Language Mod els (LLMs), there have been substantial im provements in the performance of Natural Lan guage Generation (NLG) tasks, including Text Summarization and Machine Translation. How ever, LLMs still produce outputs containing hallucinations, that is, content not grounded in factual information. Therefore, developing methods to assess the factuality of LLMs has become urgent. Indeed, resources for factuality evaluation have recently emerged. Although challenging, these resources face one or more of the following limitations: i) they are tailored to a specific task or domain; ii) they are limited in size, thereby preventing the training of new factuality evaluators, iii) they are designed for simpler verification tasks, such as claim verifi cation. To address these issues, we introduce LLM-O ASIS , to the best of our knowledge the largest resource for training end-to-end factu ality evaluators. LLM-O ASIS is constructed by extracting claims from Wikipedia, falsify ing a subset of these claims, and generating pairs of factual and unfactual texts. We then rely on human annotators to both validate the quality of our dataset and to create a gold stan dard test set for benchmarking factuality eval uation systems. Our experiments demonstrate that LLM-O ASIS presents a significant chal lenge for state-of-the-art LLMs, with GPT-4o achieving up to 60% accuracy in our proposed end-to-end factuality evaluation task, highlight ing its potential to drive future research in the field."
    ]
  },
  {
    "index": 1,
    "title": "Introduction",
    "contents": [
      "Introduction",
      "In recent years, generative approaches in NLP have demonstrated remarkable results, achieving state-of-the-art performance across various tasks. This progress has been particularly notable with the advent of Large Language Models (LLMs),",
      "Sapienza University of Rome {first.lastname}@uniroma1.it",
      "which have revolutionized the field, driving ad vancements in many tasks, including Text Summa rization (Goyal et al. , 2022; Pu et al. , 2023; Zhang et al. , 2023b), Machine Translation (Alves et al. , 2024; Zhang et al. , 2023a; Wang et al. , 2023a), and Question Answering (Kamalloo et al. , 2023; Ra sool et al. , 2024). However, a critical challenge remains as LLMs’ outputs still contain hallucina tions, i.e. content that cannot be grounded in any pre-existing knowledge (Tonmoy et al. , 2024; Tam et al. , 2022). Compounding the problem, LLMs generate highly-fluent texts (Wang et al. , 2023b), which may mislead users into trusting their factual accuracy. Therefore, developing modeling strate gies to mitigate this issue and creating tools to de tect and correct hallucinations has become urgent. In this work, we focus on the problem of factu ality evaluation, that is, the task of checking the factual accuracy of a text. Previous research has proposed various resources to address this task. Al though challenging, even for LLM-based factual reasoners, these resources are designed for specific settings, such as text summarization of news (La ban et al. , 2021; Tang et al. , 2023), books (Scirè et al. , 2024), and dialogues (Tang et al. , 2024), among others. These benchmarks, while repre sentative in their respective domains and tasks, often present peculiarities, which may lead to a lack of generalizability across different settings. A more general resource, pairing claims with ev idence from Wikipedia is FEVER (Thorne et al. , 2018); however, its applicability is limited by its fo cus on claim verification, which involves assessing the veracity of individual facts. This formulation is not well-suited to real-world scenarios, where texts typically contain multiple facts, thereby preventing the development of end-to-end factuality evalua tion systems. These limitations highlight the need for a resource that is, not restricted to a specific domain or task, offering broader applicability and enabling the design of complete factuality evaluation approaches. In this context, we introduce LLM-O ASIS , a large-scale resource for end-to-end factuality eval uation, created by extracting and falsifying infor mation from Wikipedia pages. The overall pro cess is depicted in Fig. 1. As a result, we obtain 81k ⟨factual, unfactual⟩ pairs that are suitable to train end-to-end factuality evaluation systems. Additionally, we setup a human annotation pro cess to: i) create a gold standard for the fac tuality evaluation task, useful for benchmarking LLMs, and ii) validate the quality of the proposed data creation pipeline. Additionally we issue two tasks, namely end-to-end factuality evaluation and evidence-based claim verification to benchmark LLMs. Our experiments demonstrate that our re source is challenging even for state-of-the-art mod els, both in zero-shot and Retrieval Augmented Generation (Lewis et al. , 2021, RAG) settings, with GPT-4o achieving an accuracy of 60% and 68%, respectively. In summary, our contributions are the following:",
      "• We introduce LLM-O ASIS , to the best of our knowledge the largest resource for end-to-end factuality evaluation, obtained by falsifying claims extracted from Wikipedia;",
      "• Our resource enables two tasks to challenge current LLMs to detect factual inconsistencies in both short and long texts;",
      "• We propose a gold standard benchmark, re sulting from a human annotation process, to evaluate models on the proposed tasks;",
      "• Our experiments demonstrate that our bench mark presents a significant challenge for LLMs, with smaller specialized models trained on LLM-O ASIS achieving competi tive performance.",
      "Although we selected Wikipedia as the basis for our resource, we emphasize that our methodology can be potentially adapted to any other corpus in any domain or language, as the only requirement is access to a collection of raw texts. In the hope of fostering research in factuality evaluation, we release our resource and code at https://github. com/Babelscape/LLM-Oasis."
    ]
  },
  {
    "index": 2,
    "title": "Related Work",
    "contents": [
      "Related Work",
      "Previous studies for factuality evaluation have fo cused on assessing factual consistency, i.e. , the",
      "extent to which a generated text is grounded in a source document. Resources for this task typically include human annotations that indicate whether a generated text accurately reflects the original doc ument’s facts. However, many of these works are tailored for specific tasks and domains, such as the assessment of factual consistency in summaries of news (Fabbri et al. , 2021; Tang et al. , 2023; Pagnoni et al. , 2021), books (Scirè et al. , 2024), and dialogues (Tang et al. , 2024). Moreover, they are based on the assumption that the source of knowl edge required for the verification is always avail able (e.g. , the source document). This is not the case for the more general factuality evaluation task, in which a text in natural language must be verified regardless of the availability of the evidence, poten tially requiring information retrieval techniques. The first contribution towards general-purpose factuality evaluation dates back to FEVER (Thorne et al. , 2018, Fact Extraction and VERification), which pairs claims with evidence retrieved from Wikipedia. The FEVER dataset comprises 185,445 human-generated claims, created by modifying sen tences extracted from Wikipedia and subsequently verified without knowledge of the original sen tences. The claims are classified as Supported, Refuted, or NotEnoughInfo, and for the first two categories, annotators also recorded the sentence(s) forming the necessary evidence for their judgment. Although challenging, FEVER presents limitations due to its focus on fact verification, which in volves checking the veracity of individual claims. This focus is hardly adaptable to real-world scenar ios, where texts to verify usually feature multiple claims. Additionally, FEVER’s annotation effort is limited to a relatively-small subset of 10k English Wikipedia pages. More recently, multiple studies introduced strate gies to generate synthetic instances for factuality evaluation. Notably, Muhlgay et al. (2024) intro duced FACTOR, a framework to generate factual ity benchmarks by prompting an LLM to produce factual and unfactual completions given a prefix text. FACTOR includes 4,266 instances of ⟨prefix, completion⟩ pairs, each accompanied by a factu ality label. Along the same lines, by introducing FELM, Chen et al. (2023) provide 847 LLM out puts focused on different types of knowledge, such as World Knowledge, Math, and Reasoning with human-made factuality annotations. While valu able for benchmarking LLMs, the limited size of these resources prevents them from being used to",
      "Physical agents such as extreme temperatures and ultraviolet or solar radiation can be damaging to the skin over prolonged exposure. Biological agents such as parasites, microorganisms, plants and animals can have varied effects when exposed to the skin.",
      "Any form of PPE that acts as a barrier between the skin and the agent of exposure can be considered skin protection. Because much work is done with the hands, gloves are an essential item in providing skin protection.",
      "pages covering a broad set of domains. Finally, we reserve a manually-curated subset for this task, consisting of approximately 2k instances, and use it to benchmark several state-of-the-art models.",
      "In this section, we outline the steps required to generate LLM-O ASIS . We start by selecting Wikipedia as our source of factual data due to its coverage of a wide range of topics and its frequent",
      "Unfactual Text Generation, Sec. 3.3). In the remainder of this section, for the sake of clarity, we describe the above-mentioned steps individually, but we anticipate that the step-specific outputs are obtained by means of a general, unified prompt containing the instructions for all the steps. The overall prompt is provided in Table 1.",
      "We select the 80k most visited pages in 2023. In creating our resource, we set K = 5 and s = 1. 3 We used the GPT-4 API. More details in Appendix D.",
      "Table 1: Prompt for the generation of data in LLM-O ASIS .",
      "The first step in creating LLM-O ASIS involves extracting claims from an input passage t. We randomly sample one passage from each Wikipedia page and then extract a list of claims from each of the passages (cf. Step 1 in Fig. 1).",
      "We frame the claim extraction task as an end-to end autoregressive generation problem. Let M be our generative model. Given an input passage t, we task M to extract the claims using the prompt",
      "where (c 1 , . . . , c n ) represents the sequence of the generated claims. With the prompt P 1 (t), we aim at obtaining atomic 4 and self-contained claims, i.e. elementary units of information that do not require additional context to be verified. Specifically, we",
      "Liu et al. (2023) defines a claim as an Atomic Content Unit (ACU), that is, an elementary unit of information found in a text that does not require further subdivision for the purpose of reducing ambiguity.",
      "explicitly require the model to adhere to such for mal definition, and, additionally, constrain it to gen erate short texts and avoid the usage of pronouns as subjects. For instance, given the input passage:",
      "“The Amazon Rainforest, also known as Amazonia, is a moist broadleaf forest in the Amazon biome that covers most of the Amazon basin of South America. This re gion includes territory belonging to nine nations, with Brazil containing 60% of the rainforest. ”",
      "the model M returns the following list of claims:",
      "1. The Amazon Rainforest is also known as Ama zonia.",
      "2. It is a moist broadleaf forest in the Amazon biome.",
      "3. The Amazon Rainforest covers most of the Amazon basin of South America.",
      "4. The region includes territory belonging to nine nations.",
      "5. Brazil contains 60% of the rainforest.",
      "Further examples of extracted claims can be found in Appendix A.",
      "With the aim of producing an unfactual version of the original text, we introduce a critical factual error into one of the extracted claims. Formally, given the set of claims C = (c 1 , . . . , c n ) we task the model to falsify one of the claims as follows: M(P 2 (C)) = (c i , c i ) (2)",
      "where P 2 (C) is the prompt comprising the instruc tions for claim falsification, c i the falsified claim and c i the corresponding factual one. We ask the model to provide the factual claim as well, thus enabling the investigation of the model’s behavior. As outlined in Table 1 (Step 2), we instruct the model to falsify only one of the extracted claims by introducing a critical yet subtle error, which makes it potentially challenging to detect. Moreover, in spired by findings from previous works about the manual creation of Natural Language Inference (NLI) resources (Parrish et al. , 2021; Hu et al. , 2020), we designed the prompt with instructions to discourage the generation of naive contradicting instances, e.g. , trivial negations of verbs. Continu ing the example introduced in the previous section, given the extracted set of claims:",
      "1. The Amazon Rainforest is also known as Ama zonia.",
      "2. The Amazon Rainforest is a moist broadleaf forest in the Amazon biome.",
      "3. The Amazon Rainforest covers most of the Amazon basin of South America.",
      "4. The region includes territory belonging to nine nations.",
      "5. Brazil contains 60% of the rainforest. (c i )",
      "the model M produces the following falsified claim: The majority of the forest is contained (c i ) within Peru. Further examples of ⟨factual, unfactual⟩ pairs of claims can be found in Appendix A.",
      "Factual and Unfactual Text Generation",
      "Based on the claims extracted in the previous steps (cf. Sections 3.1 and 3.2), we now aim at generating pairs of ⟨factual, unfactual⟩ texts, which populate our resource for factuality evaluation, thus enabling the training and the benchmarking of factual rea soners.",
      "Factual text generation To make the factuality evaluation task more challenging, instead of using the original passages from Wikipedia as our factual texts, we leverage paraphrase generation. This ap proach produces texts that convey the same mean ing as the original ones but with different surface forms, thereby making the verification task diffi cult for LLMs in both zero-shot settings – as the original texts could have been seen during pretrain ing – and RAG settings, which might retrieve the exact passages from Wikipedia. Formally, given the set of extracted claims C, we task the model to generate a factual text F grounded on such claims:",
      "where P 3 (C) is the prompt with the instructions for obtaining a factual text through paraphrasing. As described in Table 1 (Step 3) we explicitly require M to follow the sequence of extracted claims to encourage a full coverage of the facts expressed in the original text. For instance, given the following claims:",
      "1. The Amazon Rainforest is also known as Ama zonia.",
      "2. The Amazon Rainforest is a moist broadleaf forest in the Amazon biome.",
      "3. The Amazon Rainforest covers most of the Amazon basin of South America.",
      "4. Brazil contains 60% of the rainforest.",
      "the model M generates the following factual text:",
      "“Amazonia, widely known as the Amazon Rainforest, is a damp broadleaf forest located within the Amazon biome, cover ing a significant portion of the Amazon basin in South America. This vast region spans across nine countries, with Brazil housing 60% of the rainforest. ”",
      "See Appendix A for more examples of generated factual texts.",
      "Unfactual text generation Finally, the unfac tual texts are generated through an analogous pro cess, this time grounded on the set of claims that includes the unfactual one, namely, C = (c 1 , . . . , c i , . . . , c n ). We obtain the unfactual text U with the generation process defined with the fol lowing:",
      "where P 4 (C, F) is the prompt containing the guide lines for unfactual text generation. In particular, as specified in Table 1 (Step 4), we instruct M to pro duce a text identical to F except for the segment containing the factual error to ensure that the only confounding factor for the verification task is the unfactual portion of the text. This approach helps",
      "81,275 81,275 99.7 681,201 8.381 8.6",
      "Table 2: Summary statistics for the creation of LLM O ASIS .",
      "the model M generates the following unfactual text:",
      "“Amazonia, widely known as the Amazon Rainforest, is a damp broadleaf forest located within the Amazon biome, cover ing a significant portion of the Amazon basin in South America. This vast region spans across nine countries, and the majority of the forest is contained within Peru.",
      "Claim Extraction # Pages # Passages Avg. Tokens per Passage # Claims Avg. Claims per Passage Avg. Tokens per Claim",
      "Additional examples of unfactual texts can be found in Appendix A. Finally, statistics about claim extraction, claim falsification, factual and unfactual text generation process can be found in Table 2.",
      "As a result of the steps described in Section 3, we obtained a large resource consisting of claims and texts (both factual and unfactual) that can be used to train end-to-end factuality evaluation systems. However, due to the automated nature of the pro posed approach, it is crucial to both evaluate the quality of the produced data – by accurately eval uating the individual steps of our pipeline – and introduce a gold-standard benchmark for the task.",
      "To assess the quality of our dataset and enable a rigorous evaluation of our procedure, we asked",
      "Fleiss’ κ 0.81 0.84 0.73 0.72",
      "Table 3: Performance of the chosen LLM M in the data generation process according to human evaluation (Accuracy), and the corresponding inter-annotator agree ment (Fleiss’ κ).",
      "Task Claim Extraction Claim Falsification Factual Text Gen. Unfactual Text Gen.",
      "Accuracy (%) 96.78 98.55 90.36 89.20",
      "M = 5 expert linguists to validate a portion of N = 1, 750 instances for each task in our pipeline (cf. Sec. 3 and Fig. 1). Each annotator curated (N/M ) + K instances for each task with each of the M subsets having an overlap of K = 100 in stances shared among all annotators. We paid the annotators according to the standard salaries for their geographical location and provided them with task-specific guidelines, annotation examples, and a simple interface for each task. More details are provided in Appendix E.",
      "Claim Extraction For the claim extraction task, annotators received Wikipedia passages (t 1 , . . . , t N ), each accompanied by a list of claims extracted by the model M as described in Section 3.1. The annotators’ task was to verify whether each claim was appropriately represented in the corresponding passage (i.e. with the same seman tics) and assess their atomicity. 5 We evaluated the LLM’s performance on this task by counting the human-annotated errors, yield ing an accuracy of 96.78%. Additionally, we measured inter-annotator agreement, resulting in a Fleiss’ κ score of 0.81. These results underscore both the high quality of the generated ⟨text, claims⟩ pairs and the strong agreement among the annota tors. Among the few errors produced by the LLM, we observe some occasional incorrect claims in the context of conditional clauses, where the model interprets conditional or hypothetical statements as if they were factual claims. For instance, given the text: In contrast, if interest rates were the main motive for international investment, FDI would in clude many industries within fewer countries. [. .. ],",
      "We chose to prioritize a precision-oriented evaluation for two key reasons: First, low coverage does not affect our proposed claim verification task (see Task 2, Section 4.2); and second, evaluating coverage would have required annotators to read the entire passage, making the annotation process more time-consuming and costly.",
      "the following incorrect claims were extracted: In terest rates motivate international investment and Interest rates lead to FDI in multiple industries, thus misrepresenting the original text which, in stead, indicates a hypothetical scenario.",
      "Claim Falsification For this task, annotators re ceived pairs of claims ⟨c i , c i ⟩ with c i being one of the original claims selected from (c 1 , . . . , c n ) and c i the corresponding falsified claim produced by the model M. The annotators’ task was to verify whether each claim was appropriately falsified (i.e. with contradicting semantics). This required them to determine if c i meaningfully diverged from c i in terms of content and truthfulness, effectively capturing the model’s ability to produce altered, incorrect versions of the original claims. Again, the model achieved a very high accuracy (98.55%). We measured a Fleiss’ κ score of 0.84, showing up almost perfect agreement between the annotators. In this case, one of the most frequent error cate gory concerns instances where attempts at falsifi cation manifest through minimal lexical variation, specifically by altering a single word. In these cases, such minor substitutions do not always yield a valid falsification. For example, consider the following claims: Michael Ausiello authored the exclusive piece and Michael Ausiello wrote the ex clusive piece. As we can see, despite the substitu tion of the verb, the semantic congruence between the two claims is maintained, rendering the falsifi cation attempt ineffective. An additional instance of this type is represented by the claims: Washing ton, D.C. has milder winter weather than New York and Washington, D.C. has warmer winter weather than New York.",
      "Factual and Unfactual Text Generation For these two tasks, we used a common format. Anno tators received lists of original (or falsified) claims C (or C) and the associated factual (or unfactual) texts produced by the model M. The annotators’ task was to verify whether each claim was correctly represented in the generated text. In the context of factual text generation, we additionally check whether the texts feature the same semantics as the claims but using a different wording. For the fac tual text generation step, we measured an accuracy of 90.36% and a Fleiss’ κ score of 0.73. Similarly, for the unfactual text generation, we measured an accuracy of 89.2% and a Fleiss’ κ score of 0.72. As for factual text generation, we observe that the model adds information not present in the factual text 6 , which results in the generation of in correct claims. For instance, the claim Russian President Yeltsin formed the Russian Armed Forces in May 1992 is not included in its entirety in the factual text:",
      "Originally, the Armed Forces of the Rus sian Socialist Federative Soviet Repub lic, also acknowledged as the Red Army, served both the Russian SFSR and Soviet Union. The roots of these Armed Forces can be traced back to the Russian Civil War of 1917-1923, and they persisted un til the USSR collapsed in 1991. In 1992, Boris Yeltsin, the then Russian Presi dent, initiated the formation of the Rus sian Armed Forces, integrating a signif icant part of the Soviet Armed Forces. Concurrently, other divisions of the for mer USSR’s military disbanded around 1992-1993, establishing national forces. The Soviet Armed Forces were comprised of Ground Forces, Air Forces, Navy, the OGPU, and convoy guardians. Later, the OGPU was incorporated into the NKVD in 1934, and their Internal Troops came under the joint management of De fense and Interior Commissariats. Post World War II, the forces expanded to in clude Strategic Rocket Forces, Air De fence Forces, and Civil Defence Forces.",
      "As far as the unfactual text generation task is con cerned, we occasionally encounter clauses which are not included in the unfactual text. For example, given the text:",
      "The final segment presents imagery of PSA Flight 182’s wreckage, alongside vivid aftermath photos and air traffic control recordings. It depicts a horrific scene filled with dismembered remains and houses in ruins. Gröss comments on the persistent scent of flowers and avia tion fuel that envelops the neighborhood. He emphasizes a particularly harrowing discovery of a partially intact body as the most shocking testament to mortal ity. Moving on, Gröss probes into the potential impact of otherworldly entities on death itself. He converses with Joseph",
      "We highlight that the added information is often present in the original text.",
      "Binder, an architect grieving over his wife and son’s untimely demise. Binder maintains that his family’s apparitions occupy his residence, attempting to forge a connection with him.",
      "We find that the following claim is not included: A torso and right hand represent the worst death aspect according to Gröss. 7 Similarly to the errors observed in factual text generation, we attribute this type of mistake to the propensity of the LLM to integrate or modify information based on the knowledge acquired during its pre-training phase. Overall, the reported detailed evaluations sum marized in Table 3 show the efficacy and robustness of the proposed methodology for producing train ing data for the task.",
      "In this section, we describe the construction of our benchmark, along with the factuality-oriented tasks we propose. Specifically, we exploit the hu man annotations (cf. Section 4.1) to construct a gold-standard benchmark for model evaluation. To ensure the high quality of our data, we only retain the instances that were not marked as error by any of our annotators in any annotation stage (cf. Sec. 4.1). We employ this data to propose the following two evaluation tasks, which we describe as follows.",
      "Task 1: End-to-End Factuality Evaluation The first task is to determine whether a given text con tains any factual inaccuracies. Formally, given an input passage t, the model must output a binary label y ∈ {True, False}, where True indicates that the text is factually accurate and False indicates the presence of factual inaccuracies. For this setting, we rely only on factual and un factual texts as input passages, and discard the orig inal texts, as the latter might have already been seen during the pre-training of LLMs. Specifically, to further ensure the high quality of our benchmark, we only retain the correct paraphrases that are gen erated from a valid set of claims (cf. Factual and Unfactual Text Generation and Claim Extraction in Sec. 4.1). Concerning the valid unfactual texts, in stead, we only keep the ones that are: i) generated, again, from a set of valid claims, and, ii) properly falsified and paraphrased (cf. Claim Falsification and Factual and Unfactual Text Generation in Sec.",
      "We highlight that the added information is often present in the original text.",
      "4.1). We then labeled all the resulting factual and unfactual texts with True, and False, respectively. In this setting, we aim at evaluating models on discerning true from fake texts (i.e. , \"Truth\" from \"Mirage\"). This formulation enables the assess ment of both plain LLMs and more complex RAG models. We deem this task to be particularly chal lenging as the falsification may involve even a sin gle word occurring in one of the many claims fea tured in a text, in the spirit of recent works high lighting how LLMs struggle to deal with subtle nu ances in a large input text (Kamradt, 2023; Hsieh et al. , 2024; Laban et al. , 2024; Wang et al. , 2024)",
      "Task 2: Evidence-based Claim Verification In this setting, the task is to classify individual claims as factual or unfactual using a given evidence. This approach assumes that claims are already extracted from the text, simplifying the task by focusing on isolated statements rather than the entire text to verify. Formally, given an input claim c and a corresponding evidence passage e, the model must output a binary label y ∈ {True, False}, where True indicates that the claim is supported by the evidence and False indicates that the claim is not supported by the evidence. For this setting, we focus on the extracted claims and their corresponding unfactual version, and use the factual text as evidence. We discard both the original and unfactual texts as the former might have already been seen during the pre-training of LLMs, while the latter contradicts real-world knowledge and, therefore, the internal knowledge of LLMs, possibly leading to unfair evaluations. Additionally, to guarantee the high precision of our data, we focus on the claims that are both atomic and reflecting the same semantics of the original text (cf. Claim Extraction Sec. 4.1). Then, we only keep the ones that have been appropriately falsified (cf. Claim Falsification in Sec. 4.1), along with their unfactual counterparts. Finally, we ap ply the same quality checks described in Task 1 to retain only the valid factual texts. At this stage, we classify the ⟨c i , F⟩ pairs with the label True, while we label ⟨c i , F⟩ as False, with c i and c i being the original claim and its falsified version, respectively."
    ]
  },
  {
    "index": 3,
    "title": "End-to-end Factuality Evaluation with",
    "contents": [
      "End-to-end Factuality Evaluation with LLM-O ASIS",
      "In this section, we showcase how LLM-O ASIS can be leveraged to build an end-to-end factuality evaluation system. In the spirit of Min et al. (2023), we decompose the task of evaluating the factuality of a given text into three simpler tasks, namely, Claim Extraction, Evidence Retrieval and Claim Verifi cation. The process begins with extracting a set of atomic facts (cf. Claim Extraction, Sec. 5.1) from the text to be verified. These extracted claims are then used to retrieve relevant evidence from a reliable knowledge base (cf. Evidence Retrieval, Sec. 5.2). After this, the factual accuracy of each claim is evaluated by comparing it against the re trieved evidence (cf. Claim Verification, Sec. 5.3). Finally, the results of these individual evaluations are aggregated to determine the overall factuality of the entire text.",
      "Our approach starts by extracting atomic claims from a given input text t. With the aim of training a claim extractor, we leverage LLM-O ASIS to create a dataset of ⟨t, C⟩ tuples, where t is an original text from Wikipedia and C = (c 1 , . .. , c n ) the cor responding automatically-extracted claims by our chosen LLM M (Section 3.1). We then fine-tune a smaller sequence-to-sequence model G on this data, thus distilling the claim extraction capabilities of M. We frame the training process as a text genera tion task; more formally, we fine-tune G to generate the claims autoregressively:",
      "where y is the sequence obtained by concatenating the claims in C and y k is a token in this sequence.",
      "At this stage, given the claims extracted by G, we require a system capable of retrieving relevant pas sages from a knowledge corpus to serve as evidence to verify those claims. Again, we leverage LLM O ASIS to create a training dataset for our retriever; in particular, given each generic claim c j ∈ C ex tracted from the original text t, we construct the following training pairs:",
      "where U and F are the generated factual and unfac tual texts (cf. Section 3.3). We then augment this set by pairing the factual and unfactual texts with the falsified claim c i (cf.",
      "Section 3.2), thus obtaining the following addi tional training instances:",
      "In this way, we include all possible pairs of ⟨claim, passage⟩ in LLM-O ASIS in our training set. This strategy is aimed at increasing the generalization capabilities of our retriever: notably, given a claim, the retriever is trained to both provide the passages to support it along with the ones that are useful to contradict it. Following the methodology outlined in Dense Passage Retrieval (Karpukhin et al. , 2020, DPR), we define our retriever E as a Transformer-based encoder, which produces dense representations of both claims and passages. Starting from an input claim c and a knowledge corpus D, we use E to compute a vector representation v c for c, and v p for every passage {p 1 , p 2 , . . . , p m } ∈ D. Then, we use the dot product v c · v p to rank all the passages in D and, finally, extract the top k among these. The resulting k passages form our evidence set R k (c, D) for c. Finally, we minimize the DPR loss L to train E:",
      "where N is the batch size, v c i is the vector rep resentation of the i-th claim in the batch, v p + is i the vector representation of the corresponding gold passage for the i-th claim, and v p − represents the j vector representations of all the other passages in the batch, serving as in-batch negatives. This for mulation ensures that the model learns to score the correct passage higher than the other ones within each batch, which has been shown to be an effec tive strategy for training retrieval models (Yih et al. , 2011; Gillick et al. , 2019).",
      "The final step of our factuality evaluation method ology involves verifying each claim c generated by our claim extractor from the text t, by comparing it against the corresponding passages R k (c, D) re trieved from our corpus. Inspired by previous work on consistency evaluation (Zha et al. , 2023; Chen and Eger, 2023; Scirè et al. , 2024), we ground our verification approach on the NLI formulation. NLI is a task that determines the logical relationship between two texts: a premise and a hypothesis. For mally, given a premise pre and a hypothesis hyp:",
      "Training a claim verifier on LLM-Oasis In this section, we show how LLM-O ASIS can be uti lized to train a model for the claim verification task. Complying with the NLI formulation, we require a strategy to assess whether each claim extracted from a text is entailed, contradicted, or neutral with respect to a set of the retrieved passages. With this purpose, we construct a training dataset by deriving the following ⟨claim, passage, label⟩ triplets from LLM-O ASIS :",
      "where c j ∈ C is a claim extracted by the LLM from the original text t (cf. Section 3.1), and F and U are the factual and unfactual texts outlined in Section 3.3. We expand our training dataset for NLI with the following triplets:",
      "NLI(pre, hyp) = Y ∈ {E NT , N EUT , C ONTR }, where Y is a label indicating whether pre entails (E NT ), is neutral about (N EUT ), or contradicts (C ONTR ) hyp.",
      "where c i is the falsified version of the extracted claim c i (cf. Sec. 3.2). To obtain a complete NLI dataset, we require a strategy to generate neutral triplets as well. To achieve this, we first pair each claim c j in C (Sec tion 3.1) with the passages p i of the Wikipedia page W from where the original text t was ex tracted. Then, we select the passage p ∗ as the one that maximizes the neutrality probability when fed to an NLI model 8 Ψ along with c j :",
      "and augment our dataset with the neutral pairs ⟨c, p ∗ , N EUT ⟩. This approach increases the like lihood that the selected passages are semantically related to the claim, as they come from the same Wikipedia page, while still being neutral. This is preferable to randomly selecting neutral examples, as it tends to provide more meaningful contrasts. Finally, we fine-tune a Cross-Encoder model on this data; as a result of this process, we obtained",
      "We used a DeBERTa-v3-large model fine tuned on several NLI datasets. For more infor mation: https://huggingface.co/MoritzLaurer/ DeBERTa-v3-large-mnli-fever-anli-ling-wanli",
      "Algorithm 1 Algorithm for Claim Verification. Require: claim c, top-k retrieved passages {p 1 , p 2 , . . . , p k }, NLI model Φ 1: for each passage p i in {p 1 , p 2 , . . . , p k } do 2: ŷ ← Φ(c, p i ) 3: if ŷ == E NT then 4: return True 5: else if ŷ == C ONTR then 6: return False 7: end if {The output of the model is N EUT , i.e. , neu 8: trality. Continue to the next passage} 9: end for 10: return True {All NLI outputs are neutral, c is deemed verified}",
      "our claim verification model Φ. More information about the training setup can be found in Section 6.1.",
      "Claim verification algorithm In Alg. 1 we out line how we leverage Φ to assess the factuality of a claim. Our procedure takes as input a claim, a set of top-k retrieved passages, and a claim verification model. For each ⟨passage, claim⟩ pair we obtain a label ŷ by applying Φ:",
      "where p i is a retrieved passage and c is a claim, which are fed to the NLI model Φ as the premise and hypothesis, respectively. As described in Alg. 1, the algorithm proceeds by checking the output of this model for each passage in the ranking order. If Φ outputs E NT for a passage, the claim is deemed verified (i.e. , return True). Conversely, if Φ outputs C ONTR , the claim is deemed unfactual (i.e. , return False). Finally, if Φ outputs N EUT for all passages, the claim is deemed verified (i.e. , return True), as there is no contradicting evidence available.",
      "In practice, given an input text t, we used our claim verifier to assign a factuality label to the claims generated by our claim extractor, using the passages returned by our retriever as evidences.",
      "The final factuality prediction for the text t is an aggregation of the claim-level factuality labels. Specifically, the text t is considered factual if all of its extracted claims are verified, unfactual other wise."
    ]
  },
  {
    "index": 4,
    "title": "Experimental Setup",
    "contents": [
      "Experimental Setup",
      "In this section, we provide details about the mod els and data involved in our experiments. To train our components for the end-to-end factuality eval uation task, we leverage the synthetic data from LLM-O ASIS (cf. Section 3, Figure 1). Specifi cally, we randomly split the passages in an 80/20 proportion to build the train and validation datasets, respectively. When splitting, we ensure that all the claims, as well as the factual and unfactual text generated from the same passage, will end up in the same split. We evaluate both our modular architecture (cf. Sec. 6.1) and several LLM-based baselines (cf. Sec. 6.2), showing the effectiveness of our benchmark in challenging factuality evaluation systems. To assess their performance, we rely on the LLM O ASIS gold-standard benchmark (Section 4.2). Models are evaluated across the two proposed tasks (i.e. end-to-end verification and evidence based claim verification), and we use balanced ac curacy (Brodersen et al. , 2010) as our evaluation metric. 9",
      "Here, we provide the training details for each mod ule of our proposed solution for end-to-end factual ity evaluation (cf. Sec. 5).",
      "Claim extractor As described in Section 5.1, we build our claim extractor dataset with the ⟨text, claims⟩ tuples in the training split of LLM-O ASIS . We split the resulting dataset into ∼67k passage claims pairs for training, and ∼4k passage-claims pairs for validation. Statistics about the claim ex traction dataset can be found in Table 2. We fine-tune a T5 base (Raffel et al. , 2019) model on this data to generate the sequence of claims given an input passage. We train the model for a total of 1M steps, with Adafactor (Shazeer and Stern, 2018) as optimizer with a learning rate of 1e − 5. Following Scirè et al. (2024), we rely on the easiness F 1 metric for model selection. Let C repre sent the set of generated claims for a given text and C ∗ the corresponding set of gold claims. To com pute the easiness P score, as defined by Zhang and Bansal (2021), we first calculate the ROUGE-1 10",
      "All our experiments are carried out on a single NVIDIA GeForce RTX 3090 GPU. 10 We consider ROUGE-1 to be a suitable basis for our easiness metric due to the high extractiveness of the claim",
      "score for each generated claim c ∈ C by compar ing it to every gold claim c ∗ ∈ C ∗ , and then select the maximum score. The final easiness P score is obtained by averaging these maximum scores over all generated claims:",
      "(8) Similarly, we compute the easiness R score by selecting the maximum ROUGE-1 score for each gold claim c ∗ with respect to all generated claims:",
      "(9) Finally, we combine easiness P and easiness R to calculate the easiness F 1 score, and select the model that achieves the highest easiness F 1 on our validation set.",
      "Evidence Retriever The training dataset of our retriever comprises ∼3.2M ⟨claim-evidence⟩ pairs. At validation/test time we construct the knowledge corpus with the original texts in our validation split and gold benchmark, respectively. To make the evaluation more realistic and challenging, we expand the corpus with passages from the same Wikipedia page. This approach results in our cor pus D comprising a total of 2.5M passages. We use the pre-trained Transformer-based archi tecture E5 base (Wang et al. , 2022) as our encoder E. To generate embeddings for both claims and pas sages, we apply mean pooling over the output of E. The model is trained with a batch size of 20 input texts for 300 000 steps, using AdamW (Loshchilov and Hutter, 2019) as the optimizer. We employ a learning rate of 1 · 10 −6 , with a 20% warm-up phase.",
      "Claim Verifier As outlined in Section 5.3, we for malize the claim verification task as an NLI prob lem and construct a dataset of ∼3.5M ⟨premise, hypothesis, label⟩ triplets from LLM-O ASIS . We devoted 3.2M instances for training our claim verifi cation model and the remaining 300k for validation. We fine-tune DeBERTa-v3 large (He et al. , 2021) for a total of 1M steps on this data, using Adafactor.",
      "We provide a comprehensive evaluation of a set of LLM-based baselines on the LLM-O ASIS bench mark. We compare closed-source models from the GPT family (OpenAI et al. , 2024), including the lat est GPT-4o, to open-weight LLMs such as Llama 3 (Touvron et al. , 2023) and Mistral (Jiang et al. , 2023). 11 As part of the end-to-end task evaluation, we ablate the impact of providing the LLMs with external knowledge, that is, in the RAG setting. To experiment with this, we include the top-K pas sages 12 returned by our retriever (cf. Section 5.2) in the prompts. Details about the utilized prompts and parameters for the end-to-end factuality evalu ation and evidence-based claim verification tasks are provided in Appendix B and D, respectively."
    ]
  },
  {
    "index": 5,
    "title": "Results",
    "contents": [
      "Results",
      "7.1 Task 1: End-to-End Factuality Evaluation",
      "In this section we present the results obtained in the end-to-end factuality evaluation task (cf. Section 4.2). First of all, we examine the performance of the evidence retrieval module, as this component supplies the external knowledge that is fed to the claim verifier. The performance of the end-to-end process depends on the quality of the retrieved evi dence. This step also establishes an upper bound on the external knowledge integration, directly im pacting the subsequent evaluation results.",
      "Evidence Retriever We evaluate the perfor mance of the evidence retrieval module using the Recall at k (R@k) metric, which quantifies the pro portion of relevant documents retrieved in the top k results. Formally, it is defined as:",
      "|{relevant D} ∩ {top k retrieved D}| |{relevant D}| (10) This metric allows us to assess the ability of our retriever to identify relevant passages for factuality verification within the top-k ranked results. Higher values of k generally yield higher recall, as more documents are considered, but also introduce the risk of increasing irrelevant retrievals.",
      "Due to our computational constraints, we opt to focus our evaluation of open-source LLMs on models of up to 8B parameters. 12 We selected K=30 based on the analysis of our retriever’s performance at different values of K (cf. Sec. 7.1) conducted on the validation set.",
      "For our experiments, we evaluated different val ues of k (as shown in Figure 2) and ultimately selected for all the subsequent experiments k = 30 as it provided a balance between performance and efficiency. The fine-tuned E5 base model achieved a Recall@30 (R@30) of 0.95. This is a significant improvement compared to the same model without fine-tuning, which only achieved an R@30 of 0.52. The pretraining and fine-tuning process over 3.2M passages proved crucial for this performance gain. We remark that R@K represents an upper bound of our factuality evaluation performance when ex ternal knowledge is integrated into the verification process. Further analysis and details can be found in Appendix C.",
      "End-to-End Factuality Evaluation The results for the end-to-end factuality evaluation task are shown in Table 4. As we can observe, the bal anced accuracy results of the LLMs, without addi tional external knowledge, are poor, only slightly above the random baseline in most cases. This out come highlights that our benchmark is challenging even for state-of-the-art LLMs. This is due to the extremely-hard setting we are proposing, in which models are asked to verify the factuality of a text, often subtly altered (cf. Section 3.2), without hav ing access to any additional information. Moreover, this shows that despite having been likely exposed to the entire Wikipedia during the pretraining phase, LLMs still struggle to provide the correct label us ing just their prior knowledge. The inclusion of external knowledge from our retriever is beneficial, with a general improvement of accuracy, particularly for GPT-4o (+8%). In this setting, our model achieves a balanced accu racy score of 69.24, significantly outperforming all the LLMs, with only GPT-4o falling in the same ballpark. Despite this outcome can be attributed to the fine-tuning of our model’s components on LLM-O ASIS – whose training data distribution may resemble the one of the gold benchmark – such result is notable as our system features a num ber of parameters that is considerably smaller than that of its competitors (i.e. 1B vs 7B+). Moreover, we posit that this reduced number of parameters is counterbalanced by the quality of the training dataset which indeed enables a small architecture to achieve scores comparable with, or even better than, those of LLMs. However, the fact that the best-performing system achieves a score of ∼ 0.70 further highlights the complexity of the proposed",
      "Table 4: Balanced accuracy (B-Accuracy) of different models for end-to-end factuality evaluation. \"Params\" denotes the number of parameters in billions (B). \"w/o\" stands for without external knowledge and \"w/\" stands for with external knowledge.",
      "benchmark and paves the way for future studies on factuality evaluation.",
      "7.2 Task 2: Evidence-based Claim Verification",
      "In this section, we present the results for the second task we aim to evaluate with our benchmark (see Section 4.2), i.e. , evidence-based claim verification. In this setting, we remove the claim extraction and retrieval modules, thus directly providing the claim verifier (cf. Section 5.3) with the ⟨evidence, claim⟩ pairs in our benchmark. For the LLM baselines, we used the same prompt as for the end-to-end setting, but here we replace the retrieved external passages with the gold evidence. Again, as shown in Table 5, our specialized system outperforms all its competitors including GPT-4o (93.30 vs 90.78), which is particularly remarkable given its small size. Notably, all systems achieve higher perfor mance compared to the previous setting (e.g. , our system goes from 69.24 in the end-to-end task to 93.30 in this task). We attribute this to three main factors. First, this task is a simpler instance of the previous one, namely, the model is required to ver ify a single claim rather than a passage. Second, the system is provided with the exact evidence needed to verify the claim while, in the end-to-end formula tion, each model relies on several passages returned by the retriever, hence possibly introducing noise in the process. Finally, the end-to-end verification implies reading and reasoning on a huge context (4k tokens on average) rather than the limited one (100 tokens on average) of this task."
    ]
  },
  {
    "index": 6,
    "title": "Conclusion and Future Work",
    "contents": [
      "Conclusion and Future Work",
      "In this paper, we introduce LLM-O ASIS , a large scale resource for end-to-end factuality evaluation",
      "Figure 2: Recall@k performance of the E5 base model at different values of k.",
      "Llama 3 Mistral GPT-3.5 GPT-4o Our Model",
      "8B 7B N/A N/A 0.4B",
      "76.63 64.37 75.56 90.78 93.30",
      "Table 5: Balanced accuracy (B-Accuracy) of different models for evidence-based claim verification. \"Params\" denotes the number of parameters in billions (B).",
      "obtained by extracting and falsifying information from Wikipedia. Specifically, as outlined in Fig ure 1, given a text from Wikipedia, we extract a set of factual and unfactual claims, with the latter obtained by falsifying one of the facts expressed in the original text. Starting from these sets, we design two claims2text tasks and generate a fac tual text, which is a paraphrase of the original one, and its unfactual counterpart, featuring the falsi fied claim. This resulted in 81k ⟨factual, unfactual⟩ pairs that are suitable for training and evaluating fact-checking systems, making LLM-O ASIS the largest resource for factuality evaluation. Con trarily to previous works in this domain, such as FEVER, which is focused on the simpler task of claim verification, our resource is the first enabling the training of end-to-end factuality evaluation sys tems, i.e. , approaches that are able to assess the",
      "factuality of generic text in natural language. We additionally devise a human annotation pro cess to create a gold standard for benchmarking fac tuality evaluators and to validate the quality of the proposed data creation pipeline. LLM-O ASIS en ables two challenging tasks: end-to-end factuality evaluation, which tests the ability of models to ver ify factual accuracy in raw texts in natural language, and evidence-based claim verification, which fo cuses on assessing individual claims against pro vided evidence. Our experiments reveal that open weights LLMs, such as Mistral and Llama 3, fall short in the end to-end task, only marginally surpassing the random baseline. In the same setting, even GPT-4o faces significant challenges, in both zero-shot and RAG settings, i.e. , when provided with supporting ev idence from Wikipedia, only achieving 60% and 68% of accuracy, respectively. This underscores the difficulty of the proposed benchmark and its potential to drive progress in factuality evaluation. Furthermore, thanks to LLM-O ASIS , we designed a novel baseline for end-to-end factuality evalua tion, which consists of a pipeline of smaller, spe cialized models trained on three subtasks, namely, claim extraction, evidence retrieval and claim veri fication. Our approach demonstrated competitive or even superior performance to GPT-4o, showcas ing the potential of smaller specialized LMs for factuality evaluation.",
      "Looking forward, we plan to expand LLM O ASIS to incorporate data from diverse domains and multiple languages, enhancing its utility and applicability. With the aim of fostering research in factuality evaluation, we release our resource at https://github.com/Babelscape/LLM-Oasis."
    ]
  },
  {
    "index": 7,
    "title": "References",
    "contents": [
      "References",
      "Duarte M. Alves, José Pombal, Nuno M. Guerreiro, Pe dro H. Martins, João Alves, Amin Farajian, Ben Pe ters, Ricardo Rei, Patrick Fernandes, Sweta Agrawal, Pierre Colombo, José G. C. de Souza, and André F. T. Martins. 2024. Tower: An open multilingual large language model for translation-related tasks.",
      "Kay Henning Brodersen, Cheng Soon Ong, Klaas Enno Stephan, and Joachim M. Buhmann. 2010. The bal anced accuracy and its posterior distribution. In 2010 20th International Conference on Pattern Recogni tion, pages 3121–3124.",
      "Shiqi Chen, Yiran Zhao, Jinghan Zhang, I-Chun Chern, Siyang Gao, Pengfei Liu, and Junxian He. 2023. Felm: Benchmarking factuality evaluation of large language models.",
      "Yanran Chen and Steffen Eger. 2023. Menli: Robust evaluation metrics from natural language inference.",
      "Alexander R. Fabbri, Wojciech Kryściński, Bryan Mc Cann, Caiming Xiong, Richard Socher, and Dragomir Radev. 2021. Summeval: Re-evaluating summariza tion evaluation.",
      "Daniel Gillick, Sayali Kulkarni, Larry Lansing, Alessan dro Presta, Jason Baldridge, Eugene Ie, and Diego Garcia-Olano. 2019. Learning dense representations for entity retrieval. In Proceedings of the 23rd Con ference on Computational Natural Language Learn ing (CoNLL), pages 528–537, Hong Kong, China. Association for Computational Linguistics.",
      "Tanya Goyal, Junyi Li, and Greg Durrett. 2022. News summarization and evaluation in the era of gpt-3.",
      "Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations.",
      "Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shan tanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. Ruler: What’s the real context size of your long-context language models?",
      "Hai Hu, Kyle Richardson, Liang Xu, Lu Li, Sandra Kübler, and Lawrence S. Moss. 2020. OCNLI: original chinese natural language inference. CoRR, abs/2010.05444.",
      "Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men sch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guil laume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7b.",
      "Ehsan Kamalloo, Nouha Dziri, Charles Clarke, and Davood Rafiei. 2023. Evaluating open-domain ques tion answering in the era of large language models. In Proceedings of the 61st Annual Meeting of the",
      "Association for Computational Linguistics (Volume 1: Long Papers), pages 5591–5606, Toronto, Canada. Association for Computational Linguistics.",
      "Gregory Kamradt. 2023. Needleinahaystack.",
      "Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.",
      "Philippe Laban, Alexander R. Fabbri, Caiming Xiong, and Chien-Sheng Wu. 2024. Summary of a haystack: A challenge to long-context llms and rag systems.",
      "Philippe Laban, Tobias Schnabel, Paul N. Bennett, and Marti A. Hearst. 2021. Summac: Re-visiting nli based models for inconsistency detection in summa rization.",
      "Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Hein rich Küttler, Mike Lewis, Wen tau Yih, Tim Rock täschel, Sebastian Riedel, and Douwe Kiela. 2021. Retrieval-augmented generation for knowledge intensive nlp tasks.",
      "Yixin Liu, Alexander R. Fabbri, Pengfei Liu, Yilun Zhao, Linyong Nan, Ruilin Han, Simeng Han, Shafiq Joty, Chien-Sheng Wu, Caiming Xiong, and Dragomir Radev. 2023. Revisiting the gold standard: Grounding summarization evaluation with robust hu man evaluation.",
      "Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Confer ence on Learning Representations.",
      "Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettle moyer, and Hannaneh Hajishirzi. 2023. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singa pore. Association for Computational Linguistics.",
      "Dor Muhlgay, Ori Ram, Inbal Magar, Yoav Levine, Nir Ratner, Yonatan Belinkov, Omri Abend, Kevin Leyton-Brown, Amnon Shashua, and Yoav Shoham. 2024. Generating benchmarks for factuality evalua tion of language models. In Proceedings of the 18th Conference of the European Chapter of the Associa tion for Computational Linguistics (Volume 1: Long Papers), pages 49–66, St. Julian’s, Malta. Associa tion for Computational Linguistics.",
      "OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale man, Diogo Almeida, Janko Altenschmidt, Sam Alt man, Shyamal Anadkat, Red Avila, Igor Babuschkin,",
      "Suchir Balaji, Valerie Balcom, Paul Baltescu, Haim ing Bao, Mohammad Bavarian, Jeff Belgum, Ir wan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brock man, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Ful ford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Hee woo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Ka mali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirch ner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Kon stantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambat tista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perel man, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Poko rny, Michelle Pokrass, Vitchyr H. Pong, Tolly Pow ell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ry der, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav",
      "Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Fe lipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Fe lipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Ji ayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qim ing Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. 2024. Gpt-4 technical report.",
      "Artidoro Pagnoni, Vidhisha Balachandran, and Yulia Tsvetkov. 2021. Understanding factuality in abstrac tive summarization with frank: A benchmark for factuality metrics.",
      "Alicia Parrish, William Huang, Omar Agha, Soo-Hwan Lee, Nikita Nangia, Alexia Warstadt, Karmanya Ag garwal, Emily Allaway, Tal Linzen, and Samuel R. Bowman. 2021. Does putting a linguist in the loop improve NLU data collection? In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4886–4901, Punta Cana, Dominican Re public. Association for Computational Linguistics.",
      "Xiao Pu, Mingqi Gao, and Xiaojun Wan. 2023. Sum marization is (almost) dead.",
      "Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text trans former. CoRR, abs/1910.10683.",
      "Zafaryab Rasool, Stefanus Kurniawan, Sherwin Balugo, Scott Barnett, Rajesh Vasa, Courtney Chesser, Ben jamin M. Hampstead, Sylvie Belleville, Kon Mouza kis, and Alex Bahar-Fuchs. 2024. Evaluating llms on document-based qa: Exact answer selection and numerical extraction using cogtale dataset. Natural Language Processing Journal, 8:100083.",
      "Alessandro Scirè, Karim Ghonim, and Roberto Navigli. 2024. Fenice: Factuality evaluation of summariza tion based on natural language inference and claim extraction.",
      "Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive learning rates with sublinear memory cost. CoRR, abs/1804.04235.",
      "Derek Tam, Anisha Mascarenhas, Shiyue Zhang, Sarah Kwan, Mohit Bansal, and Colin Raffel. 2022. Evalu ating the factual consistency of large language mod els through summarization.",
      "Liyan Tang, Tanya Goyal, Alex Fabbri, Philippe La ban, Jiacheng Xu, Semih Yavuz, Wojciech Kryscin ski, Justin Rousseau, and Greg Durrett. 2023. Un derstanding factual errors in summarization: Errors, summarizers, datasets, error detectors. In Proceed ings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11626–11644, Toronto, Canada. Association for Computational Linguistics.",
      "Liyan Tang, Igor Shalyminov, Amy Wing mei Wong, Jon Burnsky, Jake W. Vincent, Yu’an Yang, Siffi Singh, Song Feng, Hwanjun Song, Hang Su, Lijia Sun, Yi Zhang, Saab Mansour, and Kathleen McK eown. 2024. Tofueval: Evaluating hallucinations of llms on topic-focused dialogue summarization.",
      "James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana. Association for Computational Linguistics.",
      "S. M Towhidul Islam Tonmoy, S M Mehedi Zaman, Vinija Jain, Anku Rani, Vipula Rawte, Aman Chadha, and Amitava Das. 2024. A comprehensive survey of hallucination mitigation techniques in large language models.",
      "Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. Llama: Open and efficient foundation language models.",
      "Hengyi Wang, Haizhou Shi, Shiwei Tan, Weiyi Qin, Wenyuan Wang, Tunyu Zhang, Akshay Nambi, Tanuja Ganu, and Hao Wang. 2024. Multimodal needle in a haystack: Benchmarking long-context capability of multimodal large language models.",
      "Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly supervised contrastive pre-training.",
      "Longyue Wang, Chenyang Lyu, Tianbo Ji, Zhirui Zhang, Dian Yu, Shuming Shi, and Zhaopeng Tu. 2023a. Document-level machine translation with large lan guage models. In Proceedings of the 2023 Confer ence on Empirical Methods in Natural Language Pro cessing, pages 16646–16661, Singapore. Association for Computational Linguistics.",
      "Yiming Wang, Zhuosheng Zhang, and Rui Wang. 2023b. Element-aware summarization with large language models: Expert-aligned evaluation and chain-of thought method.",
      "Wen-tau Yih, Kristina Toutanova, John C Platt, and Christopher Meek. 2011. Learning discriminative",
      "projections for text similarity measures. In Proceed ings of the fifteenth conference on computational nat ural language learning, pages 247–256.",
      "Yuheng Zha, Yichi Yang, Ruichen Li, and Zhiting Hu. 2023. Alignscore: Evaluating factual consistency with a unified alignment function.",
      "Biao Zhang, Barry Haddow, and Alexandra Birch. 2023a. Prompting large language model for machine translation: A case study.",
      "Shiyue Zhang and Mohit Bansal. 2021. Finding a bal anced degree of automation for summary evaluation. In Proceedings of the 2021 Conference on Empiri cal Methods in Natural Language Processing, pages 6617–6632, Online and Punta Cana, Dominican Re public. Association for Computational Linguistics.",
      "Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, and Tatsunori Hashimoto. 2023b. Benchmarking large language models for news summarization. Transactions of the Associa tion for Computational Linguistics, 12:39–57.",
      "We present several examples derived from our dataset to show the model’s capability of gener ating both factual and unfactual texts. These exam ples show how our pipeline produces paraphrased versions of original texts and introduces subtle yet critical factual inaccuracies.",
      "Original Text: Albert Einstein was a German-born theoretical physicist who developed the theory of relativity, one of the two pillars of modern physics. His work is also known for its influence on the philosophy of science. Einstein is best known for his mass–energy equivalence formula E = mc 2 , which has been dubbed “the world’s most famous equation”. Extracted Claims:",
      "1. Albert Einstein was a German-born theoretical physicist.",
      "2. He developed the theory of relativity.",
      "4. Einstein’s work influenced the philosophy of science.",
      "5. He is best known for his mass–energy equiva lence formula E = mc 2 .",
      "3. The theory of relativity is one of the two pil lars of modern physics.",
      "6. The formula E = mc 2 is dubbed “the world’s most famous equation”.",
      "Factual Text: Albert Einstein, originally from Germany, was a theoretical physicist who formulated the theory of relativity, a cornerstone of modern physics. His contributions significantly impacted the philosophy of science. The mass–energy equivalence equation E = mc 2 , which he is most famous for, is often called “the world’s most famous equation”. Falsified Claim: He developed the theory of quantum mechanics. Unfactual Text: Albert Einstein, originally from Germany, was a theoretical physicist who formulated the theory of quantum mechanics, a cornerstone of modern physics. His contributions significantly impacted the philosophy of science. The mass–energy equiva lence equation E = mc 2 , which he is most famous for, is often called “the world’s most famous equa tion”.",
      "Original Text: The Amazon Rainforest, also known as Ama zonia, is a moist broadleaf forest in the Amazon biome that covers most of the Amazon basin of South America. This region includes territory be longing to nine nations. The majority of the forest is contained within Brazil, with 60% of the rainfor est. Extracted Claims:",
      "1. The Amazon Rainforest is also known as Ama zonia.",
      "2. It is a moist broadleaf forest in the Amazon biome.",
      "3. The Amazon Rainforest covers most of the Amazon basin of South America.",
      "4. The region includes territory belonging to nine nations.",
      "5. The majority of the forest is contained within Brazil.",
      "6. Brazil contains 60% of the rainforest.",
      "Factual Text: Amazonia, widely known as the Amazon Rainfor est, is a damp broadleaf forest located within the Amazon biome, covering a significant portion of the",
      "Amazon basin in South America. This vast region spans across nine countries, with Brazil housing 60% of the rainforest. Falsified Claim: The majority of the forest is contained within Peru. Unfactual Text: Amazonia, widely known as the Amazon Rain forest, is a damp broadleaf forest located within the Amazon biome, covering a significant portion of the Amazon basin in South America. This vast region spans across nine countries, with Peru hous ing 60% of the rainforest These examples show the effectiveness of the model in creating pairs of factual and unfactual texts. The alterations are subtle, ensuring that the generated unfactual texts are challenging for both human annotators and automated systems to detect, thus providing a robust testbed for evaluating the factual accuracy of LLM-generated content.",
      "Prompts for End-to-End Factuality Evaluation",
      "To accomplish the task of end-to-end factuality evaluation, we employ different prompting strate gies depending on the language model being used. For models like Llama, which supports a system prompt, we set specific instructions as the system message. For models like Mistral, which do not support a system prompt, we include the instruc tions at the beginning of the text. In our experi ments, we set the temperature to 0.7 to control the randomness of the generated outputs, ensuring a balance between diversity and relevance. Other hy perparameters include a maximum token length of 5 and a beam search with a width of 5 for decoding the outputs. These settings were chosen to opti mize the model’s performance while maintaining computational efficiency. The unified prompt used for factuality evaluation is provided in Table 6. The system message is set as follows: \"You are a highly-accurate fact-checker. Your task is to determine the factuality of a given text. A text is considered ’Factual’ only if it is completely factually-accurate and contains no factual inac curacies. Even a single small factual inaccuracy should result in a ’Not Factual’ determination. An swer with just ’Factual’ or ’Not Factual’ without any explanation. \" To further evaluate the impact of external knowledge, we prompted the LLMs with the same pieces of evidence retrieved and used by our NLI mod ule (cf. Sec. 7.1). The updated prompt is shown in Table 7. The same prompt was also used for the experiment in Sec. 7.2 for the Evidence-based claim verification. Moreover, we updated the sys tem prompt with the following: \"You are a highly-accurate fact-checker. Your task is to determine the factuality of a given text using the evidence provided. A text is considered ’Factual’ only if it is completely factually accurate and contains no factual inaccuracies. Even a single small factual inaccuracy should result in a ’Not Factual’ determination. If evidence is not available, use your prior knowledge to make the assessment. Answer with just ’Factual’ or ’Not Factual’ without any explanation. \" We tested different prompts and found that this one led to the best results. By providing clear exam ples and detailed instructions, we aim to ensure that the model accurately assesses the factuality of the given texts. This structured approach helps in train ing and evaluating the models effectively, ensuring high accuracy in end-to-end factuality evaluation.",
      "Further details on Evidence Retriever module",
      "In this section, we present further details about our evidence retrieval model. To assess the contri bution of different components, we performed an ablation study on the retrieval module. All mod els were trained using the same hyperparameters described in Section 5.2. Results are computed on the corpus D, which contains 2.5 million passages, and evaluated on the validation split of the dataset. After training, our best model achieved a recall at k = 30 (R@30) of 0.95. We employed the E5 base model (Wang et al. , 2022), built upon the bert-base-uncased (?) ar chitecture, with weights initialized from Sentence Transformers (?). As part of our ablation study, we also trained the bert-base-uncased model with the same hyperparameters, achieving a recall of 0.85. This significant performance drop compared to the fully fine-tuned E5 demonstrates the effectiveness of the additional pretraining done in E5. Additionally, we experimented with other archi tectures from the E5 family. The E5 small model obtained a recall of 0.75, whereas the E5 large model slightly outperformed E5 base , achieving a recall of 0.96. Despite the marginal 1% performance gain, we opted to use the E5 base model in our final system due to the substantial increase in computational resources and training time required by the E5 large model, which did not justify the small performance improvement. The results of all models tested during the abla tion study are summarized in Table 8, confirming the robustness and efficiency of the E5 base model for claim retrieval, balancing performance with computational cost.",
      "Details about the employed LLMs",
      "In this section, we detail the models we used in this work. For the generation of our dataset, we used GPT-4 API, with an approximate cost of $2000. As for the open-source models we utilized for the LLM baselines, we used the instruction tuned versions of Mistral 13 and LLama 3 14 publicly available on Hugging Face. For the benchmark evaluation, we utilized the OpenAI API. Specifically, for GPT-4o, we employed the model GPT-4o-2024-05-13. For GPT-3.5, we used GPT-3.5-turbo-16k-0613 due to the necessity of handling a task with context ex ceeding the maximum window size of the standard GPT-3.5 model. For the claim-extractor, we use the pre-trained T5-base 15 as our base model.",
      "In this section, we illustrate the annotation guide lines employed. Annotators are asked to perform four different tasks related to factuality evaluation. For each task, annotators receive specific guide lines which we report in what follows. As a stan dard guideline for all tasks, annotators are required to discard instances entirely or partially written in a language other than English. Furthermore, in case of pronominal ambiguity occurring in a given claim, if the human annotator cannot determine, with a high degree of confidence, the noun to which a given pronoun refers, such claim is discarded. An notators are required to participate in joint sessions to resolve challenges and collaboratively develop agreed-upon solutions.",
      "Task description In this step, you will verify if claims extracted from a given text are accuhttps://huggingface.co/mistralai/ Mistral-7B-Instruct-v0.3 14 https://huggingface.co/meta-Llama/ Meta-Llama-3-8B-Instruct 15 https://huggingface.co/google-t5/t5-base",
      "Table 6: Prompt for factuality evaluation of a text.",
      "rately represented within the original text. You will receive a 5-sentence passage extracted from Wikipedia, along with corresponding claims pre extracted by a language model. Note: a claim denotes an atomic fact, that is, an elementary infor mation unit found in a text, that does not require further subdivision, and that can be checked for its truthfulness.",
      "Annotation Format You will be provided with a TSV (Tab-Separated Values) file containing three columns:",
      "• Column 1: Identifier (either \"text\" or \"claim id\")",
      "• Column 3: Empty. You have to fill in this column.",
      "1. Read the original text and claims thoroughly.",
      "2. For each claim, determine if it is accurately represented in the original text.",
      "3. Place a \"v\" in the third column if the claim is present in the original text, otherwise mark it with an \"x\"",
      "Annotation Example We report an example of annotated instance in Table 9.",
      "Additional Guidelines Annotators are required to discard an entire instance, composed of the orig inal text and the corresponding claims, if the orig inal text is not grammatically correct, e.g. , if it is syntactically ill-formed, or if it is semantically un clear, that is, if it is formulated in a way that the annotator cannot determine the meaning conveyed either by the entire text or one of its segments. Fur thermore, annotators are required to discard sen tences which cannot be considered as claims for the purposes of our work, e.g. , sentences composed of a single word.",
      "Task Description In this step, you will identify whether a given claim has been altered to introduce unfactual information.",
      "Annotation Format You will receive a pair of claims, where the second claim is an unfactual",
      "Table 7: Prompt for text factuality evaluation integrating external knowledge.",
      "• Column 3: Empty. You have to fill in this column.",
      "1. Compare the two claims provided.",
      "2. Determine if the unfactual claim introduces new, untrue information compared to the orig inal claim.",
      "3. Mark column 3 with \"v\" if unfactual infor mation is introduced, otherwise mark it with \"x\".",
      "Annotation Example We report an example of annotated instance in Table 11.",
      "Additional Guidelines If the original claim con tains a word that is replaced with its hyponym in the candidate nonfactual claim, while the overall mean ing of both claims remains unchanged also based on the annotator’s world knowledge, then both claims are considered to be semantically equiv alent.",
      "Task Description In this step, you will assess whether the semantics of claims is preserved in a paraphrased version of the text.",
      "Annotation Format You will receive a TSV file with four columns:",
      "• Column 1: Identifier (either \"paraphrase\" or \"claim id\")",
      "• Column 3: Empty. You have to fill in this column.",
      "Model E5 base (without fine-tuning) E5 base bert-base-uncased E5 small E5 large",
      "Table 8: Performance of Different Models on Claim Retrieval Task",
      "Table 9: Example of annotated instance in task 1 (claim extraction).",
      "• Column 4: Empty. You have to fill in this column.",
      "1. Compare each claim with its representation in the paraphrased text.",
      "2. Determine if its semantics is preserved.",
      "• If it is preserved (regardless of whether it is reported identically in the paraphrase), place a \"v\" in the third column. • Use \"x\" otherwise.",
      "• If a claim is paraphrased, mark the fourth column with \"v\". • If not paraphrased (e.g. identical), mark column 4 with \"x\".",
      "Recall@30 0.52 0.95 0.85 0.75 0.96",
      "• <\"v\", \"v\"> in the last two columns means that the semantics is preserved and the text is para phrased (at least one word changed).",
      "• <\"x\", \"v\"> in the last two columns means that the semantics is NOT preserved but the text is paraphrased.",
      "• <\"v\", \"x\"> in the last two columns means that the semantics is preserved but the text is NOT paraphrased.",
      "• <\"v\", \"x\"> in the last two columns means that the semantics is preserved but the text is NOT paraphrased.",
      "• <\"x\", \"x\"> in the last two columns means that neither the semantics is preserved nor the text is paraphrased (e.g. the claim is omitted).",
      "Annotation Example We report an example of annotated instance in Table 11.",
      "’Call Me by Your Name’ took the lead in Do rian Award nominations. The article, penned by Gregg Kilday, was published by The Hol lywood Reporter on January 10, 2018, and accessed the following day. Meanwhile, The Jameson Empire Awards were held back in 2014.",
      "Table 11: Example of annotated instance in task 3 (factual text generation).",
      "Additional Guidelines If a nearly identical date appears in the factual text and in one claim, anno tators should proceed as follows. If the date in the factual text includes the month and year, while the claim specifies the day, month, and year, even if the month and year in the claim coincide with those in the factual text, the semantics conveyed by the claim is considered to be different from that of the factual text.",
      "Task description In this step, you will assess whether all claims, including the unfactual one, are accurately reflected in a generated unfactual text.",
      "Annotation Format You will receive all claims paired with the generated unfactual text.",
      "• Column 1: Identifier (either \"claim id\", or \"unfactual_text\")",
      "• Column 3: Empty. You have to fill in this column.",
      "• Review the generated unfactual text along with all claims provided",
      "• Determine if all claims are correctly reported in the text (i.e. the factual claims should re main factual and the unfactual claims should be unfactual). Ensure that the text in the “un factual_text” field is not modified by the lan guage model to be compliant with the unfac tual claim. Paraphrasing in claims is allowed, you should focus on semantics.",
      "• Mark column 3 with \"v\" if the unfactual text corresponds to the claims accurately, other wise mark it with \"x\".",
      "Annotation Example We report an example of annotated instance in Table 12.",
      "Table 12: Example of annotated instance in task 3 (unfactual text generation)."
    ]
  }
]

こちらも，おおむねテキストは取れています．
嬉しい誤算としては，Referenceが1行ずつ割と正確に取れていること．これはライブラリのアップデートで参照情報を解析したくなりますね．
一方，数式や絵文字が入っている場合はやはりまだうまく対応できていないようです．
また，表領域にも一部取りこぼしが見られました．

まだ改善の余地はありますが，やりたかったことはまずまず機能しているようなので，ひとまず完成です．

プログラム

rsrpp > rsrpp-cli > src > main.rs

use clap::Parser;
use rsrpp::parser::parse;
use rsrpp::parser::structs::{ParserConfig, Section};
use std::path::Path;

#[derive(Parser, Debug)]
#[command(version, about, long_about=None)]
struct Args {
    #[arg(short, long)]
    pdf: String,

    #[arg(short, long)]
    out: Option<String>,
}

#[tokio::main]
async fn main() {
    let args = Args::parse();

    let is_url = args.pdf.starts_with("http");
    if !is_url && !Path::new(args.pdf.as_str()).exists() {
        eprintln!("File not found: {}", args.pdf);
        std::process::exit(-1);
    }

    let outfile = args.out.unwrap_or("output.json".to_string());
    assert!(
        outfile.ends_with(".json"),
        "Output file must be a JSON file"
    );

    let mut config = ParserConfig::new();
    let pages = parse(args.pdf.as_str(), &mut config).await.unwrap();
    let sections = Section::from_pages(&pages);
    let json = serde_json::to_string_pretty(&sections).unwrap();

    std::fs::write(format!("{}", outfile), json).unwrap();
}

次回

rsrppが完成したので，次はrsrppを用いて，学術論文のテキスト抽出を応用したアプリケーション開発に取り組みます．

次回：Rustで学術論文からテキストを抽出する #13

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up