ただただアウトプットを癖付けるためのAdvent Calendar 2024

生物物理屋がローカルLLMでサーベイ論文生成を試してみた話

Last updated at 2024-12-07Posted at 2024-11-30

はじめに

この記事は「ただただアウトプットを癖付けるための Advent Calendar 2024」に投稿した記事です。

最初の記事にも書いた通り、私は生物物理の実験を専門にしている研究者です。
最近はデータ解析のため機械学習のコード開発も行っており、幸いにもその成果がNeurIPSに採択されました。

前の記事でローカルLLMをAPIサーバーにしまして、他のAIサービスのバックエンドとして機能するようにしました。
今回はさらに、このローカルLLMを使ってサーベイ論文生成を試してみました。

サーベイ論文生成は、AutoSurvey: Large Language Models Can Automatically Write Surveys という論文で提案されています。
この論文では、LLM関係の論文をデータベース化し、ユーザーの入力に基づいて、そのデータベースから論文を生成するというものです。
今回は、この論文を参考に、ローカルLLMを使ってサーベイ論文生成を試してみました。

前の記事「生物物理屋がローカルLLMをAPIサーバーにして遊んでみた話」

次の記事「【生物物理屋による論文紹介】ハイパーグラフと層」

AutoSurveyのインストール

AutoSurveyは、githubのリポジトリからインストールできます。
Readmeに従って、git cloneして、pip installしました。

元々はローカルPCで動かすことを想定していましたが、このインストールの段階で躓いてしまいました。
AutoSurveyのインストールにはFauss-GPUが必要なのですが、これをWindows環境にインストールするのは、どうも難しいようです。
なので、ローカルLLM(Ollama)を走らせているリモートサーバーの上で、AutoSurveyも動かすことにしました。
こちらはUbuntu環境なので、AutoSurveyをpip installするだけで要求されるライブラリごとすんなりインストールできました。

AutoSurveyの使い方

ベーシックな使い方は、AutoSurveyのReadmeに書いてあります。

サーベイ論文の生成

サーベイ論文の生成に用いるコマンドの雛形は以下の通りです。

python main.py --topic "alternatives for transformer" 
               --gpu 0
               --saving_path ./output/
               --model llama3.1
               --section_num 7
               --subsection_len 700
               --rag_num 60
               --outline_reference_num 1500
               --db_path ./database
               --embedding_model nomic-ai/nomic-embed-text-v1
               --api_url http://localhost:11434/v1/chat/completions
               --api_key ollama

Readmeにあるコードからの変更点は以下の通りです。

--model : llama3.1を指定
api_url : ローカルLLMのAPIサーバーのURLを指定
api_key : 特に使わないので適当な文字列を指定

このコマンドを実行すると、LLMの教育への応用に関するサーベイ論文が生成されます。

ここで注意点として、api_urlは/api/generateや/api/chatではなく、OpenAI APIのJSON形式を取り扱うための/v1/chat/completionsを指定する必要がありました。
AutoSurveyはOpenAI型を想定して出力の後処理をしているので、前者にしてしまうと処理が正常に終了しません。
（エラー処理の関係でこのエラーに辿り着くのに1日かかってしまいました。。。）
OpenAI APIからの利用についてはこちら。
引用先では/v1としていますが、これはあくまでもベースのURLです。今回指定するのは直接アクセスされるURLであるため、さらに続けて/chat/completionsを指定する必要がありました。

一番最初に接続テストが行われ、helloに対する応答を確認します。
もちろん返答はモデルによりますが、Hello! How can I assist you today?などの返答が返ってくると正常に接続できていると判断できます。
この応答がNoneなどになっている場合、接続に失敗している可能性があります。

デバッグしている過程で気づきましたが、--gpuは指定しても使われていませんでした。現状のコードでは、使用するGPUを指定することはできないようです。

llama3.1, llama3.2, mistralの３つを試していたのですが、どれもサブセクションのアウトラインの生成中にOllamaとのhttp通信が500エラーをはいてしまい、生成が途中で止まってしまいました。
このエラーは、OllamaのAPIサーバーがリクエストを受け取ったが、処理ができなかった場合に返されるエラーです。
おそらくですが、クラウドサーバー上のLLMと異なり、ローカルLLMは同時接続にやや弱いのではないかと思います。
エラーの時のログは以下の通りです。

[GIN] 2024/11/29 - 18:32:26 | 200 |   8.83792717s |       127.0.0.1 | POST     "/v1/chat/completions"
time=2024-11-29T18:32:26.673+09:00 level=WARN source=runner.go:125 msg="truncating input prompt" limit=2048 prompt=14353 numKeep=5
panic: failed to decode batch: could not find a kv cache slot

goroutine 7 [running]:
main.(*Server).run(0xc0000fc120, {0x598508e48cc0, 0xc0000d40a0})
        github.com/ollama/ollama/llama/runner/runner.go:335 +0x23e
created by main.main in goroutine 1
        github.com/ollama/ollama/llama/runner/runner.go:934 +0xc52
[GIN] 2024/11/29 - 18:32:27 | 500 |  873.670435ms |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/11/29 - 18:32:27 | 500 |   2.52417571s |       127.0.0.1 | POST     "/v1/chat/completions"

このエラーの解消には、生成用のプロンプトを投げる頻度を変えないといけませんでした。これはライブラリ内部にハードコードされた変数で制御されています。
具体的には、model.pyのdef batch_chatの中で、

max_threads=15をmax_threads=1に
time.sleep(0.3)をtime.sleep(1)に
それぞれ変更したところ、アウトラインの生成が通るようになりました。

アウトラインの後処理でもエラーが出ました。

'<format>\n\n**I. Model Compression**\n\n* Subsection A: Knowledge Distillation\n\t+ Description A: Methods such as distillation and pruning to reduce model size and increase inference speed while retaining accuracy.\n* Subsection B: Quantization\n\t+ Description B: Techniques that aim to reduce the precision of model parameters while maintaining their functionality, reducing memory usage and computational costs.\n\n**II. Model Pruning**\n\n* Subsection C: Unstructured Weight Pruning\n\t+ Description C: Methods for removing redundant or weak connections between model weights to achieve a smaller, yet still accurate model.\n* Subsection D: Structured Pruning\n\t+ Description D: Techniques that aim to remove groups of connections in a structured way to promote model sparsity and reduce computational costs.\n\n**III. Optimization Techniques**\n\n* Subsection E: Differentiable Neural Architecture Search (DNAS)\n\t+ Description E: Methods for assigning optimal scales and precisions to model parameters while pruning redundant connections.\n* Subsection F: Second-Order Pruning\n\t+ Description F: Techniques that leverage full-second-order information for accurate pruning decisions, typically used in conjunction with unstructured weight pruning.\n\n**IV. Application Domains**\n\n* Subsection G: Natural Language Processing (NLP)\n\t+ Description G: Models developed using Transformer-based architectures for language tasks, including text classification.\n* Subsection H: Real-time and High-Volume Use Cases\n\t+ Description H: Scenarios where reducing inference time is essential, often involving the deployment of compact models.\n\n<format>'

本来はSubsection 1、Description 1というように数字が入るべきところに、この出力ではアルファベットが入っているため、エラーが出ていると思われます。
おそらくは、モデルがフォーマットに則って出力することを得意とするかどうかで、このエラーが出ているのではないかと考えられます。
今のところllama3.1, llama3.2で上のエラーが頻発しています。

mistralでは、Subsection はあるのにDescriptionがないというエラーが出ました。
ただし、このエラー率は低めなので、何度かトライしたら生成が完了することもありそうです。
しかしながら、その後の本文の執筆のタイミングで、再びhttp通信のエラーが出てしまいました。
そこで文章の長さと参考文献の数を大幅に削減してみたところ、エラー率が下がりました。
コードは以下の通りです。

python main.py --topic "alternatives for transformer" 
               --gpu 0
               --saving_path ./output/
               --model mistral
               --section_num 7
               --subsection_len 70
               --rag_num 60
               --outline_reference_num 150
               --db_path ./database
               --embedding_model nomic-ai/nomic-embed-text-v1
               --api_url http://localhost:11434/v1/chat/completions
               --api_key ollama

これを試しに30回まわしてみました。30回中、生成が完了したのは3回でした。
なかなかにエラー率が高いですが、生成が完了した場合の出力は以下の通りです。

ハイパーリンクが切れているのと、参考文献がタイトルしか記載されていないですが、これらはAutoSurveyの仕様であるようです。
一応、参考文献のいくつかはarXivに実際に存在する論文のタイトルでした。

# 

## HPCTransformer

### Overview

 HPCTransformer, a method optimized for massive parallel hardware, employs techniques such as mixed precision computing and adaptive batch size selection to minimize computational complexity [1]. The optimization strategies within HPCTransformer are not limited to its design; it also incorporates best practices from other libraries.

One such example is High-Performance Tensor Transposition (HPTT), an open-source C++ library focusing on tensor transpositions for efficiency [2]. By integrating these optimization suggestions, HPCTransformer demonstrates superior performance, particularly in supercomputing environments.

As detailed in the '[3]' paper, HPCTransformer's design and implementation on supercomputing clusters involve MPI for distributed memory parallelization using recursive coordinate bisection for domain decomposition and MPI remote memory access. The particle interactions are organized into target batches/source clusters, which efficiently map onto the GPU, significantly improving the performance of particle simulation [3].

### Optimizations

 Compared to traditional Transformers, HPCTransformer leverages a range of optimizations tailored for massive parallel hardware [1]. These comprise mixed precision computing, which allows for computations using less memory-intensive precision types for enhanced efficiency [4], and adaptive batch size selection, a dynamic process that adjusts the number of data points processed concurrently to cater to varying hardware capabilities and boost performance throughput [1]. Such strategies not only streamline computations within HPCTransformer but also reflect best practices learned from other libraries. For instance, High-Performance Tensor Transposition (HPTT), an open-source C++ library specializing in tensor transpositions for efficiency [2], was integrated to further boost the performance of HPCTransformer, particularly in supercomputing environments [3]. Additionally, evaluations against established deep learning libraries and HPC-focused tools like XGBoost have demonstrated that HPCTransformer delivers superior efficiency, with performance gains of up to 45% improvement compared to competing methods [5]. By addressing the challenges faced by transformer-based architectures in terms of computational efficiency, resource utilization, and performance optimization, HPCTransformer stands out as an effective solution for leveraging the benefits of deep learning models within high-performance computing environments. [6]

### Performance Evaluation

 Comparing HPCTransformer with Existing Deep Learning Libraries on Supercomputing Systems

Expanding upon the optimizations employed by HPCTransformer, this section presents a performance evaluation of our methodology against existing deep learning libraries across various supercomputing systems [6]. Key findings reveal that HPCTransformer outperforms baseline Transformers in terms of computational efficiency, achieving up to 2.37x speedup over vendor-optimized sparse kernels [6].

Furthermore, when compared with HPC-focused libraries like XGBoost, HPCTransformer exhibits superior performance, delivering up to 45% improvement in efficiency [5]. These results underscore the practical benefits of implementing HPCTransformer as more than just a deep learning model, but also as an effective tool within high-performance computing environments.

As alluded to in previous discussions, open challenges remain in transformer-based architectures [7], which HPCTransformer addresses by offering improved computational performance for faster adaptation, efficient resource utilization, real-time edge inference optimization, and addressing the balance between model size, accuracy, and resource requirements at scale.

### Open Challenges

 [7]

While significant advancements have been made in optimizing transformer-based architectures, several persistent challenges call for attention. Prioritizing faster adaptation to new datasets and tasks requires further reductions in transformer training times [8].

Addressing the need for improved resource utilization is also crucial, with a particular emphasis on developing more efficient computational strategies suitable for systems with hardware limitations [1]. Moreover, research initiatives are necessary to enhance transformer architectures for real-time inference at the edge, specifically in safety-critical applications like autonomous vehicles where performance and budget constraints are stringent [1]. Lastly, striking a balance between model size, accuracy, and resource requirements for large-scale transformers is an essential area of focus. This research could potentially minimize computational footprints while maintaining high performance levels.

## Transformers in Natural Language Processing

### BERT and RoBERTa

 Transitioning from advanced transformer-based models, we delve into BERT (Bidirectional Encoder Representations from Transformers) and its successor RoBERTa. These models have made unprecedented strides in the realm of Natural Language Processing (NLP), exhibiting a remarkable ability for few-shot learning as demonstrated by [9]. BERT has distinguished itself in various NLP applications such as sentiment analysis, question answering systems, and text summarization, showcasing significant improvements over previous state-of-the-art techniques in areas like information extraction and relation classification according to [10]. The advent of these large language models has facilitated a deeper understanding of the complexities inherent in human language, enabling us to tackle intricate NLP tasks with improved accuracy and efficiency.

### XLNet and T5

 Transformers Beyond BERT: Pushing the Limits with XLNet and T5 in Long Sequence Handling and Text Generation Tasks

Propelled by transformer-based models such as BERT [11], XLNet and T5 have made substantial progress in dealing with long sequences and a variety of text generation tasks.

XLNet innovates through a permuted input approach, allowing model training on extended sequences without the necessity for segmentation or intricate sequence-length dependent mechanisms (as detailed in XLNet: Generalized Autoregressive Pretraining). This strategy empowers the model to attend to any position in the original (non-permuted) order, enhancing its capacity to capture long-range dependencies accurately.

Distinctively, T5 streamlines encoder and decoder architectures within a unified model for diverse text generation tasks by employing bidirectional self-attention during encoding and causal self-attention during decoding (as expounded in T5: Text-to-Text Transfer Transformer). This integration facilitates information preservation while maintaining the original order of input sequences.

Comparative evaluations between XLNet and T5 indicate that both achieve state-of-the-art performance on various benchmarks such as WMT translation tasks [12] and GLUE [13], as stated in the original studies cited. By implementing efficiency measures like knowledge distillation, pruning, and quantization (as discussed in previous sections), users can leverage these advanced transformers for practical, accessible, and resource-optimized solutions tailored to a broad range of Natural Language Processing tasks. Further research in this terrain aims to propel the boundaries of transformers' capabilities while continuing to enhance their resource utilization and scalability.

### Efficient Transformers for NLP

 Efficient Transformers for NLP: Improving Resource Utilization and Scalability

While transformer-based models, such as XLNet and T5, have significantly advanced natural language processing capabilities, their scalability comes at a steep computation and memory cost [6]. To mitigate these challenges, various strategies focusing on efficiency improvements have been proposed.

1. Knowledge Distillation: By leveraging smaller pre-trained models to fine-tune larger ones, knowledge distillation balances the tradeoff between large-scale models and computational resources [14].

2. Pruning: Models can be optimized by discarding less critical parameters without significantly compromising performance [15; 16].

3. Quantization: This approach reduces computation time by representing parameters using fewer bits while minimizing the impact on final outputs, thus potentially offering faster training and inference times for efficient transformer-based NLP applications [6] [6].

Implementing these efficiency measures will enable more practical, accessible NLP solutions tailored to a broad range of tasks. Further research in this area aims to push the boundaries of transformers' capabilities while continuing to optimize their resource utilization.

## Convolutional Neural Networks vs. Transformer-based Architectures

### Strengths of CNNs

------------------

Building upon the historical dominance of Convolutional Neural Networks (CNNs) in various domains, it is crucial to recognize their unique abilities that have led to such prominence. Firstly, CNNs demonstrate remarkable prowess in learning spatial hierarchies [17; 18]]. This progressive learning of features allows the network to capture and analyze details ranging from simple features like edges, textures, and shapes, ultimately progressing towards more complex structures.

Secondly, CNNs utilize receptive fields [19; 20], enabling them to scan an image with varying kernel sizes. In this manner, they simultaneously assess the spatial context of specific regions while efficiently identifying patterns that occur at multiple scales. This capability significantly amplifies their performance in image recognition tasks, solidifying their status as essential tools for computer vision research.

The self-attention mechanism in transformers allows models to focus upon significant semantic information regardless of input length [21]. This adaptability is crucial for natural language processing tasks where recognizing intricate dependencies between words is vital [21]. Transformers have also demonstrated impressive performance in computer vision applications such as image captioning and visual question answering, thanks to their ability to efficiently focus on essential regions of an image [22]. Unlike CNNs, transformers adapt their weighting strategy dynamically at each timestep, enabling them to swiftly integrate new information and respond effectively to dynamic contexts or perspectives [23; 14]. This feature differentiates transformers from CNNs, which require substantial preprocessing to handle variable-length sequences, often limiting their ability to adapt quickly in changing circumstances.

### Strengths of Transformers

 Transformer-based Architectures and Their Capabilities Compared to Convolutional Neural Networks (CNNs)

Transformers, renowned for their ability to process long sequences and capture complex relationships among elements, are distinguished from CNNs in several aspects [22]. While both models excel in their respective domains, transformers demonstrate remarkable prowess in learning hierarchies of features due to their progressive self-attention mechanism [24; 18].

This adaptability allows the model to focus on semantically important elements, enabling it to effectively handle natural language processing tasks where recognizing intricate dependencies between words is vital [21] and computer vision applications, such as image captioning and visual question answering [22]. In contrast, CNNs capture spatial hierarchies, scanning images with varying kernel sizes to assess the spatial context of specific regions while efficiently identifying patterns that occur at multiple scales [19; 20].

One significant difference between transformers and CNNs is their approach to handling sequences. Unlike CNNs, which require extensive preprocessing for variable-length sequences, transformers dynamically adjust their weighting strategy at each timestep, enabling them to swiftly integrate new information and respond effectively to dynamic contexts or perspectives [23; 14].

These contrasting characteristics lead to unique applications for both architectures. Transformers, for instance, excel in tasks requiring understanding of long sequences and complex relationships, while CNNs dominate image recognition tasks due to their capability in handling spatial hierarchies and analyzing patterns at various scales [8]. In response to the challenges faced by transformer-based architectures in real-time inference requirements, such as autonomous driving systems, emerging trends focus on hybrid architectures integrating CNNs with Transformer components. By doing so, these models can capitalize on the benefits of both architectures and potentially improve performance in various deep learning pipelines [25].

### Hybrid Models

 Transformer-based Architectures and Their Challenges

Transformers, renowned for their ability to process long sequences and capture complex relationships among elements, are a cornerstone of modern deep learning [21]. They leverage the self-attention mechanism that allows the model to focus on semantically important words without being hindered by input length [21]. This flexibility is particularly beneficial for natural language processing tasks where understanding intricate dependencies between words is crucial, and in computer vision applications, transformers have shown remarkable performance in image captioning and visual question answering tasks [22].

However, when applied to real-time inference requirements, such as autonomous driving systems, transformer-based architectures face challenges due to their computational expense [8]. Future research could explore acceleration techniques and optimization strategies tailored specifically for transformers to address this issue [8].

In light of these challenges, an emerging trend in deep learning focuses on hybrid architectures that integrate Convolutional Neural Networks (CNNs) with Transformer-based components. These models leverage the strengths of both approaches to better address these challenges. For instance, SETR [26] combines a 3D CNN for feature extraction with a Transformer to handle the temporal dimension. Similarly, Swin-Transformer [27] employs a hierarchical windows strategy where local windows are first processed by CNNs before being processed by Multihead Self-Attention. Mesh-CNN [28], on the other hand, uses a Transformer attention module to model mesh connectivity while applying CNNs for spectral features. The hybridization of these architectures demonstrates potential synergies in various applications, including image and video processing, as well as point cloud analysis ([26], [27], [28]). By overcoming the limitations of each architecture, these hybrid models could potentially improve performance in various deep learning pipelines.

### Open Challenges

 Transformer-based Architectures and Their Challenges, Solutions, and Emerging Trends

Transformers have significantly impacted the realm of deep learning by facilitating effective processing of long sequences and recognizing intricate relationships among elements [21]. They achieve this by utilizing the self-attention mechanism, which allows models to focus on semantically important components without being impeded by input length [21]. This versatility makes them particularly valuable for natural language processing tasks where understanding complex dependencies between words is vital, and in computer vision applications, transformers have shown exceptional performance in image captioning and visual question answering tasks [22].

However, when applied to real-time inference scenarios like autonomous driving systems, transformer-based architectures encounter difficulties due to their computational rigor [8]. To alleviate this concern, future research could concentrate on developing acceleration techniques and optimization strategies explicitly designed for transformers [8] as a potential solution.

In response to these challenges, an evolving paradigm in deep learning focuses on hybrid architectures that amalgamate Convolutional Neural Networks (CNNs) with Transformer-based components [23]. These models capitalize on the merits of both approaches to tackle these challenges more effectively. For instance, SETR [26] blends a 3D CNN for feature extraction with a Transformer to handle the temporal dimension. Similarly, Swin-Transformer [27] employs a hierarchical windows strategy where local windows are first processed by CNNs before being processed by Multihead Self-Attention. Mesh-CNN [28], on the other hand, incorporates a Transformer attention module to model mesh connectivity while applying CNNs for spectral features. By embracing such hybridizations, these architectures showcase promising synergies in various applications, including image and video processing, as well as point cloud analysis ([26], [27], [28]). By tackling the limitations of each architecture, these hybrid models could potentially boost performance in various deep learning pipelines.


## References

[1] On Reversible Transducers

[2] High-Performance Tensor Contraction without Transposition

[3] A GPU-Accelerated Barycentric Lagrange Treecode

[4] Avoiding Synchronization in First-Order Methods for Sparse Convex  Optimization

[5] Larger-Scale Transformers for Multilingual Masked Language Modeling

[6] Efficient Quantized Sparse Matrix Operations on Tensor Cores

[7] From open learners to open games

[8] Training Strategies for Vision Transformers for Object Detection

[9] RoBERTurk  Adjusting RoBERTa for Turkish

[10] Ambigram Generation by A Diffusion Model

[11] BERT  Pre-training of Deep Bidirectional Transformers for Language  Understanding

[12] Microsoft's Submission to the WMT2018 News Translation Task  How I  Learned to Stop Worrying and Love the Data

[13] GLUE  A Multi-Task Benchmark and Analysis Platform for Natural Language  Understanding

[14] PaLM  Scaling Language Modeling with Pathways

[15] Transformer Quality in Linear Time

[16] A Study on Encodings for Neural Architecture Search

[17] An Information-theoretic Visual Analysis Framework for Convolutional  Neural Networks

[18] DATA SEARCH'18 -- Searching Data on the Web

[19] Convolutional Neural Networks In Convolution

[20] When was that made 

[21] Attention Is All You Need

[22] A Transformer-based representation-learning model with unified  processing of multimodal input for clinical diagnostics

[23] Language Models are Few-Shot Learners

[24] A Tensor-based Convolutional Neural Network for Small Dataset  Classification

[25] Towards a better understanding of testing if conditionals

[26] Convolutional State Space Models for Long-Range Spatiotemporal Modeling

[27] Swin Transformer  Hierarchical Vision Transformer using Shifted Windows

[28] MeshCNN  A Network with an Edge

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up