More than 1 year has passed since last update.

NPO法人AI開発推進協会

【備忘】LLMのチューニングに関する研究まとめ

Posted at 2024-05-17

はじめに

LLM（主にオープンソース）の利用においては、目的やドメインに応じた事前学習済みモデルのチューニングが重要になります。

今回はチューニングデータの開発や効果についての調査のため、いくつか論文を読んだので、備忘としてまとめた内容を投稿します。
ターゲットとして、主に中国語での研究を調査しました。

チューニングデータがモデルに及ぼす影響についての研究
　チューニングデータ作成時に考慮すべき点についての示唆など多めです
特定のタスクや領域へのLLMの適応を目的としたチューニングの研究
　チューニングデータの具体的な作成方法など比較的多めです

チューニングデータがモデルに及ぼす影響についての研究

Rethinking LLM Language Adaptation: A Case Study on Chinese Mixtral

要旨

2024/3/4
Mixtral, a representative sparse mixture of experts (SMoE) language model, has received significant attention due to its unique model design and superior performance. Based on Mixtral-8x7B-v0.1, in this paper, we propose Chinese-Mixtral and Chinese-Mixtral-Instruct with improved Chinese language abilities by adopting further pre-training and instruction fine-tuning. Experimental results show that our Chinese-Mixtral and Chinese-Mixtral-Instruct successfully improve Chinese understanding and generation performance while retaining the original English abilities. Then, we discuss several key questions when performing language adaptation on large language models, including the necessity of extending the language-specific vocabulary and the choice of the initialization model (foundation model v.s. instruction model), by providing empirical results and analysis. We also present the visualizations of each expert to examine their importance on downstream tasks. Our resources are publicly available through https://github.com/ymcui/Chinese-Mixtral.

目的

Mixtralを中国語に適応させることで、チューニングに関連するいくつかの問題について議論する（初期モデルの選択や語彙拡張の必要性等）
チューニングの効果について、自動化されたベンチマークや人間の評価を含むさまざまな実験結果で検証する

チューニングデータの設計・作成方法

事前学習用の20GBの一般的な中国語コーパス
instruction-tuning用の5Mの中国語指示データ

チューニング戦略

instruction-tuning / QLoRA
大規模コーパスで事前学習を行い、その後instruction-tunigを行う
tokenizerはMixtralのものをそのまま使用し、語彙拡張は行わない

チューニングの効果

様々なベンチマークで高パフォーマンスを示した。詳細は割愛
語彙拡張の必要性について
- Mixtral tokenizerに中国語トークンを追加し、語彙拡張を実施したtokenizerでも実験（語彙中の中国語トークンの割合は4.6％→44.6％）
- エンコード効率は大幅に改善（約32-44％の削減）
- 一方、中国語タスクと英語タスクの両方でパフォーマンスが低下した → 英語ベースのLLMを他の言語に適応させる際には、語彙拡張が必ずしも必要ではない可能性を示唆
初期モデルの選択について
- 事前学習済みモデルを用いるべきか、instruction-tuningされたモデルを用いるべきか
- Mixtralの両方のモデルで、同様の条件でチューニングを実施
- 事前学習済モデルを用いた方が高いパフォーマンスを示した
contextサイズについて
- Mixtral(および今回構築したChinese-Mixtral)がどれだけのcontextサイズをサポートできるか、PPLを指標に検証する
  - 元論文によると、Mixtralは32kのcontextサイズをサポートしている
- 結果、最適なPPLは32Kではなく48K。また、これらのモデルは48Kを超えても（128Kでも）かなり良好なPPLを示した
- Mixtralシリーズが優れた長文脈の汎化能力を持っており、より長い文脈をサポートするための追加の長文脈チューニングが不要である可能性を示唆

An Empirical Study of Instruction-tuning Large Language Models in Chinese

要旨

2023/10/20
The success of ChatGPT validates the potential of large language models (LLMs) in artificial general intelligence (AGI). Subsequently, the release of LLMs has sparked the opensource community’s interest in instructiontuning, which is deemed to accelerate Chat- GPT’s replication process. However, research on instruction-tuning LLMs in Chinese, the world’s most spoken language, is still in its early stages. Therefore, this paper makes an in-depth empirical study of instruction-tuning LLMs in Chinese, which can serve as a cookbook that provides valuable findings for effectively customizing LLMs that can better respond to Chinese instructions. Specifically, we systematically explore the impact of LLM bases, parameter-efficient methods, instruction data types, which are the three most important elements for instruction-tuning. Besides, we also conduct experiment to study the impact of other factors, e.g., chain-of-thought data and human-value alignment. We hope that this empirical study can make a modest contribution to the open Chinese version of ChatGPT. This paper will release a powerful Chinese LLM that is comparable to ChatGLM. The code and data are available at https: //github.com/PhoebusSi/Alpaca-CoT.

目的

中国語でinstruction tuningされたLLMを検証し、効果的にカスタマイズするための知見を提供すること
1.「どのオープンソースLLMが中国語のinstruction tuningの基盤としてより適しているか？」 2.「LoRA以外のパラメータ効率化手法がLLMにどのような影響を与えるか？」 3.「様々なタイプのinstructionデータセットがどのような影響を与えるか？」といった問いに答える

チューニングデータの設計・作成方法

主要な12個の中国語instructionデータセットを使用
LLMにより作成されたものが4つ、人間により作成・収集されたものが5つ、共同で作成されたものが3つ

チューニング戦略

Instruction tuning / LoRA

チューニングの効果（チューニングデータによる影響に着目）

ベンチマークとしてBelle-evalとMMCUを使用
Belle-evalについては、多くの場合でLLM（特にOpenAIのモデル）により構築された「多様な指示」のデータセットが良い結果を示した
- 特にBelleが良い結果を示した。理由として**「指示データの多さ」「指示の多様性」**が挙げられる
- Alpaca-GPT4のパフォーマンスは2番目に優れていた。理由として**「指示データの品質の高さ」**が挙げられる
- ただし、Close QAやExtractなどの高い読解能力を必要とするタスクでは、顕著なパフォーマンス低下が見られる
MMCUについては、どのデータセットにおいてもinstruction tuningによりパフォーマンスが向上した。特にMMCUと似たタスク形式を持つCOIG-exam4では精度が最も大きく改善した

Panda LLM: Training Data and Evaluation for Open-Sourced Chines Instruction-Following Large Language Models

要旨

2023/5/4
This project focuses on enhancing opensource large language models through instruction-tuning and providing comprehensive evaluations of their performance. We explore how various training data factors, such as quantity, quality, and linguistic distribution, influence the performance of instruction-tuned models trained on publicly accessible high-quality instruction datasets for both English and Chinese languages. Our goal is to supplement evaluation with quantitative analyses, providing valuable insights for the continued advancement of open-source chat models. Our model, data, and code are publicly available for others to use and build upon.

目的

オープンソースLLMをinstruction-tuningし、その性能について包括的な評価を提供する
対象は英語と中国語

チューニングデータの設計・作成方法

6つのオープンソースの中国語データセットを利用
データセット全体で固定のプロンプトテンプレートを使用

チューニング戦略

instruction-tuning / フルパラメータチューニング
2段階で学習
1. 5つのデータセットを用いてモデルをチューニング
2. 指示コーパスを多く含む1つのデータセットを用いてチューニング

チューニングの効果

LogiQA-v2、C3-Mixed、C3-Dialogueの3つのベンチマークを使用
- 全てのベンチマークでパフォーマンスが大きく向上
最終的に指示コーパスを多く含むデータセットを用いてinstruction-tuningすることで、パフォーマンスが大きく向上した
- 学習に用いたデータセットは、**トレーニングサンプルのわずか4.2％**しか占めない
- 推論タスクで高いパフォーマンスを達成するための主要な要因は、さまざまなドメインでのチューニング
- ただし、データを無差別に混合することはパフォーマンスの向上に繋がらない。データの一部を選択的にinstruction-tuningすることが、よりよいモデルにつながる可能性が示唆される

特定のタスクや領域へのLLMの適応を目的としたチューニングの研究

CBT-LLM: A Chinese Large Language Model for Cognitive Behavioral Therapy-based Mental Health Question Answering

要旨

2024/3/24
The recent advancements in artificial intelligence highlight the potential of language models in psychological health support. While models trained on data from mental health service platform have achieved preliminary success, challenges persist in areas such as data scarcity, quality, and ensuring a solid foundation in psychological techniques. To address these challenges, this study introduces a novel approach to enhance the precision and efficacy of psychological support through large language models. Specifically, we design a specific prompt derived from principles of Cognitive Behavioral Therapy (CBT) and have generated the CBT QA dataset, specifically for Chinese psychological health Q&A based on CBT structured intervention strategies. Unlike previous methods, our dataset emphasizes professional and structured response. Utilizing this dataset, we fine-tuned the large language model, giving birth to CBT-LLM, the large-scale language model specifically designed for Cognitive Behavioral Therapy techniques. Empirical evaluations demonstrate that CBT-LLM excels in generating structured, professional, and highly relevant responses in psychological health support tasks, showcasing its practicality and quality. The model is available on Hugging Face: https://huggingface.co/Hongbin37/CBT-LLM.

目的

大規模言語モデルを心理的健康支援に活用する
認知行動療法（CBT）の原則に沿ったアプローチを行うことで、専門的かつ構造化された回答を出力させる
中国語ベースの心理健康QAを想定する

チューニングデータの設計・作成方法

PsyQAという認知心理学の質問応答データセットを、gpt-3.5-turbo-16kを用いてCBTプロンプトの形式に変換し、QAデータセットを作成
gpt-3.5の出力に一貫性を持たせるため、gpt-4で生成した高品質な例をfew-shotとしてプロンプトに組込
単一ターンの対話応答形式を想定。応答メカニズムは5つの要素から構成される

チューニング戦略

Instruction tuning / LoRA

チューニングの効果

BLEU, METEOR, CHRF, BLEURT, BERTSCOREの5つの指標で自動評価
LLama-Chinese-7B, Alpaca-Chinese-7B, Qwen-7Bの3モデルと比較し、ほとんどの指標でスコアが向上
関連性, CBT構造, 有用性の3つの観点で人手評価
Alpaca-Chinese-7Bと比較し、全ての観点でスコアが向上。特にCBTの枠組みに適合している様子を確認

TAIWAN-LLM: Bridging the Linguistic Divide with a Culturally Aligned Language Model

要旨

2023/11/29
In the realm of language models, the nuanced linguistic and cultural intricacies of Traditional Chinese, as spoken in Taiwan, have been largely overlooked. This paper introduces TAIWAN-LLM, a pioneering Large Language Model that specifically caters to the Traditional Chinese language, with a focus on the variant used in Taiwan. Leveraging a comprehensive pretraining corpus and instruction-finetuning datasets, we have developed a model that not only understands the complexities of Traditional Chinese but also embodies the cultural context of Taiwan. TAIWANLLM represents the first of its kind, a model that is not only linguistically accurate but also culturally resonant with its user base. Our evaluations demonstrate that TAIWAN-LLM achieves superior performance in understanding and generating Traditional Chinese text, outperforming existing models that are predominantly trained on Simplified Chinese or English. The open-source release of TAIWANLLM invites collaboration and further innovation, ensuring that the linguistic diversity of Chinese speakers is embraced and well-served. The model, datasets, and further resources are made publicly available to foster ongoing research and development in this field.

目的

台湾で使用される繁体字中国語に特化したLLMを構築する

チューニングデータの設計・作成方法

cPTのためのコーパス
- 様々なソース（ソーシャルメディアやニュース、書籍等）からの台湾語文書の包括的なコレクション
- 有害性の検証をドキュメントレベルではなくウェブサイトレベルで行うことで、効率的にフィルタリング
SFTのためのデータセット
- 様々なinstruction-tuningデータセット（AlpacaやDolly等も含む）をgpt-3.5-turboにより翻訳し、同モデルで応答を生成
- 台湾の文化と言語のニュアンスを捉えるため、著者により作成された会話形式の独自データセット
Feedback SFTのためのデータセット
- 専用のプラットフォームから2万件のユーザーフィードバックを収集

チューニング戦略

Continue-Pretraining (cPT) / Supervised Fine-Tuning (SFT) / Feedback Supervised Fine-Tuning (Feedback SFT)

チューニングの効果

TC-Evalというベンチマークを使用
- 平均スコアでgpt-3.5-turboに匹敵するパフォーマンス
cPTの効果
- ablation studyより、ほとんどのタスクでパフォーマンスの著しい低下を確認
- cPTがモデルを伝統的な中国語の言語特性に適応させる上での重要性を示唆
Feedback SFTの効果
- ある程度パフォーマンスの向上に貢献しているが、cPTに比べるとその影響は小さい
- Feedback SFTはモデルをユーザーの好みに微調整するのに役立つ一方、モデルの核となる能力は主にcPTおよびSFTの段階で確立されていることを示唆
データ品質の影響
- 約90億トークンのCommonCrawlデータを追加した際に、パフォーマンスが予想外に低下した
- 高品質と低品質のコンテンツが混在する可能性があるWeb crawlデータの導入により、モデルの正確な処理と繁体字テキストの生成能力が損なわれる可能性を示唆
- トレーニングデータがターゲット言語の言語的および文化的ニュアンスに合致するよう、細心の注意を払ってデータをキュレーションする必要がある

DISC-FinLLM: A Chinese Financial Large Language Model based on Multiple Experts Fine-tuning

要旨

2023/10/25
We propose Multiple Experts Fine-tuning Framework to build a financial large language model (LLM), DISC-FinLLM. Our methodology improves general LLMs by endowing them with multi-turn question answering abilities, domain text processing capabilities, mathematical computation skills, and retrieval-enhanced generation capabilities. We build a financial instruction-tuning dataset named DISC-FINSFT, including instruction samples of four categories (consulting, NLP tasks, computing and retrieval-augmented generation). Evaluations conducted on multiple benchmarks demonstrate that our model performs better than baseline models in various financial scenarios. Further resources can be found at https: //github.com/FudanDISC/DISC-FinLLM.

目的

中国語金融LLMの開発
金融コンサル（複数ラリーの対話）、金融NLPタスク、金融計算、検索質問回答の4つのカテゴリに対応

チューニングデータの設計・作成方法

金融コンサル
- FiQAという金融QAデータセットを利用
  - 英語で記述されているため中国語に翻訳 & 回答の質が低いため回答をChatGPTで再生成
- 金融専門用語を収集し、ChatGPTを用いてQAデータセットを生成
- 金融投資フォーラムの投稿を収集し、Self-chat prompting（Xu et al., 2023）という手法をベースに、ChatGPTを用いてQAデータセットを生成
金融NLPタスク
- 10以上の公開されている中国語の金融NLPデータセットを収集
  - 感情分析、情報抽出、テキスト分類、テキスト生成といったタスクからなる
  - 各データセットごとの、またzero-shotおよびfew-shot両方のバージョンの、20以上のプロンプトテンプレートを作成し適用
- 金融ニュース、産業調査レポートからなる87000件のパッセージを収集
  - ChatGPTを利用して、文章・質問・回答の三つ組を取得し、さまざまな読解タスクのテンプレートを使用してデータセットを構築
金融計算
- 金融試験に基づく手書き金融計算問題、金融研究レポートの文脈を含む算数問題、一般的な数学問題からなるタスクプールを構築
  - 計算のための4つのツールを想定し、解答中の適切な箇所に呼び出しのためのコマンドを挿入
  - ChatGPTを利用して、5万以上の新しい計算問題と回答のペアを生成（self-instructionおよびseedタスクに基づくfew-shot CoT prompting）
検索質問回答
- 18000件の研究レポートの要約と69000件の金融ニュース記事からなる、金融知識ベースを構築
  - ChatGPTにより 1. 質問を生成 2. 参照文書を取得 3. 回答を生成し、データセットを構築
- モデルが不適切なテキストを識別して除外する能力を強化するため、無関係な文書をランダムに導入
  - 質問生成に使用された文書がシステムによって取得されなかった場合、これらの文書をランダムに組み込む

チューニング戦略

Instruction tuning / LoRA
上記4つのカテゴリそれぞれでLoRAを行うことで、適宜必要なモデルを切り替えられるようチューニング

チューニングの効果

金融NLPタスクの観点
- FinCUGEというベンチマークを使用
- 全てのモデルにおいて、ベースモデルから平均パフォーマンスが向上（2〜9ポイント）
- 金融領域におけるモデルの汎化性能向上の効果も示唆
人間による評価
- FinEvalというベンチマークを使用
- ChatGPTとGPT-4を除くすべての評価モデルを上回る平均パフォーマンス
- ablation studyより、フルパラメータ学習だと精度は低下→LoRAファインチューニングの必要性を強調

ShennongGPT: A Tuning Chinese LLM for Medication Guidance

要旨

2023/9
The burgeoning field of large language models (LLMs) holds tremendous promise for healthcare, particularly in the realms of medication guidance and adverse drug reaction prediction. Despite this, extant LLMs grapple with managing intricate polypharmacy scenarios. To address these limitations, we introduce ShennongGPT, a cutting-edge LLM, expressly tailored for robust medication guidance and adverse drug reaction forecasting. Our model employs a novel two-stage training strategy: initial learning from distilled drug databases for foundational knowledge on drug interactions, followed by simulation of human-like decision-making processes through the use of real-world patient data, enhancing the relevance and applicability of its guidance. This two-fold approach empowers ShennongGPT to excel in predicting potential adverse drug reactions and offering personalised medication advice, thereby significantly enhancing medication safety and the overall quality of healthcare services. Rigorous evaluations by healthcare professionals and AI experts highlight the superiority of Shennong GPT, which outperforms existing general and specialty LLMs.

目的

LLMによる的確な服薬指導と医薬品の副作用の表示を実現する

チューニングデータの設計・作成方法

公開されている2つの医療用データセット(共にChatGPTベースで生成されている)
様々な医療ケースや医療分野の文献といった実データ

チューニング戦略

はじめに医療用データセットを用いて事前学習し、その後実データでチューニングすることで人間のコミュニケーションにより近い応答を生成できるように調整
具体的なチューニング方式やデータセット、プロンプトの形式は読み取れず

チューニングの効果

5つの観点から、GPT-4による評価と人手による評価を実施

おわりに

今回は中国語の研究を中心に、チューニングデータの開発や効果についての研究を調査しました。
語彙拡張が逆効果になるケース等、新たな示唆も多く面白かったです。

追加で読んだ論文があれば適宜アップします。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up