More than 1 year has passed since last update.

Mars Flag Advent Calendar 2023

@s-nagasein

株式会社マーズフラッグ SP部

Geminiの性能評価に使われているベンチマークの概要まとめ

Posted at 2023-12-22

はじめに

GoogleからリリースされたGeminiの性能評価に使われているベンチマークの概要をまとめてみました。下記のことを期待しています。

・現在のAIがどういうことをどの程度できるかを知る。
・現在のAIがどのようなことに弱いかを知る。
・新しい大規模AIモデルが登場したときに、優劣を比較できるようにする。
・モデルによって得意不得意があるので、複数のAIを用途に応じて使い分けられるようにする。

テキスト

一般

MMLU

MMLU (Massive Multitask Language Understanding) は2020年にCenter for AI Safety(CAIS)のDan Hendrycksらによって提案された言語モデルを評価するためのベンチマークです。初等数学、米国史、コンピュータサイエンス、法律などの57の科目があり multiple-choice tasks(多肢選択法のタスク)で問題を解きます。

Measuring Massive Multitask Language Understanding

Abstract:
We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.

アブストラクト(機械翻訳)：
テキストモデルのマルチタスク精度を測定するための新しいテストを提案します。このテストでは、初等数学、米国史、コンピュータサイエンス、法律などを含む 57 の課題が取り上げられます。このテストで高い精度を達成するには、モデルは世界に関する広範な知識と問題解決能力を備えている必要があります。最新のモデルの精度はランダムに近い精度ですが、最大の GPT-3 モデルはランダムに比べて平均でほぼ 20 パーセント向上していることがわかりました。ただし、57 のタスクのそれぞれにおいて、最良のモデルがエキスパートレベルの精度に達するには、依然として大幅な改善が必要です。モデルのパフォーマンスにも偏りがあり、いつ間違っているのかわからないことがよくあります。さらに悪いことに、道徳や法律などの社会的に重要な主題に関しては、依然としてほぼランダムな正確性を持っています。モデルの学術的および専門的理解の広さと深さを包括的に評価することで、私たちのテストを使用して、多くのタスクにわたってモデルを分析し、重要な欠点を特定できます。

推論

Big-Bench Hard(BBH)

BIG-bench(Beyond the Imitation Game benchmark)は2022年にAarohi Srivastavaらによって提案された言語モデルを評価するためのベンチマークです。204のタスクで構成されており、132 機関の 450 人の著者らによって作らています。言語学、幼児期の発達、数学、常識的推論、生物学、物理学、社会的偏見、ソフトウェア開発などから問題が作られています。

BIG-Bench Hard (BBH) は、 BIG-Benchの中の23の困難なタスクです。これらは、以前の言語モデルの評価が平均的な人間の評価者を上回る成果を上げなかったタスクで、多段階の推論が要求されます。

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.

Abstract:
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.

アブストラクト(機械翻訳)：
言語モデルは、規模の増加に伴う量的な改善と新しい質的な機能の両方を実証します。変革をもたらす可能性のある影響にもかかわらず、これらの新しい機能はまだ十分に特徴付けられていません。将来の研究に情報を提供し、破壊的な新しいモデル機能に備え、社会的悪影響を軽減するには、現在および近い将来の言語モデルの機能と限界を理解することが重要です。この課題に対処するために、Beyond the Imitation Game ベンチマーク (BIG ベンチ) を導入します。BIG-bench は現在 204 のタスクで構成されており、132 機関の 450 人の著者が寄稿しています。タスクのトピックは多岐にわたり、言語学、幼児期の発達、数学、常識的推論、生物学、物理学、社会的偏見、ソフトウェア開発などから問題を描きます。BIG-bench は、現在の言語モデルの能力を超えていると考えられるタスクに焦点を当てています。OpenAI の GPT モデル、Google 内部のデンストランスフォーマーアーキテクチャ、およびスイッチスタイルのスパーストランスフォーマーの動作を、数百万から数千億のパラメータにわたるモデルサイズにわたって BIG ベンチで評価します。さらに、強力なベースラインを提供するために、人間の専門評価者のチームがすべてのタスクを実行しました。調査結果には次のものが含まれます。モデルのパフォーマンスとキャリブレーションはどちらもスケールに応じて向上しますが、絶対的な観点で (評価者のパフォーマンスと比較すると) 劣っています。パフォーマンスはモデルクラス間で驚くほど似ていますが、スパース性による利点があります。徐々にかつ予測通りに改善するタスクには、通常、大規模な知識や暗記コンポーネントが含まれますが、重要なスケールで「画期的な」動作を示すタスクには、多くの場合、複数のステップやコンポーネント、または脆弱な指標が含まれます。社会的偏見は通常、状況が曖昧な環境では規模が大きくなるにつれて増加しますが、これはプロンプトを提示することで改善できます。

DROP

DROP(Discrete Reasoning Over the content of Paragraphs)は2019年にDuaらによって提案された推論の読解力を評価するベンチマークです。問題数は55,000問で段落の内容をより包括的に理解して推論する能力が評価されます。論文執筆時点のSOTA(state-of-the-art)の手法で38.4%(F1スコア)の精度しか達成できなかったようで、人間の専門家の96%(F1スコア)よりも推論能力がとても低く、言語モデルが推論に弱かったことを示しています。(Gemini UltraのDROPのF1スコアは82.4)

DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs.

Abstract:
Reading comprehension has recently seen rapid progress, with systems matching humans on the most popular datasets for the task. However, a large body of work has highlighted the brittleness of these systems, showing that there is much work left to be done. We introduce a new reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs. In this crowdsourced, adversarially-created, 55k-question benchmark, a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs, as they remove the paraphrase-and-entity-typing shortcuts available in prior datasets. We apply state-of-the-art methods from both the reading comprehension and semantic parsing literatures on this dataset and show that the best systems only achieve 38.4% F1 on our generalized accuracy metric, while expert human performance is 96%. We additionally present a new model that combines reading comprehension methods with simple numerical reasoning to achieve 51% F1.

アブストラクト(機械翻訳)：
読解力は最近急速に進歩しており、システムはタスクに最も一般的なデータセットを人間と照合するようになりました。しかし、多くの研究によってこれらのシステムの脆弱性が浮き彫りになり、やるべきことがまだたくさんあることが示されています。新しい読解ベンチマークである DROP を導入します。これは、段落の内容に対する離散推論を必要とします。このクラウドソースで敵対者が作成した 55,000 問のベンチマークでは、システムは質問内の参照 (おそらく複数の入力位置) を解決し、それらに対して個別の操作 (加算、カウント、並べ替えなど) を実行する必要があります。これらの操作では、以前のデータセットで利用できた言い換えやエンティティの入力のショートカットが削除されるため、段落の内容をより包括的に理解する必要があります。このデータセットに対して読解と意味解析の両方の文献から得た最先端の手法を適用し、最良のシステムは一般化された精度指標で 38.4% の F1 しか達成できないのに対し、専門家の人間のパフォーマンスは 96% であることを示しました。さらに、51% の F1 を達成するために、読解方法と単純な数的推論を組み合わせた新しいモデルを提示します。

HellaSwag

HellaSwagは、OpenAIの研究員であるRowan Zellersらによって2019年に提案された自然言語の常識的な推論を問うベンチマークです。人間にはささいな質問(精度が95%以上)であっても、最先端モデルでは回答が困難なデータセット(精度が48％未満)となっています。

Hellaswag: Can a machine really finish your sentence?

Abstract:
Recent work by Zellers et al. (2018) introduced a new task of commonsense natural language inference: given an event description such as "A woman sits at a piano," a machine must select the most likely followup: "She sets her fingers on the keys." With the introduction of BERT, near human-level performance was reached. Does this mean that machines can perform human level commonsense inference?
In this paper, we show that commonsense inference still proves difficult for even state-of-the-art models, by presenting HellaSwag, a new challenge dataset. Though its questions are trivial for humans (>95% accuracy), state-of-the-art models struggle (<48%). We achieve this via Adversarial Filtering (AF), a data collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers. AF proves to be surprisingly robust. The key insight is to scale up the length and complexity of the dataset examples towards a critical 'Goldilocks' zone wherein generated text is ridiculous to humans, yet often misclassified by state-of-the-art models.
Our construction of HellaSwag, and its resulting difficulty, sheds light on the inner workings of deep pretrained models. More broadly, it suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.

アブストラクト(機械翻訳)：
Zellersらによる最近の研究。(2018) は、常識的な自然言語推論の新しいタスクを導入しました。「女性がピアノの前に座っている」などのイベントの説明が与えられると、マシンは最も可能性の高い後続を選択しなければなりません:「彼女は鍵盤に指を置く」。BERT の導入により、ほぼ人間レベルのパフォーマンスに到達しました。これは、機械が人間レベルの常識的な推論を実行できることを意味しますか?
この論文では、新しい課題データセットである HellaSwag を提示することにより、常識的な推論は最先端のモデルでも依然として難しいことが判明していることを示します。その質問は人間にとっては些細なものですが (精度が 95% 以上)、最先端のモデルは困難を伴います (精度が 48% 未満)。これは、一連の識別子が機械によって生成された敵対的な間違った回答のセットを繰り返し選択するデータ収集パラダイムである敵対的フィルタリング (AF) によって実現されます。AFは驚くほど堅牢であることがわかります。重要な洞察は、生成されたテキストが人間にとってばかばかしいにもかかわらず、最先端のモデルによって誤分類されることが多いクリティカルな「ゴルディロックス」ゾーンに向けて、データセットの例の長さと複雑さをスケールアップすることです。
私たちの HellaSwag の構築とその結果として生じる困難さは、深く事前トレーニングされたモデルの内部動作に光を当てます。より広範には、ベンチマークが敵対的な方法で進化する最先端技術と共進化し、これまで以上に困難な課題を提示する、NLP 研究の新たな前進の道を示唆しています。

Math

GSM8K

GSM8Kは、OpenAIのリサーチサイエンティストであるKarl Cobbeらによって、2021年に提案された複数ステップの数学的推論のベンチマークです。8.5Kの高品質で言語的に多様な小学校の数学の文章問題のデータセットを使っています。著者らによると現在のモデルはこの複数ステップの数学的推論が弱みであるとのことです。

Training verifiers to solve math word problems

Abstract:
State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high test performance, despite the conceptual simplicity of this problem distribution. To increase performance, we propose training verifiers to judge the correctness of model completions. At test time, we generate many candidate solutions and select the one ranked highest by the verifier. We demonstrate that verification significantly improves performance on GSM8K, and we provide strong empirical evidence that verification scales more effectively with increased data than a finetuning baseline.

アブストラクト(機械翻訳)：
最先端の言語モデルは、多くのタスクで人間のパフォーマンスに匹敵することができますが、複数ステップの数学的推論を確実に実行するのはまだ困難です。現在のモデルの障害を診断し、研究をサポートするために、8.5K の高品質で言語的に多様な小学校の数学の文章問題のデータセットである GSM8K を導入します。この問題分布の概念的な単純さにもかかわらず、最大の変圧器モデルでも高いテスト性能を達成できないことがわかりました。パフォーマンスを向上させるために、モデルの完成度の正確さを判断するトレーニング検証者を提案します。テスト時には、多くの候補解が生成され、検証者によって最も高いランクが付けられたものが選択されます。検証によって GSM8K のパフォーマンスが大幅に向上することを実証し、微調整ベースラインよりもデータの増加に応じて検証がより効果的に拡張されるという強力な経験的証拠を提供します。

MATH

MATHはCenter for AI Safety(CAIS)のディレクターであるDan Hendrycksらによって、2021年に提案されたベンチマークです。12,500のチャレンジングな数学的問題からなるデータセットを使っていて、各問題には完全なステップバイステップの解決策があります。
著者らによると、巨大なTransformerモデルであっても精度が比較的低いままであり、モデルのパラメーター数を増やすだけでは、強力的な数学的推論を達成するのは非現実的であるとのことです。

Measuring mathematical problem solving with the MATH dataset.

Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are able to increase accuracy on MATH, our results show that accuracy remains relatively low, even with enormous Transformer models. Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue. While scaling Transformers is automatically solving most other text-based tasks, scaling is not currently solving MATH. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community.

アブストラクト(機械翻訳)：
多くの知的作業には数学的な問題解決が必要ですが、このスキルは依然としてコンピューターの能力を超えています。機械学習モデルでこの能力を測定するために、12,500 の挑戦的な数学の問題の新しいデータセットである MATH を導入します。MATH の各問題には完全なステップバイステップの解決策があり、モデルに答えの導出と説明を生成するよう教えるために使用できます。将来の研究を促進し、数学の精度を向上させるために、モデルに数学の基礎を教えるのに役立つ大規模な補助事前トレーニングデータセットも提供しています。MATH の精度を向上させることはできましたが、巨大な Transformer モデルであっても精度が比較的低いままであることが結果からわかります。さらに、スケーリングの傾向が続く場合、単に予算とモデルのパラメーター数を増やすだけでは、強力な数学的推論を達成するのは非現実的であることがわかりました。Transformers のスケーリングは他のほとんどのテキストベースのタスクを自動的に解決しますが、スケーリングは現在 MATH を解決しません。数学的問題解決をさらに推進するには、より広範な研究コミュニティからの新しいアルゴリズムの進歩が必要になるでしょう。

Code

HumanEval

HumanEvalはOpenAIのリサーチサイエンティストであるMark Chenらによって、2021年に提案された文字列からプログラムを生成する機能の正しさを測定するベンチマークです。
著者らのモデルCodex(GitHubコードでファインチューニングしたGPT言語モデル)は問題の28.8%を解決して、GPT-3は0%、GPT-Jは11.4%の問題を解決したそうです。またモデルからのサンプリングを繰り返すことで著者らのモデルは70.2%の問題を解決したそうです。

Evaluating large language models trained on code

Abstract:
We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.

アブストラクト(機械翻訳)：
GitHub から公開されているコードに基づいて微調整された GPT 言語モデルである Codex を紹介し、その Python コード作成機能を研究します。Codex の独自の製品バージョンが GitHub Copilot を強化します。ドキュメント文字列からプログラムを合成する機能の正しさを測定するためにリリースされた新しい評価セットである HumanEval では、私たちのモデルは問題の 28.8% を解決しましたが、GPT-3 は 0%、GPT-J は 11.4% を解決しました。さらに、モデルからのサンプリングを繰り返すことが、困難なプロンプトに対して有効な解決策を生み出すための驚くほど効果的な戦略であることがわかりました。この方法を使用すると、問題ごとに 100 個のサンプルを使用して問題の 70.2% を解決できます。私たちのモデルを注意深く調査すると、長い操作チェーンを記述するドキュメント文字列や変数への操作のバインドの難しさなど、その限界が明らかになります。最後に、安全性、セキュリティ、経済性をカバーする、強力なコード生成テクノロジの導入による潜在的な広範な影響について説明します。

MULTIMODAL

Image Understanding

MMMU

MMMUはオハイオ州立大学のXiang Yueらによって、2023年に考案された大学レベルの知識と推論に関するベンチマークです。大学の試験、クイズ、教科書から注意深く収集された 11.5Kのマルチモーダルな質問で構成されています。汎用人工知能（AGI）のレベル3として定義される「エキスパートAGI」の進歩を評価するベンチマークとして有用性があるようです。

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Abstract:
We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering. These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures. Unlike existing benchmarks, MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts. Our evaluation of 14 open-source LMMs and the proprietary GPT-4V(ision) highlights the substantial challenges posed by MMMU. Even the advanced GPT-4V only achieves a 56% accuracy, indicating significant room for improvement. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence.

アブストラクト(機械翻訳)：
MMMUを紹介します。MMMUは、大学レベルの主題知識と意図的な推論を必要とする大規模な複数分野のタスクでマルチモーダルモデルを評価するように設計された新しいベンチマークです。MMMUには、芸術とデザイン、ビジネス、科学、健康と医学、人文科学と社会科学、技術と工学の6つの主要分野をカバーする、大学の試験、クイズ、教科書から注意深く収集された11.5Kのマルチモーダルな質問が含まれています。これらの質問は、30の主題と183のサブフィールドに及び、チャート、図、地図、表、楽譜、化学構造など、30種類の非常に異質な画像で構成されています。既存のベンチマークとは異なり、MMMUは、専門家が直面するタスクと同様のタスクを実行するための、ドメイン固有の知識による高度な認識と推論、挑戦的なモデルに焦点を当てています。14のオープンソースLMMと独自のGPT-4V(ision)の評価では、MMMU によってもたらされる重大な課題が浮き彫りになりました。先進的な GPT-4Vでさえ56%の精度しか達成できず、改善の余地が大きいことがわかります。私たちは、MMMUがコミュニティを刺激して、エキスパートの汎用人工知能に向けた次世代のマルチモーダル基盤モデルを構築すると信じています。

VQAv2

VQAv2はバージニア工科大学のYash Goyalらによって、2017年に考案された画像理解に関するベンチマークです。1つの質問に対して２つの画像があるような質問のデータセットとなっており、ビジュアル質問応答 (VQA) タスクのVを重視するベンチマークとなっています。

Making the V in VQA matter: Elevating the role of image understanding in visual question answering

Abstract:
Problems at the intersection of vision and language are of significant importance both as challenging research questions and for the rich set of applications they enable. However, inherent structure in our world and bias in our language tend to be a simpler signal for learning than visual modalities, resulting in models that ignore visual information, leading to an inflated sense of their capability.
We propose to counter these language priors for the task of Visual Question Answering (VQA) and make vision (the V in VQA) matter! Specifically, we balance the popular VQA dataset by collecting complementary images such that every question in our balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question. Our dataset is by construction more balanced than the original VQA dataset and has approximately twice the number of image-question pairs. Our complete balanced dataset is publicly available at this http URL as part of the 2nd iteration of the Visual Question Answering Dataset and Challenge (VQA v2.0).
We further benchmark a number of state-of-art VQA models on our balanced dataset. All models perform significantly worse on our balanced dataset, suggesting that these models have indeed learned to exploit language priors. This finding provides the first concrete empirical evidence for what seems to be a qualitative sense among practitioners.
Finally, our data collection protocol for identifying complementary images enables us to develop a novel interpretable model, which in addition to providing an answer to the given (image, question) pair, also provides a counter-example based explanation. Specifically, it identifies an image that is similar to the original image, but it believes has a different answer to the same question. This can help in building trust for machines among their users.

アブストラクト(機械翻訳)：
視覚と言語が交差する問題は、研究上の困難な課題としても、それが可能にする豊富な応用としても非常に重要です。しかし、私たちの世界に固有の構造や言語の偏りは、視覚的なモダリティよりも学習のための単純なシグナルとなる傾向があり、その結果、視覚的な情報を無視したモデルが生成され、モデルの能力が誇張された感覚につながります。
私たちは、ビジュアル質問応答 (VQA) のタスクに関するこれらの言語の事前条件に対抗し、ビジョン (VQA の V) を重要なものにすることを提案します。具体的には、バランスのとれたデータセット内のすべての質問が 1 つの画像だけではなく、質問に対する 2 つの異なる回答をもたらす類似した画像のペアに関連付けられるように、相補的な画像を収集することで、人気のある VQA データセットのバランスをとります。私たちのデータセットは、元の VQA データセットよりもバランスの取れた構造になっており、画像と質問のペアの数が約 2 倍になっています。私たちの完全なバランスの取れたデータセットは、Visual Question Answering Dataset and Challenge (VQA v2.0) の 2 回目の反復の一部として、この http URLで公開されています。
さらに、バランスの取れたデータセットで多数の最先端の VQA モデルをベンチマークします。バランスのとれたデータセットではすべてのモデルのパフォーマンスが大幅に低下しており、これらのモデルが実際に言語事前分布を利用することを学習していることを示唆しています。この発見は、実践者の間で定性的感覚と思われるものに対する初めての具体的な経験的証拠を提供するものである。
最後に、相補的な画像を識別するためのデータ収集プロトコルにより、与えられた (画像、質問) ペアに対する答えを提供するだけでなく、反例に基づいた説明も提供する、新しい解釈可能なモデルを開発することができます。具体的には、元の画像に似ているが、同じ質問に対して異なる答えがあると考えられる画像を識別します。これは、ユーザー間でマシンに対する信頼を構築するのに役立ちます。

TextVQA

TextVQAはFacebook AI ResearchのAmanpreet Singhらによって、2019年に考案された画像内に書かれたテキストを読み取り推論する能力を測るベンチマークです。現在のVQAモデルでは、画像内のテキストを読み取って推論して答えることが難しいようで、VQAv2を補完するベンチマークとなるようです。

Towards VQA models that can read

Abstract:
Studies have shown that a dominant class of questions asked by visually impaired users on images of their surroundings involves reading text in the image. But today's VQA models can not read! Our paper takes a first step towards addressing this problem. First, we introduce a new "TextVQA" dataset to facilitate progress on this important problem. Existing datasets either have a small proportion of questions about text (e.g., the VQA dataset) or are too small (e.g., the VizWiz dataset). TextVQA contains 45,336 questions on 28,408 images that require reasoning about text to answer. Second, we introduce a novel model architecture that reads text in the image, reasons about it in the context of the image and the question, and predicts an answer which might be a deduction based on the text and the image or composed of the strings found in the image. Consequently, we call our approach Look, Read, Reason & Answer (LoRRA). We show that LoRRA outperforms existing state-of-the-art VQA models on our TextVQA dataset. We find that the gap between human performance and machine performance is significantly larger on TextVQA than on VQA 2.0, suggesting that TextVQA is well-suited to benchmark progress along directions complementary to VQA 2.0.

アブストラクト(機械翻訳)：
研究によると、視覚障害のあるユーザーが周囲の画像に関して行う質問の主な種類は、画像内のテキストを読むことに関するものであることがわかっています。しかし、今の VQA モデルは読み取れません。私たちの論文は、この問題に対処するための第一歩を踏み出します。まず、この重要な問題の進捗を促進するために、新しい「TextVQA」データセットを導入します。既存のデータセットには、テキストに関する質問の割合が少ないか (VQA データセットなど)、小さすぎます (VizWiz データセットなど)。TextVQA には、28,408 枚の画像に関する 45,336 個の質問が含まれており、回答するにはテキストについての推論が必要です。2 番目に、画像内のテキストを読み取り、画像と質問のコンテキストでそれについて推論し、テキストと画像に基づく、または見つかった文字列で構成される推論である可能性のある答えを予測する、新しいモデルアーキテクチャを導入します。画像では。したがって、私たちはこのアプローチを Look、Read、Reason & Answer (LoRRA) と呼んでいます。LoRRA が TextVQA データセット上の既存の最先端の VQA モデルよりも優れたパフォーマンスを発揮することを示します。TextVQA では人間のパフォーマンスとマシンのパフォーマンスの差が VQA 2.0 よりも大幅に大きいことがわかり、TextVQA が VQA 2.0 を補完する方向に沿った進歩のベンチマークに適していることを示唆しています。

DocVQA

DocVQAはIIIT HyderabadのMinesh Mathewらによって、2021年に考案された文書画像の理解に関するベンチマークです。図やダイアグラム、インフォグラフィック等が記載された文書画像の情報を視覚的に理解することが要求されます。

DocVQA: A Dataset for VQA on Document Images

Abstract:
We present a new dataset for Visual Question Answering (VQA) on document images called DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images. Detailed analysis of the dataset in comparison with similar datasets for VQA and reading comprehension is presented. We report several baseline results by adopting existing VQA and reading comprehension models. Although the existing models perform reasonably well on certain types of questions, there is large performance gap compared to human performance (94.36% accuracy). The models need to improve specifically on questions where understanding structure of the document is crucial. The dataset, code and leaderboard are available at this http URL

アブストラクト(機械翻訳)：
DocVQA と呼ばれる、ドキュメント画像上の Visual Question Answering (VQA) 用の新しいデータセットを紹介します。このデータセットは、12,000 以上のドキュメント画像に定義された 50,000 の質問で構成されています。VQA および読解力に関する同様のデータセットと比較したデータセットの詳細な分析が示されています。既存の VQA と読解モデルを採用して、いくつかのベースライン結果を報告します。既存のモデルは、特定の種類の質問ではかなり優れたパフォーマンスを発揮しますが、人間のパフォーマンス (精度 94.36%) と比較すると、パフォーマンスに大きなギャップがあります。モデルは、文書の構造を理解することが重要な質問に関して特に改善する必要があります。データセット、コード、リーダーボードは、この http URLから入手できます。

Infographic VQA

Infographic VQAはIIIT HyderabadのMinesh Mathewらによって、2022年に考案されたインフォグラフィック画像の理解に関するベンチマークです。インフォグラフィック画像は、テキスト、グラフィック、ビジュアル要素の組み合わせを使用して情報を効果的に伝達するように設計されたドキュメントです。

InfographicVQA

Abstract:
Infographics are documents designed to effectively communicate information using a combination of textual, graphical and visual elements. In this work, we explore the automatic understanding of infographic images by using Visual Question Answering this http URL this end, we present InfographicVQA, a new dataset that comprises a diverse collection of infographics along with natural language questions and answers annotations. The collected questions require methods to jointly reason over the document layout, textual content, graphical elements, and data visualizations. We curate the dataset with emphasis on questions that require elementary reasoning and basic arithmetic skills. Finally, we evaluate two strong baselines based on state of the art multi-modal VQA models, and establish baseline performance for the new task. The dataset, code and leaderboard will be made available at this http URL

アブストラクト(機械翻訳)：
インフォグラフィックスは、テキスト、グラフィック、ビジュアル要素の組み合わせを使用して情報を効果的に伝達するように設計されたドキュメントです。この研究では、この http URL のVisual Question Answering を使用して、インフォグラフィック画像の自動理解を探索します。最後に、自然言語の質問と回答の注釈とともに、インフォグラフィックの多様なコレクションで構成される新しいデータセットである InfographicVQA を紹介します。収集された質問には、ドキュメントのレイアウト、テキストの内容、グラフィック要素、およびデータの視覚化を共同で推論する方法が必要です。私たちは、初歩的な推論と基本的な算術スキルを必要とする質問に重点を置いてデータセットを厳選しています。最後に、最先端のマルチモーダル VQA モデルに基づいて 2 つの強力なベースラインを評価し、新しいタスクのベースラインパフォーマンスを確立します。データセット、コード、リーダーボードは、この http URLから入手できます。

MathVista

MathVistaはUCLA Computer Science DepartmentのPan Luらによって、2023年に考案された視覚的なコンテキストでの数学的推論ベンチマークです。MathVistaのタスクを完了するにはきめ細かく深い視覚的理解と構成的推論が必要で、最先端の基礎モデルはこのすべてが困難であると著者らは述べています。

MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models

Abstract:
Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive problem-solving skills in many tasks and domains, but their ability in mathematical reasoning in visual contexts has not been systematically studied. To bridge this gap, we present MathVista, a benchmark designed to combine challenges from diverse mathematical and visual tasks. It consists of 6,141 examples, derived from 28 existing multimodal datasets involving mathematics and 3 newly created datasets (i.e., IQTest, FunctionQA, and PaperQA). Completing these tasks requires fine-grained, deep visual understanding and compositional reasoning, which all state-of-the-art foundation models find challenging. With MathVista, we have conducted a comprehensive, quantitative evaluation of 12 prominent foundation models. The best-performing GPT-4V model achieves an overall accuracy of 49.9%, substantially outperforming Bard, the second-best performer, by 15.1%. Our in-depth analysis reveals that the superiority of GPT-4V is mainly attributed to its enhanced visual perception and mathematical reasoning. However, GPT-4V still falls short of human performance by 10.4%, as it often struggles to understand complex figures and perform rigorous reasoning. This significant gap underscores the critical role that MathVista will play in the development of general-purpose AI agents capable of tackling mathematically intensive and visually rich real-world tasks. We further explore the new ability of self-verification, the application of self-consistency, and the interactive chatbot capabilities of GPT-4V, highlighting its promising potential for future research. The project is available at this https URL.

アブストラクト(機械翻訳)：
大規模言語モデル (LLM) と大規模マルチモーダルモデル (LMM) は、多くのタスクや領域で優れた問題解決スキルを示しますが、視覚的なコンテキストでの数学的推論における能力は体系的に研究されていません。このギャップを埋めるために、さまざまな数学的タスクと視覚的タスクからの課題を組み合わせるように設計されたベンチマークである MathVista を紹介します。これは、数学を含む 28 の既存のマルチモーダルデータセットと、新しく作成された 3 つのデータセット (IQTest、FunctionQA、および PaperQA) から派生した 6,141 の例で構成されています。これらのタスクを完了するには、きめ細かく深い視覚的理解と構成的推論が必要ですが、最先端の基礎モデルはすべてこれが困難であると感じています。MathVista を使用して、12 の著名な基礎モデルの包括的かつ定量的な評価を実施しました。最高のパフォーマンスを誇る GPT-4V モデルは、全体の精度 49.9% を達成し、2 番目に優れたパフォーマンスを誇る Bard を 15.1% 上回っています。私たちの詳細な分析により、GPT-4V の優位性は主に視覚認識と数学的推論の強化に起因することが明らかになりました。ただし、GPT-4V は複雑な数値を理解し、厳密な推論を実行するのに苦労することが多いため、人間のパフォーマンスにはまだ 10.4% 及ばない。この大きなギャップは、数学的に集中的で視覚的に豊富な現実世界のタスクに取り組むことができる汎用 AI エージェントの開発において、MathVista が果たす重要な役割を強調しています。さらに、GPT-4V の新しい自己検証機能、自己一貫性の適用、対話型チャットボット機能を調査し、将来の研究における有望な可能性を強調します。プロジェクトは、この https URLで入手できます。

Video Understanding

VATEX

VATEXはUniversity of California, Santa BarbaraのXin Wangらによって、2019年に考案された多言語ビデオに関するベンチマークです。VATEXは英語と中国語の両方で41,250以上のビデオと825,000のキャプションを含み、多言語対応で、大規模で、言語的に複雑であると著者らは述べています。

VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research

Abstract:
We present a new large-scale multilingual video description dataset, VATEX, which contains over 41,250 videos and 825,000 captions in both English and Chinese. Among the captions, there are over 206,000 English-Chinese parallel translation pairs. Compared to the widely-used MSR-VTT dataset, VATEX is multilingual, larger, linguistically complex, and more diverse in terms of both video and natural language descriptions. We also introduce two tasks for video-and-language research based on VATEX: (1) Multilingual Video Captioning, aimed at describing a video in various languages with a compact unified captioning model, and (2) Video-guided Machine Translation, to translate a source language description into the target language using the video information as additional spatiotemporal context. Extensive experiments on the VATEX dataset show that, first, the unified multilingual model can not only produce both English and Chinese descriptions for a video more efficiently, but also offer improved performance over the monolingual models. Furthermore, we demonstrate that the spatiotemporal video context can be effectively utilized to align source and target languages and thus assist machine translation. In the end, we discuss the potentials of using VATEX for other video-and-language research.

アブストラクト(機械翻訳)：
私たちは、英語と中国語の両方で 41,250 以上のビデオと 825,000 のキャプションを含む、新しい大規模な多言語ビデオ記述データセット VATEX を紹介します。キャプションの中には、206,000 以上の英語と中国語の対訳が含まれています。広く使用されている MSR-VTT データセットと比較して、VATEX は多言語対応で、大規模で、言語的に複雑で、ビデオと自然言語の両方の記述の点でより多様です。また、VATEX に基づくビデオと言語の研究のための 2 つのタスクも紹介します。(1) コンパクトな統一キャプションモデルを使用してさまざまな言語でビデオを記述することを目的とした多言語ビデオキャプション、および (2) ビデオを翻訳するためのビデオガイド付き機械翻訳ビデオ情報を追加の時空間コンテキストとして使用して、ソース言語の説明をターゲット言語に変換します。VATEX データセットに関する広範な実験により、まず、統合多言語モデルはビデオの英語と中国語の両方の説明をより効率的に生成できるだけでなく、単言語モデルよりもパフォーマンスが向上することがわかりました。さらに、時空間ビデオコンテキストを効果的に利用してソース言語とターゲット言語を調整し、機械翻訳を支援できることを実証します。最後に、VATEX を他のビデオと言語の研究に使用する可能性について説明します。

Perception Test MCQA

Perception Test MCQAは、Google DeepMindのResearch ScientistであるViorica Pătrăuceanらによって、2023年に考案されたマルチモーダルビデオの知覚と推論スキルを評価するためのベンチマークです。世界中の約100人の参加者によって撮影された、平均長 23秒の 11.6kの現実世界のビデオが使われています。人間と最先端のビデオ QAモデルとのパフォーマンスには大きな差があり(91.4%対46.2%)、マルチモーダルビデオの理解には大きな改善の余地があることが示唆されています。

Perception Test: A Diagnostic Benchmark for Multimodal Video Models

Abstract:
We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e.g. Flamingo, SeViLA, or GPT-4). Compared to existing benchmarks that focus on computational tasks (e.g. classification, detection or tracking), the Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities, to provide a comprehensive and efficient evaluation tool. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime. For these purposes, the Perception Test introduces 11.6k real-world videos, 23s average length, designed to show perceptually interesting situations, filmed by around 100 participants worldwide. The videos are densely annotated with six types of labels (multiple-choice and grounded video question-answers, object and point tracks, temporal action and sound segments), enabling both language and non-language evaluations. The fine-tuning and validation splits of the benchmark are publicly available (CC-BY license), in addition to a challenge server with a held-out test split. Human baseline results compared to state-of-the-art video QA models show a substantial gap in performance (91.4% vs 46.2%), suggesting that there is significant room for improvement in multimodal video understanding.
Dataset, baseline code, and challenge server are available at this https URL

アブストラクト(機械翻訳)：
我々は、事前トレーニングされたマルチモーダルモデル (Flamingo、SeViLA、GPT-4 など) の知覚と推論スキルを評価するための、新しいマルチモーダルビデオベンチマークである知覚テストを提案します。計算タスク (分類、検出、追跡など) に焦点を当てた既存のベンチマークと比較して、知覚テストは、ビデオ、オーディオにわたるスキル (記憶、抽象化、物理学、意味論) と推論の種類 (記述的、説明的、予測的、反事実的) に焦点を当てています。、およびテキストモダリティを使用して、包括的で効率的な評価ツールを提供します。このベンチマークは、ゼロショット/少数ショット、または限定された微調整体制で、事前トレーニングされたモデルの転送能力を調査します。これらの目的のために、知覚テストでは、世界中の約 100 人の参加者によって撮影された、知覚的に興味深い状況を示すように設計された、平均長 23 秒の 11.6k の現実世界のビデオが導入されています。ビデオには 6 種類のラベル (多肢選択式および根拠のあるビデオの質問と回答、オブジェクトとポイントのトラック、一時的なアクションとサウンドのセグメント) が密に注釈付けされており、言語と非言語の両方の評価が可能です。ベンチマークの微調整と検証の分割は、公開されたテスト分割を備えたチャレンジサーバーに加えて、公開されています (CC-BY ライセンス)。最先端のビデオ QA モデルと比較した人間のベースライン結果では、パフォーマンスに大きな差 (91.4% 対 46.2%) が示されており、マルチモーダルビデオの理解には大きな改善の余地があることが示唆されています。
データセット、ベースラインコード、チャレンジサーバーは、この https URLから入手できます。

Audio

CoVoST2(21 languages)

CoVoST2はFacebook AI Researchのresearch engineerのChanghan Wangらによって、2022年に考案された音声翻訳（Speech translationあるいはSpeech-To-Text）に関するベンチマークです。従来のデータセットと比較して、多数の言語に対応していることが特徴です。(21言語から英語への翻訳、および英語から15言語への翻訳)

CoVoST 2 and Massively Multilingual Speech-to-Text Translation

Abstract:
Speech translation has recently become an increasingly popular topic of research, partly due to the development of benchmark datasets. Nevertheless, current datasets cover a limited number of languages. With the aim to foster research in massive multilingual speech translation and speech translation for low resource language pairs, we release CoVoST 2, a large-scale multilingual speech translation corpus covering translations from 21 languages into English and from English into 15 languages. This represents the largest open dataset available to date from total volume and language coverage perspective. Data sanity checks provide evidence about the quality of the data, which is released under CC0 license. We also provide extensive speech recognition, bilingual and multilingual machine translation and speech translation baselines with open-source implementation.

アブストラクト(機械翻訳)：
音声翻訳は、ベンチマークデータセットの開発の影響もあり、最近ますます人気のある研究テーマになっています。それにもかかわらず、現在のデータセットは限られた数の言語をカバーしています。大規模な多言語音声翻訳およびリソースの少ない言語ペアの音声翻訳の研究を促進することを目的として、21 言語から英語への翻訳、および英語から 15 言語への翻訳をカバーする大規模な多言語音声翻訳コーパスである CoVoST 2 をリリースします。これは、総量と言語範囲の観点から、これまでに利用可能な最大のオープンデータセットに相当します。データ健全性チェックは、CC0 ライセンスに基づいてリリースされるデータの品質に関する証拠を提供します。また、広範な音声認識、二言語および多言語の機械翻訳、およびオープンソース実装による音声翻訳のベースラインも提供します。

FLEURS(62 lang)

FLEURSはMeta AI ResearchのAlexis Conneauらによって、2023年に考案された音声タスクに関するベンチマークです。自動音声認識 (ASR)、音声言語識別 (Speech LangID)、翻訳、検索などの様々な音声タスクがあります。

FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech

Abstract:
We introduce FLEURS, the Few-shot Learning Evaluation of Universal Representations of Speech benchmark. FLEURS is an n-way parallel speech dataset in 102 languages built on top of the machine translation FLoRes-101 benchmark, with approximately 12 hours of speech supervision per language. FLEURS can be used for a variety of speech tasks, including Automatic Speech Recognition (ASR), Speech Language Identification (Speech LangID), Translation and Retrieval. In this paper, we provide baselines for the tasks based on multilingual pre-trained models like mSLAM. The goal of FLEURS is to enable speech technology in more languages and catalyze research in low-resource speech understanding.

アブストラクト(機械翻訳)：
音声の普遍的表現の少数ショット学習評価ベンチマークである FLEURS を紹介します。FLEURS は、機械翻訳 FLoRes-101 ベンチマークをベースに構築された 102 言語の n-way 並列音声データセットで、言語ごとに約 12 時間の音声監視が行われます。FLEURS は、自動音声認識 (ASR)、音声言語識別 (Speech LangID)、翻訳、検索などのさまざまな音声タスクに使用できます。このペーパーでは、mSLAM のような事前トレーニング済みの多言語モデルに基づいてタスクのベースラインを提供します。FLEURS の目標は、より多くの言語で音声テクノロジーを有効にし、低リソースの音声理解の研究を促進することです。

ベンチマークの概要をまとめてみて

Geminiの性能評価に使われた様々なベンチマークをまとめてみて、既存の大規模言語モデルや大規模マルチモーダルモデルが人間と比較してまだまだ劣っているタスクが数多くあることを知れました。
また、こういったベンチマークの研究と開発が、既存モデルの弱点を発見・指摘してAGIに近づくために必要な仕事であることも実感できました。

今後、様々な大規模AIモデルがリリースされていくと思いますが、その際にはベンチマークのスコアをみて冷静に比較検討をしていきたいと思います。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up