83 |
Tulu 3: Pushing Frontiers in Open Language Model Post-Training |
not yet |
72 |
A Survey on LLM-as-a-Judge |
not yet |
58 |
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step |
 |
48 |
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge |
not yet |
45 |
Generative Agent Simulations of 1,000 People |
 |
42 |
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions |
 |
40 |
FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI |
 |
30 |
O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? |
not yet |
30 |
Measuring short-form factuality in large language models |
not yet |
29 |
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization |
not yet |
27 |
OminiControl: Minimal and Universal Control for Diffusion Transformer |
not yet |
27 |
Logical computation demonstrated with a neutral atom quantum processor |
not yet |
26 |
Randomized Autoregressive Visual Generation |
not yet |
25 |
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models |
not yet |
25 |
How Far is Video Generation from World Model: A Physical Law Perspective |
 |
24 |
RedPajama: an Open Dataset for Training Large Language Models |
not yet |
24 |
Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM |
not yet |
22 |
Does Prompt Formatting Have Any Impact on LLM Performance? |
not yet |
22 |
Scaling Laws for Precision |
 |
21 |
Enhancing LLM Reasoning with Reward-guided Tree Search |
not yet |
21 |
Circuit Complexity Bounds for RoPE-based Transformer Architecture |
not yet |
20 |
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models |
not yet |
20 |
How to Build a Quantum Supercomputer: Scaling from Hundreds to Millions of Qubits |
not yet |
19 |
Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks |
 |
18 |
Multimodal Whole Slide Foundation Model for Pathology |
not yet |
18 |
VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models |
not yet |
18 |
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models |
not yet |
18 |
Taming Rectified Flow for Inversion and Editing |
not yet |
17 |
CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models |
not yet |
17 |
Identity-Preserving Text-to-Video Generation by Frequency Decomposition |
not yet |
17 |
Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision |
not yet |
17 |
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models |
not yet |
17 |
WavChat: A Survey of Spoken Dialogue Models |
not yet |
17 |
Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models |
not yet |
17 |
HourVideo: 1-Hour Video-Language Understanding |
not yet |
17 |
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent |
not yet |
16 |
Large Language Model-Brained GUI Agents: A Survey |
not yet |
16 |
On Statistical Rates of Conditional Diffusion Transformers: Approximation, Estimation and Minimax Optimality |
not yet |
16 |
DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion |
not yet |
15 |
Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations |
not yet |
15 |
Emotion-Aware Interaction Design in Intelligent User Interface Using Multi-Modal Deep Learning |
not yet |
15 |
Towards evaluations-based safety cases for AI scheming |
not yet |
15 |
Personalization of Large Language Models: A Survey |
not yet |
14 |
Self-Generated Critiques Boost Reward Modeling for Language Models |
not yet |
14 |
Fundamental Limits of Prompt Tuning Transformers: Universality, Capacity and Efficiency |
not yet |
14 |
Hymba: A Hybrid-head Architecture for Small Language Models |
 |
14 |
The Surprising Effectiveness of Test-Time Training for Few-Shot Learning |
not yet |
13 |
RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts |
not yet |
13 |
Learning Humanoid Locomotion with Perceptive Internal Model |
not yet |
13 |
OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs |
not yet |
13 |
SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization |
not yet |
13 |
Metric Learning for Tag Recommendation: Tackling Data Sparsity and Cold Start Issues |
not yet |
13 |
A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness |
not yet |
13 |
Improving Steering Vectors by Targeting Sparse Autoencoder Features |
not yet |
12 |
Inference Scaling fLaws: The Limits of LLM Resampling with Imperfect Verifiers |
not yet |
12 |
OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining |
not yet |
12 |
DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding |
 |
12 |
Self-Supervised Learning in Deep Networks: A Pathway to Robust Few-Shot Classification |
not yet |
12 |
A Preview of XiYan-SQL: A Multi-Generator Ensemble Framework for Text-to-SQL |
not yet |
12 |
Safety case template for frontier AI: A cyber inability argument |
not yet |
12 |
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation |
not yet |
12 |
DiT4Edit: Diffusion Transformer for Image Editing |
not yet |
12 |
Hunyuan3D 1.0: A Unified Framework for Text-to-3D and Image-to-3D Generation |
not yet |
12 |
Addressing Representation Collapse in Vector Quantized Models with One Linear Layer |
not yet |
12 |
Vision-Language Models Can Self-Improve Reasoning via Reflection |
not yet |
12 |
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models |
not yet |
12 |
AutoGLM: Autonomous Foundation Agents for GUIs |
not yet |
12 |
PatternBoost: Constructions in Mathematics with a Little Help from AI |
not yet |
11 |
Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability |
not yet |
11 |
Scaling Speech-Text Pre-training with Synthetic Interleaved Data |
not yet |
11 |
Enhancing Few-Shot Learning with Integrated Data and GAN Model Approaches |
not yet |
11 |
All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages |
not yet |
11 |
Optimizing Gesture Recognition for Seamless UI Interaction Using Convolutional Neural Networks |
not yet |
11 |
Graph Neural Network-Based Entity Extraction and Relationship Reasoning in Complex Knowledge Graphs |
not yet |
11 |
OASIS: Open Agent Social Interaction Simulations with One Million Agents |
not yet |
11 |
Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model |
not yet |
11 |
ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning |
not yet |
11 |
MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs |
not yet |
11 |
MdEval: Massively Multilingual Code Debugging |
not yet |
11 |
Rule Based Rewards for Language Model Safety |
not yet |
11 |
Survey of Cultural Awareness in Language Models: Text and Beyond |
not yet |
11 |
Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations |
not yet |
10 |
VLSBench: Unveiling Visual Leakage in Multimodal Safety |
not yet |
10 |
INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge |
not yet |
10 |
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation |
not yet |
10 |
Auto-RAG: Autonomous Retrieval-Augmented Generation for Large Language Models |
not yet |
10 |
Leveraging Semi-Supervised Learning to Enhance Data Mining for Image Classification under Limited Labeled Data |
not yet |
10 |
ShowUI: One Vision-Language-Action Model for GUI Visual Agent |
not yet |
10 |
Adaptive Cache Management for Complex Storage Systems Using CNN-LSTM-Based Spatiotemporal Prediction |
not yet |
10 |
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory |
 |
10 |
High-fidelity universal gates in the $^{171}$Yb ground state nuclear spin qubit |
not yet |
10 |
Zero-Shot Automatic Annotation and Instance Segmentation using LLM-Generated Datasets: Eliminating Field Imaging and Manual Annotation for Deep Learning Model Development |
not yet |
10 |
The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use |
not yet |
10 |
LoRA-LiteE: A Computationally Efficient Framework for Chatbot Preference-Tuning |
not yet |
10 |
Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows |
not yet |
10 |
Rethinking Bradley-Terry Models in Preference-Based Reward Modeling: Foundations, Theory, and Alternatives |
not yet |
10 |
From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond |
not yet |
10 |
Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level |
not yet |
10 |
TableGPT2: A Large Multimodal Model with Tabular Data Integration |
not yet |
10 |
Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis |
not yet |
9 |
GRAPE: Generalizing Robot Policy via Preference Alignment |
not yet |
9 |
AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers |
not yet |
9 |
TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability |
not yet |
9 |
SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation |
not yet |
9 |
WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model |
not yet |
9 |
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs |
not yet |
9 |
XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models |
not yet |
9 |
Towards Next-Generation Medical Agent: How o1 is Reshaping Decision-Making in Medical Scenarios |
not yet |
9 |
Multimodal Autoregressive Pre-training of Large Vision Encoders |
not yet |
9 |
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games |
not yet |
9 |
Disentangling Memory and Reasoning Ability in Large Language Models |
not yet |
9 |
A Combined Encoder and Transformer Approach for Coherent and High-Quality Text Generation |
not yet |
9 |
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices |
not yet |
9 |
Large Wireless Model (LWM): A Foundation Model for Wireless Channels |
not yet |
9 |
Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents |
not yet |
9 |
A Survey on Kolmogorov-Arnold Network |
not yet |
9 |
Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks |
not yet |
9 |
LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation |
 |
9 |
GUI Agents with Foundation Models: A Comprehensive Survey |
not yet |
9 |
Advanced RAG Models with Graph Structures: Optimizing Complex Knowledge Reasoning and Text Generation |
not yet |
9 |
Distributionally Robust Optimization |
not yet |
9 |
Attacking Vision-Language Computer Agents via Pop-ups |
not yet |
9 |
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning |
not yet |
9 |
On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback |
not yet |
9 |
DexHub and DART: Towards Internet Scale Robot Data Collection |
not yet |
9 |
GameGen-X: Interactive Open-world Game Video Generation |
 |
9 |
A Public Dataset Tracking Social Media Discourse about the 2024 U.S. Presidential Election on Twitter/X |
not yet |
9 |
RSL-SQL: Robust Schema Linking in Text-to-SQL Generation |
not yet |
8 |
Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS |
not yet |
8 |
Filter, Correlate, Compress: Training-Free Token Reduction for MLLM Acceleration |
not yet |
8 |
Transformers are Deep Optimizers: Provable In-Context Learning for Deep Model Training |
not yet |
8 |
LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training |
not yet |
8 |
FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression |
not yet |
8 |
Evaluating the Robustness of Analogical Reasoning in Large Language Models |
not yet |
8 |
When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training |
not yet |
8 |
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension |
not yet |
8 |
Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models |
 |
8 |
Understanding Chain-of-Thought in LLMs through Information Theory |
not yet |
8 |
ACE2: Accurately learning subseasonal to decadal atmospheric variability and forced responses |
not yet |
8 |
AnimateAnything: Consistent and Controllable Animation for Video Generation |
not yet |
8 |
Seeing Clearly by Layer Two: Enhancing Attention Heads to Alleviate Hallucination in LVLMs |
not yet |
8 |
Golden Noise for Diffusion Models: A Learning Framework |
 |
8 |
Game-theoretic LLM: Agent Workflow for Negotiation Games |
not yet |
8 |
Autoregressive Models in Vision: A Survey |
not yet |
8 |
LLMs as Research Tools: A Large Scale Survey of Researchers' Usage and Perceptions |
not yet |
8 |
Quantum speedups in solving near-symmetric optimization problems by low-depth QAOA |
not yet |
8 |
Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding |
not yet |
8 |
Science and Project Planning for the Forward Physics Facility in Preparation for the 2024-2026 European Particle Physics Strategy Update |
not yet |
8 |
Evaluation data contamination in LLMs: how do we measure it and (when) does it matter? |
not yet |
8 |
Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback |
not yet |
8 |
What do sin$(x)$ and arcsinh$(x)$ have in Common? |
not yet |
8 |
Lingma SWE-GPT: An Open Development-Process-Centric Language Model for Automated Software Improvement |
not yet |
8 |
A Lorentz-Equivariant Transformer for All of the LHC |
not yet |
8 |
Project Sid: Many-agent simulations toward AI civilization |
not yet |
7 |
Lift3D Foundation Policy: Lifting 2D Large-Scale Pretrained Models for Robust 3D Robotic Manipulation |
not yet |
7 |
$H^3$Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs |
not yet |
7 |
I2VControl: Disentangled and Unified Video Motion Synthesis Control |
not yet |