604 |
GPT-4o System Card |
not yet |
155 |
Movie Gen: A Cast of Media Foundation Models |
 |
134 |
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models |
 |
93 |
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control |
not yet |
91 |
Video Instruction Tuning With Synthetic Data |
not yet |
82 |
Pixtral 12B |
not yet |
56 |
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation |
not yet |
53 |
O1 Replication Journey: A Strategic Progress Report -- Part 1 |
not yet |
53 |
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation |
not yet |
52 |
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second |
 |
52 |
Moshi: a speech-text foundation model for real-time dialogue |
not yet |
46 |
Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs |
not yet |
46 |
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models |
not yet |
46 |
MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion |
not yet |
45 |
YOLOv11: An Overview of the Key Architectural Enhancements |
not yet |
45 |
Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning |
not yet |
43 |
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation |
not yet |
43 |
Pyramidal Flow Matching for Efficient Video Generative Modeling |
 |
41 |
LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning |
not yet |
40 |
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding |
 |
39 |
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think |
not yet |
38 |
Aria: An Open Multimodal Native Mixture-of-Experts Model |
not yet |
37 |
OpenR: An Open Source Framework for Advanced Reasoning with Large Language Models |
not yet |
36 |
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference |
not yet |
36 |
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data |
not yet |
35 |
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge |
not yet |
35 |
How to Train Long-Context Language Models (Effectively) |
not yet |
34 |
Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens |
 |
33 |
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering |
 |
32 |
Open Materials 2024 (OMat24) Inorganic Materials Dataset and Models |
not yet |
30 |
LLaVA-Critic: Learning to Evaluate Multimodal Models |
not yet |
29 |
ALOHA Unleashed: A Simple Recipe for Robot Dexterity |
not yet |
29 |
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads |
 |
29 |
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers |
not yet |
28 |
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer |
not yet |
28 |
Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models |
not yet |
27 |
CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos |
not yet |
27 |
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities |
not yet |
27 |
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents |
not yet |
27 |
Differential Transformer |
 |
26 |
Orb: A Fast, Scalable Neural Network Potential |
not yet |
26 |
Data Scaling Laws in Imitation Learning for Robotic Manipulation |
not yet |
26 |
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction |
not yet |
26 |
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching |
 |
25 |
RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning |
 |
25 |
HelpSteer2-Preference: Complementing Ratings with Preferences |
not yet |
24 |
Rethinking Visual Dependency in Long-Context Reasoning for Large Vision-Language Models |
not yet |
23 |
Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models |
 |
23 |
Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations |
not yet |
23 |
Loong: Generating Minute-level Long Videos with Autoregressive Language Models |
not yet |
23 |
HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly |
not yet |
22 |
Generalizable Humanoid Manipulation with 3D Diffusion Policies |
not yet |
22 |
HSR-Enhanced Sparse Attention Acceleration |
not yet |
22 |
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents |
not yet |
22 |
A Survey on Diffusion Models for Inverse Problems |
not yet |
21 |
Liger Kernel: Efficient Triton Kernels for LLM Training |
not yet |
21 |
Agent-as-a-Judge: Evaluate Agents with Agents |
 |
21 |
AFlow: Automating Agentic Workflow Generation |
not yet |
21 |
Looped ReLU MLPs May Be All You Need as Practical Programmable Computers |
not yet |
21 |
Baichuan-Omni Technical Report |
not yet |
20 |
DepthSplat: Connecting Gaussian Splatting and Depth |
not yet |
20 |
A Comparative Study on Reasoning Patterns of OpenAI's o1 Model |
not yet |
20 |
LightRAG: Simple and Fast Retrieval-Augmented Generation |
not yet |
20 |
ImageFolder: Autoregressive Image Generation with Folded Tokens |
not yet |
19 |
Improve Vision Language Model Chain-of-thought Reasoning |
not yet |
19 |
Allegro: Open the Black Box of Commercial-Level Video Generation Model |
 |
19 |
Generative Reward Models |
not yet |
19 |
JudgeBench: A Benchmark for Evaluating LLM-based Judges |
not yet |
19 |
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents |
not yet |
19 |
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training |
not yet |
19 |
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks |
not yet |
19 |
Strong Model Collapse |
 |
19 |
CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL |
not yet |
18 |
No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images |
not yet |
18 |
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference |
not yet |
18 |
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark |
not yet |
18 |
Performance of the CMS high-level trigger during LHC Run 2 |
not yet |
18 |
A Survey on Data Synthesis and Augmentation for Large Language Models |
not yet |
18 |
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models |
not yet |
18 |
Fine-grained Attention I/O Complexity: Comprehensive Analysis for Backward Passes |
not yet |
18 |
DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation |
not yet |
18 |
Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models |
 |
17 |
EMMA: End-to-End Multimodal Model for Autonomous Driving |
not yet |
17 |
Automatically Interpreting Millions of Features in Large Language Models |
not yet |
17 |
Latent Action Pretraining from Videos |
 |
17 |
Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix |
not yet |
17 |
When Attention Sink Emerges in Language Models: An Empirical View |
not yet |
17 |
Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis |
not yet |
17 |
ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery |
not yet |
17 |
SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? |
not yet |
17 |
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark |
not yet |
17 |
ManiSkill3: GPU Parallelized Robotics Simulation and Rendering for Generalizable Embodied AI |
not yet |
16 |
VoiceBench: Benchmarking LLM-Based Voice Assistants |
not yet |
16 |
TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling |
not yet |
16 |
DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation |
not yet |
16 |
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models |
not yet |
16 |
How to Construct Random Unitaries |
not yet |
16 |
Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation |
not yet |
16 |
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making |
not yet |
16 |
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG |
not yet |
16 |
Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise |
not yet |
15 |
Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics |
 |
15 |
Jailbreaking and Mitigation of Vulnerabilities in Large Language Models |
not yet |
15 |
Bypassing the Exponential Dependency: Looped Transformers Efficiently Learn In-context by Multi-step Gradient Descent |
not yet |
15 |
Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues |
not yet |
15 |
Impurities and polarons in bosonic quantum gases: a review on recent progress |
not yet |
15 |
Rectified Diffusion: Straightness Is Not Your Need in Rectified Flow |
not yet |
15 |
T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design |
not yet |
15 |
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification |
not yet |
15 |
AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs |
not yet |
15 |
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection |
not yet |
15 |
Inference Scaling for Long-Context Retrieval Augmented Generation |
 |
15 |
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations |
 |
15 |
AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models |
not yet |
15 |
VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment |
not yet |
15 |
Uncertainty-aware Reward Model: Teaching Reward Models to Know What is Unknown |
not yet |
14 |
In-Context LoRA for Diffusion Transformers |
not yet |
14 |
Social Science Meets LLMs: How Reliable Are Large Language Models in Social Simulations? |
not yet |
14 |
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents |
not yet |
14 |
HOVER: Versatile Neural Whole-Body Controller for Humanoid Robots |
not yet |
14 |
Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages |
not yet |
14 |
Jailbreaking LLM-Controlled Robots |
not yet |
14 |
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs |
not yet |
14 |
The Ingredients for Robotic Diffusion Transformers |
not yet |
14 |
ARCap: Collecting High-quality Human Demonstrations for Robot Learning with Augmented Reality Feedback |
not yet |
14 |
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation |
not yet |
14 |
Falcon Mamba: The First Competitive Attention-free 7B Language Model |
not yet |
14 |
TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens |
not yet |
14 |
CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs |
not yet |
14 |
Interpretable Contrastive Monte Carlo Tree Search Reasoning |
not yet |
13 |
SelfCodeAlign: Self-Alignment for Code Generation |
 |
13 |
MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision |
not yet |
13 |
Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms |
not yet |
13 |
WorldSimBench: Towards Video Generation Models as World Simulators |
not yet |
13 |
The XLZD Design Book: Towards the Next-Generation Liquid Xenon Observatory for Dark Matter and Neutrino Physics |
not yet |
13 |
Thinking LLMs: General Instruction Following with Thought Generation |
 |
13 |
Many Heads Are Better Than One: Improved Scientific Idea Generation by A LLM-Based Multi-Agent System |
not yet |
13 |
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents |
not yet |
13 |
IC3M: In-Car Multimodal Multi-object Monitoring for Abnormal Status of Both Driver and Passengers |
not yet |
12 |
On Memorization of Large Language Models in Logical Reasoning |
not yet |
12 |
CaloChallenge 2022: A Community Challenge for Fast Calorimeter Simulation |
not yet |
12 |
Safety cases for frontier AI |
not yet |
12 |
MarDini: Masked Autoregressive Diffusion for Video Generation at Scale |
not yet |
12 |
SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models |
not yet |
12 |
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data |
not yet |
12 |
Scaling Diffusion Language Models via Adaptation from Autoregressive Models |
not yet |
12 |
Self-Supervised Graph Neural Networks for Enhanced Feature Extraction in Heterogeneous Information Networks |
not yet |
12 |
Efficient and Aesthetic UI Design with a Deep Learning-Based Interface Generation Tree Algorithm |
not yet |
12 |
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style |
not yet |
12 |
REEF: Representation Encoding Fingerprints for Large Language Models |
not yet |
12 |
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control |
not yet |
12 |
Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats |
not yet |
12 |
SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation |
not yet |
12 |
G-Designer: Architecting Multi-agent Communication Topologies via Graph Neural Networks |
not yet |
12 |
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory |
not yet |
12 |
IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation |
not yet |
12 |
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models |
not yet |
12 |
Towards Self-Improvement of LLMs via MCTS: Leveraging Stepwise Knowledge with Curriculum Preference Learning |
not yet |
12 |
Learning How Hard to Think: Input-Adaptive Allocation of LM Computation |
not yet |
12 |
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models |
not yet |
12 |
FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models |
not yet |
12 |
Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems |
not yet |
12 |
Were RNNs All We Needed? |
 |
11 |
One-Step Diffusion Policy: Fast Visuomotor Policies via Diffusion Distillation |
not yet |
11 |
AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions |
 |
11 |
Fast Best-of-N Decoding via Speculative Rejection |
not yet |
11 |
Pay Attention and Move Better: Harnessing Attention for Interactive Motion Generation and Training-free Editing |
not yet |
11 |
Why Does the Effective Context Length of LLMs Fall Short? |
not yet |
11 |
One-Step Diffusion Distillation through Score Implicit Matching |
not yet |
11 |
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance |
not yet |
11 |
CamI2V: Camera-Controlled Image-to-Video Diffusion Model |
not yet |
11 |
A Recommendation Model Utilizing Separation Embedding and Self-Attention for Feature Mining |
not yet |
11 |
From PINNs to PIKANs: Recent Advances in Physics-Informed Machine Learning |
not yet |
11 |
Chain of Ideas: Revolutionizing Research Via Novel Idea Development with LLM Agents |
not yet |
11 |
Adaptive Data Optimization: Dynamic Sample Selection with Scaling Laws |
not yet |
11 |
Mechanistic? |
not yet |
11 |
Losing dimensions: Geometric memorization in generative diffusion |
not yet |