720 |
Qwen2.5-VL Technical Report |
not yet |
159 |
LIMO: Less is More for Reasoning |
not yet |
119 |
From System 1 to System 2: A Survey of Reasoning Large Language Models |
not yet |
113 |
Demystifying Long Chain-of-Thought Reasoning in LLMs |
not yet |
107 |
Process Reinforcement through Implicit Rewards |
not yet |
97 |
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning |
not yet |
80 |
Chain of Draft: Thinking Faster by Writing Less |
 |
69 |
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features |
not yet |
67 |
TokenSkip: Controllable Chain-of-Thought Compression in LLMs |
not yet |
67 |
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model |
not yet |
64 |
Training Language Models to Reason Efficiently |
not yet |
59 |
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention |
 |
57 |
Towards an AI co-scientist |
not yet |
52 |
LLM Post-Training: A Deep Dive into Reasoning Large Language Models |
 |
52 |
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! |
 |
52 |
When More is Less: Understanding Chain-of-Thought Length in LLMs |
not yet |
50 |
Large Language Diffusion Models |
not yet |
47 |
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution |
 |
47 |
Competitive Programming with Large Reasoning Models |
 |
46 |
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach |
not yet |
45 |
Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning |
not yet |
45 |
YOLOv12: Attention-Centric Real-Time Object Detectors |
not yet |
44 |
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling |
not yet |
43 |
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning |
not yet |
43 |
CoT-Valve: Length-Compressible Chain-of-Thought Tuning |
not yet |
42 |
The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks |
not yet |
40 |
Self-Training Elicits Concise Reasoning in Large Language Models |
not yet |
39 |
LIMR: Less is More for RL Scaling |
not yet |
35 |
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model |
not yet |
34 |
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency |
not yet |
33 |
MoBA: Mixture of Block Attention for Long-Context LLMs |
not yet |
33 |
SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities |
not yet |
31 |
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success |
not yet |
31 |
LightThinker: Thinking Step-by-Step Compression |
not yet |
31 |
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines |
not yet |
31 |
The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1 |
not yet |
30 |
Small Models Struggle to Learn from Strong Reasoners |
not yet |
28 |
On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective |
not yet |
28 |
Multi-Agent Risks from Advanced AI |
not yet |
28 |
Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving |
not yet |
27 |
CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation |
not yet |
27 |
Preference Leakage: A Contamination Problem in LLM-as-a-judge |
not yet |
26 |
Muon is Scalable for LLM Training |
not yet |
26 |
Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity |
not yet |
25 |
Scaling Test-Time Compute Without Verification or RL is Suboptimal |
not yet |
25 |
Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research |
 |
24 |
H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking |
not yet |
24 |
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control |
not yet |
23 |
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs |
not yet |
23 |
Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? |
not yet |
23 |
SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs |
not yet |
23 |
A-MEM: Agentic Memory for LLM Agents |
not yet |
23 |
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents |
not yet |
23 |
AgentSociety: Large-Scale Simulation of LLM-Driven Generative Agents Advances Understanding of Human Behaviors and Society |
not yet |
23 |
SCALM: Detecting Bad Practices in Smart Contracts Through LLMs |
not yet |
23 |
ASAP: Aligning Simulation and Real-World Physics for Learning Agile Humanoid Whole-Body Skills |
not yet |
22 |
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models |
not yet |
21 |
Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models |
not yet |
21 |
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning |
not yet |
21 |
OverThink: Slowdown Attacks on Reasoning LLMs |
not yet |
21 |
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning |
not yet |
20 |
MMTEB: Massive Multilingual Text Embedding Benchmark |
not yet |
20 |
ACECODER: Acing Coder RL via Automated Test-Case Synthesis |
not yet |
19 |
Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models |
not yet |
19 |
Harnessing Multiple Large Language Models: A Survey on LLM Ensemble |
not yet |
19 |
Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path? |
not yet |
19 |
History-Guided Video Diffusion |
not yet |
19 |
Modeling and Beamforming Optimization for Pinching-Antenna Systems |
not yet |
19 |
Advancing Reasoning in Large Language Models: Promising Methods and Approaches |
not yet |
19 |
Layer by Layer: Uncovering Hidden Representations in Language Models |
not yet |
18 |
Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models |
not yet |
18 |
MLGym: A New Framework and Benchmark for Advancing AI Research Agents |
not yet |
18 |
S*: Test Time Scaling for Code Generation |
not yet |
18 |
Magma: A Foundation Model for Multimodal AI Agents |
 |
18 |
SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? |
 |
18 |
Atom of Thoughts for Markov LLM Test-Time Scaling |
not yet |
18 |
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment |
not yet |
18 |
Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model |
not yet |
18 |
ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates |
not yet |
18 |
TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models |
not yet |
18 |
Multi-agent Architecture Search via Agentic Supernet |
not yet |
18 |
Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2 |
not yet |
18 |
Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning |
not yet |
17 |
BIG-Bench Extra Hard |
not yet |
17 |
Reasoning with Latent Thoughts: On the Power of Looped Transformers |
not yet |
17 |
NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions |
not yet |
17 |
Learning Smooth and Expressive Interatomic Potentials for Physical Property Prediction |
not yet |
17 |
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search |
not yet |
16 |
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction |
not yet |
16 |
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations |
not yet |
16 |
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound |
not yet |
16 |
Goku: Flow Based Video Generative Foundation Models |
 |
16 |
DeepRAG: Thinking to Retrieve Step by Step for Large Language Models |
not yet |
16 |
STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving |
not yet |
15 |
Self-rewarding correction for mathematical reasoning |
not yet |
15 |
The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer |
 |
15 |
ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model |
not yet |
15 |
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning |
not yet |
15 |
PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection |
not yet |
15 |
Injecting Domain-Specific Knowledge into Large Language Models: A Comprehensive Survey |
not yet |
15 |
Force Matching with Relativistic Constraints: A Physics-Inspired Approach to Stable and Efficient Generative Modeling |
not yet |
15 |
Universal Approximation of Visual Autoregressive Transformers |
not yet |
15 |
Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance |
not yet |
15 |
Fast Video Generation with Sliding Tile Attention |
not yet |
15 |
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs |
 |
14 |
RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete |
not yet |
14 |
From RAG to Memory: Non-Parametric Continual Learning for Large Language Models |
not yet |
14 |
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models |
not yet |
14 |
Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks |
not yet |
14 |
AnyEdit: Edit Any Knowledge Encoded in Language Models |
not yet |
14 |
GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity? |
not yet |
14 |
Safety at Scale: A Comprehensive Survey of Large Model Safety |
not yet |
14 |
Ola: Pushing the Frontiers of Omni-Modal Language Model |
not yet |
14 |
Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis |
not yet |
14 |
VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models |
not yet |
14 |
High-Order Matching for One-Step Shortcut Diffusion Models |
not yet |
13 |
UniTok: A Unified Tokenizer for Visual Generation and Understanding |
not yet |
13 |
Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models |
not yet |
13 |
Med-RLVR: Emerging Medical Reasoning from a 3B base model via reinforcement Learning |
not yet |
13 |
SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference |
not yet |
13 |
Are Sparse Autoencoders Useful? A Case Study in Sparse Probing |
not yet |
13 |
Do Multilingual LLMs Think In English? |
not yet |
13 |
RAG-Gym: Systematic Optimization of Language Agents for Retrieval-Augmented Generation |
not yet |
13 |
HOMIE: Humanoid Loco-Manipulation with Isomorphic Exoskeleton Cockpit |
not yet |
13 |
Towards Reasoning Ability of Small Language Models |
not yet |
13 |
OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning |
not yet |
13 |
ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models |
not yet |
13 |
CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction |
not yet |
13 |
Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning |
not yet |
13 |
Teaching Language Models to Critique via Reinforcement Learning |
not yet |
13 |
Fully Autonomous AI Agents Should Not be Developed |
not yet |
13 |
Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification |
not yet |
12 |
Reward Shaping to Mitigate Reward Hacking in RLHF |
not yet |
12 |
Dynamic Parallel Tree Search for Efficient LLM Reasoning |
not yet |
12 |
Red-Teaming LLM Multi-Agent Systems via Communication Attacks |
not yet |
12 |
Which Attention Heads Matter for In-Context Learning? |
not yet |
12 |
AIDE: AI-Driven Exploration in the Space of Code |
not yet |
12 |
Baichuan-M1: Pushing the Medical Capability of Large Language Models |
not yet |
12 |
Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights |
not yet |
12 |
Evaluating o1-Like LLMs: Unlocking Reasoning for Translation through Comprehensive Analysis |
not yet |
12 |
Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More |
not yet |
12 |
PlanGenLLMs: A Modern Survey of LLM Planning Capabilities |
not yet |
12 |
HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation |
not yet |
12 |
Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs |
not yet |
12 |
Salamandra Technical Report |
not yet |
12 |
Recent Advances in Discrete Speech Tokens: A Review |
not yet |
12 |
TabICL: A Tabular Foundation Model for In-Context Learning on Large Data |
not yet |
12 |
BFS-Prover: Scalable Best-First Tree Search for LLM-based Automatic Theorem Proving |
not yet |
12 |
Boosting Multimodal Reasoning with Automated Structured Thinking |
not yet |
12 |
Latent Thought Models with Variational Bayes Inference-Time Computation |
not yet |
12 |
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models |
not yet |
11 |
MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts |
not yet |
11 |
Code to Think, Think to Code: A Survey on Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs |
not yet |
11 |
FlexTok: Resampling Images into 1D Token Sequences of Flexible Length |
not yet |
11 |
PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning |
not yet |
11 |
A Survey of Personalized Large Language Models: Progress and Future Directions |
not yet |
11 |
A Survey of LLM-based Agents in Medicine: How far are we from Baymax? |
not yet |
11 |
On Vanishing Gradients, Over-Smoothing, and Over-Squashing in GNNs: Bridging Recurrent and Graph Learning |
not yet |
11 |
Organize the Web: Constructing Domains Enhances Pre-Training Data Curation |
not yet |
11 |
Process Reward Models for LLM Agents: Practical Framework and Directions |
not yet |
11 |
DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products |
not yet |
11 |
Logical Reasoning in Large Language Models: A Survey |
not yet |
11 |
EnigmaEval: A Benchmark of Long Multimodal Reasoning Challenges |
not yet |
11 |
If Multi-Agent Debate is the Answer, What is the Question? |
not yet |
11 |
Training Deep Learning Models with Norm-Constrained LMOs |
not yet |
11 |
Self-Supervised Prompt Optimization |
not yet |
11 |
Performance Analysis of Pinching-Antenna Systems |
not yet |
11 |
Confidence Improves Self-Consistency in LLMs |
not yet |
11 |
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy |
not yet |
11 |
Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization |
not yet |
11 |
Do Large Language Model Benchmarks Test Reliability? |
not yet |
11 |
Masked Autoencoders Are Effective Tokenizers for Diffusion Models |
not yet |
11 |
Reinforcement Learning for Long-Horizon Interactive LLM Agents |
not yet |
10 |
On Benchmarking Human-Like Intelligence in Machines |
not yet |
10 |
Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids |
not yet |
10 |
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? |
not yet |
10 |
OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment |
not yet |
10 |
MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis |
not yet |
10 |
VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model |
not yet |