821 |
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning |
not yet |
85 |
s1: Simple test-time scaling |
not yet |
78 |
Kimi k1.5: Scaling Reinforcement Learning with LLMs |
not yet |
67 |
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking |
 |
44 |
Cosmos World Foundation Model Platform for Physical AI |
not yet |
39 |
2 OLMo 2 Furious |
not yet |
36 |
The Lessons of Developing Process Reward Models in Mathematical Reasoning |
 |
34 |
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models |
not yet |
32 |
Humanity's Last Exam |
 |
31 |
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling |
not yet |
27 |
LTX-Video: Realtime Video Latent Diffusion |
not yet |
26 |
MiniMax-01: Scaling Foundation Models with Lightning Attention |
not yet |
25 |
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought |
not yet |
24 |
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training |
 |
22 |
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction |
not yet |
22 |
Titans: Learning to Memorize at Test Time |
 |
19 |
Search-o1: Agentic Search-Enhanced Large Reasoning Models |
 |
17 |
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps |
 |
17 |
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs |
not yet |
17 |
REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models |
not yet |
16 |
On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis |
not yet |
16 |
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving |
not yet |
14 |
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs |
 |
14 |
A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models |
not yet |
14 |
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding |
not yet |
14 |
Evolving Deeper LLM Thinking |
 |
14 |
Open Problems in Machine Unlearning for AI Safety |
 |
14 |
Agent Laboratory: Using LLM Agents as Research Assistants |
 |
13 |
On the Computational Capability of Graph Neural Networks: A Circuit Complexity Bound Perspective |
not yet |
12 |
Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling |
not yet |
12 |
Neural Algorithmic Reasoning for Hypergraphs with Looped Transformers |
not yet |
12 |
FAST: Efficient Action Tokenization for Vision-Language-Action Models |
not yet |
12 |
Do generative video models understand physical principles? |
 |
12 |
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token |
not yet |
12 |
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM |
not yet |
12 |
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models |
not yet |
11 |
O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning |
not yet |
11 |
PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models |
not yet |
11 |
Training Medical Large Vision-Language Models with Abnormal-Aware Feedback |
not yet |
10 |
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step |
not yet |
10 |
Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation |
not yet |
10 |
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought |
 |
10 |
A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges |
not yet |
10 |
Humanoid Locomotion and Manipulation: Current Progress and Challenges in Control, Planning, and Learning |
not yet |
10 |
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling |
not yet |
10 |
Retrieval-Augmented Generation with Graphs (GraphRAG) |
not yet |
9 |
International AI Safety Report |
not yet |
9 |
Open Problems in Mechanistic Interpretability |
not yet |
9 |
Qwen2.5-1M Technical Report |
not yet |
9 |
UI-TARS: Pioneering Automated GUI Interaction with Native Agents |
 |
9 |
RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems? |
not yet |
9 |
Detection of AI Deepfake and Fraud in Online Payments Using GAN-Based Models |
not yet |
9 |
Tensor Product Attention Is All You Need |
not yet |
9 |
Multi-Agent Collaboration Mechanisms: A Survey of LLMs |
not yet |
9 |
Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains |
not yet |
9 |
Circuit Complexity Bounds for Visual Autoregressive Model |
not yet |
8 |
o3-mini vs DeepSeek-R1: Which One is Safer? |
not yet |
8 |
Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge |
not yet |
8 |
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling |
not yet |
8 |
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model |
not yet |
8 |
Pinching Antennas: Principles, Applications and Challenges |
not yet |
8 |
A General Framework for Inference-time Scaling and Steering of Diffusion Models |
not yet |
8 |
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction |
not yet |
8 |
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics |
not yet |
8 |
Rotatable Antenna Enabled Wireless Communication: Modeling and Optimization |
not yet |
8 |
Test-Time Compute: from System-1 Thinking to System-2 Thinking |
not yet |
7 |
SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer |
not yet |
7 |
Molecular-driven Foundation Model for Oncologic Pathology |
not yet |
7 |
Baichuan-Omni-1.5 Technical Report |
not yet |
7 |
Enhancing Intent Understanding for Ambiguous prompt: A Human-Machine Co-Adaption Strategy |
not yet |
7 |
Reasoning Language Models: A Blueprint |
not yet |
7 |
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos |
not yet |
7 |
O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning |
not yet |
7 |
Enhancing Low-Cost Video Editing with Lightweight Adaptors and Temporal-Aware Inversion |
not yet |
7 |
OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis |
not yet |
7 |
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings |
not yet |
6 |
SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling |
not yet |
6 |
Multimodal Large Language Models for Image, Text, and Speech Data Augmentation: A Survey |
not yet |
6 |
Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation |
 |
6 |
Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate |
not yet |
6 |
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model |
not yet |
6 |
Improving Video Generation with Human Feedback |
not yet |
6 |
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks |
not yet |
6 |
Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training |
not yet |
6 |
RichSpace: Enriching Text-to-Video Prompt Space via Text Embedding Interpolation |
not yet |
6 |
Inference-Time Alignment in Diffusion Models with Reward-Guided Generation: Tutorial and Review |
not yet |
6 |
Diffusion Adversarial Post-Training for One-Step Video Generation |
not yet |
6 |
Motion Tracks: A Unified Representation for Human-Robot Transfer in Few-Shot Imitation Learning |
not yet |
6 |
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection |
not yet |
6 |
ACE++: Instruction-Based Image Creation and Editing via Context-Aware Content Filling |
not yet |
5 |
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming |
not yet |
5 |
Probing topological matter and fermion dynamics on a neutral-atom quantum computer |
not yet |
5 |
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders |
not yet |
5 |
How Linguistics Learned to Stop Worrying and Love the Language Models |
not yet |
5 |
Challenges in Ensuring AI Safety in DeepSeek-R1 Models: The Shortcomings of Reinforcement Learning Strategies |
not yet |
5 |
Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models |
not yet |
5 |
Fanar: An Arabic-Centric Multimodal Generative AI Platform |
not yet |
5 |
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos |
not yet |
5 |
Multi-Level Attention and Contrastive Learning for Enhanced Text Classification with an Optimized Transformer |
not yet |
5 |
GAMED-Snake: Gradient-aware Adaptive Momentum Evolution Deep Snake Model for Multi-organ Segmentation |
not yet |
5 |
Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models |
 |
5 |
A Survey on Multi-Turn Interaction Capabilities of Large Language Models |
not yet |
5 |
Quantum-Centric Algorithm for Sample-Based Krylov Diagonalization |
not yet |
5 |
Vision-Language Models Do Not Understand Negation |
not yet |
5 |
Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG |
not yet |
5 |
Enhancing Automated Interpretability with Output-Centric Feature Descriptions |
not yet |
5 |
WebWalker: Benchmarking LLMs in Web Traversal |
not yet |
5 |
Understanding and Benchmarking Artificial Intelligence: OpenAI's o3 Is Not AGI |
not yet |
5 |
Multi-subject Open-set Personalization in Video Generation |
not yet |
5 |
Enabling Scalable Oversight via Self-Evolving Critic |
not yet |
5 |
Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark |
not yet |
5 |
ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning |
not yet |
5 |
LLM4SR: A Survey on Large Language Models for Scientific Research |
not yet |
5 |
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives |
not yet |
5 |
Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control |
not yet |
5 |
The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input |
not yet |
5 |
EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation |
not yet |
5 |
SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation |
not yet |
5 |
Object-level Visual Prompts for Compositional Image Generation |
not yet |
5 |
Nested Attention: Semantic-aware Attention Values for Concept Personalization |
not yet |
5 |
LEO-Split: A Semi-Supervised Split Learning Framework over LEO Satellite Networks |
not yet |
5 |
CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries |
not yet |
5 |
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning |
not yet |
5 |
Dual Diffusion for Unified Image Generation and Understanding |
not yet |
4 |
Reward-Guided Speculative Decoding for Efficient LLM Reasoning |
not yet |
4 |
Efficient Reasoning with Hidden Thinking |
not yet |
4 |
Diffusion Autoencoders are Scalable Image Tokenizers |
not yet |
4 |
GuardReasoner: Towards Reasoning-based LLM Safeguards |
not yet |
4 |
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding |
not yet |
4 |
Sparse Autoencoders Can Interpret Randomly Initialized Transformers |
not yet |
4 |
Large Language Models for Code Generation: The Practitioners Perspective |
not yet |
4 |
Parameter-Efficient Fine-Tuning for Foundation Models |
not yet |
4 |
UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models |
not yet |
4 |
Low-dimensional adaptation of diffusion models: Convergence in total variation |
not yet |
4 |
Continuous 3D Perception Model with Persistent State |
not yet |
4 |
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding |
not yet |
4 |
Poison-RAG: Adversarial Data Poisoning Attacks on Retrieval-Augmented Generation in Recommender Systems |
not yet |
4 |
Tell me about yourself: LLMs are aware of their learned behaviors |
not yet |
4 |
Generative Physical AI in Vision: A Survey |
not yet |
4 |
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments |
not yet |
4 |
Infrastructure for AI Agents |
not yet |
4 |
A Simple Aerial Detection Baseline of Multimodal Language Models |
not yet |
4 |
Towards Fast, Specialized Machine Learning Force Fields: Distilling Foundation Models via Energy Hessians |
not yet |
4 |
What Limits LLM-based Human Simulation: LLMs or Our Design? |
not yet |
4 |
Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models |
not yet |
4 |
GameFactory: Creating New Games with Generative Interactive Videos |
not yet |
4 |
CWEval: Outcome-driven Evaluation on Functionality and Security of LLM Code Generation |
not yet |
4 |
Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards |
not yet |
4 |
MiniRAG: Towards Extremely Simple Retrieval-Augmented Generation |
not yet |