595 |
Qwen2.5 Technical Report |
not yet |
456 |
DeepSeek-V3 Technical Report |
not yet |
262 |
OpenAI o1 System Card |
not yet |
147 |
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling |
not yet |
94 |
HunyuanVideo: A Systematic Framework For Large Video Generative Models |
 |
82 |
Phi-4 Technical Report |
 |
50 |
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding |
not yet |
48 |
Do NOT Think That Much for 2+3 |
 |
43 |
Training Large Language Models to Reason in a Continuous Latent Space |
 |
39 |
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference |
not yet |
38 |
Open-Sora Plan: Open-Source Large Video Generation Model |
 |
37 |
ProcessBench: Identifying Process Errors in Mathematical Reasoning |
not yet |
36 |
Open-Sora: Democratizing Efficient Video Production for All |
not yet |
36 |
Structured 3D Latents for Scalable and Versatile 3D Generation |
not yet |
34 |
Deliberative Alignment: Reasoning Enables Safer Language Models |
not yet |
33 |
Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems |
not yet |
30 |
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods |
not yet |
26 |
Free Process Rewards without Process Labels |
not yet |
25 |
Alignment faking in large language models |
not yet |
25 |
Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis |
 |
23 |
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks |
not yet |
23 |
Flow Matching Guide and Code |
not yet |
23 |
Flexible-Antenna Systems: A Pinching-Antenna Perspective |
not yet |
22 |
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search |
not yet |
22 |
NVILA: Efficient Frontier Visual Language Models |
not yet |
22 |
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation |
not yet |
20 |
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs |
 |
20 |
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces |
not yet |
20 |
Byte Latent Transformer: Patches Scale Better Than Tokens |
 |
20 |
Aya Expanse: Combining Research Breakthroughs for a New Multilingual Frontier |
not yet |
20 |
GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot |
not yet |
19 |
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation |
not yet |
18 |
Large Concept Models: Language Modeling in a Sentence Representation Space |
 |
18 |
VisionZip: Longer is Better but Not Necessary in Vision Language Models |
 |
18 |
PaliGemma 2: A Family of Versatile VLMs for Transfer |
 |
17 |
Fast Gradient Computation for RoPE Attention in Almost Linear Time |
not yet |
17 |
Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective |
 |
16 |
Experimental Demonstration of Logical Magic State Distillation |
not yet |
16 |
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning |
not yet |
16 |
ExBody2: Advanced Expressive Humanoid Whole-Body Control |
not yet |
16 |
Frontier Models are Capable of In-context Scheming |
not yet |
16 |
o1-Coder: an o1 Replication for Coding |
not yet |
15 |
Token-Budget-Aware LLM Reasoning |
not yet |
15 |
ARC Prize 2024: Technical Report |
 |
15 |
Best-of-N Jailbreaking |
not yet |
14 |
Theoretical Constraints on the Expressive Power of $\mathsf{RoPE}$-based Tensor Attention Transformers |
not yet |
14 |
Motion Prompting: Controlling Video Generation with Motion Trajectories |
not yet |
13 |
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models |
not yet |
13 |
Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning |
not yet |
13 |
Mobile-TeleVision: Predictive Motion Priors for Humanoid Whole-Body Control |
not yet |
13 |
From Slow Bidirectional to Fast Autoregressive Video Diffusion Models |
 |
13 |
The Computational Limits of State-Space Models and Mamba via the Lens of Circuit Complexity |
not yet |
12 |
Formal Mathematical Reasoning: A New Frontier in AI |
not yet |
12 |
FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching |
not yet |
12 |
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks |
 |
12 |
RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation |
not yet |
12 |
Compressed Chain of Thought: Efficient Reasoning Through Dense Representations |
not yet |
12 |
Apollo: An Exploration of Video Understanding in Large Multimodal Models |
 |
12 |
Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy, Research, and Practice |
 |
11 |
An analytic theory of creativity in convolutional diffusion models |
not yet |
11 |
LMFusion: Adapting Pretrained Language Models for Multimodal Generation |
not yet |
11 |
Flex Attention: A Programming Model for Generating Optimized Attention Kernels |
not yet |
11 |
InstantSwap: Fast Customized Concept Swapping across Sharp Shape Differences |
not yet |
10 |
A Survey on Large Language Model Acceleration based on KV Cache Management |
not yet |
10 |
MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes |
 |
10 |
Jasper and Stella: distillation of SOTA embedding models |
not yet |
10 |
DRT: Deep Reasoning Translation via Long Chain-of-Thought |
not yet |
10 |
Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis |
not yet |
10 |
Parallelized Autoregressive Visual Generation |
not yet |
10 |
AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling |
not yet |
10 |
Autoregressive Video Generation without Vector Quantization |
not yet |
10 |
Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models |
not yet |
10 |
LazyDiT: Lazy Learning for the Acceleration of Diffusion Transformers |
not yet |
10 |
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions |
not yet |
10 |
The BrowserGym Ecosystem for Web Agent Research |
not yet |
10 |
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction |
not yet |
10 |
Liquid: Language Models are Scalable and Unified Multi-modal Generators |
not yet |
10 |
Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey |
not yet |
10 |
[CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster |
not yet |
9 |
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis |
not yet |
9 |
Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders |
not yet |
9 |
LearnLM: Improving Gemini for Learning |
 |
9 |
Offline Reinforcement Learning for LLM Multi-Step Reasoning |
not yet |
9 |
Score-based Generative Diffusion Models for Social Recommendations |
not yet |
9 |
Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models |
not yet |
9 |
Entropy-Regularized Process Reward Model |
not yet |
9 |
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies |
not yet |
9 |
FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models |
not yet |
9 |
UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics |
not yet |
9 |
A Consolidated Volatility Prediction with Back Propagation Neural Network and Genetic Algorithm |
not yet |
9 |
On Evaluating the Durability of Safeguards for Open-Weight LLMs |
not yet |
9 |
Gated Delta Networks: Improving Mamba2 with Delta Rule |
not yet |
9 |
BatchTopK Sparse Autoencoders |
not yet |
9 |
Comprehensive Evaluation of Multimodal AI Models in Medical Imaging Diagnosis: From Data Augmentation to Preference-Based Comparison |
not yet |
9 |
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale |
not yet |
9 |
Evaluating and Aligning CodeLLMs on Human Preference |
not yet |
9 |
RandAR: Decoder-only Autoregressive Visual Generation in Random Orders |
not yet |
9 |
Scaling New Frontiers: Insights into Large Recommendation Models |
not yet |
8 |
Training Software Engineering Agents and Verifiers with SWE-Gym |
not yet |
8 |
Aria-UI: Visual Grounding for GUI Instructions |
not yet |
8 |
Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback |
not yet |
8 |
Categorical Symmetries in Spin Models with Atom Arrays |
not yet |
8 |
GUI Agents: A Survey |
not yet |
8 |
RAG-Star: Enhancing Deliberative Reasoning with Retrieval Augmented Verification and Refinement |
not yet |
8 |
Fault-Tolerant Operation and Materials Science with Neutral Atom Logical Qubits |
not yet |
8 |
Hierarchical Split Federated Learning: Convergence Analysis and System Optimization |
not yet |
8 |
On the Expressive Power of Modern Hopfield Networks |
not yet |
8 |
International Scientific Report on the Safety of Advanced AI (Interim Report) |
not yet |
8 |
Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression |
not yet |
8 |
U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs |
not yet |
8 |
Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models |
not yet |
8 |
An Automated Data Mining Framework Using Autoencoders for Feature Extraction and Dimensionality Reduction |
not yet |
7 |
TradingAgents: Multi-Agents LLM Financial Trading Framework |
not yet |
7 |
SegKAN: High-Resolution Medical Image Segmentation with Long-Distance Dependencies |
not yet |
7 |
DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers |
not yet |
7 |
KG4Diagnosis: A Hierarchical Multi-Agent LLM Framework with Knowledge Graph Enhancement for Medical Diagnosis |
not yet |
7 |
Progressive Multimodal Reasoning via Active Retrieval |
not yet |
7 |
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval |
not yet |
7 |
Agent-SafetyBench: Evaluating the Safety of LLM Agents |
not yet |
7 |
Minimum Data Rate Maximization for Uplink Pinching-Antenna Systems |
not yet |
7 |
Large Language Model Enhanced Recommender Systems: A Survey |
not yet |
7 |
SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents |
not yet |
7 |
Attentive Eraser: Unleashing Diffusion Model's Object Removal Potential via Self-Attention Redirection Guidance |
not yet |
7 |
C3oT: Generating Shorter Chain-of-Thought without Compromising Effectiveness |
not yet |
7 |
Reinforcement Learning Enhanced LLMs: A Survey |
not yet |
7 |
SCBench: A KV Cache-Centric Analysis of Long-Context Methods |
not yet |
7 |
SPT: Sequence Prompt Transformer for Interactive Image Segmentation |
not yet |
7 |
Simple Guidance Mechanisms for Discrete Diffusion Models |
not yet |
7 |
A Survey on Uncertainty Quantification of Large Language Models: Taxonomy, Open Research Challenges, and Future Directions |
not yet |
7 |
APOLLO: SGD-like Memory, AdamW-level Performance |
not yet |
7 |
EXAONE 3.5: Series of Large Language Models for Real-world Use Cases |
 |
7 |
NaVILA: Legged Robot Vision-Language-Action Model for Navigation |
not yet |
7 |
Advanced Risk Prediction and Stability Assessment of Banks Using Time Series Transformer Models |
not yet |
7 |
Navigation World Models |
not yet |
7 |
ChatTS: Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning |
not yet |
7 |
Enhancing Recommendation Systems with GNNs and Addressing Over-Smoothing |
not yet |
7 |
Prithvi-EO-2.0: A Versatile Multi-Temporal Foundation Model for Earth Observation Applications |
not yet |
7 |
HUGSIM: A Real-Time, Photo-Realistic and Closed-Loop Simulator for Autonomous Driving |
not yet |
7 |
Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review |
not yet |
7 |
FullStack Bench: Evaluating LLMs as Full Stack Coders |
not yet |
7 |
Task Singular Vectors: Reducing Task Interference in Model Merging |
not yet |
6 |
VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation |
not yet |
6 |
GME: Improving Universal Multimodal Retrieval by Multimodal LLMs |
not yet |
6 |
Universal Machine Learning Interatomic Potentials are Ready for Phonons |
not yet |
6 |
Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning |
not yet |
6 |
Multi-LLM Text Summarization |
not yet |
6 |
AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving |
not yet |
6 |
MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark |
not yet |
6 |
How to Synthesize Text Data without Model Collapse? |
not yet |
6 |
Numerical Pruning for Efficient Autoregressive Models |
not yet |
6 |
Wonderland: Navigating 3D Scenes from a Single Image |
not yet |
6 |
ExecRepoBench: Multi-level Executable Code Completion Evaluation |
not yet |
6 |
A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges |
not yet |
6 |
SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models |
not yet |
6 |
ChatTime: A Unified Multimodal Time Series Foundation Model Bridging Numerical and Textual Data |
not yet |