117 |
GPT-4o System Card |
not yet |
82 |
Movie Gen: A Cast of Media Foundation Models |
![](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F3843924%2F20543e34-b24a-d48a-b987-677c156fffda.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=d1e774a56e447849df1e69ff2ac2c319) |
52 |
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models |
![](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F3843924%2F3e6717e3-3154-90c8-5e4d-11e3bba031ef.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=196e6e6ed71ce2bad6ba601cc4c16185) |
25 |
Pixtral 12B |
not yet |
25 |
Video Instruction Tuning With Synthetic Data |
not yet |
23 |
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second |
![](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F3843924%2Fbe57bc5a-15dc-1a0a-3068-b5ea1571cfe0.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=9ade3b1a6379c461e85e9d6914b21650) |
22 |
MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion |
not yet |
20 |
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation |
not yet |
20 |
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think |
not yet |
19 |
Moshi: a speech-text foundation model for real-time dialogue |
not yet |
17 |
Loong: Generating Minute-level Long Videos with Autoregressive Language Models |
not yet |
16 |
Aria: An Open Multimodal Native Mixture-of-Experts Model |
not yet |
16 |
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge |
not yet |
15 |
Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens |
![](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F3843924%2F0fa7793c-f10a-b755-f304-ec4653a12e5b.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=7c000273cd0ccf5aa907412e7140d696) |
15 |
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer |
not yet |
15 |
HSR-Enhanced Sparse Attention Acceleration |
not yet |
15 |
Looped ReLU MLPs May Be All You Need as Practical Programmable Computers |
not yet |
15 |
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference |
not yet |
15 |
LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning |
not yet |
14 |
Rethinking Visual Dependency in Long-Context Reasoning for Large Vision-Language Models |
not yet |
14 |
O1 Replication Journey: A Strategic Progress Report -- Part 1 |
not yet |
14 |
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation |
not yet |
14 |
Pyramidal Flow Matching for Efficient Video Generative Modeling |
![](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F3843924%2F8a54b951-2b86-755c-a6ee-d5da7fa4c4ba.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=89ec1eacb29a8a8a6874dc24f240adb6) |
14 |
How to Train Long-Context Language Models (Effectively) |
not yet |
13 |
No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images |
not yet |
13 |
DepthSplat: Connecting Gaussian Splatting and Depth |
not yet |
13 |
Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix |
not yet |
13 |
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities |
not yet |
13 |
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models |
not yet |
13 |
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents |
not yet |
12 |
Self-Supervised Graph Neural Networks for Enhanced Feature Extraction in Heterogeneous Information Networks |
not yet |
12 |
Efficient and Aesthetic UI Design with a Deep Learning-Based Interface Generation Tree Algorithm |
not yet |
12 |
Bypassing the Exponential Dependency: Looped Transformers Efficiently Learn In-context by Multi-step Gradient Descent |
not yet |
12 |
Differential Transformer |
![](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F3843924%2Fd5f1eb56-076d-dee6-0232-52bc83812f47.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=967cb55bc145310c5647eacbde5a108f) |
12 |
A Survey on Diffusion Models for Inverse Problems |
not yet |
11 |
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding |
![](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F3843924%2Fdd539010-8843-4eab-d021-3584e1bfb12d.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=4dc7b21630ff8132df5f1c1ff5aa4817) |
11 |
Allegro: Open the Black Box of Commercial-Level Video Generation Model |
![](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F3843924%2F0a3f8f72-3646-aafc-43a8-4147f768a746.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=ff1a1c87b0ca2c52da9293a9fdbb4aef) |
11 |
A Recommendation Model Utilizing Separation Embedding and Self-Attention for Feature Mining |
not yet |
11 |
OpenR: An Open Source Framework for Advanced Reasoning with Large Language Models |
not yet |
11 |
Impurities and polarons in bosonic quantum gases: a review on recent progress |
not yet |
11 |
Fine-grained Attention I/O Complexity: Comprehensive Analysis for Backward Passes |
not yet |
11 |
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation |
not yet |
11 |
IC3M: In-Car Multimodal Multi-object Monitoring for Abnormal Status of Both Driver and Passengers |
not yet |
11 |
ImageFolder: Autoregressive Image Generation with Folded Tokens |
not yet |
10 |
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control |
not yet |
10 |
YOLOv11: An Overview of the Key Architectural Enhancements |
not yet |
10 |
Adversarial Neural Networks in Medical Imaging Advancements and Challenges in Semantic Segmentation |
not yet |
10 |
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models |
not yet |
10 |
Optimizing YOLOv5s Object Detection through Knowledge Distillation algorithm |
not yet |
10 |
Agent-as-a-Judge: Evaluate Agents with Agents |
![](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F3843924%2F4768c3e3-e576-eab2-04b4-6b9f3bcb6fce.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=c82be484d67fe7040a2a02a0827b5ed1) |
10 |
How to Construct Random Unitaries |
not yet |
10 |
Balancing Innovation and Privacy: Data Security Strategies in Natural Language Processing Applications |
not yet |
10 |
Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis |
not yet |
10 |
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training |
not yet |
10 |
Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning |
not yet |
10 |
Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation |
not yet |
10 |
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching |
![](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F3843924%2F87fd0917-da74-2d88-9754-ed765fed5242.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=a41fb64365d82cacce4e2b0179d0cb53) |
10 |
LLaVA-Critic: Learning to Evaluate Multimodal Models |
not yet |
10 |
HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly |
not yet |
10 |
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data |
not yet |
10 |
HelpSteer2-Preference: Complementing Ratings with Preferences |
not yet |
9 |
Data Scaling Laws in Imitation Learning for Robotic Manipulation |
not yet |
9 |
Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs |
not yet |
9 |
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction |
not yet |
9 |
Optimizing Retrieval-Augmented Generation with Elasticsearch for Enhanced Question-Answering Systems |
not yet |
9 |
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers |
not yet |
9 |
DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation |
not yet |
9 |
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering |
![](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F3843924%2F4fe533c5-75b9-d43b-c6ca-f23dc9a6c756.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=1f35e81d64a59541ee55cdab2d1a9885) |
9 |
CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL |
not yet |
8 |
Orb: A Fast, Scalable Neural Network Potential |
not yet |
8 |
CaloChallenge 2022: A Community Challenge for Fast Calorimeter Simulation |
not yet |
8 |
Deep Learning for Medical Text Processing: BERT Model Fine-Tuning and Comparative Study |
not yet |
8 |
Predicting Liquidity Coverage Ratio with Gated Recurrent Units: A Deep Learning Model for Risk Management |
not yet |
8 |
Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms |
not yet |
8 |
The XLZD Design Book: Towards the Next-Generation Liquid Xenon Observatory for Dark Matter and Neutrino Physics |
not yet |
8 |
Jailbreaking and Mitigation of Vulnerabilities in Large Language Models |
not yet |
8 |
Blockchain-Based Trust and Transparency in Airline Reservation Systems using Microservices Architecture |
not yet |
8 |
Automated Genre-Aware Article Scoring and Feedback Using Large Language Models |
not yet |
8 |
Jailbreaking LLM-Controlled Robots |
not yet |
8 |
DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation |
not yet |
8 |
Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies |
not yet |
8 |
Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations |
not yet |
8 |
Generative AI and Its Impact on Personalized Intelligent Tutoring Systems |
not yet |
8 |
SpeGCL: Self-supervised Graph Spectrum Contrastive Learning without Positive Samples |
not yet |
8 |
Baichuan-Omni Technical Report |
not yet |
8 |
TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens |
not yet |
8 |
Applying Hybrid Graph Neural Networks to Strengthen Credit Risk Analysis |
not yet |
8 |
Were RNNs All We Needed? |
![](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F3843924%2F86afd443-c4d7-e742-0adc-fdfc005970da.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=c3438498d9766e09957100efea5c6004) |
7 |
MarDini: Masked Autoregressive Diffusion for Video Generation at Scale |
not yet |
7 |
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data |
not yet |
7 |
Optimizing Travel Itineraries with AI Algorithms in a Microservices Architecture: Balancing Cost, Time, Preferences, and Sustainability |
not yet |
7 |
One-Step Diffusion Distillation through Score Implicit Matching |
not yet |
7 |
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance |
not yet |
7 |
Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages |
not yet |
7 |
Beamforming Optimization for Continuous Aperture Array (CAPA)-based Communications |
not yet |
7 |
Open Materials 2024 (OMat24) Inorganic Materials Dataset and Models |
not yet |
7 |
CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos |
not yet |
7 |
Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models |
![](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F3843924%2F75de0734-af1b-1365-7477-cfab30679db6.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=79699f895fe78fb107c5a2bfbd90cefe) |
7 |
Liger Kernel: Efficient Triton Kernels for LLM Training |
not yet |
7 |
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models |
not yet |
7 |
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents |
not yet |
7 |
The Future of Learning in the Age of Generative AI: Automated Question Generation and Assessment with Large Language Models |
not yet |
7 |
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents |
not yet |
7 |
Rectified Diffusion: Straightness Is Not Your Need in Rectified Flow |
not yet |
7 |
Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models |
![](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F3843924%2F1ccb6cbe-e5bf-8c2f-9cd8-efaf24ba2422.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=8e9f2681b49fafe4722d402432c2698e) |
7 |
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation |
not yet |
7 |
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection |
not yet |
7 |
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark |
not yet |
7 |
RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning |
![](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F3843924%2Fc95910a5-db0a-ce1d-550c-76a4a06db65f.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=b3ac7e631c280eb4504427e9627a0099) |
7 |
Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding |
not yet |
7 |
Uncertainty-aware Reward Model: Teaching Reward Models to Know What is Unknown |
not yet |
6 |
Deep Learning with HM-VGG: AI Strategies for Multi-modal Image Analysis |
not yet |
6 |
In-Context LoRA for Diffusion Transformers |
not yet |
6 |
EMMA: End-to-End Multimodal Model for Autonomous Driving |
not yet |
6 |
Human-Centric eXplainable AI in Education |
not yet |
6 |
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark |
not yet |
6 |
Graph Contrastive Learning via Cluster-refined Negative Sampling for Semi-supervised Text Classification |
not yet |
6 |
Improve Vision Language Model Chain-of-thought Reasoning |
not yet |
6 |
MCSFF: Multi-modal Consistency and Specificity Fusion Framework for Entity Alignment |
not yet |
6 |
A Comparative Study on Reasoning Patterns of OpenAI's o1 Model |
not yet |
6 |
ALOHA Unleashed: A Simple Recipe for Robot Dexterity |
not yet |
6 |
MIRROR: A Novel Approach for the Automated Evaluation of Open-Ended Question Generation |
not yet |
6 |
Adaptive Data Optimization: Dynamic Sample Selection with Scaling Laws |
not yet |
6 |
Latent Action Pretraining from Videos |
![](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F3843924%2Fc8c8f874-b937-97db-56fe-e2d2c558be7d.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=2feaba1a358ec8cc43628e8afd6b2531) |
6 |
When Attention Sink Emerges in Language Models: An Empirical View |
not yet |
6 |
Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models |
not yet |
6 |
Rethinking Legal Judgement Prediction in a Realistic Scenario in the Era of Large Language Models |
not yet |
6 |
Customize Your Visual Autoregressive Recipe with Set Autoregressive Modeling |
not yet |
6 |
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models |
not yet |
6 |
Progressive Autoregressive Video Diffusion Models |
not yet |
6 |
Efficient Quantum Pseudorandomness from Hamiltonian Phase States |
not yet |
6 |
Towards Interpreting Visual Information Processing in Vision-Language Models |
not yet |
6 |
ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery |
not yet |
6 |
Dynamic Diffusion Transformer |
not yet |
6 |
Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise |
not yet |
6 |
Iterated Radical Expansions and Convergence |
not yet |
6 |
ManiSkill3: GPU Parallelized Robotics Simulation and Rendering for Generalizable Embodied AI |
not yet |
5 |
Exposing Cross-Platform Coordinated Inauthentic Activity in the Run-Up to the 2024 U.S. Election |
not yet |
5 |
A Systematic Assessment of OpenAI o1-Preview for Higher Order Thinking in Education |
not yet |
5 |
Enhancing Resilience and Scalability in Travel Booking Systems: A Microservices Approach to Fault Tolerance, Load Balancing, and Service Discovery |
not yet |
5 |
MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision |
not yet |
5 |
FreeVS: Generative View Synthesis on Free Driving Trajectory |
not yet |
5 |
OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation |
not yet |
5 |
LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias |
not yet |
5 |
Performance of the CMS high-level trigger during LHC Run 2 |
not yet |
5 |
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs |
not yet |
5 |
3DGS-Enhancer: Enhancing Unbounded 3D Gaussian Splatting with View-consistent 2D Diffusion Priors |
not yet |
5 |
A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications |
not yet |
5 |
A Survey of Conversational Search |
not yet |
5 |
Group Diffusion Transformers are Unsupervised Multitask Learners |
not yet |
5 |
Iterative Methods via Locally Evolving Set Process |
not yet |
5 |
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples |
not yet |
5 |
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control |
not yet |
5 |
Generative Reward Models |
not yet |
5 |
JudgeBench: A Benchmark for Evaluating LLM-based Judges |
not yet |
5 |
Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats |
not yet |
5 |
Preference Optimization with Multi-Sample Comparisons |
not yet |
5 |
MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark |
not yet |
5 |
Boosting Camera Motion Control for Video Diffusion Transformers |
not yet |
5 |
FunnelRAG: A Coarse-to-Fine Progressive Retrieval Paradigm for RAG |
not yet |
5 |
The Ingredients for Robotic Diffusion Transformers |
not yet |
5 |
Improved List Size for Folded Reed-Solomon Codes |
not yet |
5 |
ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification |
not yet |
5 |
ARCap: Collecting High-quality Human Demonstrations for Robot Learning with Augmented Reality Feedback |
not yet |
5 |
Automated Creation of Digital Cousins for Robust Policy Learning |
not yet |
5 |
Dynamic metastability in the self-attention model |
not yet |
5 |
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG |
not yet |
5 |
Manifolds, Random Matrices and Spectral Gaps: The geometric phases of generative diffusion |
not yet |
5 |
T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design |
not yet |
5 |
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification |
not yet |
5 |
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks |
not yet |
5 |
Strong Model Collapse |
![](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F3843924%2Fec5bb4d8-15b6-6bc4-32c2-387e17a5e244.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=058249f5066ab2ee3c928e586e566cd0) |
5 |
Learning How Hard to Think: Input-Adaptive Allocation of LM Computation |
not yet |
5 |
CAR: Controllable Autoregressive Modeling for Visual Generation |
not yet |
5 |
Distillation-Free One-Step Diffusion for Real-World Image Super-Resolution |
not yet |
5 |
Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models |
not yet |
5 |
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models |
not yet |
5 |
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents |
not yet |
5 |
MetaMetrics: Calibrating Metrics For Generation Tasks Using Human Preferences |
not yet |
5 |
AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models |
not yet |
5 |
ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning |
not yet |
5 |
Deep Learning Alternatives of the Kolmogorov Superposition Theorem |
not yet |
5 |
On the expressiveness and spectral bias of KANs |
not yet |
5 |
softmax is not enough (for sharp out-of-distribution) |
not yet |
5 |
MERIT: Multimodal Wearable Vital Sign Waveform Monitoring |
not yet |
4 |
Thought Space Explorer: Navigating and Expanding Thought Space for Large Language Model Reasoning |
not yet |
4 |
Unearthing a Billion Telegram Posts about the 2024 U.S. Presidential Election: Development of a Public Dataset |
not yet |
4 |
Social Science Meets LLMs: How Reliable Are Large Language Models in Social Simulations? |
not yet |
4 |
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents |
not yet |
4 |
Safety cases for frontier AI |
not yet |
4 |
Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse |
![](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F3843924%2Fc5add986-e860-32c4-1961-62a81e241bcb.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=96038e57b12486d97ebe78d898410cc7) |
4 |
One-Step Diffusion Policy: Fast Visuomotor Policies via Diffusion Distillation |
not yet |
4 |
LoRA vs Full Fine-tuning: An Illusion of Equivalence |
![](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F3843924%2F943308ff-721c-75af-0f57-66a004ea90ce.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=79be9fb949cb4910543dbd254ebf2ab3) |
4 |
ElectionSim: Massive Population Election Simulation Powered by Large Language Model Driven Agents |
not yet |
4 |
Kernel Approximation of Fisher-Rao Gradient Flows |
not yet |
4 |
AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions |
not yet |
4 |
Fast Best-of-N Decoding via Speculative Rejection |
not yet |
4 |
A Survey of Small Language Models |
![](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F3843924%2F07c51268-4b08-2ec1-2a31-0f2c1131c6b2.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=30d151a5ac126aaff287568e713fd280) |
4 |
A distributional simplicity bias in the learning dynamics of transformers |
not yet |
4 |
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback |
not yet |
4 |
MotionCLR: Motion Generation and Training-free Editing via Understanding Attention Mechanisms |
not yet |
4 |
AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant |
not yet |
4 |
LLM-Slice: Dedicated Wireless Network Slicing for Large Language Models |
not yet |
4 |
Large Language Models Reflect the Ideology of their Creators |
not yet |
4 |
Quantum linear system algorithm with optimal queries to initial state preparation |
not yet |
4 |
WorldSimBench: Towards Video Generation Models as World Simulators |
not yet |
4 |
VoiceBench: Benchmarking LLM-Based Voice Assistants |
not yet |
4 |
Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World |
not yet |
4 |
Beyond Browsing: API-Based Web Agents |
not yet |
4 |
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style |
not yet |
4 |
Reducing Hallucinations in Vision-Language Models via Latent Space Steering |
not yet |
4 |
LTPNet Integration of Deep Learning and Environmental Decision Support Systems for Renewable Energy Demand Forecasting |
not yet |
4 |
Deep Learning for Weather Forecasting: A CNN-LSTM Hybrid Model for Predicting Historical Temperature Data |
not yet |
4 |
Transversal non-Clifford gates for quantum LDPC codes on sheaves |
not yet |
4 |
REEF: Representation Encoding Fingerprints for Large Language Models |
not yet |
4 |
Nova: An Iterative Planning and Search Approach to Enhance Novelty and Diversity of LLM Generated Ideas |
not yet |
4 |
From PINNs to PIKANs: Recent Advances in Physics-Informed Machine Learning |
not yet |
4 |
SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation |
not yet |
4 |
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression |
not yet |
4 |
WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines |
not yet |
4 |
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception |
not yet |
4 |
Expanding Chatbot Knowledge in Customer Service: Context-Aware Similar Question Generation Using Large Language Models |
not yet |
4 |
Learning Smooth Humanoid Locomotion through Lipschitz-Constrained Policies |
not yet |
4 |
OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation |
not yet |
4 |
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads |
![](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F3843924%2F7d4f3bf3-0be2-7716-94ca-25f1acf2e9a5.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=5dabac3a8650089a1c94af71184fc0bd) |
4 |
Locking Down the Finetuned LLMs Safety |
not yet |
4 |
Animate-X: Universal Character Image Animation with Enhanced Motion Representation |
not yet |
4 |
Safety-Aware Fine-Tuning of Large Language Models |
not yet |
4 |
Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise Perturbation |
not yet |
4 |
Taming Overconfidence in LLMs: Reward Calibration in RLHF |
not yet |
4 |
Toward General Instruction-Following Alignment for Retrieval-Augmented Generation |
not yet |
4 |
Two Heads Are Better Than One: A Multi-Agent System Has the Potential to Improve Scientific Idea Generation |
not yet |
4 |
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation |
not yet |
4 |
VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model |
not yet |
4 |
Losing dimensions: Geometric memorization in generative diffusion |
not yet |
4 |
Language model developers should report train-test overlap |
not yet |
4 |
Scaling Laws For Diffusion Transformers |
not yet |
4 |
Human and LLM Biases in Hate Speech Annotations: A Socio-Demographic Analysis of Annotators and Targets |
not yet |
4 |
MotionGS: Exploring Explicit Motion Guidance for Deformable 3D Gaussian Splatting |
not yet |
4 |
SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection |
not yet |
4 |
Toward hybrid quantum simulations with qubits and qumodes on trapped-ion platforms |
not yet |
4 |
IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation |
not yet |
4 |
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making |
not yet |
4 |
Personalized Visual Instruction Tuning |
not yet |
4 |
Degree Distribution based Spiking Graph Networks for Domain Adaptation |
not yet |
4 |
TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training |
not yet |
4 |
Restructuring Vector Quantization with the Rotation Trick |
not yet |
4 |
HiSplat: Hierarchical 3D Gaussian Splatting for Generalizable Sparse-View Reconstruction |
not yet |
4 |
EVOLvE: Evaluating and Optimizing LLMs For Exploration |
not yet |
4 |
Round and Round We Go! What makes Rotary Positional Encodings useful? |
not yet |
4 |
MDAP: A Multi-view Disentangled and Adaptive Preference Learning Framework for Cross-Domain Recommendation |
not yet |
4 |
TRACE: Temporal Grounding Video LLM via Causal Event Modeling |
not yet |
4 |
Falcon Mamba: The First Competitive Attention-free 7B Language Model |
not yet |
4 |
AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs |
not yet |
4 |
MIBench: A Comprehensive Benchmark for Model Inversion Attack and Defense |
not yet |
4 |
Large Language Model Based Multi-Objective Optimization for Integrated Sensing and Communications in UAV Networks |
not yet |
4 |
Gibbs state preparation for commuting Hamiltonian: Mapping to classical Gibbs sampling |
not yet |
4 |
Stochastic Runge-Kutta Methods: Provable Acceleration of Diffusion Models |
not yet |
4 |
LRQ-Fact: LLM-Generated Relevant Questions for Multimodal Fact-Checking |
not yet |
4 |
Hammer: Robust Function-Calling for On-Device Language Models via Function Masking |
not yet |
4 |
Towards Secure Tuning: Mitigating Security Risks Arising from Benign Instruction Fine-Tuning |
not yet |
4 |
Recent Advances in Speech Language Models: A Survey |
not yet |
4 |
What Matters for Model Merging at Scale? |
![](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F3843924%2F7fa7696a-cd75-e23b-aaec-6408d468f443.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=7adc380c14317348802d15f670328689) |
4 |
Autoregressive Large Language Models are Computationally Universal |
![](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F3843924%2F535675d5-5fa2-9f37-6ba8-2814c2ce8392.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=5e230e2960458ce58f1ec3157af1b91e) |
4 |
AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML |
not yet |
4 |
Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations |
not yet |
4 |
Contrastive Localized Language-Image Pre-Training |
not yet |
4 |
SteerDiff: Steering towards Safe Text-to-Image Diffusion Models |
not yet |
4 |
Selective Attention Improves Transformer |
not yet |
4 |
CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs |
not yet |
4 |
LEGO: Learnable Expansion of Graph Operators for Multi-Modal Feature Fusion |
not yet |
4 |
The Patterns of Life Human Mobility Simulation |
not yet |
3 |
EgoMimic: Scaling Imitation Learning via Egocentric Video |
not yet |
3 |
DiffPano: Scalable and Consistent Text to Panorama Generation with Spherical Epipolar-Aware Diffusion |
not yet |
3 |
Natural gradient and parameter estimation for quantum Boltzmann machines |
not yet |
3 |
Breaking Determinism: Fuzzy Modeling of Sequential Recommendation Using Discrete State Space Diffusion Model |
not yet |
3 |
Plan-on-Graph: Self-Correcting Adaptive Planning of Large Language Model on Knowledge Graphs |
![](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F3843924%2Ff31b160a-54cd-9cc7-8df7-83bb32c5f6da.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=055a8a28ae320876f15a7758f519c61d) |
3 |
Mitigating Challenges in Ethereum's Proof-of-Stake Consensus: Evaluating the Impact of EigenLayer and Lido |
not yet |
3 |
Emergence of meta-stable clustering in mean-field transformer models |
not yet |
3 |
YOLOv11 for Vehicle Detection: Advancements, Performance, and Applications in Intelligent Transportation Systems |
not yet |
3 |
Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector |
not yet |
3 |
Optimizing Posterior Samples for Bayesian Optimization via Rootfinding |
not yet |
3 |
A note on polynomial-time tolerant testing stabilizer states |
not yet |
3 |
PF3plat: Pose-Free Feed-Forward 3D Gaussian Splatting |
not yet |
3 |
From Explicit Rules to Implicit Reasoning in an Interpretable Violence Monitoring System |
not yet |
3 |
MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding |
not yet |
3 |
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference |
not yet |
3 |
OpenCity: A Scalable Platform to Simulate Urban Activities with Massive LLM Agents |
not yet |
3 |
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? |
not yet |
3 |
HOVER: Versatile Neural Whole-Body Controller for Humanoid Robots |
not yet |
3 |
SoS Certifiability of Subgaussian Distributions and its Algorithmic Applications |
not yet |
3 |
Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction |
not yet |
3 |
SEG:Seeds-Enhanced Iterative Refinement Graph Neural Network for Entity Alignment |
not yet |
3 |
What Factors Affect Multi-Modal In-Context Learning? An In-Depth Exploration |
not yet |
3 |
Centaur: a foundation model of human cognition |
![](https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F3843924%2F7460768a-41ac-92e5-b3e0-4e4fb16fdead.png?ixlib=rb-4.0.0&auto=format&gif-q=60&q=75&s=39fc507bc921ccece22d2a2d8b3d8cf1) |
3 |
SCube: Instant Large-Scale Scene Reconstruction using VoxSplats |
not yet |
3 |
YOLO11 and Vision Transformers based 3D Pose Estimation of Immature Green Fruits in Commercial Apple Orchards for Robotic Thinning |
not yet |
3 |
OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization |
not yet |
3 |
FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality |
not yet |
3 |
VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks |
not yet |
3 |
From a Tiny Slip to a Giant Leap: An LLM-Based Simulation for Fake News Evolution |
not yet |
3 |
Diff-Instruct++: Training One-step Text-to-image Generator Model to Align with Human Preferences |
not yet |
3 |
Conceptual Design of the Muonium-to-Antimuonium Conversion Experiment (MACE) |
not yet |
3 |
Denoising diffusion probabilistic models are optimally adaptive to unknown low dimensionality |
not yet |
3 |
Robust Watermarking Using Generative Priors Against Image Editing: From Benchmarking to Advances |
not yet |
3 |
Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch |
not yet |
3 |
Optimal Equivariant Architectures from the Symmetries of Matrix-Element Likelihoods |
not yet |
3 |
Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks |
not yet |
3 |
Improving Model Factuality with Fine-grained Critique-based Evaluator |
not yet |
3 |
CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation |
not yet |
3 |
Stochastic gradient descent in high dimensions for multi-spiked tensor PCA |
not yet |
3 |
Using Platt's scaling for calibration after undersampling -- limitations and how to address them |
not yet |
3 |
Analyzing Nobel Prize Literature with Large Language Models |
not yet |
3 |
Advanced simulations with PLUMED: OPES and Machine Learning Collective Variables |
not yet |
3 |
Scalable Ranked Preference Optimization for Text-to-Image Generation |
not yet |
3 |
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models |
not yet |
3 |
GDDA: Semantic OOD Detection on Graphs under Covariate Shift via Score-Based Diffusion Models |
not yet |