RAE-AR: Taming Autoregressive Models
with Representation Autoencoders

1University of Science and Technology of China, 2JD Explore Academy, 3The University of Hong Kong

🔥 Highlights

1. We systematically investigate the integration of Representation Autoencoders (e.g., DINOv2, SigLIP2, MAE) into continuous autoregressive (AR) models, an area previously considered inappropriate for generative modeling.

2. We identify two primary hurdles restricting RAEs in AR generation: complex token-wise distribution modeling and the high-dimensionality amplified training-inference gap (exposure bias).

3. We propose Token Distribution Normalization to ease modeling difficulty and Gaussian Noise Injection during training to mitigate exposure bias. These strategies successfully bridge the gap, enabling RAEs to achieve performance comparable to traditional VAEs.


Abstract

The latent space of generative modeling is long dominated by the VAE encoder. The latents from the pretrained representation encoders (e.g., DINO, SigLIP, MAE) are previously considered inappropriate for generative modeling. Recently, RAE method lights the hope and reveals that the representation autoencoder can also achieve competitive performance as the VAE encoder. However, the integration of representation autoencoder into continuous autoregressive (AR) models, remains largely unexplored.

In this work, we investigate the challenges of employing high-dimensional representation autoencoders within the AR paradigm, denoted as RAE-AR. We focus on the unique properties of AR models and identify two primary hurdles: complex token-wise distribution modeling and the high-dimensionality amplified training-inference gap (exposure bias). To address these, we introduce token simplification via distribution normalization to ease modeling difficulty and improve convergence. Furthermore, we enhance prediction robustness by incorporating Gaussian noise injection during training to mitigate exposure bias. Our empirical results demonstrate that these modifications substantially bridge the performance gap, enabling representation autoencoder to achieve results comparable to traditional VAEs on AR models. This work paves the way for a more unified architecture across visual understanding and generative modeling.


Challenges and Solutions

Framework overview showing challenges and solutions

Overview of RAE-AR: We identify two core challenges (complex token variance and exposure bias) when applying representation autoencoders directly to autoregressive models. To resolve this, we propose token distribution normalization to accelerate convergence and Gaussian noise perturbation to enhance robustness against exposure bias.


Performance of Native Representation Autoencoders

1. Reconstruction Performance of RAE

Quantitative Results

Quantitative reconstruction results

Table Analysis: Quantitative comparison of reconstruction metrics between standard VAEs and representation autoencoders (DINOv2, SigLIP2, MAE).

Qualitative Results

Qualitative reconstruction results

Visual Observations: Visual comparison of reconstructed images. While RAEs maintain reasonable semantic structure, they can sometimes exhibit blurriness in fine details compared to VAEs.

2. Generation Performance of RAE on AR Models

Generation performance comparison

Degradation Phenomenon: Directly applying standard representation autoencoders to AR models leads to poor generation performance. This severe degradation validates the significant impact of the identified token variance and exposure bias issues.


Main Experiment Results — RAE-AR

Quantitative Results

Quantitative results of RAE-AR

Ablation Metrics: Quantitative ablation studies demonstrating that our proposed techniques (Token Normalization and Noise Injection) significantly bridge the performance gap, making RAEs competitive with VAEs.

Qualitative Results

Qualitative results of RAE-AR

Visual Improvements: Visual generation results. Compared to the baseline models which struggle with severe artifacts, our RAE-AR framework successfully produces coherent, high-fidelity images across different semantic spaces.


BibTeX

@article{yu2026raear,
  title={RAE-AR: Taming Autoregressive Models with Representation Autoencoders},
  author={Yu, Hu and Xu, Hang and Huang, Jie and Xue, Zeyue and Huang, Haoyang and Duan, Nan and Zhao, Feng},
  journal={arxiv:2604.01545},
  year={2026}
}