ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters
概要
arXiv:2605.05331v1 Announce Type: cross Abstract: Vision Transformer (ViT) autoencoders have emerged as compelling tokenizers for images, offering improved reconstruction over convolutional tokenizers. However, existing ViT tokenizers cannot explore this landscape as performance degrades outside tr…