最新视觉大模型 DINOv3论文精读（逐段解析） - 文章 - 开发者社区

picture.image

向AI转型的程序员都关注公众号机器学习AI算法工程

工程地址：https://github.com/facebookresearch/dinov3

===========================================

【论文总结】：DINOv3是一个突破性的自监督视觉基础模型，其核心技术创新围绕三个关键方面：大规模数据与模型协同扩展、Gram锚定技术解决密集特征退化、多阶段训练策略。首先，通过精心设计的三重数据策略（聚类策划+检索策划+标准数据集混合）和70亿参数ViT架构，实现了数据与模型的协同扩展，解决了传统自监督学习在规模化时的稳定性问题。其次，最具创新性的Gram锚定技术通过约束patch特征间的Gram矩阵相似性结构，有效解决了长期训练中密集特征质量退化的问题，使模型在保持全局语义理解能力的同时维持精确的空间定位能力。最后，采用多阶段训练流程：基础自监督训练→Gram锚定细化→高分辨率适应→知识蒸馏，每个阶段都针对特定目标进行优化，最终产生了一个真正通用的视觉编码器，在无需微调的情况下就能在目标检测、语义分割、深度估计等多种任务上达到最优性能，为计算机视觉领域树立了新的技术标杆。

===========================================

Abstract

Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being tailored to specific tasks or domains, this training paradigm has the potential to learn visual representations from diverse sources, ranging from natural to aerial images— using a single algorithm. This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies. First, we leverage the benefit of scaling both dataset and model size by careful data preparation, design, and optimization. Second, we introduce a new method called Gram anchoring, which effectively addresses the known yet unsolved issue of dense feature maps degrading during long training schedules. Finally, we apply post-hoc strategies that further enhance our models’ flexibility with respect to resolution, model size, and alignment with text. As a result, we present a versatile vision foundation model that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models. We also share the DINOv3 suite of vision models, designed to advance the state of the art on a wide spectrum of tasks and data by providing scalable solutions for diverse resource constraints and deployment scenarios.

【翻译】自监督学习有望消除手动数据标注的需求，使模型能够毫不费力地扩展到大规模数据集和更大的架构。通过不针对特定任务或领域进行定制，这种训练范式有潜力从多样化的来源学习视觉表示，从自然图像到航拍图像——使用单一算法。这份技术报告介绍了DINOv3，这是通过利用简单而有效的策略来实现这一愿景的重要里程碑。首先，我们通过仔细的数据准备、设计和优化来利用扩展数据集和模型大小的好处。其次，我们引入了一种称为Gram锚定的新方法，它有效地解决了密集特征图在长期训练计划中退化的已知但未解决的问题。最后，我们应用后处理策略，进一步增强我们模型在分辨率、模型大小和与文本对齐方面的灵活性。因此，我们提出了一个多功能的视觉基础模型，在不进行微调的情况下，在广泛的设置范围内超越了专业的最先进技术。DINOv3产生高质量的密集特征，在各种视觉任务上取得出色性能，显著超越了之前的自监督和弱监督基础模型。我们还分享了DINOv3视觉模型套件，旨在通过为不同资源约束和部署场景提供可扩展解决方案，在广泛的任务和数据范围内推进最先进技术。

【解析】自监督学习的本质是让模型从未标注的数据中自主学习有用的表示，这种方法的优势在于它不依赖人工标注，因此可以充分利用互联网上海量的图像数据。传统的监督学习需要为每张图像提供标签，这不仅成本高昂，而且限制了数据规模的扩展。而自监督学习通过设计巧妙的学习任务，让模型从数据本身的结构和模式中学习，从而绕过了标注瓶颈。DINOv3的第一个重要贡献是数据和模型的协同扩展。深度学习中，数据量和模型规模通常需要同步增长才能达到最佳效果。数据准备不仅包括收集更多图像，还涉及数据质量控制、多样性保证和分布平衡。模型设计则需要考虑如何有效利用增加的参数量，避免过拟合和训练不稳定等问题。第二个核心创新是Gram锚定技术。在长期训练过程中，密集特征图会逐渐失去细节信息，这是一个困扰自监督学习的关键问题。密集特征图包含图像中每个位置的特征信息，对于语义分割、目标检测等需要精确空间定位的任务至关重要。Gram锚定通过特定的正则化策略来保持这些密集特征的质量，确保模型在获得全局理解能力的同时，不会牺牲局部细节的表达能力。后处理策略的引入进一步提升了模型的实用性。最终的成果是一个真正通用的视觉基础模型，它在无需微调的情况下就能在多种任务上达到最优性能。"冻结骨干网络"的使用方式大大降低了模型部署的复杂性和计算成本，同时保证了在不同任务间的一致性表现。

1 Introduction

Foundation models have become a central building block in modern computer vision, enabling broad generalization across tasks and domains through a single, reusable model. Self-supervised learning (SSL) is a powerful approach for training such models, by learning directly from raw pixel data and leveraging the natural co-occurrences of patterns in images. Unlike weakly and fully supervised pretraining methods ( Radford et al. , 2021 ; Dehghani et al. , 2023 ; Bolya et al. , 2025 ) which require images paired with high-quality metadata, SSL unlocks training on massive, raw image collections. This is particularly effective for training large-scale visual encoders thanks to the availability of virtually unlimited training data. DINOv2 ( Oquab et al. , 2024 ) exemplifies these strengths, achieving impressive results in image understanding tasks ( Wang et al. , 2025 ) and enabling pre-training for complex domains such as histopathology ( Chen et al. , 2024 ). Models trained with SSL exhibit additional desirable properties: they are robust to input distribution shifts, provide strong global and local features, and generate rich embeddings that facilitate physical scene understanding. Since SSL models are not trained for any specific downstream task, they produce versatile and robust generalist features. For instance, DINOv2 models deliver strong performance across diverse tasks and domains without requiring task-specific finetuning, allowing a single frozen backbone to serve multiple purposes. Importantly, self-supervised learning is especially suitable to train on the vast amount of available observational data in domains like histopathology ( Vorontsov et al. , 2024 ), biology ( Kim et al. , 2025 ), medical imaging ( PérezGarcía et al. , 2025 ), remote sensing ( Cong et al. , 2022 ; Tolan et al. , 2024 ), astronomy ( Parker et al. , 2024 ), or high-energy particle physics ( Dillon et al. , 2022 ). These domain often lack metadata and have already been shown to benefit from foundation models like DINOv2. Finally, SSL, requiring no human intervention, is well-suited for lifelong learning amid the growing volume of web data.

【翻译】基础模型已成为现代计算机视觉的核心构建块，通过单个可重用模型实现跨任务和领域的广泛泛化。自监督学习(SSL)是训练此类模型的强大方法，通过直接从原始像素数据学习并利用图像中模式的自然共现性。与需要图像配对高质量元数据的弱监督和全监督预训练方法(Radford et al., 2021; Dehghani et al., 2023; Bolya et al., 2025)不同，SSL解锁了在大规模原始图像集合上的训练。由于几乎无限的训练数据的可用性，这对于训练大规模视觉编码器特别有效。DINOv2(Oquab et al., 2024)体现了这些优势，在图像理解任务中取得了令人印象深刻的结果(Wang et al., 2025)，并为病理学等复杂领域的预训练提供了支持(Chen et al., 2024)。用SSL训练的模型表现出额外的理想特性：它们对输入分布偏移具有鲁棒性，提供强大的全局和局部特征，并生成有助于物理场景理解的丰富嵌入。由于SSL模型不是为任何特定的下游任务训练的，它们产生多功能且鲁棒的通用特征。例如，DINOv2模型在不需要任务特定微调的情况下，在各种任务和领域中提供强大的性能，允许单个冻结的骨干网络服务于多种目的。重要的是，自监督学习特别适合在病理学(Vorontsov et al., 2024)、生物学(Kim et al., 2025)、医学成像(PérezGarcía et al., 2025)、遥感(Cong et al., 2022; Tolan et al., 2024)、天文学(Parker et al., 2024)或高能粒子物理学(Dillon et al., 2022)等领域的大量可用观测数据上进行训练。这些领域通常缺乏元数据，并且已经被证明受益于像DINOv2这样的基础模型。最后，不需要人工干预的SSL非常适合在不断增长的网络数据量中进行终身学习。

【解析】基础模型的核心价值在于其"一次训练，处处使用"的特性。随着互联网数据的持续增长，模型需要能够不断适应新的数据分布和视觉概念，而自监督学习的无监督特性使这种持续学习成为可能。

picture.image

Figure 1: (a) Evolution of linear probing results on ImageNet1k (IN1k) over the years, comparing fully(SL), weakly- (WSL) and self-supervised learning (SSL) methods. Despite coming into the picture later, SSL has quickly progressed and now reached the Imagenet accuracy plateau of recent years. On the other hand, we demonstrate that SSL offers the unique promise of high-quality dense features. With DINOv3, we markedly improve over weakly-supervised models on dense tasks, as shown by the relative performance of the best-in-class WSL models to DINOv3 (b). We also produce PCA maps of features obtained from high resolution images with DINOv3 trained on natural © and aerial images (d).

【翻译】图1：(a) 多年来ImageNet1k (IN1k)上线性探测结果的演变，比较了全监督(SL)、弱监督(WSL)和自监督学习(SSL)方法。尽管SSL出现较晚，但它快速发展，现在已经达到了近年来ImageNet准确率的平台期。另一方面，我们证明SSL提供了高质量密集特征的独特优势。通过DINOv3，我们在密集任务上显著超越了弱监督模型，如最佳WSL模型与DINOv3的相对性能所示(b)。我们还生成了从DINOv3在自然图像©和航拍图像(d)上训练得到的高分辨率图像特征的PCA图。

【解析】这个图表展示了自监督学习发展的技术突破。线性探测是评估预训练模型质量的标准方法，它将预训练的特征提取器冻结，只训练一个简单的线性分类器来完成下游任务。能够直接反映预训练模型学到的表示质量，而不受下游任务特定优化的影响。图中显示的演变过程反映了深度学习发展的三个主要阶段：全监督学习依赖大量人工标注数据，在早期取得了显著成功；弱监督学习利用图像与文本的配对关系，通过CLIP等模型实现了新的突破；自监督学习虽然起步较晚，但通过巧妙的任务设计避免了对标注数据的依赖，最终在性能上追平甚至超越了其他方法。密集特征质量的提升是DINOv3的核心优势。传统的视觉模型往往专注于全局图像理解，在提取局部细节特征时表现不佳。密集特征指的是模型对图像中每个位置都能产生有意义的特征表示，这对于语义分割、目标检测、深度估计等需要精确空间定位的任务至关重要。PCA可视化展示了模型学到的特征在语义上的连贯性和空间上的精确性，不同颜色区域对应不同的语义概念，颜色边界与实际物体边界的吻合程度反映了特征质量的高低。

In practice, the promise of SSL, namely producing arbitrarily large and powerful models by leveraging large amounts of unconstrained data, remains challenging at scale. While model instabilities and collapse are mitigated by the heuristics proposed by Oquab et al. ( 2024 ), more problems emerge from scaling further. First, it is unclear how to collect useful data from unlabeled collections. Second, in usual training practice, employing cosine schedules implies knowing the optimization horizon a priori, which is difficult when training on large image corpora. Third, the performance of the features gradually decreases after early training, confirmed by visual inspection of the patch similarity maps. This phenomenon appears in longer training runs with models above ViT-Large size (300M parameters), reducing the usefulness of scaling DINOv2.

【翻译】在实践中，SSL的潜力，即通过利用大量无约束数据来产生任意大型和强大的模型，在规模化时仍然具有挑战性。虽然Oquab等人(2024)提出的启发式方法缓解了模型不稳定性和崩溃问题，但进一步扩展时会出现更多问题。首先，如何从未标记的集合中收集有用数据尚不清楚。其次，在通常的训练实践中，采用余弦调度需要先验地知道优化视界，这在大型图像语料库上训练时很困难。第三，特征的性能在早期训练后逐渐下降，这通过补丁相似性图的视觉检查得到了证实。这种现象出现在ViT-Large规模以上(300M参数)模型的长期训练运行中，降低了扩展DINOv2的有用性。

【解析】这段话表明了自监督学习规模化过程中面临的三个技术挑战。数据收集的困难在于无标注数据的质量参差不齐，需要设计有效的数据筛选和预处理策略来确保训练数据的多样性和代表性，同时避免噪声数据对模型性能的负面影响。优化调度的问题源于深度学习训练的复杂性。余弦学习率调度是一种常用的学习率衰减策略，它要求预先设定总的训练步数，然后按照余弦函数的形状逐渐降低学习率。但在大规模数据集上，很难准确估计达到收敛所需的训练时间，这使得制定有效的学习率调度变得困难。最关键的问题是特征质量的退化现象。在长期训练过程中，模型虽然在全局任务上表现持续改善，但局部特征的质量却开始下降。这种现象可以通过补丁相似性图观察到：高质量的特征应该让语义相似的图像区域在特征空间中距离较近，而语义不同的区域距离较远。当这种相似性模式变得模糊或混乱时，说明模型的密集特征表示能力在退化。这个问题在大模型中更加严重，因为更多的参数和更长的训练时间放大了这种退化效应，这也是为什么简单地扩大DINOv2模型规模无法带来预期收益的根本原因。

Addressing the problems above leads to this work, DINOv3 , which advances SSL training at scale. We demonstrate that a single frozen SSL backbone can serve as a universal visual encoder that achieves stateof-the-art performance on challenging downstream tasks, outperforming supervised and metadata-reliant pre-training strategies. Our research is guided by the following objectives: (1) training a foundational model versatile across tasks and domains, (2) improving the shortcomings of existing SSL models on dense features, (3) disseminating a family of models that can be used off-the-shelf. We discuss the three aims in the following.

【翻译】解决上述问题导致了这项工作，DINOv3，它推进了SSL的大规模训练。我们证明了单个冻结的SSL骨干网络可以作为通用视觉编码器，在具有挑战性的下游任务上达到最先进的性能，超越了监督学习和依赖元数据的预训练策略。我们的研究遵循以下目标：(1) 训练一个跨任务和领域的通用基础模型，(2) 改善现有SSL模型在密集特征方面的不足，(3) 传播一系列可以开箱即用的模型。我们在下文中讨论这三个目标。

【解析】通用视觉编码器的实现需要模型具备强大的特征抽象能力，能够从原始像素数据中提取出既包含全局语义信息又保留局部细节的多层次表示。这种表示需要具备足够的普适性，使得同一套特征能够同时支持图像分类、目标检测、语义分割、深度估计等多种视觉任务。第一个目标强调了基础模型的跨域泛化能力，这要求模型不仅能在自然图像上表现优异，还能适应医学影像、卫星图像、显微镜图像等专业领域的数据分布。第二个目标针对密集特征质量的提升，密集特征指模型对图像中每个空间位置都能产生有意义的特征表示，这对于需要精确空间定位的任务至关重要。现有SSL模型往往在长期训练过程中出现密集特征退化现象，导致局部细节信息丢失。第三个目标通过提供不同规模的预训练模型来满足不同计算资源约束下的应用需求，降低先进视觉技术的使用门槛。

picture.image

Figure 2: Performance of the DINOv3 family of models, compared to other families of self- or weaklysupervised models, on different benchmarks. DINOv3 significantly surpasses others on dense benchmarks, including models that leverage mask annotation priors such as AM-RADIO ( Heinrich et al. , 2025 ).

【翻译】图2：DINOv3模型家族与其他自监督或弱监督模型家族在不同基准测试上的性能比较。DINOv3在密集基准测试上显著超越其他模型，包括利用掩码标注先验的模型，如AM-RADIO (Heinrich et al., 2025)。

【解析】性能对比图展示了DINOv3在密集预测任务上的显著优势。密集基准测试任务对模型的局部特征表示能力要求极高，需要模型能够准确捕捉物体边界、纹理细节和空间关系。AM-RADIO等模型利用掩码标注先验，即通过人工标注的分割掩码来指导训练过程，虽然能够提供精确的空间定位信息，但需要大量的人工标注成本。DINOv3能够在完全无监督的情况下超越这些利用额外监督信号的方法，说明其自监督学习算法在特征学习上的有效性。

Strong & Versatile Foundational Models DINOv3 aims to offer a high level of versatility along two axes, which is enabled by the scaling of the model size and training data. First, a key desirable property for SSL models is to achieve excellent performance while being kept frozen, ideally reaching similar stateof-the-art results as specialized models. In that case, a single forward pass can deliver cutting-edge results across multiple tasks, leading to substantial computational savings—an essential advantage for practical applications, particularly on edge devices. We show the wide breadth of tasks that DINOv3 can successfully be applied to in Sec. 6 . Second, a scalable SSL training pipeline that does not depend on metadata unlocks numerous scientific applications. By pre-training on a diverse set of images, whether web images or observational data, SSL models generalize across a large set of domains and tasks. As illustrated in Fig. 1 (d), the PCA of DINOv3 features extracted from a high-resolution aerial image clearly allows to separates roads, houses, and greenery, highlighting the model’s feature quality.

【翻译】强大且多功能的基础模型 DINOv3旨在沿着两个轴提供高水平的多功能性，这通过模型大小和训练数据的扩展得以实现。首先，SSL模型的一个关键理想特性是在保持冻结状态下实现出色性能，理想情况下达到与专业模型相似的最先进结果。在这种情况下，单次前向传播可以在多个任务中提供前沿结果，从而带来大量的计算节省——这对于实际应用来说是一个重要优势，特别是在边缘设备上。我们在第6节中展示了DINOv3可以成功应用的任务的广泛范围。其次，不依赖元数据的可扩展SSL训练管道解锁了众多科学应用。通过在多样化的图像集合上进行预训练，无论是网络图像还是观测数据，SSL模型都能在大量领域和任务中泛化。如图1(d)所示，从高分辨率航拍图像中提取的DINOv3特征的PCA清楚地允许分离道路、房屋和绿地，突出了模型的特征质量。

【解析】DINOv3的多功能性设计体现在两个维度上。第一个维度是"冻结骨干网络"的通用性。第二个维度是跨领域的泛化能力。

Superior Feature Maps Through Gram Anchoring Another key feature of DINOv3 is a significant improvement of its dense feature maps. The DINOv3 SSL training strategy aims at producing models excelling at high-level semantic tasks while producing excellent feature maps amenable to solving geometric tasks such as depth estimation, or 3D matching. In particular, the models should produce dense features that can be used off-the-shelf or with little post-processing. The compromise between dense and global representation is especially difficult to optimize when training with vast amounts of images, since the objective of high-level understanding can conflict with the quality of the dense feature maps. These contradictory objectives lead to a collapse of dense features with large models and long training schedules. Our new Gram anchoring strategy effectively mitigates this collapse (see Sec. 4 ). As a result, DINOv3 obtains significantly better dense feature maps than DINOv2, staying clean even at high resolutions (see Fig. 3 ).

【翻译】通过Gram锚定实现优质特征图 DINOv3的另一个关键特性是其密集特征图的显著改进。DINOv3 SSL训练策略旨在产生在高级语义任务中表现出色的模型，同时产生适合解决诸如深度估计或3D匹配等几何任务的优秀特征图。特别是，模型应该产生可以开箱即用或只需很少后处理的密集特征。当使用大量图像进行训练时，密集表示和全局表示之间的妥协特别难以优化，因为高级理解的目标可能与密集特征图的质量冲突。这些矛盾的目标导致大型模型和长期训练计划中密集特征的崩溃。我们新的Gram锚定策略有效地缓解了这种崩溃(见第4节)。因此，DINOv3获得了比DINOv2显著更好的密集特征图，即使在高分辨率下也保持清晰(见图3)。

【解析】密集特征图质量的提升是DINOv3的核心技术突破之一。在计算机视觉中，模型需要同时具备两种不同层次的理解能力：全局语义理解和局部几何理解。全局语义理解关注的是"这是什么"的问题，比如识别图像中包含的物体类别、场景类型等高层语义信息；而局部几何理解则关注"在哪里"和"什么形状"的问题，需要对图像中每个像素位置都能产生有意义的特征表示，这对于深度估计、三维重建、精确分割等任务至关重要。传统上，这两个目标存在天然的冲突：为了获得更好的全局语义理解，模型倾向于学习更加抽象和概括的特征表示，这个过程往往会丢失局部的细节信息；而保持局部细节信息则可能影响模型对全局语义的抽象能力。在大规模长期训练过程中，这种冲突会导致"密集特征崩溃"现象——模型的密集特征图逐渐失去空间精确性和语义一致性，变得模糊和不可用。Gram锚定技术通过引入特定的正则化机制来解决这个问题。Gram矩阵原本是用来描述特征之间相关性的数学工具，在这里被用作锚定点，确保训练过程中密集特征的质量不会随着训练的进行而退化。这种方法使得DINOv3能够在保持强大全局理解能力的同时，产生高质量的密集特征图，这些特征图即使在高分辨率输入下也能保持清晰和准确。

The DINOv3 Family of Models Solving the degradation of dense feature map with Gram anchoring unlocks the power of scaling. As a consequence, training a much larger model with SSL leads to significant performance improvements. In this work, we successfully train a DINO model with 7B parameters. Since such a large model requires significant resources to run, we apply distillation to compress its knowledge into smaller variants. As a result, we present the DINOv3 family of vision models , a comprehensive suite designed to address a wide spectrum of computer vision challenges. This model family aims to advance the state of the art by offering scalable solutions adaptable to diverse resource constraints and deployment scenarios. The distillation process produces model variants at multiple scales, including Vision Transformer (ViT) Small, Base, and Large, as well as ConvNeXt-based architectures. Notably, the efficient and widely adopted ViT-L model achieves performance close to that of the original 7B teacher across a variety of tasks. Overall, the DINOv3 family demonstrates strong performance on a broad range of benchmarks, matching or exceeding the accuracy of competing models on global tasks, while significantly outperforming them on dense prediction tasks, as visible in Fig. 2 .

【翻译】DINOv3模型系列：通过Gram锚定解决密集特征图退化问题释放了缩放的力量。因此，使用SSL训练更大的模型带来了显著的性能改进。在这项工作中，我们成功训练了一个具有70亿参数的DINO模型。由于如此大的模型需要大量资源来运行，我们应用蒸馏将其知识压缩到较小的变体中。因此，我们提出了DINOv3视觉模型系列，这是一个旨在解决广泛的计算机视觉挑战的综合套件。该模型系列旨在通过提供适应不同资源约束和部署场景的可扩展解决方案来推进最先进技术。蒸馏过程产生了多个规模的模型变体，包括Vision Transformer (ViT) Small、Base和Large，以及基于ConvNeXt的架构。值得注意的是，高效且广泛采用的ViT-L模型在各种任务上实现了接近原始70亿参数教师模型的性能。总体而言，DINOv3系列在广泛的基准测试中表现出强劲的性能，在全局任务上匹配或超越竞争模型的准确性，同时在密集预测任务上显著超越它们，如图2所示。

【解析】自监督学习模型中，当模型规模超过一定阈值并进行长期训练时，密集特征图会出现质量退化现象，限制模型扩展的收益。Gram锚定通过特定的正则化机制保持特征图的内在结构稳定性，使得更大规模的模型训练成为可能。70亿参数模型在实际部署中面临巨大的计算资源挑战，知识蒸馏成为解决这一矛盾的关键手段，通过让小模型学习大模型的预测行为和内部表示，在保持性能的同时大幅降低计算成本。蒸馏过程不仅仅是简单的参数压缩，而是一种知识传递过程，大模型作为教师网络指导小模型的训练，使小模型能够获得超越其自身容量的表达能力。DINOv3模型通过提供从Small到Large的多个规模变体，用户可以根据具体的资源约束和性能需求选择最适合的模型。

picture.image

Figure 3: High-resolution dense features. We visualize the cosine similarity maps obtained with DINOv3 output features between the patches marked with a red cross and all other patches. Input image at 4096 × 4096 4096\times40964096×4096 . Please zoom in , do you agree with DINOv3?

【翻译】图3：高分辨率密集特征。我们可视化了使用DINOv3输出特征在标有红十字的补丁与所有其他补丁之间获得的余弦相似性图。输入图像为4096 × 4096 4096\times40964096×4096。请放大查看，你同意DINOv3的结果吗？

【解析】这个可视化展示了DINOv3在超高分辨率图像上的密集特征提取能力。4096 × 4096 4096\times40964096×4096属于极高分辨率，余弦相似性度量用于评估特征质量，计算两个特征向量之间的夹角余弦值，值越接近1说明特征越相似，越接近0说明差异越大。通过选择图像中的特定位置（红十字标记）作为查询点，然后计算该位置的特征与图像中所有其他位置特征的相似性，可以直观地观察模型对语义概念的理解程度。高质量的视觉特征应该能够识别出语义相似的区域，比如同一个物体的不同部分、相似的纹理或材质等。可视化结果中相似性的空间分布模式能够反映模型是否真正理解了图像的语义结构，而不是仅仅基于低级的颜色或纹理特征进行匹配。高分辨率下的特征一致性对于需要精确空间定位的任务极其重要，它能够支持精细的图像分析和编辑应用。

Overview of Contributions In this work, we introduce multiple contributions to address the challenge of scaling SSL towards a large frontier model. We build upon recent advances in automatic data curation ( Vo et al. , 2024 ) to obtain a large “background” training dataset that we carefully mix with a bit of specialized data (ImageNet-1k). This allows leveraging large amounts of unconstrained data to improve the model performance. This contribution (i) around data scaling will be described in Sec. 3.1 .

We increase our main model size to 7B parameters by defining a custom variant of the ViT architecture. We include modern position embeddings (axial RoPE) and develop a regularization technique to avoid positional artifacts. Departing from the multiple cosine schedules in DINOv2, we train with constant hyperparameter schedules for 1M iterations. This allows producing models with stronger performance. This contribution (ii) on model architecture and training will be described in Sec. 3.2 .

With the above techniques, we are able to train a model following the DINOv2 algorithm at scale. However, as mentioned previously, scale leads to a degradation of dense features. To address this, we propose a core improvement of the pipeline with a Gram anchoring training phase. This cleans the noise in the feature maps, leading to impressive similarity maps, and drastically improving the performance on both parametric and non-parametric dense tasks. This contribution (iii) on Gram training will be described in Sec. 4 .

【翻译】贡献概述在这项工作中，我们引入了多项贡献来解决将SSL扩展到大型前沿模型的挑战。我们基于自动数据策划的最新进展（Vo等人，2024）来获得一个大型"背景"训练数据集，我们将其与少量专业数据（ImageNet-1k）仔细混合。这允许利用大量无约束数据来改善模型性能。关于数据扩展的贡献(i)将在第3.1节中描述。

我们通过定义ViT架构的自定义变体将主模型大小增加到70亿参数。我们包括现代位置嵌入（轴向RoPE）并开发了一种正则化技术来避免位置伪影。与DINOv2中的多个余弦调度不同，我们使用恒定超参数调度训练100万次迭代。这允许产生性能更强的模型。关于模型架构和训练的贡献(ii)将在第3.2节中描述。

通过上述技术，我们能够大规模地训练遵循DINOv2算法的模型。然而，如前所述，规模导致密集特征的退化。为了解决这个问题，我们提出了使用Gram锚定训练阶段对管道的核心改进。这清理了特征图中的噪声，产生了令人印象深刻的相似性图，并大幅改善了参数化和非参数化密集任务的性能。关于Gram训练的贡献(iii)将在第4节中描述。

【解析】总结了DINOv3的三个技术贡献，每个贡献都针对自监督学习规模化过程中的难题。第一个贡献解决的是数据质量与数量的平衡问题。在大规模训练中，简单地增加数据量并不能保证模型性能的提升，因为网络数据的质量参差不齐，包含大量噪声和冗余信息。自动数据策划技术通过算法自动筛选和组织训练数据，确保数据的多样性和代表性。将大规模"背景"数据与精心策划的ImageNet-1k数据混合，既保证了数据的广度又确保了质量，混合策略能够让模型在保持泛化能力的同时获得更好的性能。第二个贡献涉及模型架构的创新设计。将模型规模扩展到70亿参数是一个巨大的技术挑战，需要解决计算效率、训练稳定性和位置编码等多个问题。轴向RoPE（Rotary Position Embedding）是一种先进的位置编码方法，能够更好地处理不同长度和分辨率的输入，同时保持计算效率。位置伪影是指由于位置编码不当导致的模型对图像中特定位置产生偏见的现象，正则化技术的引入有效缓解了这个问题。恒定超参数调度相比于复杂的余弦调度更加稳定，避免了学习率变化过于复杂导致的训练不稳定性。第三个贡献是最为关键的Gram锚定技术。在大规模训练过程中，虽然全局特征质量持续改善，但密集特征往往会出现退化现象，表现为特征图中出现噪声、空间一致性下降等问题。Gram矩阵能够捕捉特征之间的二阶统计信息，通过锚定机制确保特征图的空间一致性和语义连贯性，从而解决了规模化训练中密集特征退化的核心问题。

Following previous practice, the last steps of our pipeline consist of a high-resolution post-training phase and distillation into a series of high-performance models of various sizes. For the latter, we develop a novel and efficient single-teacher multiple-students distillation procedure. This contribution (iv) transfers the power of our 7B frontier model to a family of smaller practical models for common usage, that we describe in Sec. 5.2 .

【翻译】按照以往的做法，我们管道的最后步骤包括高分辨率后训练阶段和蒸馏成一系列不同大小的高性能模型。对于后者，我们开发了一种新颖且高效的单教师多学生蒸馏程序。这一贡献(iv)将我们70亿参数前沿模型的能力转移到一系列较小的实用模型中，供常见使用，我们在第5.2节中描述。

【解析】描述了DINOv3训练流程的最终阶段。高分辨率后训练是在主要训练完成后进行的额外优化步骤，专门针对高分辨率输入进行模型调优，确保模型在处理高分辨率图像时仍能保持优秀的特征提取能力。通过单教师多学生的蒸馏策略，可以同时训练多个不同规模的学生模型，这些学生模型在保持相对较小计算量的同时，能够学习到教师模型的核心能力。

As measured in our thorough benchmarking, results in Sec. 6 show that our approach defines a new standard in dense tasks and performs comparably to CLIP derivatives on global tasks. In particular, with a frozen vision backbone , we achieve state-of-the-art performance on longstanding computer vision problems such as object detection (COCO detection, mAP 66.1) and image segmentation (ADE20k, mIoU 63.0), outperforming specialized fine-tuned pipelines. Moreover, we provide evidence of the generality of our approach across domains by applying the DINOv3 algorithm to satellite imagery, in Sec. 8 , surpassing all prior approaches.

【翻译】通过我们全面的基准测试，第6节的结果表明我们的方法在密集任务上定义了新标准，并在全局任务上与CLIP衍生模型表现相当。特别是，使用冻结的视觉骨干网络，我们在长期存在的计算机视觉问题上实现了最先进的性能，如目标检测(COCO检测，mAP 66.1)和图像分割(ADE20k，mIoU 63.0)，超越了专门的微调管道。此外，我们通过在第8节将DINOv3算法应用于卫星图像，提供了我们方法跨领域通用性的证据，超越了所有先前的方法。

2 Related Work

Self-Supervised Learning Learning without annotations requires an artificial learning task that provides supervision in lieu for training. The art and challenge of SSL lies in carefully designing these so-called pre-text tasks in order to learn powerful representations for downstream tasks. The language domain, by its discrete nature, offers straightforward ways to set up such tasks, which led to many successful unsupervised pre-training approaches for text data. Examples include word embeddings ( Mikolov et al. , 2013 ; Bojanowski et al. , 2017 ), sentence representations ( Devlin et al. , 2018 ; Liu et al. , 2019 ), and plain language models ( Mikolov et al. , 2010 ; Zaremba et al. , 2014 ). In contrast, computer vision presents greater challenges due to the continuous nature of the signal. Early attempts mimicking language approaches extracted supervisory signals from parts of an image to predict other parts, e.g . by predicting relative patch position ( Doersch et al. , 2015 ), patch re-ordering ( Noroozi and Favaro , 2016 ; Misra and Maaten , 2020 ), or inpainting ( Pathak et al. , 2016 ). Other tasks involve re-colorizing images ( Zhang et al. , 2016 ) or predicting image transformations ( Gidaris et al. , 2018 ).

【翻译】自监督学习：无标注学习需要一个人工学习任务来代替训练中的监督信号。SSL的艺术和挑战在于精心设计这些所谓的代理任务，以便为下游任务学习强大的表示。语言领域由于其离散性质，提供了设置此类任务的直接方式，这导致了许多成功的文本数据无监督预训练方法。示例包括词嵌入（Mikolov等人，2013；Bojanowski等人，2017）、句子表示（Devlin等人，2018；Liu等人，2019）和普通语言模型（Mikolov等人，2010；Zaremba等人，2014）。相比之下，计算机视觉由于信号的连续性质而面临更大的挑战。早期模仿语言方法的尝试从图像的一部分提取监督信号来预测其他部分，例如通过预测相对补丁位置（Doersch等人，2015）、补丁重新排序（Noroozi和Favaro，2016；Misra和Maaten，2020）或修复（Pathak等人，2016）。其他任务涉及重新着色图像（Zhang等人，2016）或预测图像变换（Gidaris等人，2018）。

Among these tasks, inpainting-based approaches have gathered significant interest thanks to the flexibility of the patch-based ViT architecture ( He et al. , 2021 ; Bao et al. , 2021 ; El-Nouby et al. , 2021 ). The objective is to reconstruct corrupted regions of an image, which can be viewed as a form of denoising auto-encoding and is conceptually related to the masked token prediction task in BERT pretraining ( Devlin et al. , 2018 ). Notably, He et al. ( 2021 ) demonstrated that pixel-based masked auto-encoders (MAE) can be used as strong initializations for finetuning on downstream tasks. In the following, Baevski et al. ( 2022 ; 2023 ); Assran et al. ( 2023 ) showed that predicting a learned latent space instead of the pixel space leads to more powerful, higher-level features—a learning paradigm called JEPA: “Joint-Embedding Predictive Architecture” ( LeCun , 2022 ). Recently, JEPAs have also been extended to video training ( Bardes et al. , 2024 ; Assran et al. , 2025 ).

【翻译】在这些任务中，基于修复的方法由于基于patch的ViT架构的灵活性而获得了显著关注（He等人，2021；Bao等人，2021；El-Nouby等人，2021）。目标是重建图像的损坏区域，这可以被视为一种去噪自编码形式，在概念上与BERT预训练中的掩码token预测任务相关（Devlin等人，2018）。值得注意的是，He等人（2021）证明了基于像素的掩码自编码器（MAE）可以用作下游任务微调的强初始化。接下来，Baevski等人（2022；2023）；Assran等人（2023）表明，预测学习的潜在空间而不是像素空间会产生更强大、更高级的特征——这种学习范式称为JEPA：“联合嵌入预测架构”（LeCun，2022）。最近，JEPA也已扩展到视频训练（Bardes等人，2024；Assran等人，2025）。

A second line of work, closer to ours, leverages discriminative signals between images to learn visual representations. This family of methods traces its origins to early deep learning research ( Hadsell et al. , 2006 ), but gained popularity with the introduction of instance classification techniques ( Dosovitskiy et al. , 2016 ; Bojanowski and Joulin , 2017 ; Wu et al. , 2018 ). Subsequent advancements introduced contrastive objectives and information-theoretic criteria ( Hénaff et al. , 2019 ; He et al. , 2020 ; Chen and He , 2020 ; Chen et al. , 2020a ; Grill et al. , 2020 ; Bardes et al. , 2021 ), as well as self clustering-based strategies ( Caron et al. , 2018 ; Asano et al. , 2020 ; Caron et al. , 2020 ; 2021 ). More recent approaches, such as iBOT ( Zhou et al. , 2021 ), combine these discriminative losses with masked reconstruction objectives. All of these methods show the ability to learn strong features and achieve high performance on standard benchmarks like ImageNet ( Russakovsky et al. , 2015 ). However, most face challenges scaling to larger model sizes ( Chen et al. , 2021 ).

【翻译】第二类与我们工作更接近的研究利用图像间的判别信号来学习视觉表示。这一方法系列源于早期深度学习研究（Hadsell等人，2006），但随着实例分类技术的引入（Dosovitskiy等人，2016；Bojanowski和Joulin，2017；Wu等人，2018）而获得了普及。后续的进展引入了对比目标和信息论准则（Hénaff等人，2019；He等人，2020；Chen和He，2020；Chen等人，2020a；Grill等人，2020；Bardes等人，2021），以及基于自聚类的策略（Caron等人，2018；Asano等人，2020；Caron等人，2020；2021）。更近期的方法，如iBOT（Zhou等人，2021），将这些判别损失与掩码重建目标相结合。所有这些方法都显示出学习强特征并在ImageNet（Russakovsky等人，2015）等标准基准上实现高性能的能力。然而，大多数方法在扩展到更大模型规模时面临挑战（Chen等人，2021）。

Vision Foundation Models The deep learning revolution began with the AlexNet breakthrough ( Krizhevsky et al. , 2012 ), a deep convolutional neural network that outperformed all previous methods on the ImageNet challenge ( Deng et al. , 2009 ; Russakovsky et al. , 2015 ). Already early on, features learned end-to-end on the large manually-labeled ImageNet dataset were found to be highly effective for a wide range of transfer learning tasks ( Oquab et al. , 2014 ). Early work on vision foundation models then focused on architecture development, including VGG ( Simonyan and Zisserman , 2015 ), GoogleNet ( Szegedy et al. , 2015 ), and ResNets ( He et al. , 2016 ).

【翻译】视觉基础模型深度学习革命始于AlexNet的突破（Krizhevsky等人，2012），这是一个深度卷积神经网络，在ImageNet挑战赛上超越了所有先前的方法（Deng等人，2009；Russakovsky等人，2015）。早期就发现，在大型手动标注的ImageNet数据集上端到端学习的特征对于广泛的迁移学习任务非常有效（Oquab等人，2014）。早期的视觉基础模型工作随后专注于架构开发，包括VGG（Simonyan和Zisserman，2015）、GoogleNet（Szegedy等人，2015）和ResNets（He等人，2016）。

Given the effectiveness of scaling , subsequent works explored training larger models on big datasets. Sun et al. ( 2017 ) expanded supervised training data with the proprietary JFT dataset containing 300 million labeled images, showing impressive results. JFT also enabled significant performance gains for Kolesnikov et al. ( 2020 ). In parallel, scaling was explored using a combination of supervised and unsupervised data. For instance, an ImageNet-supervised model can be used to produce pseudo-labels for unsupervised data, which then serve to train larger networks ( Yalniz et al. , 2019 ). Subsequently, the availability of large supervised datasets such as JFT also facilitated the adaptation of the transformer architecture to computer vision ( Dosovitskiy et al. , 2020 ). In particular, achieving performance comparable to that of the original vision transformer (ViT) without access to JFT requires substantial effort ( Touvron et al. , 2020 ; 2022 ). Due to the learning capacity of ViTs, scaling efforts were further extended by Zhai et al. ( 2022a ), culminating in the very large ViT-22B encoder ( Dehghani et al. , 2023 ).

【翻译】鉴于扩展的有效性，后续工作探索了在大数据集上训练更大的模型。Sun等人（2017）使用包含3亿标注图像的专有JFT数据集扩展了监督训练数据，显示出令人印象深刻的结果。JFT也为Kolesnikov等人（2020）带来了显著的性能提升。与此同时，使用监督和无监督数据组合的扩展方法也被探索。例如，ImageNet监督模型可以用于为无监督数据生成伪标签，然后用于训练更大的网络（Yalniz等人，2019）。随后，大型监督数据集如JFT的可用性也促进了transformer架构在计算机视觉中的适应（Dosovitskiy等人，2020）。特别是，在无法访问JFT的情况下实现与原始视觉transformer（ViT）相当的性能需要大量努力（Touvron等人，2020；2022）。由于ViT的学习能力，扩展努力进一步扩展（Zhai等人，2022a），最终产生了非常大的ViT-22B编码器（Dehghani等人，2023）。

Given the complexity of manually labeling large datasets, weakly-supervised training —where annotations are derived from metadata associated with images—provides an effective alternative to supervised training. Early on, Joulin et al. ( 2016 ) demonstrated that a network can be pre-trained by simply predicting all words in the image caption as targets. This initial approach was further refined by leveraging sentence structures ( Li et al. , 2017 ), incorporating other types of metadata and involve curation ( Mahajan et al. , 2018 ), and scaling ( Singh et al. , 2022 ). However, weakly-supervised algorithms only reached their full potential with the introduction of contrastive losses and the joint-training of caption representations, as exemplified by Align ( Jia et al. , 2021 ) and CLIP ( Radford et al. , 2021 ).

【翻译】鉴于手动标注大数据集的复杂性，弱监督训练——其中标注源自与图像相关的元数据——为监督训练提供了有效的替代方案。早期，Joulin等人（2016）证明了网络可以通过简单地预测图像标题中的所有单词作为目标来进行预训练。这种初始方法通过利用句子结构（Li等人，2017）、纳入其他类型的元数据并涉及策划（Mahajan等人，2018）以及扩展（Singh等人，2022）得到进一步完善。然而，弱监督算法只有在引入对比损失和标题表示的联合训练后才达到其全部潜力，如Align（Jia等人，2021）和CLIP（Radford等人，2021）所示。

This highly successful approach inspired numerous open-source reproductions and scaling efforts . OpenCLIP ( Cherti et al. , 2023 ) was the first open-source effort to replicate CLIP by training on the LAION dataset ( Schuhmann et al. , 2021 ); following works leverage pre-trained backbones by fine-tuning them in a CLIP-style manner ( Sun et al. , 2023 ; 2024 ). Recognizing that data collection is a critical factor in the success of CLIP training, MetaCLIP ( Xu et al. , 2024 ) precisely follows the original CLIP procedure to reproduce its results, whereas Fang et al. ( 2024a ) use supervised datasets to curate pretraining data. Other works focus on improving the training loss, e.g . using a sigmoid loss in SigLIP ( Zhai et al. , 2023 ), or leveraging a pre-trained image encoder ( Zhai et al. , 2022b ). Ultimately though, the most critical components for obtaining cutting-edge foundation models are abundant high-quality data and substantial compute resources. In this vein, SigLIP 2 ( Tschannen et al. , 2025 ) and Perception Encoder (PE) ( Bolya et al. , 2025 ) achieve impressive results after training on more than 40B image-text pairs. The largest PE model is trained on 86B billion samples with a global batch size of 131K. Finally, a range of more complex and natively multimodal approaches have been proposed; these include contrastive captioning ( Yu et al. , 2022 ), masked modeling in the latent space ( Bao et al. , 2021 ; Wang et al. , 2022b ; Fang et al. , 2023 ; Wang et al. , 2023a ), and auto-regressive training ( Fini et al. , 2024 ).

【翻译】这种高度成功的方法启发了众多开源复现和扩展工作。OpenCLIP（Cherti等人，2023）是第一个通过在LAION数据集（Schuhmann等人，2021）上训练来复制CLIP的开源努力；后续工作通过以CLIP风格的方式微调预训练骨干网络来利用它们（Sun等人，2023；2024）。认识到数据收集是CLIP训练成功的关键因素，MetaCLIP（Xu等人，2024）精确遵循原始CLIP程序来复现其结果，而Fang等人（2024a）使用监督数据集来策划预训练数据。其他工作专注于改进训练损失，例如在SigLIP中使用sigmoid损失（Zhai等人，2023），或利用预训练图像编码器（Zhai等人，2022b）。然而，获得尖端基础模型的最关键组件最终是丰富的高质量数据和大量计算资源。在这方面，SigLIP 2（Tschannen等人，2025）和感知编码器（PE）（Bolya等人，2025）在超过400亿图像-文本对上训练后取得了令人印象深刻的结果。最大的PE模型在860亿样本上训练，全局批量大小为131K。最后，已经提出了一系列更复杂和本质上多模态的方法；这些包括对比标题（Yu等人，2022）、潜在空间中的掩码建模（Bao等人，2021；Wang等人，2022b；Fang等人，2023；Wang等人，2023a）和自回归训练（Fini等人，2024）。

In contrast, relatively little work has focused on scaling unsupervised image pretraining . Early efforts include Caron et al. ( 2019 ) and Goyal et al. ( 2019 ) utilizing the YFCC dataset ( Thomee et al. , 2016 ). Further progress has been achieved by focusing on larger datasets and models ( Goyal et al. , 2021 ; 2022a ), as well as initial attempts at data curation for SSL ( Tian et al. , 2021 ). Careful tuning of the training algorithms, larger architectures, and more extensive training data lead to the impressive results of DINOv2 ( Oquab et al. , 2024 ); for the first time, an SSL model matched or surpassed open-source CLIP variants on a range of tasks. This direction has recently been further pushed by Fan et al. ( 2025 ) by scaling to larger models without data curation, or by Venkataramanan et al. ( 2025 ) using open datasets and improved training recipes.

【翻译】相比之下，专注于扩展无监督图像预训练的工作相对较少。早期的努力包括Caron等人（2019）和Goyal等人（2019）利用YFCC数据集（Thomee等人，2016）。通过关注更大的数据集和模型（Goyal等人，2021；2022a），以及SSL数据策划的初步尝试（Tian等人，2021），取得了进一步的进展。训练算法的仔细调优、更大的架构和更广泛的训练数据导致了DINOv2的令人印象深刻的结果（Oquab等人，2024）；这是SSL模型首次在一系列任务上匹配或超越开源CLIP变体。这一方向最近进一步被Fan等人（2025）通过扩展到更大的模型而不进行数据策划，或被Venkataramanan等人（2025）使用开放数据集和改进的训练配方所推动。

Dense Transformer Features A broad range of modern vision applications consume dense features of pre-trained transformers, including multi-modal models ( Liu et al. , 2023 ; Beyer et al. , 2024 ), generative models ( Yu et al. , 2025 ; Yao et al. , 2025 ), 3D understanding ( Wang et al. , 2025 ), video understanding ( Lin et al. , 2023a ; Wang et al. , 2024b ), and robotics ( Driess et al. , 2023 ; Kim et al. , 2024 ). On top of that, traditional vision tasks such as detection, segmentation, or depth estimation require accurate local descriptors. To enhance the quality of SSL-trained local descriptors, a substantial body of work focuses on developing local SSL losses . Examples include leveraging spatio-temporal consistency in videos, e.g . using point track loops as training signal ( Jabri et al. , 2020 ), exploiting the spatial alignment between different crops of the same image ( Pinheiro et al. , 2020 ; Bardes et al. , 2022 ), or enforcing consistency between neighboring patches ( Yun et al. , 2022 ). Darcet et al. ( 2025 ) show that predicting clustered local patches leads to improved dense representations. DetCon ( Hénaff et al. , 2021 ) and ORL ( Xie et al. , 2021 ) perform contrastive learning on region proposals but assume that such proposals exist a priori ; this assumption is relaxed by approaches such as ODIN ( Hénaff et al. , 2022 ) and SlotCon ( Wen et al. , 2022 ). Without changing the training objective, Darcet et al. ( 2024 ) show that adding register tokens to the input sequence greatly improves dense feature maps, and recent works find this can be done without model training ( Jiang et al. , 2025 ; Chen et al. , 2025 ).

【翻译】密集Transformer特征：广泛的现代视觉应用消费预训练transformer的密集特征，包括多模态模型（Liu等人，2023；Beyer等人，2024）、生成模型（Yu等人，2025；Yao等人，2025）、3D理解（Wang等人，2025）、视频理解（Lin等人，2023a；Wang等人，2024b）和机器人技术（Driess等人，2023；Kim等人，2024）。除此之外，传统的视觉任务如检测、分割或深度估计都需要准确的局部描述符。为了提高SSL训练的局部描述符质量，大量工作专注于开发局部SSL损失。例子包括利用视频中的时空一致性，例如使用点轨迹循环作为训练信号（Jabri等人，2020），利用同一图像不同裁剪之间的空间对齐（Pinheiro等人，2020；Bardes等人，2022），或强制相邻patch之间的一致性（Yun等人，2022）。Darcet等人（2025）表明预测聚类的局部patch可以改善密集表示。DetCon（Hénaff等人，2021）和ORL（Xie等人，2021）对区域提议执行对比学习，但假设这些提议先验存在；这一假设被ODIN（Hénaff等人，2022）和SlotCon（Wen等人，2022）等方法放松了。在不改变训练目标的情况下，Darcet等人（2024）表明向输入序列添加寄存器token可以大大改善密集特征图，最近的工作发现这可以在不进行模型训练的情况下完成（Jiang等人，2025；Chen等人，2025）。

picture.image

Figure 4: DINOv3 at very high resolution. We visualize dense features of DINOv3 by mapping the first three components of a PCA computed over the feature space to RGB. To focus the PCA on the subject, we mask the feature maps via background subtraction. With increasing resolution, DINOv3 produces crisp features that stay semantically meaningful. We visualize more PCAs in Sec. 6.1.1 .

【翻译】图4：超高分辨率下的DINOv3。我们通过将在特征空间上计算的PCA的前三个主成分映射到RGB来可视化DINOv3的密集特征。为了将PCA聚焦在主体上，我们通过背景减法来掩蔽特征图。随着分辨率的增加，DINOv3产生清晰的特征，保持语义的意义。我们在第6.1.1节中可视化更多的PCA。

【解析】这里展示了DINOv3在处理高分辨率图像时的能力。PCA（主成分分析）是一种降维技术，用于提取数据中最重要的信息。在这个可视化中，研究者将DINOv3提取的高维特征向量通过PCA降维到3维，然后将这3个维度分别映射到RGB颜色的红、绿、蓝通道。这样做的目的是用颜色来表示特征的不同方面，让我们能够直观地看到模型在不同空间位置提取到的特征差异。背景减法通过移除背景像素的干扰，让PCA专注于分析前景物体的特征。图4展示了一个重要现象：当输入图像分辨率越来越高时，DINOv3能够提取出更加精细和清晰的特征表示，而且这些特征在语义上仍然是有意义的。这说明模型具有良好的多尺度特征提取能力，能够在保持语义理解的同时捕获细节信息。

A recent trend are distillation-based, " agglomerative " methods that combine information from multiple image encoders with varying in global and local feature quality, trained using different levels of supervision ( Ranzinger et al. , 2024 ; Bolya et al. , 2025 ): AM-RADIO ( Ranzinger et al. , 2024 ) combines the strengths of the fully-supervised SAM ( Kirillov et al. , 2023 ), the weakly-supervised CLIP, and the self-supervised DINOv2 into a unified backbone. The Perception Encoder ( Bolya et al. , 2025 ) similarly distills SAM(v2) into a specialized dense variant called PEspatial. They use an objective enforcing cosine similarity between student and teacher patches to be high, where their teacher is trained with mask annotations. Similar losses were shown to be effective in the context of style transfer, by reducing the inconsistency between the Gram matrices of feature dimensions ( Gatys et al. , 2016 ; Johnson et al. , 2016 ; Yoo et al. , 2024 ). In this work, we adopt a Gram objective to regularize cosine similarity between student and teacher patches, favoring them being close. In our case, we use earlier iterations of the SSL model itself as the teacher, demonstrating that early-stage SSL models effectively guides SSL training for both global and dense tasks.

【翻译】最近的趋势是基于蒸馏的"聚合"方法，这些方法结合来自多个图像编码器的信息，这些编码器在全局和局部特征质量上有所不同，使用不同级别的监督进行训练（Ranzinger等人，2024；Bolya等人，2025）：AM-RADIO（Ranzinger等人，2024）将完全监督的SAM（Kirillov等人，2023）、弱监督的CLIP和自监督的DINOv2的优势结合到一个统一的骨干网络中。感知编码器（Bolya等人，2025）类似地将SAM(v2)蒸馏为一个专门的密集变体，称为PEspatial。它们使用一个目标来强制学生和教师补丁之间的余弦相似度很高，其中教师使用掩码标注进行训练。类似的损失在风格迁移的背景下被证明是有效的，通过减少特征维度的Gram矩阵之间的不一致性（Gatys等人，2016；Johnson等人，2016；Yoo等人，2024）。在这项工作中，我们采用Gram目标来正则化学生和教师补丁之间的余弦相似度，偏向于使它们接近。在我们的情况下，我们使用SSL模型本身的早期迭代作为教师，证明早期阶段的SSL模型有效地指导SSL训练，用于全局和密集任务。

【解析】这段话介绍了一种新兴的模型训练策略——聚合式蒸馏方法。知识蒸馏是一种模型压缩和知识传递技术，通常让小模型（学生）学习大模型（教师）的知识。但这里提到的"聚合"方法有所不同，它不是简单的模型压缩，而是将多个具有不同特长的预训练模型的知识整合到一个新的统一模型中。AM-RADIO方法很有代表性：它整合了三个不同训练方式的模型——SAM专长于分割任务（完全监督），CLIP擅长理解图文关系（弱监督），DINOv2在自监督学习方面表现出色。通过整合这些模型的优势，可以得到一个在多种任务上都表现良好的统一骨干网络。余弦相似度衡量两个向量夹角的指标，值越接近1说明两个向量越相似。在蒸馏过程中，通过最大化学生和教师网络对应补丁特征的余弦相似度，可以让学生网络学到教师网络的特征表示能力。Gram矩阵捕获特征之间的相关性模式，特别是在风格迁移任务中用于保持纹理和风格信息。本工作创新性地将早期训练阶段的同一个SSL模型作为教师，指导后续训练阶段，这种自我指导的策略既保持了全局任务的性能，又改善了密集预测任务的效果。

Other works focus on post-hoc improvements to the local features of SSL-trained models. For example, Ziegler and Asano ( 2022 ) fine-tune a pre-trained model with a dense clustering objective; similarly, Salehi et al. ( 2023 ) fine-tune by aligning patch features temporally, in both cases enhance the quality of local features. Closer to us, Pariza et al. ( 2025 ) propose a patch-sorting based objective to encourage the student and teacher to produce features with consistent neighbor ordering. Without finetuning, STEGO ( Hamilton et al. , 2022 ) learns a non-linear projection on top of frozen SSL features to form compact clusters and amplify correlation patterns. Alternatively, Simoncini et al. ( 2024 ) augment self-supervised features by concatenating gradients from different self-supervised objectives to frozen SSL features. Recently, Wysoczańska et al. ( 2024 ) show that noisy feature maps are significantly improved through a weighted average of patches.

【翻译】其他工作专注于对SSL训练模型的局部特征进行事后改进。例如，Ziegler和Asano（2022）使用密集聚类目标对预训练模型进行微调；类似地，Salehi等人（2023）通过在时间上对齐patch特征来进行微调，这两种情况都增强了局部特征的质量。与我们更接近的是，Pariza等人（2025）提出了基于patch排序的目标，以鼓励学生和教师产生具有一致邻居排序的特征。在不进行微调的情况下，STEGO（Hamilton等人，2022）在冻结的SSL特征之上学习非线性投影，以形成紧密的聚类并放大相关模式。另外，Simoncini等人（2024）通过将来自不同自监督目标的梯度连接到冻结的SSL特征来增强自监督特征。最近，Wysoczańska等人（2024）表明通过patch的加权平均可以显著改善噪声特征图。

【解析】这段话讨论的是改善自监督学习（SSL）模型局部特征质量的后处理方法。传统的SSL方法在训练完成后，其提取的局部特征（即patch级别的特征）可能存在质量不够理想的问题，因此研究者们开发了各种后处理技术来解决这一问题。密集聚类目标是指在每个空间位置都进行聚类操作，而不仅仅是全局特征聚类，这样可以让相似的局部区域在特征空间中更加紧密。时间对齐是利用视频序列中相同物体在不同帧中的对应关系来约束特征学习，确保同一物体的特征在时间维度上保持一致性。patch排序方法通过确保教师网络和学生网络对相同图像区域的相邻关系判断保持一致，来提高特征的空间连贯性。STEGO方法采用了一种无需重新训练的策略，它在已经训练好的SSL特征基础上添加一个可学习的非线性映射层，这个映射层专门用于增强特征的局部聚类性质和空间相关性。梯度连接技术通过融合不同自监督学习目标产生的梯度信息来丰富特征表示，这种方法能够综合多种自监督信号的优势。加权平均方法则是通过对邻近patch的特征进行加权融合来减少特征噪声，提高特征图的平滑性和一致性。

Related, but not specific to SSL, some recent works generate high-resolution feature maps from ViT feature maps ( Fu et al. , 2024 ), which are often low-resolution due to patchification of images. In contrast with this body of work, our models natively deliver high-quality dense feature maps that remain stable and consistent across resolutions, as shown in Fig. 4 .

【翻译】相关但不特定于SSL的一些最近工作从ViT特征图生成高分辨率特征图（Fu等人，2024），由于图像的patch化，这些特征图通常是低分辨率的。与这些工作相比，我们的模型原生地提供高质量的密集特征图，这些特征图在不同分辨率下保持稳定和一致，如图4所示。

【解析】这段话强调了DINOv3相对于其他方法的优势。Vision Transformer (ViT)由于其patch化的处理方式，即将输入图像分割成固定大小的patch块，导致输出的特征图分辨率相对较低。例如，如果输入图像是224×224像素，patch大小是16×16，那么得到的特征图只有14×14的空间分辨率。为了解决这个问题，一些研究工作开发了上采样或插值技术来从低分辨率的ViT特征图重建高分辨率特征图。然而，这类方法存在几个问题：首先，它们需要额外的计算开销来进行特征图重建；其次，重建过程可能引入伪影或失真；最重要的是，这些方法在处理不同分辨率输入时可能表现不一致。相比之下，DINOv3通过改进的架构设计和训练策略，能够直接产生高质量的密集特征图，无需后处理步骤。这些特征图不仅在空间上具有丰富的细节，而且在面对不同输入分辨率时能够保持特征质量的稳定性和一致性。

定性分析

We start by analyzing DINOv3’s dense feature maps qualitatively. To this end, we project the dense feature space into 3 dimensions using principal component analysis (PCA), and map the resulting 3D space into RGB. Because of the sign ambiguity in PCA (eight variants) and the arbitrary mapping between principal components and colors (six variants), we explore all combinations and report the visually most compelling one. The resulting visualization is shown in Fig. 13 . Compared to other vision backbones, it can be seen that the features of DINOv3 are sharper, containing much less noise, and showing superior semantical coherence.

【翻译】我们首先定性分析DINOv3的密集特征图。为此，我们使用主成分分析（PCA）将密集特征空间投影到3维，并将得到的3D空间映射到RGB。由于PCA中的符号模糊性（八种变体）和主成分与颜色之间的任意映射（六种变体），我们探索了所有组合并报告了视觉上最引人注目的一种。结果可视化如图13所示。与其他视觉骨干网络相比，可以看出DINOv3的特征更加清晰，包含的噪声更少，并显示出卓越的语义一致性。

picture.image

Figure 13: Comparison of dense features. We compare several vision backbones by projecting their dense outputs using PCA and mapping them to RGB. From left to right: SigLIP 2 ViT-g/16, PEspatial ViT-G/14, models using patch 16 and 1120 DINOv2 ViT-g/14 with registers, DINOv3 ViT-7B/16. Images are forwarded at resolution 1280×960 for models using patch 16 and 1120×840 for patch 14, i.e. all feature maps have size 80×60.

【翻译】图13：密集特征比较。我们通过使用PCA投影其密集输出并将其映射到RGB来比较几个视觉骨干网络。从左到右：SigLIP 2 ViT-g/16，PEspatial ViT-G/14，使用patch 16和1120的模型，带寄存器的DINOv2 ViT-g/14，DINOv3 ViT-7B/16。对于使用patch 16的模型，图像以1280×960分辨率前向传播，对于patch 14的模型为1120×840，即所有特征图的大小为80×60。

Conclusion

DINOv3 represents a significant advancement in the field of self-supervised learning, demonstrating the potential to revolutionize the way visual representations are learned across various domains. By scaling dataset and model size through meticulous data preparation, design, and optimization, DINOv3 showcases the power of self-supervised learning to eliminate the dependency on manual annotations. The introduction of the Gram anchoring method effectively mitigates the degradation of dense feature maps over extended training periods, ensuring robust and reliable performance.

【翻译】DINOv3代表了自监督学习领域的重大进展，展示了在各个领域彻底改变视觉表示学习方式的潜力。通过精心的数据准备、设计和优化来扩展数据集和模型规模，DINOv3展示了自监督学习消除对手动标注依赖的强大能力。Gram锚定方法的引入有效缓解了密集特征图在长期训练过程中的退化，确保了稳健可靠的性能。

Together with the implementation of post-hoc polishing strategies, such as high-resolution post-training and distillation, we achieve state-of-the-art performance across a wide range of visual tasks with no fine-tuning of the image encoder. The DINOv3 suite of vision models not only sets new benchmarks but also offers a versatile solution across various resource constraints, deployment scenarios, and application use cases. The progress made with DINOv3 is a testament to the promise of self-supervised learning in advancing the state of the art in computer vision and beyond.

【翻译】结合事后打磨策略的实施，如高分辨率后训练和蒸馏，我们在广泛的视觉任务中取得了最先进的性能，而无需对图像编码器进行微调。DINOv3视觉模型套件不仅设立了新的基准，还为各种资源约束、部署场景和应用用例提供了多功能解决方案。DINOv3取得的进展证明了自监督学习在推进计算机视觉及其他领域最先进技术方面的前景。

原文地址

https://blog.csdn.net/weixin\_46248968/article/details/150449810

机器学习算法AI大数据技术

搜索公众号添加： datanlp

picture.image