Qwen2-VL全流程解析

author：张一极
2026年02月06日22:37:43

Qwen2-VL 延续了 Vision Encoder + Adapter + LLM 的经典架构，但在细节上进行了优化。

主要流程就是，VE作为特征提取模块，经过Adapter后，映射到 LLM 可用空间，调整分布尺度，最后通过LLM的输出结构输出。

1. 视觉编码器+Adapter+自回归输出

基础结构：采用约 6.75 亿参数的 ViT (Vision Transformer)，初始化自 DFN (Data Filtering Network) 的 ViT。

DFN的初始化，天然的针对大规模互联网数据进行训练和调优，带来更鲁棒的视觉表征能力。

Qwen2-VL is exposed to a corpus of around 600 billion tokens. The LLM component of Qwen2-VL is initialized using the parameters from Qwen2 (Yang et al., 2024), while the vision encoder of Qwen2-VL is initialized with the ViT derived from DFN. However, the fixed position embedding in the original DFN’s ViT (Fang et al., 2023) is replaced by RoPE-2D
This pre-training phase 5 primarily focuses on learning image-text relationships, textual content recognition within images through OCR, and image classification tasks. Such foundational training is instrumental in enabling the model to develop a robust understanding of core visual-textual correlations and alignments.

但是不仅仅是DFN的初始化，qwen针对DFN也进行了结构上的优化，替换了原本的固定位置的嵌入，改成了RoPE-2D。

替换完后部分参数可以正常加载，比如：

patch_embed.* blocks..attn.qkv.weight.* blocks..attn.proj.weight.* blocks..mlp.* blocks..norm.*

像是位置相关的参数，则需要重新初始化。

Qwen2-VL is exposed to a corpus of around 600 billion tokens. The LLM component of Qwen2-VL is initialized using the parameters from Qwen2 (Yang et al., 2024), while the vision encoder of Qwen2-VL is initialized with the ViT derived from DFN. However, the fixed position embedding in the original DFN’s ViT (Fang et al., 2023) is replaced by RoPE-2D.

原有VIT的流程为：

\begin{aligned} (1) & q, k, v = qkv (x) \\ (2) & a t t n = softmax (\frac{q k^{⊤}}{\sqrt{d}}) \end{aligned}

替换后为：

\begin{aligned} (3) & q, k = A p p l y R o P E 2 D (q, k, c o o r d s) \\ (4) & a t t n = s o f t m a x (\frac{q @ k . T}{\sqrt{(} d)}) \end{aligned}

2.训练细节：

第一阶段训练中，消耗6000亿tokens，磨合DFN和RoPE，替换为 RoPE-2D 后，模型不再依赖预定义的网格位置，而是根据输入图像的实际高度和宽度动态生成频率。这种结构上的灵活性意味着模型在第一阶段通过大量不同尺寸的图片训练，使那些继承自 DFN 的权重学会了如何在变长的视觉 token 序列中保持语义一致性。

同时引入 3D 卷积（depth=2）处理视频输入，使得模型可以处理更多视频帧而不增加序列长度。

$\times$ $\times$ $14 \times 14$ 区域合并为一个视觉 token。这种方式不仅捕捉了空间特征，还天然地融合了时间维度上的动态变化。

$2 \times 2$ 空间池化，将相邻的2x2的tokens压缩成一个token，原本视频数据被极大地压缩，使得模型能够处理长达 20 秒甚至更长的视频。

Furthermore, to reduce the visual tokens of each image, a simple MLP layer is employed after the ViT to compress adjacent 2 × 2 tokens into a single token, with the special <|vision_start|> and <|vision_end|> tokens placed at the beginning and end of the compressed visual tokens. As a result, an image with a resolution of 224 × 224, encoded with a ViT using patch_size=14, will be compressed to 66 tokens before entering LLM.

最后，通过自回归输出最后的tokens。

3. 训练流程

Qwen2-VL 的训练分为三个阶段，共处理了约 1.4 万亿Tokens 的数据。

第一阶段：ViT 预训练

目标：优化 ViT 的语义理解能力，使其与 LLM 对齐。
操作：
- 冻结 LLM (Qwen2)。
- 训练 ViT (初始化自 DFN，并替换为 RoPE-2D)。
数据：约 6000 亿Tokens。主要包含大规模的图像-文本对，重点学习图像与文本的基本关联，以及 OCR 文字识别能力。

第二阶段：联合预训练

目标：全参数微调，提升综合多模态能力。
操作：解冻所有参数，同时训练 ViT 和 LLM。
数据：约 8000 亿 Tokens。
- 更多混合数据：图文交错文章、OCR 数据、视觉问答 (VQA)、视频对话等。
- 增加多任务数据以提升通用能力。
结果：模型在此阶段获得了更细致的图文理解能力和长序列外推能力。

第三阶段：指令微调 (Instruction Fine-tuning)

目标：提升指令遵循能力和对话交互体验。
操作：
- 冻结 ViT
- 微调 LLM
数据：包含多模态对话、文档解析、多图对比、视频流对话等。