Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

README.md

Arxiv Hugging Face PWC

Performance

A. MLLMs Evaluation Results

To evaluate MLCD’s performance within multimodal large language models (MLLMs), we replaced the CLIP model in LLaVA-NeXT with the MLCD model. We paired this with the Qwen2.5-7B language model. For reproducibility, we utilized the LLaVA-Pretrain dataset for pre-training and the LLaVA-NeXT-Data for structured fine-tuning. The evaluation results confirm that the MLCD model performs exceptionally well across multiple benchmarks, underscoring its effectiveness in MLLMs.

Vision Tower RoPE2D ChartQA DocVQA InfoVQA OCRBench MMMU
CLIP (ViT-L-14-336px) × 66.52 75.21 38.88 525.00 44.20
SigLIP (ViT-SO400M-384px) × 69.28 76.71 41.38 554.00 46.78
DFN5B (ViT-H-14-378px) × 64.36 70.87 38.59 473.00 48.00
HF:MLCD (ViT-L-14-336px) × 67.84 76.46 43.48 531.00 44.30
HF:MLCD (ViT-bigG-14-336px) 71.07 79.63 44.38 572.00 46.78
HF:MLCD (ViT-bigG-14-448px) 73.80 83.34 46.59 582.00 46.00
Vision Tower MLCD (ViT_L_14_336px) CLIP (ViT_L_14_336px)
LLM Qwen2.5-7B Qwen2.5-7B
AI2D 76.98 73.15
GQA 64.17 63.31
ScienceQA-Img 78.09 76.35
InfoVQA-Val 43.48 38.88
MMBenchCN-Dev 74.83 72.51
MMBenchEN-Dev 76.37 74.57
SeedBench 68.20 66.80
SeedBench-Img 73.75 72.72
MMStar 50.98 48.98
MMMU 44.30 44.20
POPE 88.69 88.83
ChartQA 67.84 66.52
DocVQA-Val 76.46 75.21
TextVQA-Val 61.69 62.47
OCRBench 531 525
MME(cognition) 432 384
MME(perception) 1598 1512

B. Linear Probe Evaluation Results

This table presents the results of linear probe evaluations comparing CLIP and MLCD models on the ViT_L_14_336px architecture across various datasets. The linear probe test freezes the pre-trained model's weights and trains a linear classifier on top to assess how well the model's representations generalize to different tasks.

The results of the ImageNet linear probe are as follows:

Model Name ImageNet Linear Probe Hugging Face
MLCD-ViT-B-32-224px 79.1 HF:MLCD-ViT-B-32-224px
MLCD-ViT-L-14-336px 86.3 HF:MLCD-ViT-L-14-336px
MLCD-ViT-bigG-14-224px 87.1 HF:MLCD-ViT-bigG-14-224px
Dataset MLCD (ViT_L_14_336px) CLIP (ViT_L_14_336px)
Food101 96.21 95.90
CIFAR-10 99.36 97.90
CIFAR-100 93.69 87.40
Birdsnap 88.18 79.90
SUN397 87.96 82.20
Stanford Cars 95.16 91.50
FGVC Aircraft 86.38 71.60
Describable Textures Dataset 86.70 83.00
Oxford-IIIT Pets 96.27 95.10
Caltech-101 97.92 96.00
Flowers102 99.58 99.20
ImageNet 86.10 85.40

convert pytorch2huggingface

python convert_vit_bigG_14_rope2d_to_hf.py \
--pytorch_dump_folder_path mlcd-vit-bigG-patch14-336 \
--checkpoint_path MLCD_ViT_bigG_14_336px_pytorch.pt \
--image_size 336