mlcd

Performance

A. MLLMs Evaluation Results

To evaluate MLCD’s performance within multimodal large language models (MLLMs), we replaced the CLIP model in LLaVA-NeXT with the MLCD model. We paired this with the Qwen2.5-7B language model. For reproducibility, we utilized the LLaVA-Pretrain dataset for pre-training and the LLaVA-NeXT-Data for structured fine-tuning. The evaluation results confirm that the MLCD model performs exceptionally well across multiple benchmarks, underscoring its effectiveness in MLLMs.

Vision Tower	RoPE2D	ChartQA	DocVQA	InfoVQA	OCRBench	MMMU
CLIP (ViT-L-14-336px)	×	66.52	75.21	38.88	525.00	44.20
SigLIP (ViT-SO400M-384px)	×	69.28	76.71	41.38	554.00	46.78
DFN5B (ViT-H-14-378px)	×	64.36	70.87	38.59	473.00	48.00
HF:MLCD (ViT-L-14-336px)	×	67.84	76.46	43.48	531.00	44.30
HF:MLCD (ViT-bigG-14-336px)	√	71.07	79.63	44.38	572.00	46.78
HF:MLCD (ViT-bigG-14-448px)	√	73.80	83.34	46.59	582.00	46.00

Vision Tower	MLCD (ViT_L_14_336px)	CLIP (ViT_L_14_336px)
LLM	Qwen2.5-7B	Qwen2.5-7B
AI2D	76.98	73.15
GQA	64.17	63.31
ScienceQA-Img	78.09	76.35
InfoVQA-Val	43.48	38.88
MMBenchCN-Dev	74.83	72.51
MMBenchEN-Dev	76.37	74.57
SeedBench	68.20	66.80
SeedBench-Img	73.75	72.72
MMStar	50.98	48.98
MMMU	44.30	44.20
POPE	88.69	88.83
ChartQA	67.84	66.52
DocVQA-Val	76.46	75.21
TextVQA-Val	61.69	62.47
OCRBench	531	525
MME(cognition)	432	384
MME(perception)	1598	1512

B. Linear Probe Evaluation Results

This table presents the results of linear probe evaluations comparing CLIP and MLCD models on the ViT_L_14_336px architecture across various datasets. The linear probe test freezes the pre-trained model's weights and trains a linear classifier on top to assess how well the model's representations generalize to different tasks.

The results of the ImageNet linear probe are as follows:

Model Name	ImageNet Linear Probe	Hugging Face
MLCD-ViT-B-32-224px	79.1	HF:MLCD-ViT-B-32-224px
MLCD-ViT-L-14-336px	86.3	HF:MLCD-ViT-L-14-336px
MLCD-ViT-bigG-14-224px	87.1	HF:MLCD-ViT-bigG-14-224px

Dataset	MLCD (ViT_L_14_336px)	CLIP (ViT_L_14_336px)
Food101	96.21	95.90
CIFAR-10	99.36	97.90
CIFAR-100	93.69	87.40
Birdsnap	88.18	79.90
SUN397	87.96	82.20
Stanford Cars	95.16	91.50
FGVC Aircraft	86.38	71.60
Describable Textures Dataset	86.70	83.00
Oxford-IIIT Pets	96.27	95.10
Caltech-101	97.92	96.00
Flowers102	99.58	99.20
ImageNet	86.10	85.40

convert pytorch2huggingface

python convert_vit_bigG_14_rope2d_to_hf.py \
--pytorch_dump_folder_path mlcd-vit-bigG-patch14-336 \
--checkpoint_path MLCD_ViT_bigG_14_336px_pytorch.pt \
--image_size 336

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
convert_vit_bigG_14_rope2d_to_hf.py		convert_vit_bigG_14_rope2d_to_hf.py
vit_rope2d.py		vit_rope2d.py
vit_rope2d_hf.py		vit_rope2d_hf.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Performance

A. MLLMs Evaluation Results

B. Linear Probe Evaluation Results

convert pytorch2huggingface

FilesExpand file tree

mlcd

Directory actions

More options

Directory actions

More options

Latest commit

History

mlcd

Folders and files

parent directory

README.md

Performance

A. MLLMs Evaluation Results

B. Linear Probe Evaluation Results

convert pytorch2huggingface