This guide walks through the quickstart steps for using the VLMEvalKit to evaluate pre-trained vision-language models. Follow these instructions to set up the environment and begin using the evaluation kit.
-
Create a virtual environment:
It's recommended to create and activate a virtual environment for installing dependencies:
python3 -m venv venv source venv/bin/activate # For Linux/macOS
-
Install dependencies:
Once the virtual environment is activated, install the required dependencies using:
pip install -e . -
Install
flash-attn(required for Qwen2-VL evaluation):For optimized attention mechanisms, you can install
flash-attn.:pip install wheel pip install flash-attn --no-cache-dir
-
Ensure the
.envfile contains theOPENAI_API_KEYin the following format:OPENAI_API_KEY=your_openai_api_key
You will need to download and set the lora checkpoint before evaluation. You can download the checkpoints using the download_checkpoint.py script.
- MiniCPM-Llama3-V-2_5: Update the checkpoint variable in
vlmeval/vlm/minicpm_v.py - Qwen2-VL: Update the checkpoint variable in
vlmeval/vlm/qwen2_vl/model.py
You can run the script with python or torchrun for any of the three models - MiniCPM-Llama3-V-2_5, Qwen2-VL-7B-Instruct or Qwen2-VL-2B-Instruct:
# When running with `python`, only one VLM instance is instantiated, and it might use multiple GPUs (depending on its default behavior).
python run.py --data MME --model MiniCPM-Llama3-V-2_5
# When running with `torchrun`, one VLM instance is instantiated on each GPU. It can speed up the inference.
# However, that is only suitable for VLMs that consume small amounts of GPU memory.
# On a node with 2 GPU
torchrun --nproc-per-node=2 run.py --data MME --model MiniCPM-Llama3-V-2_5