This example demonstrates how to finetune a vision-language model (VLM) using TensorZero.
To keep things simple, we'll tackle a naive task: classifying arXiv papers based on an image of the paper's first page, without providing any instructions about the taxonomy of the categories.
Before fine-tuning, unsurprisingly, the model completely fails at this task due to the lack of context (e.g. with outputs such as {"category": "Academic Paper"}).
After fine-tuning, the model achieves high accuracy on a held-out test set (e.g. with outputs such as {"category": "cs.CV"}).
We provide a simple configuration in config/tensorzero.toml.
The configuration specifies a straightforward JSON function classify_document with a single variant that uses GPT-4o Mini and a boolean metric correct_classification.
- Install Docker.
- Install Python 3.10+.
- Generate an OpenAI API key.
We generate a dataset of 20 papers for each of arXiv's computer science (cs.*) categories.
The dataset includes an image of the paper's first page and the correct category.
To download a pre-generated dataset, run the following commands:
wget https://assets.tensorzero.com/examples/multimodal-vision-finetuning.tar.gz
tar -xvf multimodal-vision-finetuning.tar.gz -C .To generate the dataset from scratch, see data/generate_dataset.ipynb.
- Set the
OPENAI_API_KEYenvironment variable to your OpenAI API key. - Run
docker compose upto start TensorZero. - Install the Python dependencies. We recommend using
uv:uv sync - Run the notebook:
main.ipynb
After running the notebook with the baseline variant, you can fine-tune a model using the TensorZero UI.
- Visit
http://localhost:4000. - Go to
Supervised Finetuning. - Start a fine-tuning job using
demonstrationas the metric.
Tip
Most models don't support multi-modal fine-tuning.
We recommend using gpt-4o-2024-08-06 for this example.
After completion, create a new variant in your config/tensorzero.toml file.
It should look like this:
[functions.classify_document.variants.finetuned]
type = "chat_completion"
model = "openai::ft:gpt-4o-2024-08-06:tensorzero::xxxxxxx"
json_mode = "strict"
retries = { num_retries = 2, max_delay_s = 10 }Restart the gateway to apply the configuration changes and re-run the notebook with the finetuned variant.
You'll notice that the model now achieves high accuracy on the held-out test set.