mtmd : add ultravox audio input #13623

ngxson · 2025-05-18T21:15:01Z

Supersede #12745

Important

Support for llama-server will be added in a separated PR

For ultravox, it does not work very well with audio longer than 1 minute - Not sure why

How it works

This PR target specifically ultravox model, which is essentially a fine-tuned Whisper encoder and a custom projector.

Most of the preprocessing code are copied from whisper.cpp. The preprocessor will convert input PCM to mel spectrogram with dimension of n_frames * n_mel, so it can be considered as a gray scale (1 channel) image with W=n_frames and H=n_mel

The preprocessing code is inside mtmd-audio.cpp, the mel filters values are hard-coded for convenient.

Demo CLI

Supported formats: mp3, wav, flac

# use pre-quantized model
llama-mtmd-cli -hf ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF
llama-mtmd-cli -hf ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF

# use local Llama 3.2 1B model (original model from Meta, no fine-tuned) with ultravox projector
llama-mtmd-cli -m llama3_2-1b.gguf --mmproj mmproj-ultravox-v0_5-llama-3_2-1b-f16.gguf

# run one-shot, no chat
llama-mtmd-cli -hf ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF --audio ./my_audio.mp3 -p "Transcribe this audio"

Example output:

 Running in chat mode, available commands:
   /audio <path>    load an audio
   /clear           clear the chat history
   /quit or /exit   exit the program

> /audio ../models/i-have-a-dream-30s.mp3
../models/i-have-a-dream-30s.mp3 audio loaded

> what is this
encoding audio slice...
audio slice encoded in 894 ms
decoding audio batch 1/1, n_tokens_batch = 187
audio decoded (batch 1/1) in 57 ms
encoding audio slice...
audio slice encoded in 885 ms
decoding audio batch 1/1, n_tokens_batch = 187
audio decoded (batch 1/1) in 58 ms

I have a dream that one day every valley shall be exalted and every hill and
mountain shall be made low the rough places will be made straight and the crooked
places will be made straight and the Lord shall be revealed and all shall see it
together this is our hope this is the path that I go back to the sun with this
faith we will be able to Hew out of the mountain of despair of stone of stone.

New API

The API now accepts PCM F32 as input via mtmd_bitmap_init_from_audio(). Optionally, you can check if a given bitmap is audio or not by using mtmd_bitmap_is_audio()

The helper mtmd_helper_bitmap_init_from_buf/file is extended to load input file data to the correct mtmd_bitmap type (decided by the magic bytes of the file), so it will just work out-of-the-box without any changes in application code.

mtmd_input_chunk now has a new type called MTMD_INPUT_CHUNK_TYPE_AUDIO

You can get the number of audio/image tokens that a chunk takes via the newly added mtmd_input_chunk_get_n_tokens API

The rest of the process (encode/decode) is the same as before. So, very little changes for downstream application.

For complete changes, see tools/mtmd/mtmd-cli.cpp : https://github.com/ggml-org/llama.cpp/pull/13623/files#diff-4bfe825a05fa2d2598cc93f39aaa081605d2fd82823bd5d15e7dab72acd85e7c

Deprecated API

The image marker <__image__> will continue to work, but it's deprecated as a new marker <__media__> being added. This marker is defined in MTMD_DEFAULT_MEDIA_MARKER

The 3 APIs will be deprecated (but will continue to function, NO breaking change):

mtmd_image_tokens_get_n_tokens
mtmd_image_tokens_get_id
mtmd_image_tokens_get_n_pos

They simple change their prefix to mtmd_input_chunk_ :

mtmd_input_chunk_get_n_tokens
mtmd_input_chunk_get_id
mtmd_input_chunk_get_n_pos

TODO in next PRs:

support audio input on server
move miniaudio.h and stb_image.h to mtmd_helper
add deprecation macro for mtmd_image_tokens_get_n_tokens / n_pos / id

ngxson · 2025-05-18T23:10:21Z

Ok somehow it works magically, the code is still nowhere near finish

Tested using first 6 seconds from https://www.youtube.com/watch?v=vP4iY1TtS3s

tools/mtmd/mtmd-audio.cpp

ngxson · 2025-05-20T22:05:52Z

With the gelu_erf from #13667 , this is now able to transcribe full 30s of audio:

I can transcribe the audio for you. Here is the transcription:

"I have a dream that one day every valley shall be exalted and every hill and mountain shall be made low the rough places will be made plain and the crooked places will be made straight and the Lord shall be revealed and all shall see it together this is our hope this is the peace that I go back to the sun with this faith we will be able to Hew out of the mountain of despair of stone of the darkness"

Note: The original audio may have slight variations in tone and pitch, but the above transcription should be accurate.

Next step is to allow more than 30s input

ngxson · 2025-05-21T16:33:05Z

tools/mtmd/mtmd.cpp

+        if (has_audio) {
+            LOG_WRN("%s: audio input is in experimental stage and may have reduced quality:\n"
+                    "    https://github.com/ggml-org/llama.cpp/pull/13623\n", __func__);
+        }


The model hallucinates on audio longer than 1 minute and I'm still not sure why (haven't yet had time to try the same audio on transformers)

But I think for now putting a small notice here is enough, this is kinda experimental support for now, hopefully we will get gemma 3n supported soon

ggerganov · 2025-05-22T06:21:10Z

convert_hf_to_gguf.py

+        self.hparams["image_size"] = self.hparams["num_mel_bins"]
+        self.hparams["patch_size"] = self.hparams["num_mel_bins"]


Are the image_size and patch_size used in the audio encoder?

It is unused, but I leave it here from my first draft version so the warmup works. But yeah I should remove this

tools/mtmd/clip.cpp

ggerganov · 2025-05-22T06:32:27Z

tools/mtmd/mtmd.h

+#define MTMD_DEFAULT_MEDIA_MARKER "<__media__>"
+
+// deprecated marker, use MTMD_DEFAULT_MEDIA_MARKER instead


We have such constants in llama.h and ggml.h, but we eventually have to start moving those behind API calls. It's more future-proof.

Good idea! I added it in 107790a

ggerganov

The preprocessor will convert input PCM to mel spectrogram with dimension of n_frames * n_mel, so it can be considered as a gray scale (1 channel) image with W=n_frames and H=n_mel

This is a neat idea. Do you think it would be compatible with other audio models or is this a lucky coincidence for this architecture? I guess the question is if all audio encoders work with 2D spectrograms.

ngxson · 2025-05-22T14:14:22Z

I have seen so far just 2 types of model:

whisper-based (used by ultravox, qwen2-audio, phi-4-mm) and they use 2D mel spec as input. As many models do this way, the current impl is quite bias toward whisper for this reason 😂
quantized residual vector based models (mimi encoder, gemma 3n) which accepts raw PCM F32 as input, so technically it will be a 1D image (W=n_samples and H=1)

So overall, I think this system should work well for most audio models

I'll resolve the 2 comments a bit later today, and will merge it after that. Thanks for reviewing this!

ngxson · 2025-05-22T16:10:50Z

Ok so I ended up adding a prefix clip.audio which should allow both audio + vision encoders to coexist in the same mmproj

GGUFs on ggml-org for ultravox was also updated to reflect this change.

Tested the conversion script with gemma 3 to make sure that it doesn't produce a broken mmproj file

I also ran a test to make sure this doesn't accidentally break any existing vision models. Merging this PR once the CI is green 🤞

OK:   llama-mtmd-cli ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
OK:   llama-mtmd-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
OK:   llama-mtmd-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
OK:   llama-mtmd-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M
OK:   llama-mtmd-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
OK:   llama-mtmd-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
OK:   llama-mtmd-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
OK:   llama-mtmd-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
OK:   llama-mtmd-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
OK:   llama-mtmd-cli ggml-org/InternVL2_5-1B-GGUF:Q8_0
OK:   llama-mtmd-cli ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0

zhouwg · 2025-05-23T11:12:09Z

truly AI expert, ......genius programmer, another gg!

ngxson added 9 commits May 4, 2025 17:06

convert ok, load ok

4fa0c27

warmup ok

8b73116

test

4ac7940

still does not work?

4282465

fix padding

45cdb7f

temporary give up

f3605b9

Merge branch 'master' into xsn/mtmd_ultravox

1804fa2

fix merge conflict

bc708b4

build_ultravox()

de20afd

github-actions bot added examples python python script changes labels May 18, 2025

ngxson added 8 commits May 19, 2025 10:46

rm test

bbe4940

Merge branch 'master' into xsn/mtmd_ultravox

4d44460

fix merge conflict

8d7d75a

add necessary mtmd APIs

dce799d

first working version (only 4s of audio)

f151854

will this monster compile?

9a0dcb6

fix compile

1a90395

please compile

4a8c092

ngxson commented May 19, 2025

View reviewed changes

tools/mtmd/mtmd-audio.cpp Outdated Show resolved Hide resolved

ngxson added 6 commits May 19, 2025 22:29

fPIC

6f23ad1

fix windows

cf38b47

various fixes

cf4f5d2

clean up audio_helpers

3bbb26b

fix conversion

3ce96d7

add some debug stuff

cf9613f

long audio input ok

23d0d7f

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label May 21, 2025

github-actions bot added the Apple Metal https://en.wikipedia.org/wiki/Metal_(API) label May 21, 2025

ngxson added 2 commits May 21, 2025 15:06

adapt the api

7033aa1

Merge branch 'master' into xsn/mtmd_ultravox

e7c8a2e

ngxson force-pushed the xsn/mtmd_ultravox branch from 167dc89 to e7c8a2e Compare May 21, 2025 15:15

github-actions bot added the server label May 21, 2025

ngxson added 2 commits May 21, 2025 17:35

add --audio arg

111c820

final touch UX

e6416b0

github-actions bot added the documentation Improvements or additions to documentation label May 21, 2025

ngxson changed the title ~~mtmd : (WIP) add ultravox audio input~~ mtmd : add ultravox audio input May 21, 2025

ngxson marked this pull request as ready for review May 21, 2025 16:30

ngxson requested a review from ggerganov May 21, 2025 16:30

ngxson commented May 21, 2025

View reviewed changes

add miniaudio to readme

36a1abb

ngxson removed ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels May 21, 2025

fix typo

544f4f1

ggerganov reviewed May 22, 2025

View reviewed changes

ggerganov approved these changes May 22, 2025

View reviewed changes

ngxson added 3 commits May 22, 2025 17:14

Merge branch 'master' into xsn/mtmd_ultravox

7602ee4

refactor kv metadata

9afb3af

mtmd_default_marker()

107790a

ngxson merged commit 797990c into ggml-org:master May 22, 2025
49 checks passed

ngxson mentioned this pull request May 22, 2025

server : support audio input #13714

Merged

jhen0409 mentioned this pull request May 23, 2025

feat: sync llama.cpp mybigday/llama.rn#146

Merged

tattn mentioned this pull request May 23, 2025

[auto] Update llama.cpp to latest version tattn/LocalLLMClient#9

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mtmd : add ultravox audio input #13623

mtmd : add ultravox audio input #13623

Uh oh!

ngxson commented May 18, 2025 •

edited

Loading

Uh oh!

ngxson commented May 18, 2025

Uh oh!

Uh oh!

ngxson commented May 20, 2025 •

edited

Loading

Uh oh!

ngxson May 21, 2025

Uh oh!

ggerganov May 22, 2025

Uh oh!

ngxson May 22, 2025

Uh oh!

Uh oh!

ggerganov May 22, 2025

Uh oh!

ngxson May 22, 2025

Uh oh!

ggerganov left a comment

Uh oh!

ngxson commented May 22, 2025 •

edited

Loading

Uh oh!

ngxson commented May 22, 2025

Uh oh!

Uh oh!

zhouwg commented May 23, 2025

Uh oh!

Uh oh!

		self.hparams["image_size"] = self.hparams["num_mel_bins"]
		self.hparams["patch_size"] = self.hparams["num_mel_bins"]

		#define MTMD_DEFAULT_MEDIA_MARKER "<__media__>"

		// deprecated marker, use MTMD_DEFAULT_MEDIA_MARKER instead

mtmd : add ultravox audio input #13623

mtmd : add ultravox audio input #13623

Uh oh!

Conversation

ngxson commented May 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How it works

Demo CLI

New API

Deprecated API

Uh oh!

ngxson commented May 18, 2025

Uh oh!

Uh oh!

ngxson commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson May 21, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov May 22, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson May 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ggerganov May 22, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson May 22, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

ngxson commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented May 22, 2025

Uh oh!

Uh oh!

zhouwg commented May 23, 2025

Uh oh!

Uh oh!

ngxson commented May 18, 2025 •

edited

Loading

ngxson commented May 20, 2025 •

edited

Loading

ngxson commented May 22, 2025 •

edited

Loading