wasm

@quantcpp/wasm

Single-header C LLM inference engine compiled to WebAssembly. 192 KB binary. Runs GGUF models in your browser with KV cache compression.

Install

npm install @quantcpp/wasm

Quick start

<script type="module">
  import { Quant } from '@quantcpp/wasm';

  const q = await Quant.create({
    scriptUrl: 'node_modules/@quantcpp/wasm/quant.js',
    modelUrl: 'https://huggingface.co/bartowski/SmolLM2-135M-Instruct-GGUF/resolve/main/SmolLM2-135M-Instruct-Q8_0.gguf',
    onStatus: (msg) => console.log('[quant]', msg),
  });

  await q.generate('The capital of France is', {
    maxTokens: 32,
    temperature: 0.0,
    onToken: (text) => document.body.append(text),
    onDone: ({ nTokens, elapsedMs }) => {
      console.log(`Generated ${nTokens} tokens in ${elapsedMs.toFixed(0)} ms`);
    },
  });

  q.free();
</script>

Why?

192 KB binary. The entire inference engine — tokenizer, transformer forward pass, KV cache compression — fits in less than most JPEGs.
Zero server. Models load and run entirely client-side. Nothing is uploaded.
Real models. Llama 3, Qwen 3.5, Gemma 3, SmolLM2, and any other GGUF model under your memory budget.
KV compression built in. Run 4–7× longer context than FP16 KV cache.
One file at the source. Powered by quant.h, a 628 KB single-header C library you can drop into any project.

API

See index.d.ts for the full TypeScript surface.

import { Quant } from '@quantcpp/wasm';

const q = await Quant.create({
  scriptUrl: './quant.js',           // path to the loaded WASM glue
  modelUrl: '/models/llama.gguf',    // optional eager model load
  kvType: 'uniform_4b',              // KV cache quantization
  vQuant: 'q4',                      // value cache quantization
});

await q.generate('Hello', {
  maxTokens: 64,
  temperature: 0.7,
  onToken: (text) => process.stdout.write(text),
});

q.free();

Supported KV quantization types

Type	Bits/elem	Notes
`fp32`	32	baseline
`uniform_4b` ⭐	4	recommended; +6.3% PPL on Llama 3.2 3B
`uniform_2b`	2	maximum compression, lower quality
`polar_3b` / `polar_4b`	3 / 4	PolarQuant-style
`qjl_1b`	1	sign-hash baseline
`turbo_kv_3b` / `turbo_kv_4b`	3 / 4	TurboQuant-structure (research; see issue #14)

Build from source

git clone https://github.com/quantumaikr/quant.cpp
cd quant.cpp/wasm
bash build.sh   # requires emscripten (brew install emscripten)

Output: quant.wasm (192 KB) and quant.js (~30 KB glue).

License

Apache 2.0. See LICENSE.

Citation

If you use quant.cpp's KV compression building blocks in research, please cite the underlying papers:

Name		Name	Last commit message	Last commit date
parent directory ..
.npmignore		.npmignore
README.md		README.md
_headers		_headers
build.sh		build.sh
coi-serviceworker.js		coi-serviceworker.js
index.d.ts		index.d.ts
index.html		index.html
index.js		index.js
index.mjs		index.mjs
package.json		package.json
quant.js		quant.js
quant.wasm		quant.wasm
quant_wasm.c		quant_wasm.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

@quantcpp/wasm

Install

Quick start

Why?

API

Supported KV quantization types

Build from source

License

Citation

FilesExpand file tree

wasm

Directory actions

More options

Directory actions

More options

Latest commit

History

wasm

Folders and files

parent directory

README.md

@quantcpp/wasm

Install

Quick start

Why?

API

Supported KV quantization types

Build from source

License

Citation