This repo quantizes Meta-Llama-3-8B-Instruct to W8A8 (INT8 weights + INT8 activations), then serves the quantized model with vLLM (KV cache + continuous batching).
- CUDA-capable GPU
uvinstalled- HF token in
.env(see.envfile format below)
HF_TOKEN=hf_...
uv sync
uv run quantize.py
Outputs a local model directory (default):
Meta-Llama-3-8B-Instruct-W8A8/
You can override settings via env vars:
MODEL_ID(default:meta-llama/Meta-Llama-3-8B-Instruct)SAVE_DIR(default:{model_name}-W8A8)NUM_CALIBRATION_SAMPLES(default:512)MAX_SEQUENCE_LENGTH(default:2048)
Example:
MODEL_ID=meta-llama/Meta-Llama-3-8B-Instruct \
SAVE_DIR=./Meta-Llama-3-8B-Instruct-W8A8 \
NUM_CALIBRATION_SAMPLES=256 \
MAX_SEQUENCE_LENGTH=2048 \
uv run quantize.py
The server uses vLLM's OpenAI-compatible API. KV cache and continuous batching are enabled by default in vLLM. The main batching knobs are max_num_batched_tokens and max_num_seqs.
Start the server:
./scripts/serve_vllm.sh
Defaults are baked into scripts/serve_vllm.sh, but you can override via env vars:
MODEL_DIR=./Meta-Llama-3-8B-Instruct-W8A8 \
SERVED_MODEL_NAME=llama3-w8a8 \
GPU_MEM_UTIL=0.90 \
MAX_MODEL_LEN=8192 \
MAX_BATCHED_TOKENS=16384 \
MAX_NUM_SEQS=256 \
./scripts/serve_vllm.sh
If your vLLM build needs an explicit quantization flag:
EXTRA_ARGS="--quantization compressed-tensors" ./scripts/serve_vllm.sh
./scripts/smoke_test.sh
uv run python scripts/concurrency_test.py
You can tune concurrency in scripts/concurrency_test.py.
List models:
curl http://localhost:8000/v1/models
Chat completion:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3-w8a8",
"messages": [
{"role":"system","content":"You are a helpful assistant."},
{"role":"user","content":"Explain KV cache in one paragraph."}
],
"temperature": 0,
"max_tokens": 128
}'
- The quantized model uses
compressed-tensorsmetadata inMeta-Llama-3-8B-Instruct-W8A8/config.json. vLLM loads this ascompressed-tensorsquantization. sitecustomize.pyprovides a small compatibility shim forlm-format-enforcerwhen running vLLM.- Large artifacts (
Meta-Llama-3-8B-Instruct-W8A8/weights, logs, tensorboard) are intentionally not tracked.