Meta-Llama-3-8B-Instruct quantization + vLLM serving

This repo quantizes Meta-Llama-3-8B-Instruct to W8A8 (INT8 weights + INT8 activations), then serves the quantized model with vLLM (KV cache + continuous batching).

Prereqs

CUDA-capable GPU
uv installed
HF token in .env (see .env file format below)

HF_TOKEN=hf_...

Install deps

uv sync

Quantize

uv run quantize.py

Outputs a local model directory (default):

Meta-Llama-3-8B-Instruct-W8A8/

You can override settings via env vars:

MODEL_ID (default: meta-llama/Meta-Llama-3-8B-Instruct)
SAVE_DIR (default: {model_name}-W8A8)
NUM_CALIBRATION_SAMPLES (default: 512)
MAX_SEQUENCE_LENGTH (default: 2048)

Example:

MODEL_ID=meta-llama/Meta-Llama-3-8B-Instruct \
SAVE_DIR=./Meta-Llama-3-8B-Instruct-W8A8 \
NUM_CALIBRATION_SAMPLES=256 \
MAX_SEQUENCE_LENGTH=2048 \
uv run quantize.py

Serve with vLLM (KV cache + batching)

The server uses vLLM's OpenAI-compatible API. KV cache and continuous batching are enabled by default in vLLM. The main batching knobs are max_num_batched_tokens and max_num_seqs.

Start the server:

./scripts/serve_vllm.sh

Defaults are baked into scripts/serve_vllm.sh, but you can override via env vars:

MODEL_DIR=./Meta-Llama-3-8B-Instruct-W8A8 \
SERVED_MODEL_NAME=llama3-w8a8 \
GPU_MEM_UTIL=0.90 \
MAX_MODEL_LEN=8192 \
MAX_BATCHED_TOKENS=16384 \
MAX_NUM_SEQS=256 \
./scripts/serve_vllm.sh

If your vLLM build needs an explicit quantization flag:

EXTRA_ARGS="--quantization compressed-tensors" ./scripts/serve_vllm.sh

Smoke test

./scripts/smoke_test.sh

Concurrency test (batching)

uv run python scripts/concurrency_test.py

You can tune concurrency in scripts/concurrency_test.py.

Using the served model

List models:

curl http://localhost:8000/v1/models

Chat completion:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3-w8a8",
    "messages": [
      {"role":"system","content":"You are a helpful assistant."},
      {"role":"user","content":"Explain KV cache in one paragraph."}
    ],
    "temperature": 0,
    "max_tokens": 128
  }'

Notes

The quantized model uses compressed-tensors metadata in Meta-Llama-3-8B-Instruct-W8A8/config.json. vLLM loads this as compressed-tensors quantization.
sitecustomize.py provides a small compatibility shim for lm-format-enforcer when running vLLM.
Large artifacts (Meta-Llama-3-8B-Instruct-W8A8/ weights, logs, tensorboard) are intentionally not tracked.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Meta-Llama-3-8B-Instruct-W8A8		Meta-Llama-3-8B-Instruct-W8A8
scripts		scripts
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
quantize.py		quantize.py
sitecustomize.py		sitecustomize.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Meta-Llama-3-8B-Instruct quantization + vLLM serving

Prereqs

Install deps

Quantize

Serve with vLLM (KV cache + batching)

Smoke test

Concurrency test (batching)

Using the served model

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Meta-Llama-3-8B-Instruct quantization + vLLM serving

Prereqs

Install deps

Quantize

Serve with vLLM (KV cache + batching)

Smoke test

Concurrency test (batching)

Using the served model

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages