Skip to content

ndemir/meta-llama-Meta-Llama-3-8B-Instruct-quantization

Repository files navigation

Meta-Llama-3-8B-Instruct quantization + vLLM serving

This repo quantizes Meta-Llama-3-8B-Instruct to W8A8 (INT8 weights + INT8 activations), then serves the quantized model with vLLM (KV cache + continuous batching).

Prereqs

  • CUDA-capable GPU
  • uv installed
  • HF token in .env (see .env file format below)
HF_TOKEN=hf_...

Install deps

uv sync

Quantize

uv run quantize.py

Outputs a local model directory (default):

Meta-Llama-3-8B-Instruct-W8A8/

You can override settings via env vars:

  • MODEL_ID (default: meta-llama/Meta-Llama-3-8B-Instruct)
  • SAVE_DIR (default: {model_name}-W8A8)
  • NUM_CALIBRATION_SAMPLES (default: 512)
  • MAX_SEQUENCE_LENGTH (default: 2048)

Example:

MODEL_ID=meta-llama/Meta-Llama-3-8B-Instruct \
SAVE_DIR=./Meta-Llama-3-8B-Instruct-W8A8 \
NUM_CALIBRATION_SAMPLES=256 \
MAX_SEQUENCE_LENGTH=2048 \
uv run quantize.py

Serve with vLLM (KV cache + batching)

The server uses vLLM's OpenAI-compatible API. KV cache and continuous batching are enabled by default in vLLM. The main batching knobs are max_num_batched_tokens and max_num_seqs.

Start the server:

./scripts/serve_vllm.sh

Defaults are baked into scripts/serve_vllm.sh, but you can override via env vars:

MODEL_DIR=./Meta-Llama-3-8B-Instruct-W8A8 \
SERVED_MODEL_NAME=llama3-w8a8 \
GPU_MEM_UTIL=0.90 \
MAX_MODEL_LEN=8192 \
MAX_BATCHED_TOKENS=16384 \
MAX_NUM_SEQS=256 \
./scripts/serve_vllm.sh

If your vLLM build needs an explicit quantization flag:

EXTRA_ARGS="--quantization compressed-tensors" ./scripts/serve_vllm.sh

Smoke test

./scripts/smoke_test.sh

Concurrency test (batching)

uv run python scripts/concurrency_test.py

You can tune concurrency in scripts/concurrency_test.py.

Using the served model

List models:

curl http://localhost:8000/v1/models

Chat completion:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3-w8a8",
    "messages": [
      {"role":"system","content":"You are a helpful assistant."},
      {"role":"user","content":"Explain KV cache in one paragraph."}
    ],
    "temperature": 0,
    "max_tokens": 128
  }'

Notes

  • The quantized model uses compressed-tensors metadata in Meta-Llama-3-8B-Instruct-W8A8/config.json. vLLM loads this as compressed-tensors quantization.
  • sitecustomize.py provides a small compatibility shim for lm-format-enforcer when running vLLM.
  • Large artifacts (Meta-Llama-3-8B-Instruct-W8A8/ weights, logs, tensorboard) are intentionally not tracked.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors