This project implements an end-to-end solution for Human Action Recognition (HAR) using CLIP (Contrastive Language-Image Pre-training) models. The system supports multiple training modes including single-GPU, DistributedDataParallel (DDP), and FullyShardedDataParallel (FSDP).
- Advanced Model Architecture: Leverages CLIP with custom text prompts for zero-shot and fine-tuned action classification
- Distributed Training: Supports single-GPU, DDP, and FSDP training modes
- Comprehensive Evaluation: Detailed metrics, confusion matrices, and per-class accuracy visualizations
- Dual Experiment Tracking: Support for both MLflow (self-hosted) and Weights & Biases (cloud) simultaneously
- Automated Training & Deployment: End-to-end automated training with DVC dataset versioning and HuggingFace Hub integration
- Production-Ready Inference: REST API for model serving with multiple model format support (PyTorch, ONNX, TorchScript)
- Data Version Control: DVC integration for dataset and model versioning
- Model Export: ONNX and TensorRT export with benchmarking
- Interactive UI: Streamlit-based web interface for model testing
The CLIP HAR project implements a comprehensive MLOps architecture with automation at its core, featuring interconnected components focused on human action recognition.
The system integrates:
- Data management with version control
- CLIP-based model architecture
- Distributed training capabilities
- Comprehensive evaluation metrics
- Production-ready deployment options
- Interactive user interfaces
For detailed architecture diagrams and component descriptions, see:
CLIP_HAR_PROJECT/
├── app/ # Streamlit application
├── configs/ # Configuration files
├── data/ # Data handling modules
│ ├── dataset.py # Dataset loading and preparation
│ ├── preprocessing.py # Data preprocessing utilities
│ └── augmentation.py # Augmentation strategies
├── deployment/ # Deployment utilities
│ └── export.py # Model export (ONNX, TensorRT)
├── evaluation/ # Evaluation modules
│ ├── evaluator.py # Evaluation orchestration
│ ├── metrics.py # Metric computation utilities
│ └── visualization.py # Result visualization
├── mlops/ # MLOps integration
│ ├── tracking.py # Unified tracking system (MLflow & wandb)
│ ├── dvc_utils.py # DVC integration utilities
│ ├── huggingface_hub_utils.py # HuggingFace Hub integration
│ ├── automated_training.py # Automated training module
│ └── inference_serving.py # Inference serving module
├── models/ # Model definitions
│ ├── clip_model.py # CLIP-based model
│ └── model_factory.py # Model creation utilities
├── pipeline/ # End-to-end pipelines
│ ├── training_pipeline.py # Training pipeline
│ └── inference_pipeline.py # Inference pipeline
├── training/ # Training modules
│ ├── distributed.py # Distributed training utilities
│ └── trainer.py # Trainer implementations
├── utils/ # Utility functions
├── docs/ # Documentation
│ ├── architecture.md # Detailed architecture overview
│ ├── docker_guide.md # Docker containerization guide
│ ├── api_reference.md # API reference documentation
│ └── experiment_tracking.md # Experiment tracking guide
├── custom_evaluate.py # Evaluation script
├── launch_distributed.py # Distributed training launcher
├── train.py # Training script
├── docker/ # Docker configuration files
│ ├── docker-compose.yml # Docker Compose configuration
│ ├── Dockerfile.train # Training container Dockerfile
│ ├── Dockerfile.app # App/Inference container Dockerfile
│ └── Dockerfile # Base Dockerfile
├── dvc.yaml # DVC pipeline definition
└── requirements.txt # Project dependencies
-
Clone the repository:
git clone https://github.com/tuandung222/Open-vocabulary-Action-Recognition-with-CLIP.git cd Open-vocabulary-Action-Recognition-with-CLIP
-
Install the required packages:
pip install -r requirements.txt pip install -e .
-
Set up DVC:
# Intialize DVC if not already initialized dvc init dvc add data/raw # Add raw data to version control
For experiments and smaller datasets, run training on a single GPU:
python train.py --distributed_mode none --batch_size 128 --max_epochs 15 --lr 3e-6
For faster training with multiple GPUs using DistributedDataParallel (DDP):
# Launch DDP training using torchrun (automatically handles process creation)
python launch_distributed.py \
--distributed_mode ddp \
--batch_size 64 \
--max_epochs 15 \
--lr 3e-6 \
--output_dir outputs/ddp_training
For very large models or datasets, use Fully Sharded Data Parallel (FSDP) to shard model parameters across GPUs:
python launch_distributed.py \
--distributed_mode fsdp \
--batch_size 32 \
--max_epochs 15 \
--lr 2e-6 \
--output_dir outputs/fsdp_training
Under the hood, the distributed training in this project is implemented using PyTorch's distributed training capabilities, specifically DistributedDataParallel (DDP) and FullyShardedDataParallel (FSDP). Here's how it works:
-
Launcher Abstraction: When you run
launch_distributed.py
, it abstracts away the complexity of setting up distributed training:- It detects the number of available GPUs
- It automatically configures the
torchrun
command with appropriate arguments - It launches the main training script (
train.py
) with the proper environment variables set
-
Behind the Scenes: The launcher is actually using
torchrun
(PyTorch's distributed launcher) to spawn multiple processes:# From launch_distributed.py cmd = [ "torchrun", "--nproc_per_node", str(num_gpus), "--master_addr", args.master_addr, "--master_port", args.master_port, "train.py", # ... additional arguments ]
-
Process Management: Each GPU gets its own Python process with:
- A unique
LOCAL_RANK
(GPU index) - A unique
RANK
(process index in the distributed group) - Shared
WORLD_SIZE
(total number of processes)
- A unique
-
Trainer Integration: Inside the
DistributedTrainer
class, the distributed environment is automatically set up based on these environment variables:- The model is wrapped in DDP or FSDP depending on your choice
- Distributed samplers are created for the datasets
- Gradients are synchronized across processes during training
- Only the main process (rank 0) performs logging and checkpoint saving
This approach makes distributed training much simpler to use, as you don't have to manually set up process groups, wrap models, or handle synchronization - the launcher and trainer handle all these details for you.
Parameter | Description | Default |
---|---|---|
--distributed_mode |
Training mode (none , ddp , fsdp ) |
none |
--model_name |
CLIP model name/path | openai/clip-vit-base-patch16 |
--batch_size |
Training batch size (per GPU) | 256 |
--eval_batch_size |
Evaluation batch size (per GPU) | 128 |
--max_epochs |
Maximum number of training epochs | 15 |
--lr |
Learning rate | 3e-6 |
--unfreeze_visual |
Unfreeze visual encoder | False |
--unfreeze_text |
Unfreeze text encoder | False |
--no_mixed_precision |
Disable mixed precision training | False |
Run an end-to-end training pipeline with all components:
python -m CLIP_HAR_PROJECT.pipeline.training_pipeline \
--config_path configs/training_config.yaml \
--output_dir outputs/training_run \
--distributed_mode ddp \
--augmentation_strength medium
Automate the complete training pipeline with specific dataset versions and model checkpoints:
python -m CLIP_HAR_PROJECT.mlops.automated_training \
--config configs/training_config.yaml \
--output_dir outputs/auto_training \
--dataset_version v1.2 \
--checkpoint previous_models/checkpoint.pt \
--push_to_hub \
--experiment_name "clip_har_v2" \
--distributed_mode ddp
The automated training pipeline:
- Loads a specific dataset version using DVC
- Starts from a checkpoint if provided
- Runs the complete training pipeline
- Pushes the trained model to HuggingFace Hub
- Saves all results and metrics
For a comprehensive guide covering all training scenarios, distributed training options, and troubleshooting tips, see Training Guide.
Evaluate a trained model:
python custom_evaluate.py --model_path /path/to/checkpoint.pt --output_dir results
The evaluation produces:
- Accuracy, precision, recall, and F1 score
- Confusion matrix visualization
- Per-class accuracy analysis
- Detailed classification report
The project supports both MLflow and Weights & Biases for experiment tracking simultaneously:
# Use both MLflow and wandb
python train.py --experiment_name "clip_har_experiment"
# Use only MLflow
python train.py --use_mlflow --no_wandb --experiment_name "clip_har_experiment"
# Use only wandb
python train.py --no_mlflow --use_wandb --experiment_name "clip_har_experiment"
# Disable all tracking
python train.py --no_tracking
# Specify custom MLflow port
python train.py --experiment_name "clip_har_experiment" --mlflow_port 5001
# Start MLflow tracking server
mlflow server --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./mlruns --host 0.0.0.0
The MLflow UI will be accessible at http://localhost:5000 and provides:
- Experiment comparison
- Metric visualization
- Model versioning
- Artifact management
If you encounter issues with the MLflow server:
-
Address already in use error: If port 5000 is already taken, use a different port:
mlflow server --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./mlruns --host 0.0.0.0 --port 5001
When using a custom port for the MLflow server, make sure to specify the same port in your training script:
python train.py --use_mlflow --experiment_name "my_experiment" --mlflow_port 5001
-
Server running but dashboard empty: Make sure your training explicitly enables MLflow:
python train.py --use_mlflow --experiment_name "my_experiment"
-
Check if MLflow server is already running:
ps aux | grep mlflow
-
Restart a stuck server:
pkill -f "mlflow server"
-
Remote access: For accessing MLflow UI from a remote machine, use SSH tunneling:
ssh -L 5000:localhost:5000 username@remote_server
Then open http://localhost:5000 in your local browser. For a different port, adjust the command accordingly:
ssh -L 5001:localhost:5001 username@remote_server
# Login to wandb
wandb login
# Run training with wandb project/group
python train.py --use_wandb --project_name "clip_har" --group_name "experiments"
The wandb dashboard provides:
- Real-time training monitoring
- Advanced visualizations
- Team collaboration
- Run comparisons
# Initialize DVC
dvc init
# Add dataset to DVC tracking
dvc add data/raw
# Run the pipeline
dvc repro
# Push data to remote storage (if configured)
dvc push
The DVC pipeline in dvc.yaml
includes stages for:
- Data preparation
- Model training
- Evaluation
- Model export
This project supports multiple export formats to optimize models for different deployment scenarios:
python -m CLIP_HAR_PROJECT.deployment.export_clip_model \
--model_path outputs/trained_model.pt \
--export_format onnx torchscript tensorrt \
--benchmark
The Open Neural Network Exchange format provides cross-platform compatibility:
- Framework-independent model representation
- Optimized inference with ONNX Runtime
- Deployment on CPU, GPU, and specialized hardware
PyTorch's serialization format for production deployment:
- C++ runtime compatibility
- Graph optimizations for faster inference
- Better portability than native PyTorch models
NVIDIA's high-performance inference optimizer:
- Maximum GPU acceleration
- Mixed precision support (FP32, FP16, INT8)
- Kernel fusion and other advanced optimizations
Format | Inference Time | FPS | Relative Speed | Use Case |
---|---|---|---|---|
PyTorch | ~25-30ms | ~35 | 1x | Development, flexibility |
TorchScript | ~18-22ms | ~50 | ~1.4x | Production CPU/GPU |
ONNX | ~15-20ms | ~60 | ~1.7x | Cross-platform deploy |
TensorRT | ~5-8ms | ~150 | ~4-5x | Maximum GPU performance |
For applications requiring maximum inference speed, my TensorRT integration provides substantial performance benefits.
- GPU-Optimized Inference: Up to 5x faster inference compared to standard PyTorch models
- Multiple Precision Support: FP32, FP16, and INT8 quantization options
- Dynamic Batch Processing: Configurable batch sizes for both real-time and batch processing
- Seamless API Integration: Uses the same inference API as other model formats
python -m CLIP_HAR_PROJECT.deployment.export_clip_model \
--model_path outputs/trained_model.pt \
--config_path configs/training_config.yaml \
--export_format tensorrt \
--precision fp16 \ # Options: fp32, fp16, int8
--batch_size 16 \
--validate \
--benchmark
python -m CLIP_HAR_PROJECT.mlops.inference_serving \
--model_path exports/model.trt \
--model_type tensorrt \
--class_names outputs/class_names.json \
--port 8000
The provided Docker container has all necessary TensorRT dependencies pre-installed:
docker-compose -f docker/docker-compose.yml up clip-har-app
The project provides Docker containers for training and inference:
# Build containers
docker-compose -f docker/docker-compose.yml build
# Run training container
docker-compose -f docker/docker-compose.yml run clip-har-train
# Run app/inference container
docker-compose -f docker/docker-compose.yml up clip-har-app
For detailed Docker setup, see docs/docker_guide.md.
Deploy a model as a REST API for inference:
# Serve a PyTorch model
python -m CLIP_HAR_PROJECT.mlops.inference_serving \
--model_path outputs/trained_model.pt \
--model_type pytorch \
--port 8000
# Serve an ONNX model
python -m CLIP_HAR_PROJECT.mlops.inference_serving \
--model_path outputs/model.onnx \
--model_type onnx \
--class_names outputs/class_names.json \
--port 8001
# Serve a TorchScript model
python -m CLIP_HAR_PROJECT.mlops.inference_serving \
--model_path outputs/model.torchscript \
--model_type torchscript \
--port 8002
The inference API provides endpoints for:
- GET / - Get service information
- GET /health - Health check endpoint
- POST /predict - Run inference on an image (JSON)
- Image data as base64 string
- Image URL
- POST /predict/image - Run inference on an uploaded image (multipart/form-data)
Example API usage:
# Using the Python client
from CLIP_HAR_PROJECT.mlops.inference_serving import InferenceClient
client = InferenceClient(url="http://localhost:8000")
# Predict from image file
result = client.predict_from_image_path("path/to/image.jpg")
print(f"Top prediction: {result['predictions'][0]['class_name']}")
print(f"Confidence: {result['predictions'][0]['score']:.4f}")
# Predict from image URL
result = client.predict_from_image_url("https://example.com/image.jpg")
For complete API reference, see docs/api_reference.md.
Push trained models to HuggingFace Hub:
from CLIP_HAR_PROJECT.mlops.huggingface_hub_utils import push_model_to_hub
# Push a trained model to HuggingFace Hub
model_url = push_model_to_hub(
model=model,
model_name="clip-har-v1",
repo_id="tuandunghcmut/clip-har-v1",
commit_message="Upload CLIP HAR model",
metadata={"accuracy": 0.92, "f1_score": 0.91},
private=False
)
print(f"Model uploaded to: {model_url}")
The project supports advanced HuggingFace Hub integration including:
- Automated model publishing during training
- Custom model cards with rich metadata
- Complete pipeline publishing for easier inference
- CI/CD integration through GitHub Actions
- Model versioning with tags and branches
For detailed instructions on these advanced features, see HuggingFace Integration Guide.
Run the Streamlit app:
streamlit run app/app.py
Features:
- Image upload for action classification
- Real-time webcam action recognition
- Model performance visualization
- Python 3.8+
- PyTorch 2.0+
- HuggingFace Transformers
- MLflow
- Weights & Biases
- DVC
- Streamlit
- ONNX Runtime
- FastAPI & Uvicorn (for inference serving)
The project uses the Human Action Recognition (HAR) dataset from HuggingFace, containing 15 action classes:
- calling
- clapping
- cycling
- dancing
- drinking
- eating
- fighting
- hugging
- laughing
- listening_to_music
- running
- sitting
- sleeping
- texting
- using_laptop
The CLIP-based model achieves:
- Zero-shot classification accuracy: ~81%
- Fine-tuned model accuracy: ~92%
For production deployments, I've prepared Kubernetes configurations to ensure scalable and reliable service operation. The deployment uses a microservices architecture with separate components for inference, model management, and monitoring.
The kubernetes/
directory contains all necessary configuration files:
# Apply the entire configuration
kubectl apply -f kubernetes/
# Or apply individual components
kubectl apply -f kubernetes/clip-har-inference.yaml
kubectl apply -f kubernetes/clip-har-monitoring.yaml
- Inference Service: Scalable pods with auto-scaling based on CPU/GPU utilization
- Model Registry: Persistent storage for model versions
- API Gateway: Manages external access and load balancing
- HPA (Horizontal Pod Autoscaler): Automatically scales based on demand
The deployment is configured with appropriate resource requests and limits:
resources:
requests:
memory: "2Gi"
cpu: "1"
nvidia.com/gpu: 1
limits:
memory: "4Gi"
cpu: "2"
nvidia.com/gpu: 1
I've implemented a comprehensive monitoring stack using industry-standard tools for observability and performance tracking.
The system uses Prometheus to collect and store time-series metrics:
- Custom Metrics: Model inference latency, throughput, GPU utilization
- System Metrics: Node resource utilization, network throughput
- Business Metrics: Requests per minute, success rates
# Access Prometheus dashboard
kubectl port-forward svc/prometheus-server 9090:9090
For log aggregation and analysis:
- Centralized Logging: All service logs collected and indexed
- Structured Logging: JSON-formatted logs with standardized fields
- Log Retention: Configurable retention policies
# Access Kibana dashboard
kubectl port-forward svc/kibana 5601:5601
Preconfigured Grafana dashboards provide visual monitoring:
- System Overview: Resource utilization across the cluster
- Model Performance: Inference times, accuracy metrics
- API Performance: Request rates, latencies, error rates
# Access Grafana dashboard
kubectl port-forward svc/grafana 3000:3000
The monitoring system includes alerting for critical conditions:
- Model Drift: Alert when accuracy metrics drop below thresholds
- Resource Constraints: Notify on memory/CPU/GPU pressure
- Error Rates: Alert on elevated API error rates
Alerts can be configured to notify through various channels (email, Slack, PagerDuty).
The CLIP HAR project implements a robust CI/CD pipeline to automate testing, building, and deployment processes while ensuring code quality and operational reliability.
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Build │───▶│ Test │───▶│ Model Eval │───▶│ Artifacts │───▶│ Deploy │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
My CI/CD pipeline is implemented with GitHub Actions and consists of these key workflows:
- Trigger: On push to main/develop branches and pull requests
- Jobs: Code linting (flake8, black, isort) and unit/integration tests
- Benefits: Ensures code quality and prevents breaking changes
- Trigger: When model code changes are pushed
- Jobs: Pulls test data via DVC, evaluates model performance, uploads metrics
- Hardware: Runs on GPU-enabled self-hosted runners
- Benefits: Validates model performance before deployment
- Trigger: On pushes to main and version tags
- Jobs: Builds Docker images with optimized caching, pushes to DockerHub
- Benefits: Creates reproducible deployment artifacts
- Trigger: After successful container builds or manual dispatch
- Jobs: Applies Kubernetes configurations with rolling updates
- Benefits: Zero-downtime deployments with health checking
The pipeline includes a weekly scheduled job for model retraining that:
- Pulls the latest dataset version from DVC
- Executes the automated training pipeline
- Pushes successful models to the model registry
- Can be manually triggered as needed
For production environments, I use ArgoCD for GitOps-based continuous delivery:
- Repository structure follows the GitOps pattern with environment-specific configurations
- ArgoCD syncs the Kubernetes cluster state with the declared configurations
- Promotion between environments (dev, staging, prod) via pull requests
- Immutable Artifacts: Container images are versioned and never modified
- Canary Deployments: New versions are deployed to a subset of users first
- Automated Rollbacks: Failed deployments trigger automatic rollbacks
- Metric Validation: Post-deployment checks verify system metrics
- Security Scanning: Container images are scanned for vulnerabilities
- Architecture Overview
- Docker Setup Guide
- API Reference
- Experiment Tracking Guide
- Training Guide
- HuggingFace Integration Guide
- Project Roadmap
This project is licensed under the MIT License.