Human Action Recognition using CLIP

This project implements an end-to-end solution for Human Action Recognition (HAR) using CLIP (Contrastive Language-Image Pre-training) models. The system supports multiple training modes including single-GPU, DistributedDataParallel (DDP), and FullyShardedDataParallel (FSDP).

1. Features

Advanced Model Architecture: Leverages CLIP with custom text prompts for zero-shot and fine-tuned action classification
Distributed Training: Supports single-GPU, DDP, and FSDP training modes
Comprehensive Evaluation: Detailed metrics, confusion matrices, and per-class accuracy visualizations
Dual Experiment Tracking: Support for both MLflow (self-hosted) and Weights & Biases (cloud) simultaneously
Automated Training & Deployment: End-to-end automated training with DVC dataset versioning and HuggingFace Hub integration
Production-Ready Inference: REST API for model serving with multiple model format support (PyTorch, ONNX, TorchScript)
Data Version Control: DVC integration for dataset and model versioning
Model Export: ONNX and TensorRT export with benchmarking
Interactive UI: Streamlit-based web interface for model testing

2. System Architecture

The CLIP HAR project implements a comprehensive MLOps architecture with automation at its core, featuring interconnected components focused on human action recognition.

The system integrates:

Data management with version control
CLIP-based model architecture
Distributed training capabilities
Comprehensive evaluation metrics
Production-ready deployment options
Interactive user interfaces

For detailed architecture diagrams and component descriptions, see:

System Architecture

3. Project Structure

CLIP_HAR_PROJECT/
├── app/                  # Streamlit application
├── configs/              # Configuration files
├── data/                 # Data handling modules
│   ├── dataset.py        # Dataset loading and preparation
│   ├── preprocessing.py  # Data preprocessing utilities
│   └── augmentation.py   # Augmentation strategies
├── deployment/           # Deployment utilities
│   └── export.py         # Model export (ONNX, TensorRT)
├── evaluation/           # Evaluation modules
│   ├── evaluator.py      # Evaluation orchestration
│   ├── metrics.py        # Metric computation utilities
│   └── visualization.py  # Result visualization
├── mlops/                # MLOps integration
│   ├── tracking.py       # Unified tracking system (MLflow & wandb)
│   ├── dvc_utils.py      # DVC integration utilities
│   ├── huggingface_hub_utils.py  # HuggingFace Hub integration
│   ├── automated_training.py     # Automated training module
│   └── inference_serving.py      # Inference serving module
├── models/               # Model definitions
│   ├── clip_model.py     # CLIP-based model
│   └── model_factory.py  # Model creation utilities
├── pipeline/             # End-to-end pipelines
│   ├── training_pipeline.py  # Training pipeline
│   └── inference_pipeline.py # Inference pipeline
├── training/             # Training modules
│   ├── distributed.py    # Distributed training utilities
│   └── trainer.py        # Trainer implementations
├── utils/                # Utility functions
├── docs/                 # Documentation
│   ├── architecture.md   # Detailed architecture overview
│   ├── docker_guide.md   # Docker containerization guide
│   ├── api_reference.md  # API reference documentation
│   └── experiment_tracking.md  # Experiment tracking guide
├── custom_evaluate.py    # Evaluation script
├── launch_distributed.py # Distributed training launcher
├── train.py              # Training script
├── docker/               # Docker configuration files
│   ├── docker-compose.yml # Docker Compose configuration
│   ├── Dockerfile.train  # Training container Dockerfile
│   ├── Dockerfile.app    # App/Inference container Dockerfile
│   └── Dockerfile        # Base Dockerfile
├── dvc.yaml              # DVC pipeline definition
└── requirements.txt      # Project dependencies

4. Installation

Clone the repository:

git clone https://github.com/tuandung222/Open-vocabulary-Action-Recognition-with-CLIP.git
cd Open-vocabulary-Action-Recognition-with-CLIP

Install the required packages:

pip install -r requirements.txt
pip install -e .

Set up DVC:

 # Intialize DVC if not already initialized
dvc init
dvc add data/raw  # Add raw data to version control

5. Training

Single-GPU Training

For experiments and smaller datasets, run training on a single GPU:

python train.py --distributed_mode none --batch_size 128 --max_epochs 15 --lr 3e-6

Distributed Training with DDP

For faster training with multiple GPUs using DistributedDataParallel (DDP):

# Launch DDP training using torchrun (automatically handles process creation)
python launch_distributed.py \
    --distributed_mode ddp \
    --batch_size 64 \
    --max_epochs 15 \
    --lr 3e-6 \
    --output_dir outputs/ddp_training

Large-Scale Training with FSDP

For very large models or datasets, use Fully Sharded Data Parallel (FSDP) to shard model parameters across GPUs:

python launch_distributed.py \
    --distributed_mode fsdp \
    --batch_size 32 \
    --max_epochs 15 \
    --lr 2e-6 \
    --output_dir outputs/fsdp_training

How Distributed Training Works

Under the hood, the distributed training in this project is implemented using PyTorch's distributed training capabilities, specifically DistributedDataParallel (DDP) and FullyShardedDataParallel (FSDP). Here's how it works:

Launcher Abstraction: When you run launch_distributed.py, it abstracts away the complexity of setting up distributed training:
- It detects the number of available GPUs
- It automatically configures the torchrun command with appropriate arguments
- It launches the main training script (train.py) with the proper environment variables set

Behind the Scenes: The launcher is actually using torchrun (PyTorch's distributed launcher) to spawn multiple processes:

# From launch_distributed.py
cmd = [
    "torchrun",
    "--nproc_per_node", str(num_gpus),
    "--master_addr", args.master_addr,
    "--master_port", args.master_port,
    "train.py",
    # ... additional arguments
]

Process Management: Each GPU gets its own Python process with:
- A unique LOCAL_RANK (GPU index)
- A unique RANK (process index in the distributed group)
- Shared WORLD_SIZE (total number of processes)
Trainer Integration: Inside the DistributedTrainer class, the distributed environment is automatically set up based on these environment variables:
- The model is wrapped in DDP or FSDP depending on your choice
- Distributed samplers are created for the datasets
- Gradients are synchronized across processes during training
- Only the main process (rank 0) performs logging and checkpoint saving

This approach makes distributed training much simpler to use, as you don't have to manually set up process groups, wrap models, or handle synchronization - the launcher and trainer handle all these details for you.

Training Configuration Options

Parameter	Description	Default
`--distributed_mode`	Training mode (`none`, `ddp`, `fsdp`)	`none`
`--model_name`	CLIP model name/path	`openai/clip-vit-base-patch16`
`--batch_size`	Training batch size (per GPU)	256
`--eval_batch_size`	Evaluation batch size (per GPU)	128
`--max_epochs`	Maximum number of training epochs	15
`--lr`	Learning rate	3e-6
`--unfreeze_visual`	Unfreeze visual encoder	False
`--unfreeze_text`	Unfreeze text encoder	False
`--no_mixed_precision`	Disable mixed precision training	False

Using the Training Pipeline

Run an end-to-end training pipeline with all components:

python -m CLIP_HAR_PROJECT.pipeline.training_pipeline \
    --config_path configs/training_config.yaml \
    --output_dir outputs/training_run \
    --distributed_mode ddp \
    --augmentation_strength medium

Automated Training

Automate the complete training pipeline with specific dataset versions and model checkpoints:

python -m CLIP_HAR_PROJECT.mlops.automated_training \
    --config configs/training_config.yaml \
    --output_dir outputs/auto_training \
    --dataset_version v1.2 \
    --checkpoint previous_models/checkpoint.pt \
    --push_to_hub \
    --experiment_name "clip_har_v2" \
    --distributed_mode ddp

The automated training pipeline:

Loads a specific dataset version using DVC
Starts from a checkpoint if provided
Runs the complete training pipeline
Pushes the trained model to HuggingFace Hub
Saves all results and metrics

For a comprehensive guide covering all training scenarios, distributed training options, and troubleshooting tips, see Training Guide.

6. Evaluation

Evaluate a trained model:

python custom_evaluate.py --model_path /path/to/checkpoint.pt --output_dir results

The evaluation produces:

Accuracy, precision, recall, and F1 score
Confusion matrix visualization
Per-class accuracy analysis
Detailed classification report

7. Experiment Tracking

Dual Tracking with MLflow and Weights & Biases

The project supports both MLflow and Weights & Biases for experiment tracking simultaneously:

# Use both MLflow and wandb
python train.py --experiment_name "clip_har_experiment"

# Use only MLflow
python train.py --use_mlflow --no_wandb --experiment_name "clip_har_experiment"

# Use only wandb
python train.py --no_mlflow --use_wandb --experiment_name "clip_har_experiment"

# Disable all tracking
python train.py --no_tracking

# Specify custom MLflow port
python train.py --experiment_name "clip_har_experiment" --mlflow_port 5001

Setting up MLflow Server (Self-hosted)

# Start MLflow tracking server
mlflow server --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./mlruns --host 0.0.0.0

The MLflow UI will be accessible at http://localhost:5000 and provides:

Experiment comparison
Metric visualization
Model versioning
Artifact management

Troubleshooting MLflow Server

If you encounter issues with the MLflow server:

Address already in use error: If port 5000 is already taken, use a different port:

mlflow server --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./mlruns --host 0.0.0.0 --port 5001

When using a custom port for the MLflow server, make sure to specify the same port in your training script:

python train.py --use_mlflow --experiment_name "my_experiment" --mlflow_port 5001

Server running but dashboard empty: Make sure your training explicitly enables MLflow:
```
python train.py --use_mlflow --experiment_name "my_experiment"
```
Check if MLflow server is already running:
```
ps aux | grep mlflow
```
Restart a stuck server:
```
pkill -f "mlflow server"
```
Remote access: For accessing MLflow UI from a remote machine, use SSH tunneling:
```
ssh -L 5000:localhost:5000 username@remote_server
```
Then open http://localhost:5000 in your local browser. For a different port, adjust the command accordingly:
```
ssh -L 5001:localhost:5001 username@remote_server
```

Setting up Weights & Biases (Cloud)

# Login to wandb
wandb login

# Run training with wandb project/group
python train.py --use_wandb --project_name "clip_har" --group_name "experiments"

The wandb dashboard provides:

Real-time training monitoring
Advanced visualizations
Team collaboration
Run comparisons

8. Data Version Control with DVC

# Initialize DVC
dvc init

# Add dataset to DVC tracking
dvc add data/raw

# Run the pipeline
dvc repro

# Push data to remote storage (if configured)
dvc push

The DVC pipeline in dvc.yaml includes stages for:

Data preparation
Model training
Evaluation
Model export

9. Model Export and Deployment

This project supports multiple export formats to optimize models for different deployment scenarios:

python -m CLIP_HAR_PROJECT.deployment.export_clip_model \
    --model_path outputs/trained_model.pt \
    --export_format onnx torchscript tensorrt \
    --benchmark

Supported Export Formats

1. ONNX

The Open Neural Network Exchange format provides cross-platform compatibility:

Framework-independent model representation
Optimized inference with ONNX Runtime
Deployment on CPU, GPU, and specialized hardware

2. TorchScript

PyTorch's serialization format for production deployment:

C++ runtime compatibility
Graph optimizations for faster inference
Better portability than native PyTorch models

3. TensorRT

NVIDIA's high-performance inference optimizer:

Maximum GPU acceleration
Mixed precision support (FP32, FP16, INT8)
Kernel fusion and other advanced optimizations

Performance Comparison

Format	Inference Time	FPS	Relative Speed	Use Case
PyTorch	~25-30ms	~35	1x	Development, flexibility
TorchScript	~18-22ms	~50	~1.4x	Production CPU/GPU
ONNX	~15-20ms	~60	~1.7x	Cross-platform deploy
TensorRT	~5-8ms	~150	~4-5x	Maximum GPU performance

10. TensorRT Integration

For applications requiring maximum inference speed, my TensorRT integration provides substantial performance benefits.

Key TensorRT Features

GPU-Optimized Inference: Up to 5x faster inference compared to standard PyTorch models
Multiple Precision Support: FP32, FP16, and INT8 quantization options
Dynamic Batch Processing: Configurable batch sizes for both real-time and batch processing
Seamless API Integration: Uses the same inference API as other model formats

How to Use TensorRT

1. Export Your Model to TensorRT

python -m CLIP_HAR_PROJECT.deployment.export_clip_model \
    --model_path outputs/trained_model.pt \
    --config_path configs/training_config.yaml \
    --export_format tensorrt \
    --precision fp16 \  # Options: fp32, fp16, int8
    --batch_size 16 \
    --validate \
    --benchmark

2. Serve the TensorRT Model

python -m CLIP_HAR_PROJECT.mlops.inference_serving \
    --model_path exports/model.trt \
    --model_type tensorrt \
    --class_names outputs/class_names.json \
    --port 8000

3. Deploy with Docker

The provided Docker container has all necessary TensorRT dependencies pre-installed:

docker-compose -f docker/docker-compose.yml up clip-har-app

11. Docker Containers

The project provides Docker containers for training and inference:

# Build containers
docker-compose -f docker/docker-compose.yml build

# Run training container
docker-compose -f docker/docker-compose.yml run clip-har-train

# Run app/inference container
docker-compose -f docker/docker-compose.yml up clip-har-app

For detailed Docker setup, see docs/docker_guide.md.

12. Inference Serving

Deploy a model as a REST API for inference:

# Serve a PyTorch model
python -m CLIP_HAR_PROJECT.mlops.inference_serving \
    --model_path outputs/trained_model.pt \
    --model_type pytorch \
    --port 8000

# Serve an ONNX model
python -m CLIP_HAR_PROJECT.mlops.inference_serving \
    --model_path outputs/model.onnx \
    --model_type onnx \
    --class_names outputs/class_names.json \
    --port 8001

# Serve a TorchScript model
python -m CLIP_HAR_PROJECT.mlops.inference_serving \
    --model_path outputs/model.torchscript \
    --model_type torchscript \
    --port 8002

Using the Inference API

The inference API provides endpoints for:

GET / - Get service information
GET /health - Health check endpoint
POST /predict - Run inference on an image (JSON)
- Image data as base64 string
- Image URL
POST /predict/image - Run inference on an uploaded image (multipart/form-data)

Example API usage:

# Using the Python client
from CLIP_HAR_PROJECT.mlops.inference_serving import InferenceClient

client = InferenceClient(url="http://localhost:8000")

# Predict from image file
result = client.predict_from_image_path("path/to/image.jpg")
print(f"Top prediction: {result['predictions'][0]['class_name']}")
print(f"Confidence: {result['predictions'][0]['score']:.4f}")

# Predict from image URL
result = client.predict_from_image_url("https://example.com/image.jpg")

For complete API reference, see docs/api_reference.md.

13. HuggingFace Hub Integration

Push trained models to HuggingFace Hub:

from CLIP_HAR_PROJECT.mlops.huggingface_hub_utils import push_model_to_hub

# Push a trained model to HuggingFace Hub
model_url = push_model_to_hub(
    model=model,
    model_name="clip-har-v1",
    repo_id="tuandunghcmut/clip-har-v1",
    commit_message="Upload CLIP HAR model",
    metadata={"accuracy": 0.92, "f1_score": 0.91},
    private=False
)
print(f"Model uploaded to: {model_url}")

Advanced HuggingFace Hub Features

The project supports advanced HuggingFace Hub integration including:

Automated model publishing during training
Custom model cards with rich metadata
Complete pipeline publishing for easier inference
CI/CD integration through GitHub Actions
Model versioning with tags and branches

For detailed instructions on these advanced features, see HuggingFace Integration Guide.

14. Streamlit App

Run the Streamlit app:

streamlit run app/app.py

Features:

Image upload for action classification
Real-time webcam action recognition
Model performance visualization

15. Requirements

Python 3.8+
PyTorch 2.0+
HuggingFace Transformers
MLflow
Weights & Biases
DVC
Streamlit
ONNX Runtime
FastAPI & Uvicorn (for inference serving)

16. Dataset

The project uses the Human Action Recognition (HAR) dataset from HuggingFace, containing 15 action classes:

calling
clapping
cycling
dancing
drinking
eating
fighting
hugging
laughing
listening_to_music
running
sitting
sleeping
texting
using_laptop

17. Results

The CLIP-based model achieves:

Zero-shot classification accuracy: ~81%
Fine-tuned model accuracy: ~92%

18. Kubernetes Deployment

For production deployments, I've prepared Kubernetes configurations to ensure scalable and reliable service operation. The deployment uses a microservices architecture with separate components for inference, model management, and monitoring.

Kubernetes Setup

The kubernetes/ directory contains all necessary configuration files:

# Apply the entire configuration
kubectl apply -f kubernetes/

# Or apply individual components
kubectl apply -f kubernetes/clip-har-inference.yaml
kubectl apply -f kubernetes/clip-har-monitoring.yaml

Key Components

Inference Service: Scalable pods with auto-scaling based on CPU/GPU utilization
Model Registry: Persistent storage for model versions
API Gateway: Manages external access and load balancing
HPA (Horizontal Pod Autoscaler): Automatically scales based on demand

Resource Allocation

The deployment is configured with appropriate resource requests and limits:

resources:
  requests:
    memory: "2Gi"
    cpu: "1"
    nvidia.com/gpu: 1
  limits:
    memory: "4Gi"
    cpu: "2"
    nvidia.com/gpu: 1

19. Monitoring Infrastructure

I've implemented a comprehensive monitoring stack using industry-standard tools for observability and performance tracking.

Prometheus for Metrics Collection

The system uses Prometheus to collect and store time-series metrics:

Custom Metrics: Model inference latency, throughput, GPU utilization
System Metrics: Node resource utilization, network throughput
Business Metrics: Requests per minute, success rates

# Access Prometheus dashboard
kubectl port-forward svc/prometheus-server 9090:9090

Kibana and Elasticsearch for Logging

For log aggregation and analysis:

Centralized Logging: All service logs collected and indexed
Structured Logging: JSON-formatted logs with standardized fields
Log Retention: Configurable retention policies

# Access Kibana dashboard
kubectl port-forward svc/kibana 5601:5601

Grafana Dashboards

Preconfigured Grafana dashboards provide visual monitoring:

System Overview: Resource utilization across the cluster
Model Performance: Inference times, accuracy metrics
API Performance: Request rates, latencies, error rates

# Access Grafana dashboard
kubectl port-forward svc/grafana 3000:3000

Alerting

The monitoring system includes alerting for critical conditions:

Model Drift: Alert when accuracy metrics drop below thresholds
Resource Constraints: Notify on memory/CPU/GPU pressure
Error Rates: Alert on elevated API error rates

Alerts can be configured to notify through various channels (email, Slack, PagerDuty).

20. CI/CD Pipeline

The CLIP HAR project implements a robust CI/CD pipeline to automate testing, building, and deployment processes while ensuring code quality and operational reliability.

Pipeline Overview

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│    Build    │───▶│    Test     │───▶│ Model Eval  │───▶│  Artifacts  │───▶│   Deploy    │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘

GitHub Actions Workflows

My CI/CD pipeline is implemented with GitHub Actions and consists of these key workflows:

1. Continuous Integration

Trigger: On push to main/develop branches and pull requests
Jobs: Code linting (flake8, black, isort) and unit/integration tests
Benefits: Ensures code quality and prevents breaking changes

2. Model Evaluation

Trigger: When model code changes are pushed
Jobs: Pulls test data via DVC, evaluates model performance, uploads metrics
Hardware: Runs on GPU-enabled self-hosted runners
Benefits: Validates model performance before deployment

3. Container Building

Trigger: On pushes to main and version tags
Jobs: Builds Docker images with optimized caching, pushes to DockerHub
Benefits: Creates reproducible deployment artifacts

4. Kubernetes Deployment

Trigger: After successful container builds or manual dispatch
Jobs: Applies Kubernetes configurations with rolling updates
Benefits: Zero-downtime deployments with health checking

Automated Model Retraining

The pipeline includes a weekly scheduled job for model retraining that:

Pulls the latest dataset version from DVC
Executes the automated training pipeline
Pushes successful models to the model registry
Can be manually triggered as needed

GitOps with ArgoCD

For production environments, I use ArgoCD for GitOps-based continuous delivery:

Repository structure follows the GitOps pattern with environment-specific configurations
ArgoCD syncs the Kubernetes cluster state with the declared configurations
Promotion between environments (dev, staging, prod) via pull requests

CI/CD Best Practices

Immutable Artifacts: Container images are versioned and never modified
Canary Deployments: New versions are deployed to a subset of users first
Automated Rollbacks: Failed deployments trigger automatic rollbacks
Metric Validation: Post-deployment checks verify system metrics
Security Scanning: Container images are scanned for vulnerabilities

21. Documentation

22. Acknowledgements

23. License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.dvc		.dvc
.github/workflows		.github/workflows
app		app
configs		configs
data		data
deployment		deployment
docker		docker
docs		docs
evaluation		evaluation
kubernetes		kubernetes
mlops		mlops
models		models
pipeline		pipeline
scripts		scripts
tests		tests
training		training
utils		utils
.dvcignore		.dvcignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
ROADMAP.md		ROADMAP.md
__init__.py		__init__.py
custom_evaluate.py		custom_evaluate.py
dvc.yaml		dvc.yaml
launch_distributed.py		launch_distributed.py
mkdocs.yml		mkdocs.yml
mlflow.db		mlflow.db
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py
simple_train.py		simple_train.py
test_mlflow.py		test_mlflow.py
test_mlflow_connection.py		test_mlflow_connection.py
test_plot.png		test_plot.png
train.py		train.py

tuandung222/Open-vocabulary-Action-Recognition-with-CLIP

Folders and files

Latest commit

History

Repository files navigation