A Dagster-based orchestrator for processing digital assets following the OAIS reference model. Manages Submission Information Packages (SIPs) with support for multiple metadata standards including Dublin Core, METS, and PREMIS for comprehensive digital preservation.
Note
This project is in alpha and actively evolving. APIs and behaviors may change. Provided as-is, with no guarantees on stability.
- Features
- Quick Start
- Prerequisites
- Dev Toolchain
- Architecture
- Usage
- Commands Reference
- Configuration
- Project Structure
- Dependency Management
- Troubleshooting
- Domain Terms
- License
- Further Reading
- Parses METS XML files into validated Pydantic models following OAIS standards
- Extracts Dublin Core descriptive metadata and PREMIS preservation metadata
- Validates file fixity information (MD5, SHA-1, SHA-256, SHA-512 checksums)
- Automated file monitoring via Dagster sensors
- Hot-reload development with instant feedback
- Kubernetes-ready with Helm chart configuration
Clone the repository and enter the development environment:
git clone https://github.com/eth-library/data-assets-pipeline.git
cd data-assets-pipeline
direnv allowIf not using direnv, activate the environment manually:
nix developStart the Dagster development server:
dagster devOpen http://localhost:3000 in your browser.
Ensure you have the following installed:
- Python 3.12+
- uv (Python package manager)
Set up the project:
git clone https://github.com/eth-library/data-assets-pipeline.git
cd data-assets-pipeline
uv venv --python python3.12
source .venv/bin/activate
uv syncThe dap command becomes available automatically because the CLI is a UV workspace member.
Start the Dagster development server:
dagster devOpen http://localhost:3000 in your browser.
| Tool | Purpose | Installation |
|---|---|---|
| Nix | Development environment (optional but recommended) | Install guide |
| direnv | Automatic environment loading | Install guide |
| nix-direnv | Fast Nix + direnv integration | Install guide |
| Python 3.12+ | Runtime | python.org |
| uv | Fast Python package manager | curl -LsSf https://astral.sh/uv/install.sh | sh |
For Kubernetes development, you'll also need:
- Docker Desktop with Kubernetes enabled
- kubectl
- helm
The project uses a layered toolchain where each layer builds on the one below. See cli/README.md for the full CLI reference.
nix flakes Reproducible packages — pins exact versions of Python, uv, kubectl, etc.
└─ direnv Auto-loading shell env — activates when you cd into the project
└─ nix-direnv Cached flake evaluation — avoids re-evaluating the flake on every shell
└─ uv Fast Python deps — installs packages from the lockfile in milliseconds
└─ dap Ergonomic commands — wraps pytest, ruff, mypy, helm, kubectl
Why each layer exists:
- Nix flakes (
flake.nix): ensures every developer has identical tool versions. No "works on my machine". - direnv (
.envrc): automatically loads the nix environment when you enter the project directory. No manualnix develop. - nix-direnv: caches the nix evaluation so shell startup stays fast. Without it, entering the directory would re-evaluate the flake every time.
- uv (
pyproject.toml,uv.lock): manages Python dependencies. Fast, deterministic, lockfile-based. - dap CLI (
cli/): the commands documented below. Wraps quality checks, environment management, and Kubernetes deployment.
The pipeline processes XML files through a sequence of Dagster assets, each extracting a specific layer of the preservation package hierarchy:
| Asset | Input | Output | Description |
|---|---|---|---|
sip_asset |
XML file paths | SIPModel |
Parses METS XML into a structured SIP model |
intellectual_entities |
SIPModel |
list[IEModel] |
Extracts intellectual entity metadata with Dublin Core |
representations |
list[IEModel] |
list[RepresentationModel] |
Collects file representations (preservation, access, original) |
files |
list[RepresentationModel] |
list[FileModel] |
Extracts file metadata including MIME types and paths |
fixities |
list[FileModel] |
list[FixityModel] |
Extracts and validates file checksums |
The xml_file_sensor monitors a configured directory for new XML files and automatically triggers the pipeline. By default, it watches da_pipeline_tests/test_data/ every 30 seconds.
Start the Dagster UI with hot-reload:
dagster devRun the test suite:
dap testTo run the pipeline manually via the Dagster UI launchpad, provide a run configuration:
ops:
sip_asset:
config:
file_paths:
- /path/to/your/mets.xmlRequires Docker Desktop with Kubernetes enabled (Settings > Kubernetes > Enable).
Deploy to local Kubernetes:
dap k8s upThe Dagster UI will be available at http://localhost:8080.
Rebuild and restart after code changes:
dap k8s restartTear down the deployment:
dap k8s downRun dap --help to see all available commands.
| Command | Description |
|---|---|
dap test [--scope core|cli|all] |
Run tests with pytest |
dap lint [--fix] [--scope ...] |
Check code style and formatting with ruff |
dap typecheck [--scope ...] |
Run type checking with mypy |
dap check [--scope ...] |
Run all quality checks (ruff, mypy, pytest) |
The --scope flag controls which code is checked:
core(default):da_pipeline,da_pipeline_testscli:cli/dap_cli,cli/testsall: both core and CLI
| Command | Description |
|---|---|
dap welcome |
Show welcome banner and environment info |
dap tools [--all] |
Show installed tool versions and paths |
dap env |
Show environment paths and status |
dap clean [--yes] |
Remove .venv and caches |
dap reset [--yes] |
Clean and reinstall dependencies |
dap tools shows the Python toolchain by default. Pass --all to include nix, direnv, kubectl, and helm.
dap clean and dap reset prompt for confirmation. Pass --yes / -y to skip (for CI/scripts).
| Command | Description |
|---|---|
dap k8s up |
Build and deploy to local Kubernetes |
dap k8s down [--yes] |
Tear down deployment |
dap k8s restart |
Rebuild image and rollout restart |
dap k8s status |
Show pods and services |
dap k8s logs |
Stream user code pod logs |
dap k8s shell |
Interactive shell in user code pod |
These commands show how to use tools that are available directly in your shell:
| Command | Description |
|---|---|
dap uv |
Common uv commands (sync, lock, add, run) |
dap dagster |
Common dagster/dg commands |
dap direnv |
Common direnv commands (allow, reload, status) |
| Flag | Description |
|---|---|
--version / -V |
Show CLI version |
--help |
Show help |
For working on the dap CLI itself (Python, Typer + Rich), see cli/CONTRIBUTING.md.
| Variable | Description | Default |
|---|---|---|
DAGSTER_HOME |
Dagster instance directory | Project root (set by .envrc) |
DAGSTER_TEST_DATA_PATH |
Directory containing METS XML files for the sensor to monitor | da_pipeline_tests/test_data |
DAP_THEME |
Override terminal background detection for colours (light or dark) |
unset |
DAP_QUIET |
Set to 1 to suppress Quick Start section in dap welcome |
unset |
NO_COLOR |
Set to 1 to disable all colour output (also respected in CI) |
unset |
Copy .env.example to .env and modify as needed.
| File | Purpose |
|---|---|
flake.nix |
Nix development environment with multiple shells (see below) |
cli/ |
Python CLI source code (see cli/CONTRIBUTING.md) |
pyproject.toml |
Python project metadata and dependencies |
dagster.yaml |
Dagster instance configuration |
config.yaml |
Example run configuration for manual pipeline execution |
helm/values.yaml |
Base Helm values for Kubernetes deployment |
helm/values-local.yaml |
Local Kubernetes overrides |
helm/pvc.yaml |
Persistent volume claim for Dagster storage |
The Nix flake provides a single development shell activated via direnv allow or nix develop. It includes Python, uv, kubectl, and helm.
cli/ # dap CLI (Python) — see cli/CONTRIBUTING.md
├── dap_cli/
│ ├── app.py # Entry point — registers all commands
│ ├── theme.py # Rich console, ETH brand colors, symbols
│ ├── commands/ # Command modules (dev, env, hints, k8s)
│ └── utils/ # Subprocess helpers
└── tests/ # CLI tests
da_pipeline/ # Main package (Python)
├── definitions.py # Dagster entry point (Definitions)
├── assets.py # Pipeline assets
├── sensors.py # File monitoring sensor and job definition
├── mets_parser.py # METS XML parsing logic
├── pydantic_models.py # OAIS-compliant data models
└── utils.py # Helper functions for metadata collection
da_pipeline_tests/ # Tests
├── test_data/ # Sample METS XML files
│ └── synthetic_sip.xml # Synthetic test data
├── test_assets.py # Asset tests
└── conftest.py # Pytest fixtures
helm/ # Kubernetes configuration
├── values.yaml # Base Helm values
├── values-local.yaml # Local development overrides
└── pvc.yaml # Persistent volume claim
Dependencies are managed with uv:
Install or update dependencies:
uv syncUpdate the lock file:
uv lockAdd a new dependency:
uv add <package>Ensure no other process is using port 3000 (local) or 8080 (Kubernetes):
lsof -i :3000Verify Kubernetes is running:
kubectl cluster-infoEnsure Docker Desktop Kubernetes is enabled in Settings > Kubernetes > Enable Kubernetes.
Check the DAGSTER_TEST_DATA_PATH environment variable points to a directory containing .xml files. The sensor only processes files with the .xml extension.
Ensure dependencies are installed:
uv sync- METS (Metadata Encoding and Transmission Standard): XML schema for encoding metadata about digital objects
- OAIS (Open Archival Information System): ISO reference model for digital preservation systems
- SIP (Submission Information Package): A package submitted to an archive for ingest
- IE (Intellectual Entity): A distinct unit of information to be preserved (e.g., a document, dataset)
Apache License 2.0 - (c) 2026 ETH Zurich
- Dagster Documentation - Data orchestration platform
- Pydantic Documentation - Data validation library
- METS (Metadata Encoding and Transmission Standard)
- OAIS Reference Model (ISO 14721)
- Dublin Core Metadata
- PREMIS (Preservation Metadata)
- uv - Fast Python package manager
- Typer - Python CLI framework
- Rich - Terminal formatting library
- Nix - Reproducible development environments
- direnv - Automatic environment loading
- Ruff - Python linter and formatter
- Helm - Kubernetes package manager
- Dagster Helm Chart