Skip to content

eth-library/data-assets-pipeline

Repository files navigation

Data Archive Pipeline (DAP) Orchestrator

A Dagster-based orchestrator for processing digital assets following the OAIS reference model. Manages Submission Information Packages (SIPs) with support for multiple metadata standards including Dublin Core, METS, and PREMIS for comprehensive digital preservation.

Note

This project is in alpha and actively evolving. APIs and behaviors may change. Provided as-is, with no guarantees on stability.

Contents

Features

  • Parses METS XML files into validated Pydantic models following OAIS standards
  • Extracts Dublin Core descriptive metadata and PREMIS preservation metadata
  • Validates file fixity information (MD5, SHA-1, SHA-256, SHA-512 checksums)
  • Automated file monitoring via Dagster sensors
  • Hot-reload development with instant feedback
  • Kubernetes-ready with Helm chart configuration

Quick Start

With Nix (Recommended)

Clone the repository and enter the development environment:

git clone https://github.com/eth-library/data-assets-pipeline.git
cd data-assets-pipeline
direnv allow

If not using direnv, activate the environment manually:

nix develop

Start the Dagster development server:

dagster dev

Open http://localhost:3000 in your browser.

Without Nix

Ensure you have the following installed:

  • Python 3.12+
  • uv (Python package manager)

Set up the project:

git clone https://github.com/eth-library/data-assets-pipeline.git
cd data-assets-pipeline
uv venv --python python3.12
source .venv/bin/activate
uv sync

The dap command becomes available automatically because the CLI is a UV workspace member.

Start the Dagster development server:

dagster dev

Open http://localhost:3000 in your browser.

Prerequisites

Tool Purpose Installation
Nix Development environment (optional but recommended) Install guide
direnv Automatic environment loading Install guide
nix-direnv Fast Nix + direnv integration Install guide
Python 3.12+ Runtime python.org
uv Fast Python package manager curl -LsSf https://astral.sh/uv/install.sh | sh

For Kubernetes development, you'll also need:

  • Docker Desktop with Kubernetes enabled
  • kubectl
  • helm

Dev Toolchain

The project uses a layered toolchain where each layer builds on the one below. See cli/README.md for the full CLI reference.

nix flakes        Reproducible packages — pins exact versions of Python, uv, kubectl, etc.
  └─ direnv       Auto-loading shell env — activates when you cd into the project
      └─ nix-direnv   Cached flake evaluation — avoids re-evaluating the flake on every shell
          └─ uv       Fast Python deps — installs packages from the lockfile in milliseconds
              └─ dap   Ergonomic commands — wraps pytest, ruff, mypy, helm, kubectl

Why each layer exists:

  • Nix flakes (flake.nix): ensures every developer has identical tool versions. No "works on my machine".
  • direnv (.envrc): automatically loads the nix environment when you enter the project directory. No manual nix develop.
  • nix-direnv: caches the nix evaluation so shell startup stays fast. Without it, entering the directory would re-evaluate the flake every time.
  • uv (pyproject.toml, uv.lock): manages Python dependencies. Fast, deterministic, lockfile-based.
  • dap CLI (cli/): the commands documented below. Wraps quality checks, environment management, and Kubernetes deployment.

Architecture

The pipeline processes XML files through a sequence of Dagster assets, each extracting a specific layer of the preservation package hierarchy:

Assets

Asset Input Output Description
sip_asset XML file paths SIPModel Parses METS XML into a structured SIP model
intellectual_entities SIPModel list[IEModel] Extracts intellectual entity metadata with Dublin Core
representations list[IEModel] list[RepresentationModel] Collects file representations (preservation, access, original)
files list[RepresentationModel] list[FileModel] Extracts file metadata including MIME types and paths
fixities list[FileModel] list[FixityModel] Extracts and validates file checksums

Sensor

The xml_file_sensor monitors a configured directory for new XML files and automatically triggers the pipeline. By default, it watches da_pipeline_tests/test_data/ every 30 seconds.

Usage

Local Development

Start the Dagster UI with hot-reload:

dagster dev

Run the test suite:

dap test

Manual Pipeline Runs

To run the pipeline manually via the Dagster UI launchpad, provide a run configuration:

ops:
  sip_asset:
    config:
      file_paths:
        - /path/to/your/mets.xml

Kubernetes Deployment

Requires Docker Desktop with Kubernetes enabled (Settings > Kubernetes > Enable).

Deploy to local Kubernetes:

dap k8s up

The Dagster UI will be available at http://localhost:8080.

Rebuild and restart after code changes:

dap k8s restart

Tear down the deployment:

dap k8s down

Commands Reference

Run dap --help to see all available commands.

Development

Command Description
dap test [--scope core|cli|all] Run tests with pytest
dap lint [--fix] [--scope ...] Check code style and formatting with ruff
dap typecheck [--scope ...] Run type checking with mypy
dap check [--scope ...] Run all quality checks (ruff, mypy, pytest)

The --scope flag controls which code is checked:

  • core (default): da_pipeline, da_pipeline_tests
  • cli: cli/dap_cli, cli/tests
  • all: both core and CLI

Environment

Command Description
dap welcome Show welcome banner and environment info
dap tools [--all] Show installed tool versions and paths
dap env Show environment paths and status
dap clean [--yes] Remove .venv and caches
dap reset [--yes] Clean and reinstall dependencies

dap tools shows the Python toolchain by default. Pass --all to include nix, direnv, kubectl, and helm.

dap clean and dap reset prompt for confirmation. Pass --yes / -y to skip (for CI/scripts).

Kubernetes

Command Description
dap k8s up Build and deploy to local Kubernetes
dap k8s down [--yes] Tear down deployment
dap k8s restart Rebuild image and rollout restart
dap k8s status Show pods and services
dap k8s logs Stream user code pod logs
dap k8s shell Interactive shell in user code pod

Hint Commands

These commands show how to use tools that are available directly in your shell:

Command Description
dap uv Common uv commands (sync, lock, add, run)
dap dagster Common dagster/dg commands
dap direnv Common direnv commands (allow, reload, status)

Global Flags

Flag Description
--version / -V Show CLI version
--help Show help

CLI Development

For working on the dap CLI itself (Python, Typer + Rich), see cli/CONTRIBUTING.md.

Configuration

Environment Variables

Variable Description Default
DAGSTER_HOME Dagster instance directory Project root (set by .envrc)
DAGSTER_TEST_DATA_PATH Directory containing METS XML files for the sensor to monitor da_pipeline_tests/test_data
DAP_THEME Override terminal background detection for colours (light or dark) unset
DAP_QUIET Set to 1 to suppress Quick Start section in dap welcome unset
NO_COLOR Set to 1 to disable all colour output (also respected in CI) unset

Copy .env.example to .env and modify as needed.

Configuration Files

File Purpose
flake.nix Nix development environment with multiple shells (see below)
cli/ Python CLI source code (see cli/CONTRIBUTING.md)
pyproject.toml Python project metadata and dependencies
dagster.yaml Dagster instance configuration
config.yaml Example run configuration for manual pipeline execution
helm/values.yaml Base Helm values for Kubernetes deployment
helm/values-local.yaml Local Kubernetes overrides
helm/pvc.yaml Persistent volume claim for Dagster storage

Development Shells

The Nix flake provides a single development shell activated via direnv allow or nix develop. It includes Python, uv, kubectl, and helm.

Project Structure

cli/                         # dap CLI (Python) — see cli/CONTRIBUTING.md
├── dap_cli/
│   ├── app.py               # Entry point — registers all commands
│   ├── theme.py             # Rich console, ETH brand colors, symbols
│   ├── commands/            # Command modules (dev, env, hints, k8s)
│   └── utils/               # Subprocess helpers
└── tests/                   # CLI tests

da_pipeline/                 # Main package (Python)
├── definitions.py           # Dagster entry point (Definitions)
├── assets.py                # Pipeline assets
├── sensors.py               # File monitoring sensor and job definition
├── mets_parser.py           # METS XML parsing logic
├── pydantic_models.py       # OAIS-compliant data models
└── utils.py                 # Helper functions for metadata collection

da_pipeline_tests/           # Tests
├── test_data/               # Sample METS XML files
│   └── synthetic_sip.xml    # Synthetic test data
├── test_assets.py           # Asset tests
└── conftest.py              # Pytest fixtures

helm/                        # Kubernetes configuration
├── values.yaml              # Base Helm values
├── values-local.yaml        # Local development overrides
└── pvc.yaml                 # Persistent volume claim

Dependency Management

Dependencies are managed with uv:

Install or update dependencies:

uv sync

Update the lock file:

uv lock

Add a new dependency:

uv add <package>

Troubleshooting

Dagster UI not loading

Ensure no other process is using port 3000 (local) or 8080 (Kubernetes):

lsof -i :3000

Kubernetes deployment fails

Verify Kubernetes is running:

kubectl cluster-info

Ensure Docker Desktop Kubernetes is enabled in Settings > Kubernetes > Enable Kubernetes.

Sensor not detecting files

Check the DAGSTER_TEST_DATA_PATH environment variable points to a directory containing .xml files. The sensor only processes files with the .xml extension.

Test failures

Ensure dependencies are installed:

uv sync

Domain Terms

  • METS (Metadata Encoding and Transmission Standard): XML schema for encoding metadata about digital objects
  • OAIS (Open Archival Information System): ISO reference model for digital preservation systems
  • SIP (Submission Information Package): A package submitted to an archive for ingest
  • IE (Intellectual Entity): A distinct unit of information to be preserved (e.g., a document, dataset)

License

Apache License 2.0 - (c) 2026 ETH Zurich

Further Reading

Core Technologies

Digital Preservation Standards

Development Tools

  • uv - Fast Python package manager
  • Typer - Python CLI framework
  • Rich - Terminal formatting library
  • Nix - Reproducible development environments
  • direnv - Automatic environment loading
  • Ruff - Python linter and formatter

Kubernetes

About

A pipeline for ingesting and processing digital archive assets, extracting metadata from METS files, and orchestrating archive workflows.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors