Skip to content

DataChain DataChain

The Context Layer for unstructured data

PyPI Python Version Codecov Tests

The Model Floor Is the Same for Everyone. The Context Ceiling Is Yours.

A Python library that turns files in S3, GCS, and Azure into versioned, typed datasets, queryable at warehouse speed.

Bytes never leave your storage. Two core components: a Compute Engine for distributed Python over files and a Dataset DB for warehouse-speed queries over Pydantic-typed records. For agent workflows, two more: a Knowledge Base of markdown summaries and an Agent Harness (skill + MCP) that plugs all of it into Claude Code, Cursor, and Codex, so they understand your data.

Get started

  • 🤖 Agents - knowledge base for Claude Code, Codex, and Cursor
  • 🐍 Python - full control over data processing
  • 💡 Concepts - the Dataset DB, the Compute Engine, and the Knowledge Base
  • 🧩 Use Cases - patterns where the harness changes the work
DataChain architecture: AI agents and humans, the Dataset DB at the centre, and object storage