Today we’re open-sourcing kafka-connect-ai — a single, Apache 2.0 licensed Kafka Connect connector that replaces hundreds of single-purpose connectors with 12 protocol adapters and an LLM-powered transformation pipeline.
The technical breakthroughs that make this possible: compiled transforms that let the LLM generate reusable transformation code (declarative mappings at ~500ns/record, sandboxed GraalVM JavaScript at ~2μs/record) instead of calling the model per-record; 4-tier model routing that automatically sends simple records to the cheapest capable model — or bypasses the LLM entirely for deterministic transforms; semantic caching via Redis vector similarity that eliminates duplicate LLM calls; and PII masking that strips sensitive fields before they ever reach the model. The result: a universal connector that handles HTTP, JDBC, MongoDB, gRPC, Kafka-to-Kafka, Redis, Cassandra, Kinesis, cloud data warehouses, WebSocket/SSE streaming, LDAP, and SAP — with LLM costs that converge toward zero as the system learns your schemas.
The Connector Sprawl Problem
Ask any Kafka engineer what keeps them up at night, and it is rarely the broker. It is Connect. More specifically, it is the ever-expanding sprawl of connector JARs, configuration models, dependency conflicts, and version mismatches that accumulates in every mature Kafka deployment until it becomes a full-time job.
200 Connectors, One Fundamental Operation
The OSO engineers regularly walk into organisations running fifteen to twenty different connector JARs in a single Kafka Connect cluster. Each has its own version, its own transitive dependencies, and its own classpath conflicts with the others. Maintaining that ecosystem — updating JARs, resolving dependency collisions, debugging serialisation mismatches — consumes two to three engineers who should be building pipelines.
Every connector also introduces a new mental model. The JDBC connector structures its configuration differently from the HTTP connector. The S3 connector has its own authentication surface. The Elasticsearch connector has its own concept of index routing. There is no unified configuration language. Knowledge does not transfer — expertise with the Salesforce connector gives an engineer nothing useful when diagnosing the MongoDB connector.
Licensing Has Turned Sprawl Into a Strategic Risk
Apache Kafka itself remains Apache 2.0. But many connectors have followed a different trajectory. The Confluent Community Licence, introduced in 2018, moved Schema Registry, KSQL, and a significant portion of the connector catalogue under a licence that prohibits offering those components as competing SaaS products. Premium Connectors extended this further — mission-critical integrations like Oracle CDC are gated behind enterprise subscriptions.
The four risks are concrete: licensing terms on currently-used connectors may change; cost escalation follows platform lock-in; migrating away from connector-specific investments is extremely high-friction; and innovation velocity may slow as contributing organisations’ priorities shift.
kafka-connect-ai is Apache 2.0 licensed. No Community Licence, no Premium tier, no enterprise restrictions.
The Core Insight: Protocols, Not Products
After more than thirty enterprise Kafka engagements, the OSO engineers arrived at a conclusion that runs against the grain of how the Connect ecosystem developed: the connector-per-system model is a categorical error. The right abstraction is the protocol.
What changes between Salesforce and HubSpot is not the protocol — both speak REST HTTP. What changes between PostgreSQL and MySQL is not the protocol — both speak JDBC. What changes is the configuration surface and the data shape. Build a clean implementation of each protocol once, and let AI handle the per-system transformation dynamically.
kafka-connect-ai ships with 12 protocol adapters today:
Adapter
Covers
Key Capabilities
HTTP REST
Every REST/SaaS API
5 auth modes, 5 pagination strategies
JDBC
Every SQL database
4 query modes, auto-DDL, upsert
Kafka
Cross-cluster replication
K2K bridge and migration
MongoDB
Document stores
Wire protocol, change streams
Warehouse
Snowflake, Redshift, BigQuery
Cloud analytics sinks
gRPC
Any gRPC service
Protobuf service definitions
Streaming
WebSocket, SSE, CometD
Real-time event sources
Redis
Streams, Pub/Sub, KV
Full Redis data model
Cassandra
Wide-column stores
CQL native protocol
Kinesis
AWS event streaming
AWS-native source/sink
LDAP
Directory services
Identity and org sync
SAP
Enterprise ERP
RFC calls (profile-gated)
Each adapter covers an entire category of systems. The HTTP adapter covers every REST API. The JDBC adapter covers every SQL database. The gRPC adapter covers any gRPC service. Adding a new integration is a configuration change, not a development project.
Architecture: How It Actually Works
kafka-connect-ai is not an AI wrapper on top of existing connectors. It is a purpose-built transformation pipeline where the LLM generates code, not just data — and that code is cached and reused across all subsequent records with the same schema shape.
The Key Innovation: Compiled Transforms
This is the breakthrough that makes AI-powered data transformation economically viable at scale.
Traditional approaches call the LLM for every record. kafka-connect-ai takes a fundamentally different approach: the LLM generates transformation code, not transformed data. That code is compiled, validated, cached, and reused for every subsequent record with the same schema shape.
The two-tier execution strategy:
Tier 0 — Declarative mappings (~500ns/record): Pure Java, zero security risk. Handles field renames, JSONPath mappings, defaults, and type casts. No scripting engine involved.
Tier 1 — Sandboxed JavaScript (~2μs/record after JIT): GraalVM Polyglot with full sandboxing — no file system access, no network access, no host access. Configurable memory limit (10MB default) and execution timeout (100ms default). Handles complex transformations with conditional logic.
4-Tier Model Routing
Not every record needs the same model. kafka-connect-ai automatically routes records to the cheapest capable tier:
Tier
Model
When
Cost
T0
Deterministic (no LLM)
Field renames, type casts, timestamp formatting
$0
T1
Claude Haiku
Simple flat records
~$0.25/MTok
T2
Claude Sonnet
Moderate complexity
~$3/MTok
T3
Claude Opus
Complex nested structures
~$15/MTok
Records cascade down tiers — the router picks the cheapest model that can handle the complexity.
Cost Optimisation Stack
Four layers work together to minimise LLM spend:
Layer
Mechanism
Savings
Compiled Transforms
LLM generates code once, cached by schema fingerprint
100% after first record
Tier 0 Deterministic
Field renames, type casts — no LLM call at all
100%
Tier 1 Fast Model
Simple records routed to Claude Haiku
~90% vs default model
Semantic Cache
Redis vector similarity deduplicates LLM calls
100% per cache hit
Prompt Caching
Anthropic caches repeated system prompts
~50% input tokens
Privacy: PII Masking
Sensitive fields are masked before data reaches the LLM and restored after transformation:
Configure by field name (case-insensitive) or regex pattern matching. Meets GDPR requirements by ensuring PII never leaves your infrastructure boundary to reach the LLM provider.
Production Observability: 16 JMX Metrics
Every stage of the pipeline is instrumented via Micrometer:
kafka-connect-ai is a Kafka Connect connector. It deploys inside your existing Connect cluster, alongside your existing connectors, installed identically to any other connector JAR. There is no forklift migration.
Phase 1: Start with HTTP-based SaaS Integrations
These are the highest-friction connectors to maintain — API versioning changes, authentication updates, and pagination modifications from vendors require connector updates that depend on external maintainers. With kafka-connect-ai, updating a configuration parameter handles it immediately.
Phase 2: Consolidate JDBC Connectors
Most organisations have accumulated multiple JDBC connectors across different database sources. Consolidating into one adapter eliminates classpath conflicts, creates a single configuration model, and surfaces configuration inconsistencies across independently-managed connectors.
Phase 3: Expand to Specialised Protocols
With 12 adapters covering MongoDB, gRPC, Redis, Cassandra, Kinesis, WebSocket/SSE streaming, LDAP, SAP, and cloud data warehouses (Snowflake, Redshift, BigQuery), you can progressively migrate specialised connectors without changing anything about your Kafka infrastructure.
Phase 4: Audit Licensing Exposure
Separate from the technical migration, identify which connectors in your cluster fall under the Confluent Community Licence, which are gated behind enterprise subscriptions, and which carry managed connector pricing. Quantify the cost. kafka-connect-ai is Apache 2.0 — run it on-premises, in the cloud, or offer it as a service, with zero licensing constraints.
Sub-millisecond latency — Initial LLM calls add latency (100ms-5s). Compiled transforms bring this down to microseconds, but the first record for each new schema shape requires an LLM call.
Full CDC with WAL — For database replication with transaction ordering, Debezium is purpose-built. kafka-connect-ai uses polling queries.
Binary data — kafka-connect-ai works with JSON. For Avro, Protobuf, or binary payloads, use specialised connectors.
Conclusion
The connector-per-system model was a pragmatic solution to a 2015 problem. In 2025, it has become a source of operational overhead, licensing risk, and organisational fragility. Two hundred connectors, each encoding the same structural logic for a different product, each requiring specialist knowledge — this is not the right foundation for enterprise data integration at scale.
The abstraction was always wrong. Protocols matter; products do not. What was missing was a transformation engine sophisticated enough to handle per-system variability without a separate codebase for each system — and smart enough to learn from the data it sees so it only needs the LLM once per schema shape, not once per record.
That is what kafka-connect-ai provides. Twelve protocol adapters. LLM-compiled transforms that converge toward zero cost. Four-tier model routing. PII masking. Semantic caching. 16 production-grade JMX metrics. Apache 2.0 licensed.
The project is open source, the adapter interface is designed for community contribution, and the conversation is open.
Speak with one of our engineers to find out how a protocol-first approach can eliminate connector sprawl, reduce licensing risk, and future-proof your data portability strategy.