[Feature][Doris] Support VARIANT source and sink mapping#10855
Conversation
DanielLeens
left a comment
There was a problem hiding this comment.
Thanks for the contribution. I reviewed the latest head 561526d5e4bc565b2b7d41665e8c0685ddd8e251 against upstream/dev and traced both the schema-mapping path and the real sink runtime path.
What this PR fixes
- User pain: Doris
VARIANTmetadata currently loses its type identity when it goes through SeaTunnel, so source/sink schema round-trips cannot preserveVARIANT. - Fix approach: the PR adds
VARIANThandling in the Doris type converters, preserves it in the Doris 2.x reconvert path, updates docs, and adds mapping-oriented tests. - One-line summary: the metadata / DDL part is in place, but the default JSON sink runtime path is still not fully wired for real VARIANT writes.
Full runtime chain
Source metadata path
-> AbstractDorisTypeConverter.sampleTypeConverter()
-> Doris VARIANT is exposed as SeaTunnel STRING
Sink schema / DDL path
-> DorisCatalogUtil.columnToDorisType()
-> DorisTypeConverterV2.reconvertString()
-> sourceType=VARIANT is written back as Doris VARIANT
Actual sink write path
-> DorisSinkWriter.write()
-> SeaTunnelRowSerializerFactory.createSerializer()
-> if doris.config.format=json
-> SeaTunnelRowSerializer -> JsonSerializationSchema
-> RowToJsonConverters STRING branch
-> textNode((String) value)
That last part is the blocker. On the documented default JSON stream-load path, a SeaTunnel STRING field whose content is {"a":1} is still serialized as a JSON string token, not as a structured JSON object node. In other words, the PR currently fixes the type label and DDL round-trip, but it does not yet prove that the sink main path can actually write structured VARIANT values as advertised.
Review findings
Issue 1: the default format=json sink path still serializes VARIANT candidate values as JSON strings, not structured JSON nodes
- Location:
seatunnel-connectors-v2/connector-doris/src/main/java/org/apache/seatunnel/connectors/doris/serialize/SeaTunnelRowSerializer.java:89 - Why this matters:
the PR updates the type converter and DDL generation, but the runtime payload still goes throughJsonSerializationSchema, and the shared JSON formatter turns every SeaTunnelSTRINGintotextNode((String) value).
That means the normal JSON sink path still emits a quoted string payload instead of an object/array payload for VARIANT-style content. - Risk:
users may get a DorisVARIANTcolumn on the target side, but the documented sink capability is still incomplete on the real write path. - Recommendation:
either add a dedicated VARIANT-aware JSON serialization branch for Doris sink fields, or narrow the PR scope and docs to schema/source support only for now. - Severity: high
Issue 2: the new tests validate converter behavior and generated column definitions, but they do not validate the actual stream-load payload
- Location:
seatunnel-connectors-v2/connector-doris/src/test/java/org/apache/seatunnel/connectors/doris/util/DorisCatalogUtilTest.java:75 - Why this matters:
the new tests show thatVARIANTsurvives the mapping and that generated DDL can containVARIANT, but they never reachDorisSinkWriter -> SeaTunnelRowSerializer -> JsonSerializationSchema.
So today’s test suite still cannot prove that sink support really works end to end. - Risk:
this kind of gap makes it easy to ship “schema support only” while the runtime path remains broken or partial. - Recommendation:
please add at least one serializer-level test for the JSON sink path, and ideally a Doris sink IT/e2e that verifies the loaded VARIANT value can be queried as structured content. - Severity: medium
Compatibility / side effects
- Compatibility: additive and non-breaking from an API/config perspective.
- Side effects: the main problem here is not performance; it is that the documented sink capability currently overstates what the runtime path actually supports.
- Docs: both EN and ZH docs were updated, but at the moment the sink claim is ahead of the implementation.
Merge conclusion
Conclusion: can merge after fixes
- Blocking items
- Issue 1 must be addressed first. The default JSON sink main path is not fully wired for real VARIANT writes yet.
- Non-blocking suggestions
- Issue 2 should also be addressed so the runtime payload shape is covered by tests and does not regress later.
Overall, I think the schema / DDL direction is good, but this is still only part of the full solution. I’d recommend either finishing the runtime path in this PR or explicitly shrinking the scope so the code, tests, and docs all describe the same capability boundary.
What changed
VARIANTto source type conversion and map it to SeaTunnelSTRINGVARIANTduring Doris sink type reconvert so auto-created sink tables keep the column typeVARIANTsupport in Doris source/sink docs and add unit tests for source/sink type mappingWhy
SeaTunnel could not read Doris
VARIANTcolumns, and sink auto-DDL could not keepVARIANTwhen writing back to Doris.Validation
git push origin corgy/dev-doris-variant-source3.0.0-SNAPSHOTdependency chain forconnector-doriswas unavailable