Skip to content

[Feature][Doris] Support VARIANT source and sink mapping#10855

Draft
corgy-w wants to merge 1 commit into
apache:devfrom
corgy-w:corgy/dev-doris-variant-source
Draft

[Feature][Doris] Support VARIANT source and sink mapping#10855
corgy-w wants to merge 1 commit into
apache:devfrom
corgy-w:corgy/dev-doris-variant-source

Conversation

@corgy-w
Copy link
Copy Markdown
Contributor

@corgy-w corgy-w commented May 8, 2026

What changed

  • add Doris VARIANT to source type conversion and map it to SeaTunnel STRING
  • preserve VARIANT during Doris sink type reconvert so auto-created sink tables keep the column type
  • document VARIANT support in Doris source/sink docs and add unit tests for source/sink type mapping

Why

SeaTunnel could not read Doris VARIANT columns, and sink auto-DDL could not keep VARIANT when writing back to Doris.

Validation

  • git push origin corgy/dev-doris-variant-source
  • local module tests were not runnable in this environment because the 3.0.0-SNAPSHOT dependency chain for connector-doris was unavailable

Copy link
Copy Markdown
Contributor

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. I reviewed the latest head 561526d5e4bc565b2b7d41665e8c0685ddd8e251 against upstream/dev and traced both the schema-mapping path and the real sink runtime path.

What this PR fixes

  • User pain: Doris VARIANT metadata currently loses its type identity when it goes through SeaTunnel, so source/sink schema round-trips cannot preserve VARIANT.
  • Fix approach: the PR adds VARIANT handling in the Doris type converters, preserves it in the Doris 2.x reconvert path, updates docs, and adds mapping-oriented tests.
  • One-line summary: the metadata / DDL part is in place, but the default JSON sink runtime path is still not fully wired for real VARIANT writes.

Full runtime chain

Source metadata path
  -> AbstractDorisTypeConverter.sampleTypeConverter()
      -> Doris VARIANT is exposed as SeaTunnel STRING

Sink schema / DDL path
  -> DorisCatalogUtil.columnToDorisType()
      -> DorisTypeConverterV2.reconvertString()
          -> sourceType=VARIANT is written back as Doris VARIANT

Actual sink write path
  -> DorisSinkWriter.write()
      -> SeaTunnelRowSerializerFactory.createSerializer()
      -> if doris.config.format=json
          -> SeaTunnelRowSerializer -> JsonSerializationSchema
              -> RowToJsonConverters STRING branch
                  -> textNode((String) value)

That last part is the blocker. On the documented default JSON stream-load path, a SeaTunnel STRING field whose content is {"a":1} is still serialized as a JSON string token, not as a structured JSON object node. In other words, the PR currently fixes the type label and DDL round-trip, but it does not yet prove that the sink main path can actually write structured VARIANT values as advertised.

Review findings

Issue 1: the default format=json sink path still serializes VARIANT candidate values as JSON strings, not structured JSON nodes

  • Location: seatunnel-connectors-v2/connector-doris/src/main/java/org/apache/seatunnel/connectors/doris/serialize/SeaTunnelRowSerializer.java:89
  • Why this matters:
    the PR updates the type converter and DDL generation, but the runtime payload still goes through JsonSerializationSchema, and the shared JSON formatter turns every SeaTunnel STRING into textNode((String) value).
    That means the normal JSON sink path still emits a quoted string payload instead of an object/array payload for VARIANT-style content.
  • Risk:
    users may get a Doris VARIANT column on the target side, but the documented sink capability is still incomplete on the real write path.
  • Recommendation:
    either add a dedicated VARIANT-aware JSON serialization branch for Doris sink fields, or narrow the PR scope and docs to schema/source support only for now.
  • Severity: high

Issue 2: the new tests validate converter behavior and generated column definitions, but they do not validate the actual stream-load payload

  • Location: seatunnel-connectors-v2/connector-doris/src/test/java/org/apache/seatunnel/connectors/doris/util/DorisCatalogUtilTest.java:75
  • Why this matters:
    the new tests show that VARIANT survives the mapping and that generated DDL can contain VARIANT, but they never reach DorisSinkWriter -> SeaTunnelRowSerializer -> JsonSerializationSchema.
    So today’s test suite still cannot prove that sink support really works end to end.
  • Risk:
    this kind of gap makes it easy to ship “schema support only” while the runtime path remains broken or partial.
  • Recommendation:
    please add at least one serializer-level test for the JSON sink path, and ideally a Doris sink IT/e2e that verifies the loaded VARIANT value can be queried as structured content.
  • Severity: medium

Compatibility / side effects

  • Compatibility: additive and non-breaking from an API/config perspective.
  • Side effects: the main problem here is not performance; it is that the documented sink capability currently overstates what the runtime path actually supports.
  • Docs: both EN and ZH docs were updated, but at the moment the sink claim is ahead of the implementation.

Merge conclusion

Conclusion: can merge after fixes

  1. Blocking items
  • Issue 1 must be addressed first. The default JSON sink main path is not fully wired for real VARIANT writes yet.
  1. Non-blocking suggestions
  • Issue 2 should also be addressed so the runtime payload shape is covered by tests and does not regress later.

Overall, I think the schema / DDL direction is good, but this is still only part of the full solution. I’d recommend either finishing the runtime path in this PR or explicitly shrinking the scope so the code, tests, and docs all describe the same capability boundary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants