Spark: Read/Write `UnknownType` by Fokko · Pull Request #13445 · apache/iceberg

Fokko · 2025-07-02T10:54:44Z

Related dev-thread: https://lists.apache.org/thread/gq9lyndb574ptq7vkz83zgkp1lx7vp5x

kevinjqliu · 2025-07-26T16:10:52Z

nit: lets make sure this PR applied to both spark 3.5 and 4.0!

context: https://lists.apache.org/thread/7x7xcm3y87y81c4grq4nn9gdjd4jm05f

Fokko · 2025-08-08T10:14:06Z

nit: lets make sure this PR applied to both spark 3.5 and 4.0!

Let's do this first in Spark 3.5, and I'll forwardport it to 4.0 in another PR. There will probably be some comments on the code, and then I have to keep everything in sync (something I'm not good at :).

stevenzwu

only got time to look at a few classes. will spend more time next Monday.

This reverts commit c0172a1.

kevinjqliu · 2025-08-24T20:46:53Z

+    if (null == struct) {
+      return IntStream.rangeClosed(0, numWriters).toArray();
    }


rangeClosed is inclusive on both ends, the resulting array will be of size numWriters + 1, is that right?

i see this is used in a couple of other places as well
https://grep.app/search?f.repo=apache%2Ficeberg&q=IntStream.rangeClosed%280%2C+numWriters%29.toArray%28%29%3B

stevenzwu

the new approach suggested by Ryan looks pretty good

stevenzwu · 2025-08-25T20:25:29Z

+    /** Returns a mapping from writer index to field index, skipping Unknown columns. */
+    private static int[] writerToFieldIndex(List<DataType> types, int numWriters) {
+      if (null == types) {
+        return IntStream.rangeClosed(0, numWriters).toArray();


should it be range or rangeClosed? I though the end should be exclusive as [0, numWriters) .

also if the types list/array is null, should we fail? is it a valid scenario?

I would imagine it can be empty. even that is a bit weird to me.

should it be range or rangeClosed? I though the end should be exclusive as [0, numWriters).

I think it should be range, but I'd rather follow up on that in a separate PR since there are also similar occurrences in other places.

also if the types list/array is null, should we fail? is it a valid scenario?

I don't think it can be null, but I think we should fail on null == types. To illustrate, this PR addresses Spark, but the Flink writers don't pass in the struct, failing on UnknownTypes. I would also suggest this to be a follow up PR.

I would imagine it can be empty. even that is a bit weird to me.

I think that's the case of the empty struct, which is not allowed in Parquet

sounds good to follow up separately

created tracking issue for range vs rangeClosed
#13921

stevenzwu · 2025-08-26T13:40:23Z

-          sField.name());
-      results.add(visitField(sField, field, visitor));
+    for (StructField sField : struct.fields()) {
+      String fieldName = AvroSchemaUtil.makeCompatibleName(sField.name());


@Fokko @rdblue will using field name be a problem if the field is renamed in Iceberg table? would the mapping be messed up during read or write?

During write, if both Spark type and Parquet type are converted from Iceberg schema, this should be fine. But if we are reading a Parquet file with old field name, will this break?

Got the answer from Fokko offline. "he ReadBuilder does not use the visitor, and follows a different path which uses projection by ID"

Yes, that's correct. The read path doesn't use this.

Map Iceberg's UnknownType to Spark's NullType in both directions (TypeToSparkType: UNKNOWN -> NullType; SparkTypeToType: NullType -> UnknownType). Filter NullType-backed fields from Parquet/ORC writers so UnknownType columns produce only nulls. Aligns Spark 3.x with the existing Spark 4.x behavior from apache#13445. Tests added (mirrored from apache#13445 to v3.4 and v3.5): - AvroDataTestBase: unk field in SUPPORTED_PRIMITIVES, plus testUnknownNestedLevel, testUnknownListType, testUnknownMapType - TestSparkOrcReader: testUnknownListType, testUnknownMapType overrides - TestSparkParquetReader: testUnknownListType, testUnknownMapType overrides - TestSparkRecordOrcReaderWriter: testUnknownListType, testUnknownMapType overrides - TestORCDataFrameWrite: testUnknownListType, testUnknownMapType overrides - TestParquetDataFrameWrite: testUnknownListType, testUnknownMapType overrides - TestParquetScan: testUnknownListType, testUnknownMapType overrides - ScanTestBase.writeAndValidate: opt into format version 3 when the schema contains UnknownType

Spark: Read/Write UnknownType

ce9d04b

github-actions Bot added spark core data build labels Jul 2, 2025

Fokko added 6 commits July 27, 2025 09:02

Merge branch 'main' of github.com:apache/iceberg into fd-unknown

ac42a9d

Oops

743fb7e

Fix test

8ea72a2

Merge branch 'main' of github.com:apache/iceberg into fd-unknown

ea76b09

Merge branch 'main' of github.com:apache/iceberg into fd-unknown

2a925e3

Cleanup

1f9bbb6

github-actions Bot added the API label Aug 8, 2025

Fokko marked this pull request as ready for review August 8, 2025 10:14

Fokko added 3 commits August 8, 2025 12:19

Update another test

7c354e6

Fix another test

9c98dda

Merge branch 'main' of github.com:apache/iceberg into fd-unknown

0e12098

stevenzwu reviewed Aug 8, 2025

View reviewed changes

rdblue reviewed Aug 8, 2025

View reviewed changes

Comment thread api/src/main/java/org/apache/iceberg/types/Types.java Outdated

Fokko added 2 commits August 10, 2025 21:46

Cleanup

4f80690

WIP

c0172a1

github-actions Bot added parquet and removed API labels Aug 10, 2025

Revert "WIP"

f241497

This reverts commit c0172a1.

github-actions Bot added the API label Aug 11, 2025

Revert Types.java

bfd7c7e

github-actions Bot removed the API label Aug 11, 2025

WIP

be9023c

github-actions Bot added the ORC label Aug 11, 2025

kevinjqliu reviewed Aug 24, 2025

View reviewed changes

Fokko added 7 commits August 25, 2025 09:19

Comments and fix

d785fac

Simplify logic

b77e2fd

Merge branch 'main' of github.com:apache/iceberg into fd-unknown

de24cc5

Revert changing the visitor

11ddcbf

Reinstate lookup

bd1f12b

Cleanup

fec952c

Fix one more test

5201143

stevenzwu reviewed Aug 25, 2025

View reviewed changes

Comment thread spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetWriters.java

Fokko added 2 commits August 25, 2025 21:21

Stream API

b8d45b0

Merge branch 'main' of github.com:apache/iceberg into fd-unknown

35cd634

stevenzwu reviewed Aug 25, 2025

View reviewed changes

Fokko added 3 commits August 25, 2025 23:07

Stick to Arrays

d323439

Use group.getFieldIndex instead

b9fdd1a

Merge branch 'main' of github.com:apache/iceberg into fd-unknown

a2a339a

stevenzwu reviewed Aug 25, 2025

View reviewed changes

Comment thread spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/data/ParquetWithSparkSchemaVisitor.java Outdated

kevinjqliu mentioned this pull request Aug 26, 2025

check usage of IntStream.rangeClosed in codebase #13921

Closed

3 tasks

Thanks Steven

e77e486

stevenzwu reviewed Aug 26, 2025

View reviewed changes

stevenzwu approved these changes Aug 27, 2025

View reviewed changes

kevinjqliu approved these changes Aug 27, 2025

View reviewed changes

rdblue reviewed Aug 28, 2025

View reviewed changes

Comment thread spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/data/ParquetWithSparkSchemaVisitor.java Outdated

Fokko added 4 commits August 28, 2025 21:49

Move to index based

4617df4

Merge branch 'main' of github.com:apache/iceberg into fd-unknown

c078aeb

Comments, thanks Ryan

5f1c1b3

Make Spotless happy

571d62e

stevenzwu approved these changes Aug 28, 2025

View reviewed changes

rdblue merged commit 0798210 into apache:main Aug 29, 2025
42 checks passed

nastra mentioned this pull request Apr 21, 2026

Spark: Add unknown type support to Spark 3.4 and 3.5 #16066

Merged

Conversation

Fokko commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinjqliu commented Jul 26, 2025

Uh oh!

Fokko commented Aug 8, 2025

Uh oh!

stevenzwu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stevenzwu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stevenzwu Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Fokko commented Jul 2, 2025 •

edited

Loading

Fokko Aug 25, 2025 •

edited

Loading

stevenzwu Aug 26, 2025 •

edited

Loading