Skip to content

Spark: Read/Write UnknownType#13445

Merged
rdblue merged 75 commits into
apache:mainfrom
Fokko:fd-unknown
Aug 29, 2025
Merged

Spark: Read/Write UnknownType#13445
rdblue merged 75 commits into
apache:mainfrom
Fokko:fd-unknown

Conversation

@Fokko
Copy link
Copy Markdown
Contributor

@Fokko Fokko commented Jul 2, 2025

@kevinjqliu
Copy link
Copy Markdown
Contributor

nit: lets make sure this PR applied to both spark 3.5 and 4.0!

context: https://lists.apache.org/thread/7x7xcm3y87y81c4grq4nn9gdjd4jm05f

@github-actions github-actions Bot added the API label Aug 8, 2025
@Fokko
Copy link
Copy Markdown
Contributor Author

Fokko commented Aug 8, 2025

nit: lets make sure this PR applied to both spark 3.5 and 4.0!

Let's do this first in Spark 3.5, and I'll forwardport it to 4.0 in another PR. There will probably be some comments on the code, and then I have to keep everything in sync (something I'm not good at :).

@Fokko Fokko marked this pull request as ready for review August 8, 2025 10:14
Copy link
Copy Markdown
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only got time to look at a few classes. will spend more time next Monday.

Comment thread api/src/main/java/org/apache/iceberg/types/Types.java Outdated
Comment thread api/src/main/java/org/apache/iceberg/types/Types.java Outdated
Comment thread api/src/test/java/org/apache/iceberg/TestSchema.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/types/PruneUnknownTypes.java Outdated
Comment thread api/src/main/java/org/apache/iceberg/types/Types.java Outdated
@github-actions github-actions Bot added parquet and removed API labels Aug 10, 2025
This reverts commit c0172a1.
@github-actions github-actions Bot added the API label Aug 11, 2025
@github-actions github-actions Bot removed the API label Aug 11, 2025
@github-actions github-actions Bot added the ORC label Aug 11, 2025
Comment on lines 635 to 637
if (null == struct) {
return IntStream.rangeClosed(0, numWriters).toArray();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rangeClosed is inclusive on both ends, the resulting array will be of size numWriters + 1, is that right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the new approach suggested by Ryan looks pretty good

/** Returns a mapping from writer index to field index, skipping Unknown columns. */
private static int[] writerToFieldIndex(List<DataType> types, int numWriters) {
if (null == types) {
return IntStream.rangeClosed(0, numWriters).toArray();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be range or rangeClosed? I though the end should be exclusive as [0, numWriters) .

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also if the types list/array is null, should we fail? is it a valid scenario?

I would imagine it can be empty. even that is a bit weird to me.

Copy link
Copy Markdown
Contributor Author

@Fokko Fokko Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be range or rangeClosed? I though the end should be exclusive as [0, numWriters).

I think it should be range, but I'd rather follow up on that in a separate PR since there are also similar occurrences in other places.

also if the types list/array is null, should we fail? is it a valid scenario?

I don't think it can be null, but I think we should fail on null == types. To illustrate, this PR addresses Spark, but the Flink writers don't pass in the struct, failing on UnknownTypes. I would also suggest this to be a follow up PR.

I would imagine it can be empty. even that is a bit weird to me.

I think that's the case of the empty struct, which is not allowed in Parquet

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to follow up separately

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created tracking issue for range vs rangeClosed
#13921

sField.name());
results.add(visitField(sField, field, visitor));
for (StructField sField : struct.fields()) {
String fieldName = AvroSchemaUtil.makeCompatibleName(sField.name());
Copy link
Copy Markdown
Contributor

@stevenzwu stevenzwu Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko @rdblue will using field name be a problem if the field is renamed in Iceberg table? would the mapping be messed up during read or write?

During write, if both Spark type and Parquet type are converted from Iceberg schema, this should be fine. But if we are reading a Parquet file with old field name, will this break?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got the answer from Fokko offline. "he ReadBuilder does not use the visitor, and follows a different path which uses projection by ID"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct. The read path doesn't use this.

@rdblue rdblue merged commit 0798210 into apache:main Aug 29, 2025
42 checks passed
Kurtiscwright pushed a commit to Kurtiscwright/iceberg that referenced this pull request Apr 23, 2026
Map Iceberg's UnknownType to Spark's NullType in both directions
(TypeToSparkType: UNKNOWN -> NullType; SparkTypeToType: NullType ->
UnknownType). Filter NullType-backed fields from Parquet/ORC writers
so UnknownType columns produce only nulls. Aligns Spark 3.x with the
existing Spark 4.x behavior from apache#13445.

Tests added (mirrored from apache#13445 to v3.4 and v3.5):
  - AvroDataTestBase: unk field in SUPPORTED_PRIMITIVES, plus
    testUnknownNestedLevel, testUnknownListType, testUnknownMapType
  - TestSparkOrcReader: testUnknownListType, testUnknownMapType overrides
  - TestSparkParquetReader: testUnknownListType, testUnknownMapType overrides
  - TestSparkRecordOrcReaderWriter: testUnknownListType, testUnknownMapType overrides
  - TestORCDataFrameWrite: testUnknownListType, testUnknownMapType overrides
  - TestParquetDataFrameWrite: testUnknownListType, testUnknownMapType overrides
  - TestParquetScan: testUnknownListType, testUnknownMapType overrides
  - ScanTestBase.writeAndValidate: opt into format version 3 when the
    schema contains UnknownType
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants