core: Adding read vector to range readable interface and adding mappe… by stubz151 · Pull Request #13997 · apache/iceberg

stubz151 · 2025-09-05T13:55:02Z

What Am I doing

Adding a read vector implementation to the range readable interface. To do this I'm adding the methods to check if it's enabled and proving an interface which one can implement.
#13254

Changes

added read vector to the ranged readable interface
added config values for it and passed it down to the hadoop parquet library
added a mapper in parquet io that can turn our range readable + seekable streams into a parquet seekable stream.

testing

Tested with the AAL implementation and it is passed correctly with flag
--conf "spark.sql.iceberg.read.vector.enabled=true" \

Notes

Kept the AAL changes seperate to not bloat this PR but can include them we want to see a functional implementation of this.

stubz151 · 2025-09-05T13:59:20Z

+
+    private static List<ParquetObjectRange> convertRanges(List<ParquetFileRange> ranges) {
+      return ranges.stream()
+          .map(


this just maps between the internal parquet hadoop range and the new iceberg one.

fuatbasik

Thanks a lot @stubz151 . Interface and configuration flag looked good to me. I just put two small comments

…r to parquet stream.

danielcweeks · 2025-09-11T20:43:39Z

+ * this class is written by @mukundthakur, and taken from
+ * /hadoop-common/src/main/java/org/apache/hadoop/fs/VectoredReadUtils.java (thank you!).
+ */
+public final class VectoredReadUtils {


I don't feel like we need this class. There are three things this does, but it should probalby be just one. The validateRangeRequest should just be handled in the constructor of the FileRange (we currently don't have any validation there). The sortRangeList is a subset of validateAndSortRanges which seems duplicative.

I'd suggest moving validateAndSort to the RangeReadable interface as a static utility that can be used by implementors and avoid creating this util class.

danielcweeks · 2025-09-11T20:46:44Z

+  private final long offset;
+  private final int length;
+
+  public FileRange(CompletableFuture<ByteBuffer> byteBuffer, long offset, int length) {


Looking at the parquet implementation, I don't think you can pass the byteBuffer future in like this. I believe this is intended to be set by the implementation so that it can be returned to the invoker.

We not passing the bytebuffer in here right, we passing a future that completes with a byte buffer, we need a way to map the futures in Iceberg to the future's we are setting in Parquet,
So when we call parquetFileRange.setDataReadFuture(future); we need to have a way of tracking that future in Iceberg and that's what this gives us.

danielcweeks · 2025-09-22T22:27:15Z

+    }
+
+    @Override
+    public void readVectored(List<ParquetFileRange> ranges, ByteBufferAllocator allocate)


Can we add some tests at the ParquetIO level to validate this? I know we're adding some in S3FileIO, but it would be good to have this interface tested (even if there's a mock implementation)

I added in testRangeReadableAdapterReadVectored which does something similar to the tests in S3FileIO, but focused a bit more on checking that the buffers/ranges are being used correctly, I skipped the other operations but can add them in if we want. Let me know

pvary · 2025-09-23T09:37:51Z

          optionsBuilder.withDecryption(fileDecryptionProperties);
        }

+        optionsBuilder.withUseHadoopVectoredIo(true);


There were some efforts to allow Iceberg working without Hadoop on the classpath.
I'm not sure how far away these efforts went, and also not sure how this change will effect that effort.

Could you please help me understand the consequences of always using withUseHadoopVectoredIo?

Thanks,
Peter

For part 1 about the effort to reduce the dependencies on Hadoop I don't think that was ever completed I do see a TODO comment about wanting to do it. I am probably making the effort more complicated as I am adding 2 new imports from Hadoop but I don't think that is a big risk.

for 2) withUseHadoopVectoredIo is used in the file reader in conjunction with readVectoredAvailable() so moving to always using readVector doesn't change anything unless the stream also supports readVectored.
https://github.com/apache/parquet-java/blob/f50dd6cb4b526cf4b585993c1b69a838cd8151f3/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1303

I think the naming of this option is a little misleading. The withUseHadoopVectoredIo doesn't necessarily depend on hadoop as @stubz151 mentions, but rather enables the vectored io behavior in Parquet.

pvary

It's ok from my side, but I would like to ask someone else to take a look as well.

danielcweeks

Thanks @stubz151 !

apache#13997) Core: Adding read vector to range readable interface and adding mapper to parquet stream.

github-actions Bot added API spark parquet labels Sep 5, 2025

stubz151 commented Sep 5, 2025

View reviewed changes

fuatbasik approved these changes Sep 5, 2025

View reviewed changes

Comment thread api/src/main/java/org/apache/iceberg/io/ParquetObjectRange.java Outdated

Comment thread parquet/src/main/java/org/apache/iceberg/parquet/ParquetIO.java Outdated

stubz151 force-pushed the vector_impl branch from 6cc3884 to 6e19605 Compare September 5, 2025 14:38

stubz151 marked this pull request as ready for review September 5, 2025 14:39

danielcweeks reviewed Sep 5, 2025

View reviewed changes

Comment thread spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java Outdated

amogh-jahagirdar reviewed Sep 5, 2025

View reviewed changes

Comment thread api/src/main/java/org/apache/iceberg/io/ParquetObjectRange.java Outdated

stubz151 force-pushed the vector_impl branch 2 times, most recently from 48d647d to fa97848 Compare September 8, 2025 13:42

danielcweeks reviewed Sep 8, 2025

View reviewed changes

Comment thread api/src/main/java/org/apache/iceberg/io/RangeReadable.java Outdated

core: Adding read vector to range readable interface and adding mappe…

8f05471

…r to parquet stream.

stubz151 force-pushed the vector_impl branch from fa97848 to 8f05471 Compare September 11, 2025 16:09

danielcweeks reviewed Sep 11, 2025

View reviewed changes

Comment thread api/src/main/java/org/apache/iceberg/io/RangeReadable.java Outdated

danielcweeks reviewed Sep 11, 2025

View reviewed changes

Comment thread parquet/src/main/java/org/apache/iceberg/parquet/ParquetIO.java

github-actions Bot added the AWS label Sep 12, 2025

fix: addressing comments on PR

65ce5a7

stubz151 force-pushed the vector_impl branch from cdbffc4 to 65ce5a7 Compare September 12, 2025 15:58

danielcweeks reviewed Sep 22, 2025

View reviewed changes

Comment thread api/src/main/java/org/apache/iceberg/io/RangeReadable.java Outdated

danielcweeks reviewed Sep 22, 2025

View reviewed changes