Parquet: Add vectorized position reader by chenjunjiedada · Pull Request #1356 · apache/iceberg

chenjunjiedada · 2020-08-18T14:52:42Z

This adds parquet position reader for the vectorized case.

chenjunjiedada · 2020-09-01T01:54:29Z

@rdblue , This is for vectorized parquet position reader and also include a follow up for skipping reading footer redundantly. Could you please take a look at your convenience?

shardulm94 · 2020-09-11T08:26:04Z

+  public static VectorizedArrowReader positions() {
+    return PositionVectorReader.INSTANCE;
+  }


I am unsure if returning a singleton instance for PostitionVectorReader is safe, since it contains a member variable rowStart which seems to differ for every row group created. Can there be a possibility of multiple tasks running on the same executor JVM and wanting to refer to different PostionVectorReaders at the same time?

Thanks for your comment.

The INSTANCE is defined to return a new PosistionVectorReader. It doesn't have a class scope field such as instance and the null checking logic, so it is not a singleton.

IIUC, spark will assign the number of spark.executor.cores of tasks per executor.

The INSTANCE field is defined is static, so I am guessing new PositionVectorReader() will only be called once when the class is being loaded by the JVM. This seems like mimicking a singleton to me, unless I am reading the things wrongly.

You are right! I forgot the static keyword, it is an eager mode of the singleton. Just remove using singleton in 52f4468.

shardulm94

Minor comment, but otherwise LGTM!

shardulm94 · 2020-09-15T07:11:27Z

    }
  }

+  public static class PositionVectorHolder extends VectorHolder {


Seems like technically this class is redundant since the user can use VectorHolder directly, but is probably good for readability?

Yes, I added a private VectorHolder constructor for PositionVectionHolder which I don't want others to use it directly.

shardulm94 · 2020-09-15T07:38:28Z

+      Field arrowField = ArrowSchemaUtil.convert(MetadataColumns.ROW_POSITION);
+      FieldVector vec = arrowField.createVector(ArrowAllocation.rootAllocator());
+      ((BigIntVector) vec).allocateNew(numValsToRead);


Can this follow an approach similar to VectorizedArrowReader and not create a new FieldVector and NullabilityHolder for every invocation?

iceberg/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

Lines 123 to 125 in 52f4468

} else {

vec.setValueCount(0);

nullabilityHolder.reset();

Make sense to me, updated in 73de369

chenjunjiedada · 2020-09-18T01:26:38Z

@rdblue , Could you please help to take a look on this?

rdblue · 2020-09-19T01:29:25Z

Thanks @chenjunjiedada, this is definitely on my list to review. I'll take a look as soon as I can.

holdenk

Still figuring my way around the code base, so my question might be silly and thanks for working on this :)

holdenk · 2020-09-29T18:17:35Z

+        ((BigIntVector) vec).allocateNew(numValsToRead);
+        for (int i = 0; i < numValsToRead; i += 1) {
+          vec.getDataBuffer().setLong(i * Long.BYTES, rowStart + i);
+          nulls = new NullabilityHolder(numValsToRead);


Why are we setting this inside of the for loop?

Thanks @holdenk, This is a problem. Let me move this out of the loop.

chenjunjiedada · 2020-10-12T15:41:02Z

@rdblue , Do we want to include this to 0.10.0? This includes an optimization that avoids reading extra footer in case of no position column.

chenjunjiedada · 2020-11-03T02:29:21Z

@shardulm94 , Does that ORC fix also valid in parquet side?

rdblue · 2020-11-03T18:19:57Z

This includes an optimization that avoids reading extra footer in case of no position column.

Separate fixes should be added in different PRs. Thanks for fixing this, I'd like to get it in without blocking on vectorization.

Do we want to include this to 0.10.0?

I don't think so. We want to get 0.10.0 out and even if the vectorization changes made it in, we wouldn't be able to read delete files in the vectorized path.

shardulm94 · 2020-11-04T19:16:45Z

@shardulm94 , Does that ORC fix also valid in parquet side?

@chenjunjiedada Which fix you are referring to?

chenjunjiedada · 2020-11-05T03:05:32Z

@shardulm94 I meant #1706.

shardulm94 · 2020-11-05T09:54:45Z

@chenjunjiedada That bug was specific to ORC and does not exist in the Parquet reader.

rdblue · 2020-12-18T18:39:10Z

Thanks for fixing this, @chenjunjiedada! And thanks to @shardulm94 and @holdenk for reviewing!

…#1356)

probot-autolabeler Bot added spark parquet arrow labels Aug 18, 2020

shardulm94 reviewed Sep 11, 2020

View reviewed changes

shardulm94 reviewed Sep 15, 2020

View reviewed changes

chenjunjiedada mentioned this pull request Sep 19, 2020

Parquet: Add row position reader #1254

Merged

holdenk reviewed Sep 29, 2020

View reviewed changes

rdblue mentioned this pull request Oct 1, 2020

Add names to parameterized tests and simplify the parameters list #1539

Merged

chenjunjiedada force-pushed the add-position-for-parquet-vectorized-reader branch from 949c250 to a34a9df Compare October 12, 2020 15:36

chenjunjiedada added 5 commits November 10, 2020 09:48

Parquet: Add vectorized position reader

aaeed07

fix checkstyle

78257b0

Don't use singleton for position reader

de70c96

avoid unnecessary allocation

96a351f

fix allocation inside loop

ca79a67

chenjunjiedada force-pushed the add-position-for-parquet-vectorized-reader branch from a34a9df to ca79a67 Compare November 10, 2020 07:43

chenjunjiedada mentioned this pull request Dec 18, 2020

Spark: Sort retained rows in DELETE FROM by file and position #1955

Merged

rdblue approved these changes Dec 18, 2020

View reviewed changes

rdblue merged commit 6379050 into apache:master Dec 18, 2020

chenjunjiedada deleted the add-position-for-parquet-vectorized-reader branch January 4, 2021 12:56

parthchandra pushed a commit to parthchandra/iceberg that referenced this pull request Oct 22, 2025

Add Iceberg version to UserAgent in S3 requests (apache#9963) (apache…

ae7634b

…#1356)

Conversation

chenjunjiedada commented Aug 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chenjunjiedada commented Sep 1, 2020

Uh oh!

shardulm94 Sep 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenjunjiedada Sep 11, 2020

Choose a reason for hiding this comment

Uh oh!

shardulm94 Sep 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenjunjiedada Sep 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shardulm94 left a comment

Choose a reason for hiding this comment

Uh oh!

shardulm94 Sep 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenjunjiedada Sep 16, 2020

Choose a reason for hiding this comment

Uh oh!

shardulm94 Sep 15, 2020

Choose a reason for hiding this comment

Uh oh!

chenjunjiedada Sep 16, 2020

Choose a reason for hiding this comment

Uh oh!

chenjunjiedada commented Sep 18, 2020

Uh oh!

rdblue commented Sep 19, 2020

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

holdenk Sep 29, 2020

Choose a reason for hiding this comment

Uh oh!

chenjunjiedada Sep 30, 2020

Choose a reason for hiding this comment

Uh oh!

chenjunjiedada commented Oct 12, 2020

Uh oh!

chenjunjiedada commented Nov 3, 2020

Uh oh!

rdblue commented Nov 3, 2020

Uh oh!

shardulm94 commented Nov 4, 2020

Uh oh!

chenjunjiedada commented Nov 5, 2020

Uh oh!

shardulm94 commented Nov 5, 2020

Uh oh!

rdblue commented Dec 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chenjunjiedada commented Aug 18, 2020 •

edited

Loading

shardulm94 Sep 11, 2020 •

edited

Loading

shardulm94 Sep 13, 2020 •

edited

Loading

chenjunjiedada Sep 13, 2020 •

edited

Loading

shardulm94 Sep 15, 2020 •

edited

Loading