Parquet: Add vectorized position reader#1356
Conversation
|
@rdblue , This is for vectorized parquet position reader and also include a follow up for skipping reading footer redundantly. Could you please take a look at your convenience? |
| public static VectorizedArrowReader positions() { | ||
| return PositionVectorReader.INSTANCE; | ||
| } |
There was a problem hiding this comment.
I am unsure if returning a singleton instance for PostitionVectorReader is safe, since it contains a member variable rowStart which seems to differ for every row group created. Can there be a possibility of multiple tasks running on the same executor JVM and wanting to refer to different PostionVectorReaders at the same time?
There was a problem hiding this comment.
Thanks for your comment.
The INSTANCE is defined to return a new PosistionVectorReader. It doesn't have a class scope field such as instance and the null checking logic, so it is not a singleton.
IIUC, spark will assign the number of spark.executor.cores of tasks per executor.
There was a problem hiding this comment.
The INSTANCE field is defined is static, so I am guessing new PositionVectorReader() will only be called once when the class is being loaded by the JVM. This seems like mimicking a singleton to me, unless I am reading the things wrongly.
There was a problem hiding this comment.
You are right! I forgot the static keyword, it is an eager mode of the singleton. Just remove using singleton in 52f4468.
shardulm94
left a comment
There was a problem hiding this comment.
Minor comment, but otherwise LGTM!
| } | ||
| } | ||
|
|
||
| public static class PositionVectorHolder extends VectorHolder { |
There was a problem hiding this comment.
Seems like technically this class is redundant since the user can use VectorHolder directly, but is probably good for readability?
There was a problem hiding this comment.
Yes, I added a private VectorHolder constructor for PositionVectionHolder which I don't want others to use it directly.
| Field arrowField = ArrowSchemaUtil.convert(MetadataColumns.ROW_POSITION); | ||
| FieldVector vec = arrowField.createVector(ArrowAllocation.rootAllocator()); | ||
| ((BigIntVector) vec).allocateNew(numValsToRead); |
There was a problem hiding this comment.
Can this follow an approach similar to VectorizedArrowReader and not create a new FieldVector and NullabilityHolder for every invocation?
|
@rdblue , Could you please help to take a look on this? |
|
Thanks @chenjunjiedada, this is definitely on my list to review. I'll take a look as soon as I can. |
holdenk
left a comment
There was a problem hiding this comment.
Still figuring my way around the code base, so my question might be silly and thanks for working on this :)
| ((BigIntVector) vec).allocateNew(numValsToRead); | ||
| for (int i = 0; i < numValsToRead; i += 1) { | ||
| vec.getDataBuffer().setLong(i * Long.BYTES, rowStart + i); | ||
| nulls = new NullabilityHolder(numValsToRead); |
There was a problem hiding this comment.
Why are we setting this inside of the for loop?
There was a problem hiding this comment.
Thanks @holdenk, This is a problem. Let me move this out of the loop.
949c250 to
a34a9df
Compare
|
@rdblue , Do we want to include this to 0.10.0? This includes an optimization that avoids reading extra footer in case of no position column. |
|
@shardulm94 , Does that ORC fix also valid in parquet side? |
Separate fixes should be added in different PRs. Thanks for fixing this, I'd like to get it in without blocking on vectorization.
I don't think so. We want to get 0.10.0 out and even if the vectorization changes made it in, we wouldn't be able to read delete files in the vectorized path. |
@chenjunjiedada Which fix you are referring to? |
|
@shardulm94 I meant #1706. |
|
@chenjunjiedada That bug was specific to ORC and does not exist in the Parquet reader. |
a34a9df to
ca79a67
Compare
|
Thanks for fixing this, @chenjunjiedada! And thanks to @shardulm94 and @holdenk for reviewing! |
This adds parquet position reader for the vectorized case.