Spark 3.3: Automatically set Arrow properties for read performance by aokolnychyi · Pull Request #6550 · apache/iceberg

aokolnychyi · 2023-01-09T22:03:58Z

This PR adds logic to automatically set Arrow properties for read performance. Unless these properties are set, our read path can be up to 2x slower than built-in read path in Spark.

I verified this patch by removing all explicit settings and running benchmarks with and without setting properties.

Benchmark results without setting Arrow properties:

Benchmark                                                              Mode  Cnt  Score   Error  Units
VectorizedReadFlatParquetDataBenchmark.readLongsIcebergVectorized5k      ss    5  1.321 ± 0.029   s/op
VectorizedReadFlatParquetDataBenchmark.readLongsSparkVectorized5k        ss    5  1.064 ± 0.162   s/op
VectorizedReadFlatParquetDataBenchmark.readStringsIcebergVectorized5k    ss    5  2.187 ± 0.031   s/op
VectorizedReadFlatParquetDataBenchmark.readStringsSparkVectorized5k      ss    5  1.304 ± 0.287   s/op

Benchmark results with setting Arrow properties automatically:

Benchmark                                                              Mode  Cnt  Score   Error  Units
VectorizedReadFlatParquetDataBenchmark.readLongsIcebergVectorized5k      ss    5  0.927 ± 0.028   s/op
VectorizedReadFlatParquetDataBenchmark.readLongsSparkVectorized5k        ss    5  1.035 ± 0.070   s/op
VectorizedReadFlatParquetDataBenchmark.readStringsIcebergVectorized5k    ss    5  1.306 ± 0.029   s/op
VectorizedReadFlatParquetDataBenchmark.readStringsSparkVectorized5k      ss    5  1.369 ± 0.114   s/op

aokolnychyi · 2023-01-09T22:07:09Z

                deleteFilter));
  }

+  public static ColumnarBatchReader buildReader(


This method is added to make sure NullCheckingForGet.NULL_CHECKING_ENABLED is referenced after we set system properties in this class. Otherwise, Arrow would memorize them earlier and our logic would have no effect. See BoundsChecking and NullCheckingForGet in Arrow for details.

We never had specific Iceberg configs for this behavior. We always relied on system properties for Arrow.

LGTM.

In our internal Spark fork, we have the arrow config settings configured in the executor JVM args. I can confirm that things have been running fine with them in our prod env for the last 2+ years. So, +1.

That's nice to hear, @samarthjain! All of our tests also set these properties so it seems fairly safe.

aokolnychyi · 2023-01-09T22:08:48Z

cc @samarthjain @rdblue @flyrain @RussellSpitzer @szehon-ho

RussellSpitzer

My only worry is we basically are changing behavior for other Arrow uses within the Spark Environment and we do not change the property back to the default when we are done if it was not configured.

That said, I think the fact that we leave escape valves is good and we just should be very clear in the documentation that we are changing behavior for all of Spark.

aokolnychyi · 2023-01-09T23:32:34Z

My only worry is we basically are changing behavior for other Arrow uses within the Spark Environment and we do not change the property back to the default when we are done if it was not configured.

I had the same question but the problem is that Arrow would read properties only once and then cache the value. Even if we change the values back, it will have no effect. Let me look into Spark a bit more to see if anything can go wrong.

singhpk234 · 2023-01-10T00:03:40Z

+1 for the changes
more tickets related to same :

flyrain · 2023-01-10T00:21:58Z

+            new ReaderBuilder(
+                expectedSchema,
+                fileSchema,
+                NullCheckingForGet.NULL_CHECKING_ENABLED,


Minor point: Since we are going to read/write this system property everywhere, I guess there is no point to pass around this parameter. Always good to minimize the parameters. I'm OK to fix it or not though.

I did that initially too but I had to change more places as it is used in tests. So I gave up and reverted.

yeah, it is going to change a lot of places. I'm OK with another PR or something.

Agreed. I'll double check.

I took another look. We should be able to get rid of this property once the deprecated methods are removed. We can then switch to setting the Arrow property in tests.

flyrain

+1 Thanks @aokolnychyi

aokolnychyi · 2023-01-10T20:48:04Z

I took a look at ArrowColumnVector in Spark and it seems to be using its own nullability mechanism.

@Override
public UTF8String getUTF8String(int rowId) {
  if (isNullAt(rowId)) return null;
  return accessor.getUTF8String(rowId);
}

Given the performance impact, I am inclined to go ahead with this change.
It is unlikely users will ever set this manually and will think our readers are slow.

aokolnychyi · 2023-01-10T21:43:44Z

Thanks for reviewing, @singhpk234 @samarthjain @flyrain @RussellSpitzer!

…pache#6550) (cherry picked from commit ba63f25) Cloudera ID: DEX-8798

…pache#6550)

Spark 3.3: Automatically set Arrow properties for read performance

0f6434f

github-actions Bot added build spark labels Jan 9, 2023

aokolnychyi commented Jan 9, 2023

View reviewed changes

RussellSpitzer approved these changes Jan 9, 2023

View reviewed changes

flyrain reviewed Jan 10, 2023

View reviewed changes

flyrain approved these changes Jan 10, 2023

View reviewed changes

aokolnychyi mentioned this pull request Jan 10, 2023

Cache in GenericArrowVectorAccessorFactory causing lot of upfront allocations causing slowness in Vectorized parquet reading #6319

Closed

samarthjain approved these changes Jan 10, 2023

View reviewed changes

aokolnychyi merged commit ba63f25 into apache:master Jan 10, 2023

This was referenced Jan 10, 2023

Slow performance on TPC-DS tests #4217

Closed

ArrowBuf boundary checks causing CPU burn and slowness in vectorized parq reading #6320

Closed

wypoon mentioned this pull request Jan 26, 2023

Spark 3.2: Automatically set Arrow properties for read performance #6671

Merged

InvisibleProgrammer pushed a commit to InvisibleProgrammer/iceberg that referenced this pull request Mar 10, 2023

Spark 3.3: Automatically set Arrow properties for read performance (a…

539d13c

…pache#6550) (cherry picked from commit ba63f25) Cloudera ID: DEX-8798

sunchao pushed a commit to sunchao/iceberg that referenced this pull request May 10, 2023

Spark 3.3: Automatically set Arrow properties for read performance (a…

d99f155

…pache#6550)

manuzhang mentioned this pull request Sep 27, 2023

Spark failed to read imported parquet file #8655

Closed

Conversation

aokolnychyi commented Jan 9, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Jan 9, 2023

Uh oh!

RussellSpitzer left a comment

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Jan 9, 2023

Uh oh!

singhpk234 commented Jan 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flyrain left a comment

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Jan 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aokolnychyi commented Jan 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

singhpk234 commented Jan 10, 2023 •

edited

Loading

aokolnychyi commented Jan 10, 2023 •

edited

Loading