Spark 3.3: Automatically set Arrow properties for read performance#6550
Conversation
| deleteFilter)); | ||
| } | ||
|
|
||
| public static ColumnarBatchReader buildReader( |
There was a problem hiding this comment.
This method is added to make sure NullCheckingForGet.NULL_CHECKING_ENABLED is referenced after we set system properties in this class. Otherwise, Arrow would memorize them earlier and our logic would have no effect. See BoundsChecking and NullCheckingForGet in Arrow for details.
There was a problem hiding this comment.
We never had specific Iceberg configs for this behavior. We always relied on system properties for Arrow.
There was a problem hiding this comment.
LGTM.
In our internal Spark fork, we have the arrow config settings configured in the executor JVM args. I can confirm that things have been running fine with them in our prod env for the last 2+ years. So, +1.
There was a problem hiding this comment.
That's nice to hear, @samarthjain! All of our tests also set these properties so it seems fairly safe.
RussellSpitzer
left a comment
There was a problem hiding this comment.
My only worry is we basically are changing behavior for other Arrow uses within the Spark Environment and we do not change the property back to the default when we are done if it was not configured.
That said, I think the fact that we leave escape valves is good and we just should be very clear in the documentation that we are changing behavior for all of Spark.
I had the same question but the problem is that Arrow would read properties only once and then cache the value. Even if we change the values back, it will have no effect. Let me look into Spark a bit more to see if anything can go wrong. |
|
+1 for the changes |
| new ReaderBuilder( | ||
| expectedSchema, | ||
| fileSchema, | ||
| NullCheckingForGet.NULL_CHECKING_ENABLED, |
There was a problem hiding this comment.
Minor point: Since we are going to read/write this system property everywhere, I guess there is no point to pass around this parameter. Always good to minimize the parameters. I'm OK to fix it or not though.
There was a problem hiding this comment.
I did that initially too but I had to change more places as it is used in tests. So I gave up and reverted.
There was a problem hiding this comment.
yeah, it is going to change a lot of places. I'm OK with another PR or something.
There was a problem hiding this comment.
Agreed. I'll double check.
There was a problem hiding this comment.
I took another look. We should be able to get rid of this property once the deprecated methods are removed. We can then switch to setting the Arrow property in tests.
|
I took a look at Given the performance impact, I am inclined to go ahead with this change. |
|
Thanks for reviewing, @singhpk234 @samarthjain @flyrain @RussellSpitzer! |
…pache#6550) (cherry picked from commit ba63f25) Cloudera ID: DEX-8798
This PR adds logic to automatically set Arrow properties for read performance. Unless these properties are set, our read path can be up to 2x slower than built-in read path in Spark.
I verified this patch by removing all explicit settings and running benchmarks with and without setting properties.
Benchmark results without setting Arrow properties:
Benchmark results with setting Arrow properties automatically: