Skip to content

Spark 3.3: Automatically set Arrow properties for read performance#6550

Merged
aokolnychyi merged 1 commit into
apache:masterfrom
aokolnychyi:set-arrow-props
Jan 10, 2023
Merged

Spark 3.3: Automatically set Arrow properties for read performance#6550
aokolnychyi merged 1 commit into
apache:masterfrom
aokolnychyi:set-arrow-props

Conversation

@aokolnychyi
Copy link
Copy Markdown
Contributor

This PR adds logic to automatically set Arrow properties for read performance. Unless these properties are set, our read path can be up to 2x slower than built-in read path in Spark.

I verified this patch by removing all explicit settings and running benchmarks with and without setting properties.

Benchmark results without setting Arrow properties:

Benchmark                                                              Mode  Cnt  Score   Error  Units
VectorizedReadFlatParquetDataBenchmark.readLongsIcebergVectorized5k      ss    5  1.321 ± 0.029   s/op
VectorizedReadFlatParquetDataBenchmark.readLongsSparkVectorized5k        ss    5  1.064 ± 0.162   s/op
VectorizedReadFlatParquetDataBenchmark.readStringsIcebergVectorized5k    ss    5  2.187 ± 0.031   s/op
VectorizedReadFlatParquetDataBenchmark.readStringsSparkVectorized5k      ss    5  1.304 ± 0.287   s/op

Benchmark results with setting Arrow properties automatically:

Benchmark                                                              Mode  Cnt  Score   Error  Units
VectorizedReadFlatParquetDataBenchmark.readLongsIcebergVectorized5k      ss    5  0.927 ± 0.028   s/op
VectorizedReadFlatParquetDataBenchmark.readLongsSparkVectorized5k        ss    5  1.035 ± 0.070   s/op
VectorizedReadFlatParquetDataBenchmark.readStringsIcebergVectorized5k    ss    5  1.306 ± 0.029   s/op
VectorizedReadFlatParquetDataBenchmark.readStringsSparkVectorized5k      ss    5  1.369 ± 0.114   s/op

deleteFilter));
}

public static ColumnarBatchReader buildReader(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is added to make sure NullCheckingForGet.NULL_CHECKING_ENABLED is referenced after we set system properties in this class. Otherwise, Arrow would memorize them earlier and our logic would have no effect. See BoundsChecking and NullCheckingForGet in Arrow for details.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never had specific Iceberg configs for this behavior. We always relied on system properties for Arrow.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

In our internal Spark fork, we have the arrow config settings configured in the executor JVM args. I can confirm that things have been running fine with them in our prod env for the last 2+ years. So, +1.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's nice to hear, @samarthjain! All of our tests also set these properties so it seems fairly safe.

@aokolnychyi
Copy link
Copy Markdown
Contributor Author

Copy link
Copy Markdown
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only worry is we basically are changing behavior for other Arrow uses within the Spark Environment and we do not change the property back to the default when we are done if it was not configured.

That said, I think the fact that we leave escape valves is good and we just should be very clear in the documentation that we are changing behavior for all of Spark.

@aokolnychyi
Copy link
Copy Markdown
Contributor Author

My only worry is we basically are changing behavior for other Arrow uses within the Spark Environment and we do not change the property back to the default when we are done if it was not configured.

I had the same question but the problem is that Arrow would read properties only once and then cache the value. Even if we change the values back, it will have no effect. Let me look into Spark a bit more to see if anything can go wrong.

@singhpk234
Copy link
Copy Markdown
Contributor

singhpk234 commented Jan 10, 2023

+1 for the changes
more tickets related to same :

  1. Slow performance on TPC-DS tests #4217 (comment)
  2. ArrowBuf boundary checks causing CPU burn and slowness in vectorized parq reading #6320

new ReaderBuilder(
expectedSchema,
fileSchema,
NullCheckingForGet.NULL_CHECKING_ENABLED,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor point: Since we are going to read/write this system property everywhere, I guess there is no point to pass around this parameter. Always good to minimize the parameters. I'm OK to fix it or not though.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that initially too but I had to change more places as it is used in tests. So I gave up and reverted.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it is going to change a lot of places. I'm OK with another PR or something.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I'll double check.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another look. We should be able to get rid of this property once the deprecated methods are removed. We can then switch to setting the Arrow property in tests.

Copy link
Copy Markdown
Contributor

@flyrain flyrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Thanks @aokolnychyi

@aokolnychyi
Copy link
Copy Markdown
Contributor Author

aokolnychyi commented Jan 10, 2023

I took a look at ArrowColumnVector in Spark and it seems to be using its own nullability mechanism.

@Override
public UTF8String getUTF8String(int rowId) {
  if (isNullAt(rowId)) return null;
  return accessor.getUTF8String(rowId);
}

Given the performance impact, I am inclined to go ahead with this change.
It is unlikely users will ever set this manually and will think our readers are slow.

@aokolnychyi aokolnychyi merged commit ba63f25 into apache:master Jan 10, 2023
@aokolnychyi
Copy link
Copy Markdown
Contributor Author

Thanks for reviewing, @singhpk234 @samarthjain @flyrain @RussellSpitzer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants