Filtering dictionary encoded FIX_LEN_BYTE_ARRAY#2551
Conversation
| case FIXED_LEN_BYTE_ARRAY: dictSet.add((T) conversion.apply(dict.decodeToBinary(i))); | ||
| break; |
There was a problem hiding this comment.
I wanted to write the following but Checkstyle failed on it because of indention. I guess, it is/was a Checkstyle bug. I don't have experience on gradle so I am not sure how to fix this in Iceberg configuration. Probably, we need to upgrade Checkstyle somehow.
Anyway, both the current and this other one is correct but I would prefer the latter.
| case FIXED_LEN_BYTE_ARRAY: dictSet.add((T) conversion.apply(dict.decodeToBinary(i))); | |
| break; | |
| case FIXED_LEN_BYTE_ARRAY: // Same as BINARY |
| OutputFile outFile = Files.localOutput(parquetFile); | ||
| try (FileAppender<Record> appender = Parquet.write(outFile) | ||
| .schema(FILE_SCHEMA) | ||
| .withWriterVersion(WriterVersion.PARQUET_2_0) // V2 is needed to force dictionary encoding for FIXED type |
There was a problem hiding this comment.
I don't know why we do not use dictionary encoding for V1 but this is the current case. See DefaultV1ValuesWriterFactory. Meanwhile this behavior is not specified anywhere so other implementations may write it and Impala certainly does.
There was a problem hiding this comment.
@rdblue: Does Iceberg specification defines which writer version we should use?
@gszadovszky: What happens when a V2 writer is used and an old parquet client tries to read the file.
Thanks,
Peter
There was a problem hiding this comment.
@pvary: V1 and V2 are quite tricky references because it is not always clear what it covers. From parquet-mr point of view the most basic differences are the page headers (v1 or v2) and the used encodings. Since parquet-mr has support for v2 since a while it is mostly not about the old parquet readers but the other implementations of parquet readers. For example Impala does not support v2 page headers nor the encodings used by parquet-mr in case of v2 is set.
BUT, using dictionary encoding for fixed data types are not against parquet specification for v1. And Impala does it. (It is only parquet-mr that does not do dictionary encoding for fixed types in case of v1 but it can read it and the filtering also works just fine.)
I've been hesitating to switch the whole unit test to V2 (or implement a separate test for fixed+dictionary) but this seemed easier and simpler. Also, V1 or V2 seems irrelevant to this test (excluding the fact that we cannot test fixed+dictionary with V1).
There was a problem hiding this comment.
@pvary, I'm assuming that people mostly don't use Parquet v2 because it was never finalized (to my knowledge). I think there's an ongoing effort in the Parquet community to clean up some of the docs, but I've not seen a vote to adopt any of the v2 encodings as official.
|
I think, it is fine to extend the API with package-private methods for testing purposes. It makes the testing simpler and easier to implement. Meanwhile, I am open to invest on writing the test without the need of this extension if you think it is required. |
|
@gszadovszky, I think the package-private method is fine. But we need to make sure that this doesn't stop testing v1 since that's what people primarily use. Could you add v1/v2 as a test parameter so all the tests run on both? |
|
Thanks @rdblue. I've thought v1/v2 is irrelevant for this test but I agree it is better to be on the safe side and it's a good idea to add writer version as a test parameter. I'll make this change soon. |
| .shouldRead(parquetSchema, rowGroupMetadata, dictionaryStore); | ||
| Assert.assertFalse("Should not read: No decimal_fixed values less than -1234567890.0987654321", shouldRead); | ||
| switch (writerVersion) { | ||
| case PARQUET_1_0: |
There was a problem hiding this comment.
Normally, we would use Assume to simply skip this test for Parquet v1. Would you mind doing that here? I don't think we need to validate that Parquet doesn't dictionary encode.
|
Thanks, @gszadovszky! |
Fixes #2546