Skip to content

Filtering dictionary encoded FIX_LEN_BYTE_ARRAY#2551

Merged
rdblue merged 5 commits into
apache:masterfrom
gszadovszky:2546
May 20, 2021
Merged

Filtering dictionary encoded FIX_LEN_BYTE_ARRAY#2551
rdblue merged 5 commits into
apache:masterfrom
gszadovszky:2546

Conversation

@gszadovszky
Copy link
Copy Markdown
Contributor

Fixes #2546

Comment on lines +409 to +410
case FIXED_LEN_BYTE_ARRAY: dictSet.add((T) conversion.apply(dict.decodeToBinary(i)));
break;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to write the following but Checkstyle failed on it because of indention. I guess, it is/was a Checkstyle bug. I don't have experience on gradle so I am not sure how to fix this in Iceberg configuration. Probably, we need to upgrade Checkstyle somehow.

Anyway, both the current and this other one is correct but I would prefer the latter.

Suggested change
case FIXED_LEN_BYTE_ARRAY: dictSet.add((T) conversion.apply(dict.decodeToBinary(i)));
break;
case FIXED_LEN_BYTE_ARRAY: // Same as BINARY

OutputFile outFile = Files.localOutput(parquetFile);
try (FileAppender<Record> appender = Parquet.write(outFile)
.schema(FILE_SCHEMA)
.withWriterVersion(WriterVersion.PARQUET_2_0) // V2 is needed to force dictionary encoding for FIXED type
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why we do not use dictionary encoding for V1 but this is the current case. See DefaultV1ValuesWriterFactory. Meanwhile this behavior is not specified anywhere so other implementations may write it and Impala certainly does.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue: Does Iceberg specification defines which writer version we should use?

@gszadovszky: What happens when a V2 writer is used and an old parquet client tries to read the file.

Thanks,
Peter

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pvary: V1 and V2 are quite tricky references because it is not always clear what it covers. From parquet-mr point of view the most basic differences are the page headers (v1 or v2) and the used encodings. Since parquet-mr has support for v2 since a while it is mostly not about the old parquet readers but the other implementations of parquet readers. For example Impala does not support v2 page headers nor the encodings used by parquet-mr in case of v2 is set.
BUT, using dictionary encoding for fixed data types are not against parquet specification for v1. And Impala does it. (It is only parquet-mr that does not do dictionary encoding for fixed types in case of v1 but it can read it and the filtering also works just fine.)

I've been hesitating to switch the whole unit test to V2 (or implement a separate test for fixed+dictionary) but this seemed easier and simpler. Also, V1 or V2 seems irrelevant to this test (excluding the fact that we cannot test fixed+dictionary with V1).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pvary, I'm assuming that people mostly don't use Parquet v2 because it was never finalized (to my knowledge). I think there's an ongoing effort in the Parquet community to clean up some of the docs, but I've not seen a vote to adopt any of the v2 encodings as official.

@gszadovszky gszadovszky requested a review from rdblue May 13, 2021 15:44
@gszadovszky
Copy link
Copy Markdown
Contributor Author

I think, it is fine to extend the API with package-private methods for testing purposes. It makes the testing simpler and easier to implement. Meanwhile, I am open to invest on writing the test without the need of this extension if you think it is required.
@rdblue, what do you think?

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented May 17, 2021

@gszadovszky, I think the package-private method is fine. But we need to make sure that this doesn't stop testing v1 since that's what people primarily use. Could you add v1/v2 as a test parameter so all the tests run on both?

@gszadovszky
Copy link
Copy Markdown
Contributor Author

Thanks @rdblue. I've thought v1/v2 is irrelevant for this test but I agree it is better to be on the safe side and it's a good idea to add writer version as a test parameter. I'll make this change soon.

.shouldRead(parquetSchema, rowGroupMetadata, dictionaryStore);
Assert.assertFalse("Should not read: No decimal_fixed values less than -1234567890.0987654321", shouldRead);
switch (writerVersion) {
case PARQUET_1_0:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally, we would use Assume to simply skip this test for Parquet v1. Would you mind doing that here? I don't think we need to validate that Parquet doesn't dictionary encode.

@rdblue rdblue merged commit 0e4f0f0 into apache:master May 20, 2021
@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented May 20, 2021

Thanks, @gszadovszky!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dictionary filter for Parquet does not handle FIXED_LEN_BYTE_ARRAY

3 participants