Filtering dictionary encoded FIX_LEN_BYTE_ARRAY by gszadovszky · Pull Request #2551 · apache/iceberg

gszadovszky · 2021-05-03T14:56:03Z

gszadovszky · 2021-05-03T15:01:00Z

+          case FIXED_LEN_BYTE_ARRAY: dictSet.add((T) conversion.apply(dict.decodeToBinary(i)));
+            break;


I wanted to write the following but Checkstyle failed on it because of indention. I guess, it is/was a Checkstyle bug. I don't have experience on gradle so I am not sure how to fix this in Iceberg configuration. Probably, we need to upgrade Checkstyle somehow.

Anyway, both the current and this other one is correct but I would prefer the latter.

Suggested change

case FIXED_LEN_BYTE_ARRAY: dictSet.add((T) conversion.apply(dict.decodeToBinary(i)));

break;

case FIXED_LEN_BYTE_ARRAY: // Same as BINARY

rdblue · 2021-05-06T20:42:41Z

    OutputFile outFile = Files.localOutput(parquetFile);
    try (FileAppender<Record> appender = Parquet.write(outFile)
        .schema(FILE_SCHEMA)
+        .withWriterVersion(WriterVersion.PARQUET_2_0) // V2 is needed to force dictionary encoding for FIXED type


Why is this?

I don't know why we do not use dictionary encoding for V1 but this is the current case. See DefaultV1ValuesWriterFactory. Meanwhile this behavior is not specified anywhere so other implementations may write it and Impala certainly does.

@rdblue: Does Iceberg specification defines which writer version we should use?

@gszadovszky: What happens when a V2 writer is used and an old parquet client tries to read the file.

Thanks,
Peter

@pvary: V1 and V2 are quite tricky references because it is not always clear what it covers. From parquet-mr point of view the most basic differences are the page headers (v1 or v2) and the used encodings. Since parquet-mr has support for v2 since a while it is mostly not about the old parquet readers but the other implementations of parquet readers. For example Impala does not support v2 page headers nor the encodings used by parquet-mr in case of v2 is set.
BUT, using dictionary encoding for fixed data types are not against parquet specification for v1. And Impala does it. (It is only parquet-mr that does not do dictionary encoding for fixed types in case of v1 but it can read it and the filtering also works just fine.)

I've been hesitating to switch the whole unit test to V2 (or implement a separate test for fixed+dictionary) but this seemed easier and simpler. Also, V1 or V2 seems irrelevant to this test (excluding the fact that we cannot test fixed+dictionary with V1).

@pvary, I'm assuming that people mostly don't use Parquet v2 because it was never finalized (to my knowledge). I think there's an ongoing effort in the Parquet community to clean up some of the docs, but I've not seen a vote to adopt any of the v2 encodings as official.

gszadovszky · 2021-05-17T14:22:46Z

I think, it is fine to extend the API with package-private methods for testing purposes. It makes the testing simpler and easier to implement. Meanwhile, I am open to invest on writing the test without the need of this extension if you think it is required.
@rdblue, what do you think?

rdblue · 2021-05-17T17:30:02Z

@gszadovszky, I think the package-private method is fine. But we need to make sure that this doesn't stop testing v1 since that's what people primarily use. Could you add v1/v2 as a test parameter so all the tests run on both?

gszadovszky · 2021-05-18T08:16:23Z

Thanks @rdblue. I've thought v1/v2 is irrelevant for this test but I agree it is better to be on the safe side and it's a good idea to add writer version as a test parameter. I'll make this change soon.

rdblue · 2021-05-19T19:42:06Z

-        .shouldRead(parquetSchema, rowGroupMetadata, dictionaryStore);
-    Assert.assertFalse("Should not read: No decimal_fixed values less than -1234567890.0987654321", shouldRead);
+    switch (writerVersion) {
+      case PARQUET_1_0:


Normally, we would use Assume to simply skip this test for Parquet v1. Would you mind doing that here? I don't think we need to validate that Parquet doesn't dictionary encode.

rdblue · 2021-05-20T19:31:06Z

Thanks, @gszadovszky!

Filtering dictionary encoded FIX_LEN_BYTE_ARRAY

653806a

Fixes apache#2546

github-actions Bot added the parquet label May 3, 2021

gszadovszky commented May 3, 2021

View reviewed changes

rdblue reviewed May 6, 2021

View reviewed changes

gszadovszky requested a review from rdblue May 13, 2021 15:44

Use annotation VisibleForTesting

5190e6b

Gabor Szadovszky added 2 commits May 18, 2021 11:08

Writer version as test parameter

9909982

Empty commit to re-trigger checks

03cb493

rdblue reviewed May 19, 2021

View reviewed changes

Use assume to check dictionary encoding

182e7f8

rdblue approved these changes May 20, 2021

View reviewed changes

rdblue merged commit 0e4f0f0 into apache:master May 20, 2021

ddrinka mentioned this pull request Nov 11, 2022

Allow setting the Parquet format version for datafile writes #6177

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filtering dictionary encoded FIX_LEN_BYTE_ARRAY#2551

Filtering dictionary encoded FIX_LEN_BYTE_ARRAY#2551
rdblue merged 5 commits into
apache:masterfrom
gszadovszky:2546

gszadovszky commented May 3, 2021

Uh oh!

gszadovszky May 3, 2021

Uh oh!

rdblue May 6, 2021

Uh oh!

gszadovszky May 10, 2021

Uh oh!

pvary May 17, 2021

Uh oh!

gszadovszky May 17, 2021

Uh oh!

rdblue May 17, 2021

Uh oh!

gszadovszky commented May 17, 2021

Uh oh!

rdblue commented May 17, 2021

Uh oh!

gszadovszky commented May 18, 2021

Uh oh!

rdblue May 19, 2021

Uh oh!

rdblue commented May 20, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		case FIXED_LEN_BYTE_ARRAY: dictSet.add((T) conversion.apply(dict.decodeToBinary(i)));
		break;

	case FIXED_LEN_BYTE_ARRAY: dictSet.add((T) conversion.apply(dict.decodeToBinary(i)));
	break;
	case FIXED_LEN_BYTE_ARRAY: // Same as BINARY

Conversation

gszadovszky commented May 3, 2021

Uh oh!

gszadovszky May 3, 2021

Choose a reason for hiding this comment

Uh oh!

rdblue May 6, 2021

Choose a reason for hiding this comment

Uh oh!

gszadovszky May 10, 2021

Choose a reason for hiding this comment

Uh oh!

pvary May 17, 2021

Choose a reason for hiding this comment

Uh oh!

gszadovszky May 17, 2021

Choose a reason for hiding this comment

Uh oh!

rdblue May 17, 2021

Choose a reason for hiding this comment

Uh oh!

gszadovszky commented May 17, 2021

Uh oh!

rdblue commented May 17, 2021

Uh oh!

gszadovszky commented May 18, 2021

Uh oh!

rdblue May 19, 2021

Choose a reason for hiding this comment

Uh oh!

rdblue commented May 20, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants