Flink: bridge the gap btw FlinkSource and IcebergSource (FLIP-27) and… by stevenzwu · Pull Request #5318 · apache/iceberg

stevenzwu · 2022-07-21T00:06:18Z

… added an opt-in config to use FLIP-27 source in Flink SQL

stevenzwu · 2022-07-21T01:26:35Z

 */

-package org.apache.iceberg.flink;
+package org.apache.iceberg.flink.source;


moved this class inside the source dir so that it can use some package private methods from IcebergSource. and it also seems like a good home for this class.

stevenzwu · 2022-07-21T01:27:48Z

+        source, WatermarkStrategy.noWatermarks(), source.name(), TypeInformation.of(RowData.class));
+
+    if (source.getBoundedness() == Boundedness.BOUNDED) {
+      int parallelism = SourceUtil.inferParallelism(readableConfig, limit, () -> source.planSplitsForBatch().size());


the expensive lambda function of planSplitsForBatch will only be executed if config is enabled

stevenzwu · 2022-07-21T01:31:06Z

      }
    }
-
-    int inferParallelism(FlinkInputFormat format, ScanContext context) {


the following two methods are refactored to a new class SourceUtil so that the FLIP-27 IcebergSource can reuse them.

stevenzwu · 2022-07-21T01:32:43Z

  @Override
  public void start() {
    super.start();
-    if (shouldEnumerate) {


split discovery for static/batch enumerator is performed in the IcebergSource line 183 now. this is to consolidate the batch planning into a single class.

stevenzwu · 2022-07-21T01:33:34Z

+import org.apache.flink.util.CloseableIterator;
+import org.apache.iceberg.relocated.com.google.common.collect.Lists;
+
+public class SqlHelpers {


util methods extracted from TestFlinkScanSql

stevenzwu · 2022-07-21T01:35:23Z

  }

-  @Before
-  public void before() throws IOException {


this is refactored to HadoopCatalogResource to be reusable (e.g. by the new TestSqlBase)

stevenzwu · 2022-07-21T01:36:17Z

-  }
-
-  @Test
-  public void testResiduals() throws Exception {


remaining test methods are refactored into TestSqlBase to share the test btw current and new FLIP-27 sources

stevenzwu · 2022-07-21T01:37:08Z

 */

-package org.apache.iceberg.flink;
+package org.apache.iceberg.flink.source;


moved to source dir to be consistent with the move of IcebergTableSource class

rdblue

Looks good! Let me know whether we need the infer parallelism changes and I'll merge.

stevenzwu · 2022-07-22T20:53:48Z

@zhangjun0x01 @openinx @yittg like to get your input on the inferring parallelism feature. The current implementation in FlinkSource would require two split planning (1) get the splits to derive split count (2) split planning in source. This is obviously not ideal. Hence @rdblue and I are wondering how useful is this feature? Do we need to carry it over to the FLIP-27 source? The main question is regarding the additional split planning. other parts of inferring parallelism (like limit count) should be fine.

For now, I am going to exclude it in this PR. If we decide that we should carry it over, I can follow up with a separate PR.

rdblue · 2022-07-26T16:53:28Z

    this.scanContext = scanContext;
    this.readerFunction = readerFunction;
    this.assignerFactory = assignerFactory;
+    this.table = table;


I thought we were going to avoid these changes for now, since we don't know whether they will be needed to infer parallelism?

We are avoiding the feature of inferring parallelism. But I think this refactoring is still good.

It avoid double loading of the table. a Table is loaded in the builder to get fields like schema, io, encryption etc. It will be loaded again in the IcebergSource#createEnumerator method, which also runs in the jobmanager/driver.

table/lazyTable() is used by the name() getter.

rdblue · 2022-07-27T20:15:54Z

@stevenzwu, can you rebase?

… added an opt-in config to use FLIP-27 source in Flink SQL

rdblue · 2022-07-27T21:29:13Z

Looks good to me! I'll merge when tests are passing.

…tion order on expected vs actual

stevenzwu · 2022-07-28T04:02:37Z

@rdblue this is ready to be merged. I reviewed the diff again and looks good after rebase.

Previous two CI runs failed due to some flakiness caused by congestion of CI machines (probably overwhelmed by a lot of PRs rebased after the big-bang spotlessApply commit). Once it failed in the TestS3OutputStream. Another time it failed in the FLIP-27 TestIcebergSourceFailover, which because it wasn't able to make enough progress after 2 mins of waiting.

github-actions Bot added the flink label Jul 21, 2022

stevenzwu force-pushed the flip27SQL branch from 6d478cd to afa9d9d Compare July 21, 2022 00:35

stevenzwu commented Jul 21, 2022

View reviewed changes

rdblue reviewed Jul 22, 2022

View reviewed changes

Comment thread flink/v1.15/flink/src/main/java/org/apache/iceberg/flink/source/IcebergTableSource.java

rdblue approved these changes Jul 22, 2022

View reviewed changes

stevenzwu force-pushed the flip27SQL branch from afa9d9d to 2412334 Compare July 22, 2022 20:44

stevenzwu force-pushed the flip27SQL branch from 47407bd to 8b9b8c5 Compare July 22, 2022 21:38

stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jul 22, 2022

Flink: port PR apache#5318 to 1.14

b1b2071

rdblue reviewed Jul 26, 2022

View reviewed changes

stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jul 27, 2022

Flink: port PR apache#5318 to 1.14

8d09464

stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jul 27, 2022

Flink: port PR apache#5318 to 1.14

a36b6a9

Flink: bridge the gap btw FlinkSource and IcebergSource (FLIP-27) and…

62201b7

… added an opt-in config to use FLIP-27 source in Flink SQL

stevenzwu force-pushed the flip27SQL branch from 8b9b8c5 to 62201b7 Compare July 27, 2022 20:24

stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jul 27, 2022

Flink: port PR apache#5318 to 1.14

f836e4e

rdblue approved these changes Jul 27, 2022

View reviewed changes

stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jul 27, 2022

Flink: port PR apache#5318 to 1.14

8e05f78

fix a few minor issues: rebase regression, comments formatting, asser…

e37afb8

…tion order on expected vs actual

stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jul 28, 2022

Flink: port PR apache#5318 to 1.14

cf6313c

stevenzwu closed this Jul 28, 2022

stevenzwu reopened this Jul 28, 2022

stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jul 28, 2022

Flink: port PR apache#5318 to 1.14

70df698

rdblue merged commit 23c9345 into apache:master Jul 28, 2022

Conversation

stevenzwu commented Jul 21, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rdblue left a comment

Choose a reason for hiding this comment

Uh oh!

stevenzwu commented Jul 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevenzwu Jul 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jul 27, 2022

Uh oh!

rdblue commented Jul 27, 2022

Uh oh!

stevenzwu commented Jul 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stevenzwu commented Jul 22, 2022 •

edited

Loading

stevenzwu Jul 26, 2022 •

edited

Loading

stevenzwu commented Jul 28, 2022 •

edited

Loading