Skip to content

Spark: Support UPDATE statements with subqueries#2206

Merged
aokolnychyi merged 4 commits into
apache:masterfrom
aokolnychyi:update-with-subquery
Feb 4, 2021
Merged

Spark: Support UPDATE statements with subqueries#2206
aokolnychyi merged 4 commits into
apache:masterfrom
aokolnychyi:update-with-subquery

Conversation

@aokolnychyi
Copy link
Copy Markdown
Contributor

@aokolnychyi aokolnychyi commented Feb 3, 2021

This PR adds support for UPDATE statements with subqueries.

Let's consider an example.

sql("UPDATE %s SET dep = 'x' WHERE id IN (SELECT id + 1 FROM %s)", tableName, tableName)
== Optimized Logical Plan ==
ReplaceData RelationV2[id#28, dep#29] testhive.default.table, IcebergBatchWrite(table=testhive.default.table, format=ORC)
+- Union
   :- Project [id#28, x AS dep#40]
   :  +- Join LeftSemi, (id#28 = (id + 1)#32)
   :     :- Project [id#28]
   :     :  +- DynamicFileFilter[id#28, _file#38]
   :     :     :- Project [id#28, _file#38]
   :     :     :  +- RelationV2[id#28, dep#29, _file#38, _pos#39L] testhive.default.table
   :     :     +- Aggregate [_file#38], [_file#38]
   :     :        +- Project [_file#38]
   :     :           +- Join LeftSemi, (id#28 = (id + 1)#32)
   :     :              :- Project [id#28, _file#38]
   :     :              :  +- RelationV2[id#28, dep#29, _file#38, _pos#39L] testhive.default.table
   :     :              +- Project [(id#30 + 1) AS (id + 1)#32]
   :     :                 +- RelationV2[id#30] testhive.default.table
   :     +- Project [(id#30 + 1) AS (id + 1)#32]
   :        +- RelationV2[id#30] testhive.default.table
   +- Project [id#28, dep#29]
      +- Filter NOT (exists#41 <=> true)
         +- Join ExistenceJoin(exists#41), (id#28 = (id + 1)#32)
            :- Project [id#28, dep#29]
            :  +- RelationV2[id#28, dep#29, _file#38, _pos#39L] testhive.default.table
            +- Project [(id#30 + 1) AS (id + 1)#32]
               +- RelationV2[id#30] testhive.default.table

import org.apache.spark.sql.catalyst.utils.PlanUtils.isIcebergRelation

object MergeIntoTablePredicateCheck extends (LogicalPlan => Unit) {
object RowLevelOperationsPredicateCheck extends (LogicalPlan => Unit) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I squashed multiple rules into one.

scanPlan: LogicalPlan,
assignments: Seq[Assignment],
cond: Expression): LogicalPlan = {
cond: Expression = Literal.TrueLiteral): LogicalPlan = {
Copy link
Copy Markdown
Contributor Author

@aokolnychyi aokolnychyi Feb 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we process matched rows, we know all rows must be updated. No need for If expressions.

}

@Test
public void testUpdateWithInSubquery() {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I validated results match Postgres for some queries. I'll check others before the PR is merged.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I think you may want to check here is subqueries that uses aliases as well

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, let me add this together with self joins.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added testUpdateWithSelfSubquery with this.

Comment thread spark3/src/main/java/org/apache/iceberg/spark/source/SparkMergeScan.java Outdated
@aokolnychyi
Copy link
Copy Markdown
Contributor Author

@github-actions github-actions Bot added the spark label Feb 3, 2021
Copy link
Copy Markdown
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic in the rewrite looks correct to me.

@dilipbiswal
Copy link
Copy Markdown
Contributor

@aokolnychyi A small request.. Is it possible for you to put the rewritten optimized plan in the pr description ? Just for us to refer back to quickly get an idea of the rewrite.

@aokolnychyi
Copy link
Copy Markdown
Contributor Author

@dilipbiswal, good idea. I'll add a few sample plans.

@aokolnychyi
Copy link
Copy Markdown
Contributor Author

@dilipbiswal, added an example to the PR description.

@aokolnychyi aokolnychyi merged commit 4b6fa61 into apache:master Feb 4, 2021
@aokolnychyi
Copy link
Copy Markdown
Contributor Author

Thanks for reviewing, @RussellSpitzer @dilipbiswal @rdblue!

coolderli pushed a commit to coolderli/iceberg that referenced this pull request Apr 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants