Skip to content

Spark: Read DVs when reading from .position_deletes table#11657

Merged
nastra merged 4 commits into
apache:mainfrom
nastra:dv-iterable-for-position-deletes
Dec 16, 2024
Merged

Spark: Read DVs when reading from .position_deletes table#11657
nastra merged 4 commits into
apache:mainfrom
nastra:dv-iterable-for-position-deletes

Conversation

@nastra
Copy link
Copy Markdown
Contributor

@nastra nastra commented Nov 26, 2024

this is part of #11122 and has been extracted from #11545

Comment thread spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/DVIterable.java Outdated
Comment thread spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/DVIterable.java Outdated
@nastra nastra force-pushed the dv-iterable-for-position-deletes branch 2 times, most recently from b347427 to cd35ea5 Compare November 27, 2024 13:29
@nastra nastra requested a review from aokolnychyi November 27, 2024 13:32
@nastra nastra force-pushed the dv-iterable-for-position-deletes branch 3 times, most recently from b79a7da to 2512b5f Compare November 27, 2024 15:18
Comment thread spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/DVIterable.java Outdated

@Override
public CloseableIterator<InternalRow> iterator() {
PuffinReader reader = builder.build();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[optional] might be too much, but can we have one reader per DV file ? considering specifically for this use case we will have to read all the blobs in the DV file eventually.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have a particular use case in mind as this isn't something that is being needed currently when reading the PositionDeletesTable?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue we rarely need to read the entire DV file as not all DVs may be still valid.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and makes sense to not go this route, was mostly coming from the case that we need to read more than 1 blob in a puffin DV file in that case it might be better to reuse the reader.

@nastra nastra force-pushed the dv-iterable-for-position-deletes branch from 9a47998 to 3e1bafe Compare November 29, 2024 15:20
Comment thread spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/DVIterable.java Outdated
import org.junit.jupiter.api.io.TempDir;

@ExtendWith(ParameterizedTestExtension.class)
public class TestPositionDeletesReader extends TestBase {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also test reading DVs end-to-end by quering the position_deletes metadata table in Spark? I think we can commit DVs from Java and then read it from Spark, as we don't have DVs in Spark right now?

Copy link
Copy Markdown
Contributor Author

@nastra nastra Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are being tested end-to-end in #11545. I really just wanted to extract these pieces here to make reviewing easier. Alternatively, I could close this PR and we'll just have it as part of #11545 where all of this is fully tested end-to-end through Spark


@Override
public CloseableIterator<InternalRow> iterator() {
PuffinReader reader = builder.build();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue we rarely need to read the entire DV file as not all DVs may be still valid.

Comment thread spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/DVIterable.java Outdated
@nastra nastra force-pushed the dv-iterable-for-position-deletes branch 3 times, most recently from 710febe to de6e0da Compare December 6, 2024 17:29
@nastra nastra closed this Dec 6, 2024
@nastra nastra reopened this Dec 6, 2024
Comment thread spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/DVIterator.java Outdated
Comment thread spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/DVIterator.java Outdated
@nastra nastra force-pushed the dv-iterable-for-position-deletes branch 3 times, most recently from 9774d18 to d633591 Compare December 9, 2024 13:32
Comment thread spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/DVIterator.java Outdated
}

this.row = new GenericInternalRow(rowValues.toArray());
} else if (null != deletedPositionIndex) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Do we actually need null != deletedPositionIndex? I think it is the first invocation and we need to initialize the row or we need to update the position. Shouldn't this fail if the index is still null to indicate something went wrong?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a case where deletedPositionIndex is null is still a valid case IMO. This would be true if a user doesn't project the pos column

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants