Skip to content

Hive: turn off the stats gathering when iceberg.hive.keep.stats is false#10148

Merged
pvary merged 4 commits into
apache:mainfrom
stargrey102:hive-disable-stats
Apr 23, 2024
Merged

Hive: turn off the stats gathering when iceberg.hive.keep.stats is false#10148
pvary merged 4 commits into
apache:mainfrom
stargrey102:hive-disable-stats

Conversation

@stargrey102
Copy link
Copy Markdown
Contributor

Hive engine collects the stats by traversing the folder to count number of files and size when the table is created and hive.stats.autogather is turned on. The operation can be expensive for a large table while for iceberg table the stats is not needed to be stored on HMS. When the iceberg.hive.keep.stats is set to false, add table parameter "DO_NOT_UPDATE_STATS" so that Hive engine won't collect the stats.

Hive engine collects the stats by traversing the folder to count number of files and size when the table is created and hive.stats.autogather is turned on. The operation can be expensive for a large table. When the iceberg.hive.keep.stats is set to false, add table parameter "DO_NOT_UPDATE_STATS" so that Hive engine won't collect the stats.
@github-actions github-actions Bot added the hive label Apr 15, 2024
@pvary
Copy link
Copy Markdown
Contributor

pvary commented Apr 15, 2024

@deniskuzZ: what would be the effect of this change to the Hive integration?

@github-actions github-actions Bot added the MR label Apr 15, 2024
@stargrey102
Copy link
Copy Markdown
Contributor Author

stargrey102 commented Apr 15, 2024

Some hive tested failed due to a new table property is added "DO_NOT_UPDATE_STATS":"true"
: 27d3c8a
Would you mind helping me trigger the integration tests again please? @pvary Thank you.

@deniskuzZ
Copy link
Copy Markdown
Member

deniskuzZ commented Apr 16, 2024

AFAIK, autogater doesn't even work in Hive. After some operations like insert, we issue an extra stats update task that persists column stats changes either to the HMS or puffin file ("hive.iceberg.stats.source", "iceberg")
+ In the case of Iceberg, we always take basic stats from Iceberg snapshot summary

@github-actions github-actions Bot removed the MR label Apr 16, 2024
After this change is applied, the table property contains a new entry:
"DO_NOT_UPDATE_STATS":"true"
@github-actions github-actions Bot added the MR label Apr 16, 2024
@stargrey102
Copy link
Copy Markdown
Contributor Author

Hi @deniskuzZ: for more info: we identified the autogather executed (controlled by hive.stats.autogather) to collect stats from Hive when committing a new Iceberg table to Hive (Hive version 3.1.3: https://github.com/apache/hive/blob/rel/release-3.1.3/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java#L1868) and that is the motivation of this PR.
There are two configs:
hive.stats.autogather and hive.stats.column.autogather
and this pr only tries to disable the first one but not cols stats.

@deniskuzZ
Copy link
Copy Markdown
Member

deniskuzZ commented Apr 17, 2024

@stargrey102, have you checked the same in Hive-4.0?
see HIVE-27355, apache/hive#4348
Skipping stats calculation when creating an Iceberg table since Iceberg doesn't use HMS stats.

@stargrey102
Copy link
Copy Markdown
Contributor Author

@deniskuzZ thank you for the link. HiveOperationsBase uses HiveMetastore client when creating the Iceberg table: https://github.com/apache/iceberg/blob/main/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveOperationsBase.java#L75 while the Hive fix pointed to https://github.com/difin/hive/blob/f96b586c1f338d13b91049a54da09018b3b84723/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java
Another thought would be if we are not able to upgrade to Hive 4, we would still need to patch it from Iceberg.

@pvary
Copy link
Copy Markdown
Contributor

pvary commented Apr 18, 2024

Thanks @deniskuzZ for the info! Good to know that it will not hurt the Hive integration, to have this PR in.

@stargrey102: Could we create a test which actually checks that the stats are not collected?

// The hive configuration HIVESTATSAUTOGATHER must be set to true from hive engine
shell =
HiveIcebergStorageHandlerTestUtils.shell(
ImmutableMap.of(HiveConf.ConfVars.HIVESTATSAUTOGATHER.varname, "true"));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Hive shell sets the hive stats autogather to false as default:

hiveConf.setBoolVar(HiveConf.ConfVars.HIVESTATSAUTOGATHER, false);

and here to set to true to that we can test this change

@stargrey102
Copy link
Copy Markdown
Contributor Author

stargrey102 commented Apr 19, 2024

Hi @pvary, sure. I added a test with 2 cases: keep or not keep hive stats.
Since the hive engine sets hivestatsautogather to false by default:

hiveConf.setBoolVar(HiveConf.ConfVars.HIVESTATSAUTOGATHER, false);

This change needed it set to true and I created a new test class instead of adding the test to existing classes, to avoid to impact any existing tests.
Let me know if there is better way to do it. Thanks a lot!

Copy link
Copy Markdown
Contributor

@pvary pvary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small formatting comments, but looks good to me

@stargrey102
Copy link
Copy Markdown
Contributor Author

Hi @pvary thank you for the review, typo and format have been fixed. Do you think if it can be merged?

@pvary pvary merged commit 866021d into apache:main Apr 23, 2024
@pvary
Copy link
Copy Markdown
Contributor

pvary commented Apr 23, 2024

Thanks for @stargrey102 for the PR and @deniskuzZ for the help during the review!

sasankpagolu pushed a commit to sasankpagolu/iceberg that referenced this pull request Oct 27, 2024
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants