Hive: turn off the stats gathering when iceberg.hive.keep.stats is false#10148
Conversation
Hive engine collects the stats by traversing the folder to count number of files and size when the table is created and hive.stats.autogather is turned on. The operation can be expensive for a large table. When the iceberg.hive.keep.stats is set to false, add table parameter "DO_NOT_UPDATE_STATS" so that Hive engine won't collect the stats.
|
@deniskuzZ: what would be the effect of this change to the Hive integration? |
|
AFAIK, autogater doesn't even work in Hive. After some operations like insert, we issue an extra stats update task that persists column stats changes either to the HMS or puffin file ("hive.iceberg.stats.source", "iceberg") |
27d3c8a to
d9080b7
Compare
After this change is applied, the table property contains a new entry: "DO_NOT_UPDATE_STATS":"true"
|
Hi @deniskuzZ: for more info: we identified the autogather executed (controlled by hive.stats.autogather) to collect stats from Hive when committing a new Iceberg table to Hive (Hive version 3.1.3: https://github.com/apache/hive/blob/rel/release-3.1.3/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java#L1868) and that is the motivation of this PR. |
|
@stargrey102, have you checked the same in Hive-4.0? |
|
@deniskuzZ thank you for the link. HiveOperationsBase uses HiveMetastore client when creating the Iceberg table: https://github.com/apache/iceberg/blob/main/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveOperationsBase.java#L75 while the Hive fix pointed to https://github.com/difin/hive/blob/f96b586c1f338d13b91049a54da09018b3b84723/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java |
|
Thanks @deniskuzZ for the info! Good to know that it will not hurt the Hive integration, to have this PR in. @stargrey102: Could we create a test which actually checks that the stats are not collected? |
| // The hive configuration HIVESTATSAUTOGATHER must be set to true from hive engine | ||
| shell = | ||
| HiveIcebergStorageHandlerTestUtils.shell( | ||
| ImmutableMap.of(HiveConf.ConfVars.HIVESTATSAUTOGATHER.varname, "true")); |
There was a problem hiding this comment.
The Hive shell sets the hive stats autogather to false as default:
and here to set to true to that we can test this change
|
Hi @pvary, sure. I added a test with 2 cases: keep or not keep hive stats. This change needed it set to true and I created a new test class instead of adding the test to existing classes, to avoid to impact any existing tests. Let me know if there is better way to do it. Thanks a lot! |
pvary
left a comment
There was a problem hiding this comment.
Small formatting comments, but looks good to me
6cf4a3d to
ebec5cc
Compare
|
Hi @pvary thank you for the review, typo and format have been fixed. Do you think if it can be merged? |
|
Thanks for @stargrey102 for the PR and @deniskuzZ for the help during the review! |
Hive engine collects the stats by traversing the folder to count number of files and size when the table is created and hive.stats.autogather is turned on. The operation can be expensive for a large table while for iceberg table the stats is not needed to be stored on HMS. When the iceberg.hive.keep.stats is set to false, add table parameter "DO_NOT_UPDATE_STATS" so that Hive engine won't collect the stats.