Support for role based access of HadoopCatalog table listing #1941#1979
Conversation
| try { | ||
| return fs.listStatus(metadataPath, TABLE_FILTER).length >= 1; | ||
| } catch (IOException e) { | ||
| return false; |
There was a problem hiding this comment.
Same here. Returning false is not correct.
There was a problem hiding this comment.
We use the iceberg HadoopCatalog over Azure ABFS, Users use spark to list the tables in the catalog. However not every user has access to all tables on disk (we use file system access control list). When a user cannot access a folder the listStatus function will throw a subclass of IOException.
In my ticket I suggested using a flag to configure the HadoopCatalog to throw errors when unable to list these directories or return false if it is expected that some users will not be able to list all folders/tables.
The listStatus throws an FileNotFoundException when the path provided does not exists. I will handle that as a "normal" condition, any other IOException I will either catch or not based on a configuration flag. This should handle scenarios where not all user's can access all tables/namespaces
What do you think about that proposal?
There was a problem hiding this comment.
I think it makes sense to catch FileNotFoundException and return false. What is the exception class that you get when a user can't list path? I would rather use that instead of suppressing any IOException if there is a flag set.
It is also unclear how that suppression is set.
There was a problem hiding this comment.
I've added support for a flag. However I have not yet retrieved the value from the configuration properties.
The exception thrown is this one:
IOException
AzureBlobFileSystemException
AbfsRestOperationException
Unfortunately the exception thrown is an ABFS specific class.
There was a problem hiding this comment.
Should the boolean be configurable? How are you setting it?
There was a problem hiding this comment.
Hey Ryan, sorry for the late reply. Ya I'm thinking it could support a flag passed in as a configuration via the org.apache.hadoop.conf.Configuration
I'm just not sure how to pass values into that configuration object. For example there is already a FILE_IO_IMPL supported. How would I add an additional configuration like FILE_IO_IMPL?
There was a problem hiding this comment.
Getting the config from catalog properties looks good, but I'd move the property constant as I noted above.
There was a problem hiding this comment.
this is really brittle design. It will only work for one specific store and contains the assumption that the store always returns the string "AuthorizationPermissionMismatch" on an error. If the abfs store ever changed that string everything will break.
- That abfs exception also includes a status code, 403, which is what should really be used as the signal.
- Other stores return AccessDeniedException. Has anyone considered filing a hadoop bug asking for translation of rest exceptions with 403s into those? I am reasonably confident nobody is going reject PRs there.
|
Thanks for working on this, @cccs-jc. I like that it avoids extra calls to the NameNode. |
| private final String warehouseLocation; | ||
| private final FileSystem fs; | ||
| private final FileIO fileIO; | ||
| private boolean suppressACLExcpetion = false; |
There was a problem hiding this comment.
| private boolean suppressACLExcpetion = false; | |
| private boolean suppressACLException = false; |
| } catch (FileNotFoundException e) { | ||
| return false; | ||
| } catch (IOException e) { | ||
| if (suppressACLExcpetion) { |
There was a problem hiding this comment.
These exceptions could be thrown be errors that don't have anything to do with ACL so can we try come up with a more generic name for this flag? suppressIOException maybe? I guess it depends on where they will be configured and how.
There was a problem hiding this comment.
fair enough.
I'm going to add this configuration like it is done for the FILE_IO_IMPL for example.
|
I'm still concerned about adding this without a better check. Adding a flag to turn all |
|
sure I could check if it's an azure blob storage exception but that would make it specific to azure.. This is the outline of the listTables method. In the end if a location can't be read it will return NoSuchNamespaceException. If for any reason you cannot list the dir/metadata folder it won't return the table. So the fact that IOException are not return will not break anything. Worst case users don't see the namespace/tables and hunt down why that is. Maybe logging the error would be helpful while still returning false. My goal here is to not interrupt the execution of listTables if certain user's can't access certain tables. So having a log entry might be useful for the admins/operators. |
|
Logging the error is definitely a good idea. It sounds like the main problem is that there isn't a specific error to catch. I think it would be fine if any directory that the user doesn't have access to doesn't show up in the list results. But we do need to ensure that ignoring directories isn't applied to other situations where it isn't reasonable. So if we can inspect the exception from Azure and make this work I think that's the way to go. We can support other errors later as we find them. |
|
sounds good. I will narrow it to the Azure package and also narrow it to when it is listing a table or namespace only. |
|
I've checked everywhere the following methods are used isTableDir In all cases I can see it makes sense to not throw the exception since these are tests. For example Nonetheless I will add checks for determining if the exception is an azure blob exception. |
Conflicts: core/src/main/java/org/apache/iceberg/CatalogProperties.java core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java
| public static final String HIVE_CLIENT_POOL_SIZE = "clients"; | ||
| public static final int HIVE_CLIENT_POOL_SIZE_DEFAULT = 2; | ||
|
|
||
| public static final String HADOOP_SUPPRESS_IOEXCEPTION = "suppress-io-exception"; |
There was a problem hiding this comment.
Since this is specific to Hadoop, let's move it into HadoopCatalog. I don't think the other catalogs would need this because they do not list using FS operations. A private constant in HadoopCatalog should be fine and we can always expose it later if we need to.
|
|
||
| String fileIOImpl = properties.get(CatalogProperties.FILE_IO_IMPL); | ||
| this.suppressIOException = Boolean.parseBoolean(properties.get(CatalogProperties.HADOOP_SUPPRESS_IOEXCEPTION)); | ||
|
|
There was a problem hiding this comment.
Can you move this to after fileIO is set? The line above and that line are related so we shouldn't separate them.
|
|
||
| private boolean shouldSuppressIOException(IOException ioException) { | ||
| String name = ioException.getClass().getName(); | ||
| return suppressIOException && name.startsWith("org.apache.hadoop.fs.azurebfs"); |
There was a problem hiding this comment.
I would really prefer if this were more specific and looked at the exception message as well to ensure the error is a permissions or ACL problem.
There was a problem hiding this comment.
I have fixed the code review suggestions.
I'm detecting the azure blob permission exception using the message (that's all I have available). In that message there is an AzureServiceErrorCode of value "AuthorizationPermissionMismatch" which I use to detect a permission issue.
I agree this is much better error handling since it will let other errors be reported.
I'm thinking the flag should thus be "suppress permission errors". What do you think?
| return false; | ||
| } catch (IOException e) { | ||
| if (shouldSuppressIOException(e)) { | ||
| LOG.warn("Unalbe to metadata directory {}: {}", metadataPath, e); |
| return false; | ||
| } catch (IOException e) { | ||
| if (shouldSuppressIOException(e)) { | ||
| LOG.warn("Unalbe to list directory {}: {}", path, e); |
| } | ||
|
|
||
| public static final String ENGINE_HIVE_ENABLED = "iceberg.engine.hive.enabled"; | ||
|
|
There was a problem hiding this comment.
Nit: this file doesn't need to be touched. Could you remove this whitespace change? It may cause git conflicts.
|
@cccs-jc, thanks for updating this! Other than the file that doesn't need to be modified and renaming the config property, I think this is ready to go. Thanks for getting this done! |
|
Hey Ryan, Made the change. Should be good to go now. Please have a final spot check. |
|
Thanks, @cccs-jc! The tests are passing after a restart, so I'll merge this. |
|
FWIW, HADOOP-15710 first reported this issue in 2018. Someone needs to fix this. Will escalate |
Modified the implementation of the HadoopCatalog to avoid using the deprecated isDirectory method.
Instead used the getFileStatus() method to get both the fact that a path exists and is a directory in a single call.
Created two methods to determine isTableDir() and isDirectory() which catch any I/O exception which are thrown when a user/client does not have access to certain folders.
http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html#isDirectory-org.apache.hadoop.fs.Path-