-
Notifications
You must be signed in to change notification settings - Fork 234
feat(v2): load da from csv and save to csv #1144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Will nested schemas work somehow? |
Yes, for nested fields you can give the access path as column names, e.g. 'image.url' to set the url of a field of type Image |
7f03c5e to
ed134c5
Compare
b21d518 to
5126056
Compare
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
27303af to
5babdae
Compare
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
tests/toydata/docs_nested.csv
Outdated
| @@ -0,0 +1,4 @@ | |||
| count,text,image,image2.url | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| count,text,image,image2.url | |
| count,text,image,image2__url |
Not really sure here but I think we cannot/should not use . inside column name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, do you mean like by csv convention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah kind of because, though it is more gut feeling that actual knowledge about CSV convention. But generally I think we should keep with a__b sintax rather than a.b one. Same for vector database I believe @JohannesMessner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will adjust this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't we use "dot-notation" in other places, e.g. nested.field instead of nested__field? We need to make sure that this is uniform everywhere.
@JohannesMessner true, at other places we used the dot separation for the access paths (such as in traverse_flat(), e.g.da.traverse_flat(access_path='author.name'). @samsja mentioned that we should not use the . inside the column name. I am not sure how dangerous/against the rules this is
|
|
||
| def test_from_csv_without_schema_raise_exception(): | ||
| with pytest.raises(TypeError, match='no document schema defined'): | ||
| DocumentArray.from_csv(file_path=str(TOYDATA_DIR / 'docs_nested.csv')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense that this raise an Expection for now. But what about implementing at this level the auto schema detection ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for now the exception, I think the auto detection or handing over a schema some other way then setting .document_type I would handle in a separate PR, if that's good with you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes makes sense to do it in another PR
Signed-off-by: anna-charlotte <[email protected]>
JohannesMessner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments, but I think this will be a truly cool feature! We should promote this properly once it is out, I bet the data validation folks that already use pydantic would love it
docarray/array/array/io.py
Outdated
| """ | ||
| Check if a given access path is a valid path for a given Document class. | ||
| """ | ||
| field, _, remaining = access_path.partition('__') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't we use "dot-notation" in other places, e.g. nested.field instead of nested__field? We need to make sure that this is uniform everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: we will use __ instead of . separator for access paths
docarray/array/array/io.py
Outdated
| def _access_path_to_dict(access_path: str, value) -> Dict[str, Any]: | ||
| """ | ||
| Convert an access path and its value to a (potentially) nested dict. | ||
| EXAMPLE USAGE | ||
| .. code-block:: python | ||
| assert access_path_to_dict('image__url', 'img.png') == {'image': {'url': 'img.png'}} | ||
| """ | ||
| fields = access_path.split('__') | ||
| for field in reversed(fields): | ||
| result = {field: value} | ||
| value = result | ||
| return result | ||
|
|
||
|
|
||
| def _dict_to_access_paths(d: dict) -> Dict[str, Any]: | ||
| """ | ||
| Convert a (nested) dict to a Dict[access_path, value]. | ||
| Access paths are defines as a path of field(s) separated by "__". | ||
| EXAMPLE USAGE | ||
| .. code-block:: python | ||
| assert dict_to_access_paths({'image': {'url': 'img.png'}}) == {'image__url', 'img.png'} | ||
| """ | ||
| result = {} | ||
| for k, v in d.items(): | ||
| if isinstance(v, dict): | ||
| v = _dict_to_access_paths(v) | ||
| for nested_k, nested_v in v.items(): | ||
| new_key = '__'.join([k, nested_k]) | ||
| result[new_key] = nested_v | ||
| else: | ||
| result[k] = v | ||
| return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider making these methods global helpers, they will come in handy in other places of the code base as well, e.g. i already implemented something similar for document stores
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, sure, sounds good. What exactly do you mean by making these methods global helpers, put them somewhere in a helper file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes exactly, in a helper.py or utils.py at the top level or something like that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok done
Signed-off-by: anna-charlotte <[email protected]>
Co-authored-by: Johannes Messner <[email protected]> Signed-off-by: Charlotte Gerhaher <[email protected]>
Co-authored-by: Johannes Messner <[email protected]> Signed-off-by: Charlotte Gerhaher <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
bc46938 to
00a9ea7
Compare
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
|
📝 Docs are deployed on https://ft-feat-from-to-csv--jina-docs.netlify.app 🎉 |
JohannesMessner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice feature!
* feat: load from and to csv Signed-off-by: anna-charlotte <[email protected]> * fix: from to csv Signed-off-by: anna-charlotte <[email protected]> * feat: add access path to dict Signed-off-by: anna-charlotte <[email protected]> * fix: from to csv Signed-off-by: anna-charlotte <[email protected]> * fix: clean up Signed-off-by: anna-charlotte <[email protected]> * docs: add docstring and update tmpdir in test Signed-off-by: anna-charlotte <[email protected]> * fix: merge nested dicts Signed-off-by: anna-charlotte <[email protected]> * fix: clean up Signed-off-by: anna-charlotte <[email protected]> * fix: clean up Signed-off-by: anna-charlotte <[email protected]> * test: update test Signed-off-by: anna-charlotte <[email protected]> * fix: apply samis suggestion from code review Signed-off-by: anna-charlotte <[email protected]> * fix: apply suggestions from code review wrt access paths Signed-off-by: anna-charlotte <[email protected]> * fix: apply johannes suggestion Co-authored-by: Johannes Messner <[email protected]> Signed-off-by: Charlotte Gerhaher <[email protected]> * fix: apply johannes suggestion Co-authored-by: Johannes Messner <[email protected]> Signed-off-by: Charlotte Gerhaher <[email protected]> * fix: apply suggestions from code review Signed-off-by: anna-charlotte <[email protected]> * fix: apply suggestions from code review Signed-off-by: anna-charlotte <[email protected]> * fix: typos Signed-off-by: anna-charlotte <[email protected]> * refactor: move helper functions to helper file Signed-off-by: anna-charlotte <[email protected]> * test: fix fixture Signed-off-by: anna-charlotte <[email protected]> --------- Signed-off-by: anna-charlotte <[email protected]> Signed-off-by: Charlotte Gerhaher <[email protected]> Co-authored-by: Johannes Messner <[email protected]> Signed-off-by: Arnav Zutshi <[email protected]>
Goals:
document_typeneeds to be set to determine schema.'image__url'None