Skip to content

Conversation

@anna-charlotte
Copy link
Contributor

@anna-charlotte anna-charlotte commented Feb 17, 2023

Goals:

  • load DocumentArray from csv, with a given Document class to provide a schema
    • document_type needs to be set to determine schema.
    • column names have to match the Document types fields
    • for nested fields, use access paths as column name, such as 'image__url'
    • if value is None, leave csvfield empty, or enter None
  • save a DocumentArray to csv
    • following the same guidelines as above.
  • check and update documentation, if required. See guide

@JohannesMessner
Copy link
Member

JohannesMessner commented Feb 20, 2023

Will nested schemas work somehow?

@anna-charlotte
Copy link
Contributor Author

Will nested schemas work somehow?

Yes, for nested fields you can give the access path as column names, e.g. 'image.url' to set the url of a field of type Image

anna-charlotte added 7 commits February 20, 2023 15:06
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
anna-charlotte added 3 commits February 20, 2023 15:28
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
@anna-charlotte anna-charlotte marked this pull request as ready for review February 20, 2023 14:49
@@ -0,0 +1,4 @@
count,text,image,image2.url
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
count,text,image,image2.url
count,text,image,image2__url

Not really sure here but I think we cannot/should not use . inside column name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, do you mean like by csv convention?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah kind of because, though it is more gut feeling that actual knowledge about CSV convention. But generally I think we should keep with a__b sintax rather than a.b one. Same for vector database I believe @JohannesMessner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will adjust this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we use "dot-notation" in other places, e.g. nested.field instead of nested__field? We need to make sure that this is uniform everywhere.

@JohannesMessner true, at other places we used the dot separation for the access paths (such as in traverse_flat(), e.g.da.traverse_flat(access_path='author.name'). @samsja mentioned that we should not use the . inside the column name. I am not sure how dangerous/against the rules this is


def test_from_csv_without_schema_raise_exception():
with pytest.raises(TypeError, match='no document schema defined'):
DocumentArray.from_csv(file_path=str(TOYDATA_DIR / 'docs_nested.csv'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense that this raise an Expection for now. But what about implementing at this level the auto schema detection ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for now the exception, I think the auto detection or handing over a schema some other way then setting .document_type I would handle in a separate PR, if that's good with you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes makes sense to do it in another PR

Copy link
Member

@JohannesMessner JohannesMessner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, but I think this will be a truly cool feature! We should promote this properly once it is out, I bet the data validation folks that already use pydantic would love it

"""
Check if a given access path is a valid path for a given Document class.
"""
field, _, remaining = access_path.partition('__')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we use "dot-notation" in other places, e.g. nested.field instead of nested__field? We need to make sure that this is uniform everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: we will use __ instead of . separator for access paths

Comment on lines 622 to 682
def _access_path_to_dict(access_path: str, value) -> Dict[str, Any]:
"""
Convert an access path and its value to a (potentially) nested dict.
EXAMPLE USAGE
.. code-block:: python
assert access_path_to_dict('image__url', 'img.png') == {'image': {'url': 'img.png'}}
"""
fields = access_path.split('__')
for field in reversed(fields):
result = {field: value}
value = result
return result


def _dict_to_access_paths(d: dict) -> Dict[str, Any]:
"""
Convert a (nested) dict to a Dict[access_path, value].
Access paths are defines as a path of field(s) separated by "__".
EXAMPLE USAGE
.. code-block:: python
assert dict_to_access_paths({'image': {'url': 'img.png'}}) == {'image__url', 'img.png'}
"""
result = {}
for k, v in d.items():
if isinstance(v, dict):
v = _dict_to_access_paths(v)
for nested_k, nested_v in v.items():
new_key = '__'.join([k, nested_k])
result[new_key] = nested_v
else:
result[k] = v
return result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider making these methods global helpers, they will come in handy in other places of the code base as well, e.g. i already implemented something similar for document stores

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sure, sounds good. What exactly do you mean by making these methods global helpers, put them somewhere in a helper file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes exactly, in a helper.py or utils.py at the top level or something like that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok done

Charlotte Gerhaher and others added 7 commits February 21, 2023 13:29
Co-authored-by: Johannes Messner <[email protected]>
Signed-off-by: Charlotte Gerhaher <[email protected]>
Co-authored-by: Johannes Messner <[email protected]>
Signed-off-by: Charlotte Gerhaher <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: anna-charlotte <[email protected]>
@github-actions
Copy link

📝 Docs are deployed on https://ft-feat-from-to-csv--jina-docs.netlify.app 🎉

Copy link
Member

@JohannesMessner JohannesMessner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice feature!

@anna-charlotte anna-charlotte merged commit 251eea7 into feat-rewrite-v2 Feb 22, 2023
@anna-charlotte anna-charlotte deleted the feat-from-to-csv branch February 22, 2023 13:15
@anna-charlotte anna-charlotte linked an issue Feb 23, 2023 that may be closed by this pull request
AsRaNi1 pushed a commit to AsRaNi1/docarray that referenced this pull request Feb 28, 2023
* feat: load from and to csv

Signed-off-by: anna-charlotte <[email protected]>

* fix: from to csv

Signed-off-by: anna-charlotte <[email protected]>

* feat: add access path to dict

Signed-off-by: anna-charlotte <[email protected]>

* fix: from to csv

Signed-off-by: anna-charlotte <[email protected]>

* fix: clean up

Signed-off-by: anna-charlotte <[email protected]>

* docs: add docstring and update tmpdir in test

Signed-off-by: anna-charlotte <[email protected]>

* fix: merge nested dicts

Signed-off-by: anna-charlotte <[email protected]>

* fix: clean up

Signed-off-by: anna-charlotte <[email protected]>

* fix: clean up

Signed-off-by: anna-charlotte <[email protected]>

* test: update test

Signed-off-by: anna-charlotte <[email protected]>

* fix: apply samis suggestion from code review

Signed-off-by: anna-charlotte <[email protected]>

* fix: apply suggestions from code review wrt access paths

Signed-off-by: anna-charlotte <[email protected]>

* fix: apply johannes suggestion

Co-authored-by: Johannes Messner <[email protected]>
Signed-off-by: Charlotte Gerhaher <[email protected]>

* fix: apply johannes suggestion

Co-authored-by: Johannes Messner <[email protected]>
Signed-off-by: Charlotte Gerhaher <[email protected]>

* fix: apply suggestions from code review

Signed-off-by: anna-charlotte <[email protected]>

* fix: apply suggestions from code review

Signed-off-by: anna-charlotte <[email protected]>

* fix: typos

Signed-off-by: anna-charlotte <[email protected]>

* refactor: move helper functions to helper file

Signed-off-by: anna-charlotte <[email protected]>

* test: fix fixture

Signed-off-by: anna-charlotte <[email protected]>

---------

Signed-off-by: anna-charlotte <[email protected]>
Signed-off-by: Charlotte Gerhaher <[email protected]>
Co-authored-by: Johannes Messner <[email protected]>
Signed-off-by: Arnav Zutshi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement from_to/csv,pandas v2: handle load from external data sources

4 participants