Skip to content

Conversation

@JohannesMessner
Copy link
Member

@JohannesMessner JohannesMessner commented Feb 10, 2023

Goals:

First implementation of a Document Store.

Design doc: https://lightning-scent-57a.notion.site/Document-Stores-v2-design-doc-f11d6fe6ecee43f49ef88e0f1bf80b7f

Usage Example:

# create a Document Store
from docarray.storage.backends.HnswDocStore import HnswDocumentIndex
from pydantic import Field
from docarray.typing import NdArray
from docarray import DocumentArray, BaseDocument

class MyDocument(BaseDocument):
    tens: NdArray[10]

store = HnswDocumentIndex[MyDocument](work_dir='path/to/work/dir')

# index some data    
data = DocumentArray[MyDocument]([MyDocument(tens=np.random.randn(10)) for i in range(10)])
store.index(data)

# find by np array
result = store.find(np.random.random((10,)), embedding_field='tens', limit=5)
result = store.find_batched(np.random.random((2, 10)), embedding_field='tens', limit=5)

# find by Document
result = store.find(MyDocument(tens=np.random.randn(10)), search_field='tens', limit=5)
result = store.find_batched(DocumentArray[MyDocument]([MyDocument(tens=np.random.randn(10)) for i in range(10)]), search_field='tens', limit=5)

# delete data
del store[data.id]  # delete all

# advanced configs
class MyDocument(BaseDocument):
    tens_one: NdArray = Field(dim=10, space='cosine')
    tens_two: NdArray = Field(dim=10, space='l2')
store = HnswDocumentStore[MyDocument](work_dir='path/to/work/dir')

# select search field
data = DocumentArray[MyDocument]([MyDocument(tens_one=np.random.randn(10), tens_two=np.random.randn(10)) for i in range(10)])
store.index(data)
result = store.find(np.random.random((10,)), search_field='tens_one', limit=5)
result = store.find(np.random.random((10,)), search_field='tens_two', limit=5)

# use query builder
q = store.build_query()
q.find(query=data[0], search_field='tens_one', limit=5)
q.find(query=data[1], search_field='tens_two', limit=5)
q.filter(filter_query={'tens_two': {'$exists': True}})
result = store.execute_query(q.build())

# nested search
class InnerDocument(BaseDocument):
    tens_one: NdArray = Field(dim=10)

class MyDocument(BaseDocument):
    d: InnerDocument
    tens: NdArray = Field(dim=10)

store = HnswDocumentStore[MyDocument](work_dir='path/to/work/dir')
data = DocumentArray[MyDocument]([MyDocument(d=InnerDocument(tens_one=np.random.randn(10)), tens=np.random.randn(10)) for i in range(10)])
store.index(data)

result = store.find(np.random.random((10,)), search_field='tens', limit=5)
result = store.find(np.random.random((10,)), search_field='d__tens_one', limit=5)

ToDo:

  • basic search
  • automatic conversions in the base class
  • query builder interface
    • autocomplete for method kwargs
  • think about other interface methods such as num_docs() etc
  • think about abstraction layers
  • infer tensor dimensionality from type hint
  • handle config passing properly
  • persistence using SQLite
    • proper error messages when things go wrong
  • handle HNSWLib nuances around saving/loading and changing configurations
  • deleting documents
    - [ ] nested access syntax: change __ to .
  • test torch and tf
  • enable arbitrary data types in fields
  • subindices
  • tests

Note: unchecked boxes above have been moved to separate PR

@JohannesMessner JohannesMessner self-assigned this Feb 10, 2023
@JohannesMessner JohannesMessner linked an issue Feb 10, 2023 that may be closed by this pull request
@JohannesMessner JohannesMessner mentioned this pull request Feb 13, 2023
47 tasks
Signed-off-by: Johannes Messner <[email protected]>
Signed-off-by: Johannes Messner <[email protected]>
Signed-off-by: Johannes Messner <[email protected]>
Signed-off-by: Johannes Messner <[email protected]>
Signed-off-by: Johannes Messner <[email protected]>
Signed-off-by: Johannes Messner <[email protected]>
Signed-off-by: Johannes Messner <[email protected]>
Signed-off-by: Johannes Messner <[email protected]>
Signed-off-by: Johannes Messner <[email protected]>
Signed-off-by: Johannes Messner <[email protected]>
Signed-off-by: Johannes Messner <[email protected]>
Signed-off-by: Johannes Messner <[email protected]>
Signed-off-by: Johannes Messner <[email protected]>
Signed-off-by: Johannes Messner <[email protected]>
Signed-off-by: Johannes Messner <[email protected]>
Signed-off-by: Johannes Messner <[email protected]>
Signed-off-by: Johannes Messner <[email protected]>
Co-authored-by: Anne Yang <[email protected]>
Signed-off-by: Johannes Messner <[email protected]>
@github-actions
Copy link

github-actions bot commented Mar 1, 2023

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

default_column_config: Dict[Type, Dict[str, Any]] = field(
default_factory=lambda: {
np.ndarray: {
'dim': 128,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to keep this dim here when class _Column has n_dim?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, let me double check this

Copy link
Member Author

@JohannesMessner JohannesMessner Mar 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok to clarify: The n_dim in the _Column is taken from the type parameter, e.g. NdArray[512], then n_dim will be 512. The dim here is just a parameter that people can pass to Field(). So in the _Column it could be that n_dim is empty while dim isn't, or vice versa, or both are empty, etc.

We cannot combine these automatically, because what is called dim here could have other names for other backends.

I will clarify the guidance on this in the doc, thanks for pointing out!

@github-actions
Copy link

github-actions bot commented Mar 1, 2023

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

Copy link
Member

@samsja samsja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we mark all of the test as "slow" and "docstore" related ?

@JohannesMessner
Copy link
Member Author

can we mark all of the test as "slow" and "docstore" related ?

yep makes sense

@JohannesMessner JohannesMessner requested a review from samsja March 1, 2023 11:11
@github-actions
Copy link

github-actions bot commented Mar 1, 2023

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@github-actions
Copy link

github-actions bot commented Mar 1, 2023

📝 Docs are deployed on https://ft-feat-doc-store--jina-docs.netlify.app 🎉

from docarray.doc_index.backends.hnswlib_doc_index import HnswDocumentIndex
from docarray.typing import NdArray

pytestmark = [pytest.mark.slow, pytest.mark.doc_index]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh did not know you could do this interesting

@samsja samsja merged commit 13cc669 into feat-rewrite-v2 Mar 1, 2023
@samsja samsja deleted the feat-doc-store branch March 1, 2023 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DocumentStore first implementation

6 participants