feat: hnswlib document index #1124

JohannesMessner · 2023-02-10T17:07:26Z

Goals:

First implementation of a Document Store.

Design doc: https://lightning-scent-57a.notion.site/Document-Stores-v2-design-doc-f11d6fe6ecee43f49ef88e0f1bf80b7f

Usage Example:

# create a Document Store
from docarray.storage.backends.HnswDocStore import HnswDocumentIndex
from pydantic import Field
from docarray.typing import NdArray
from docarray import DocumentArray, BaseDocument

class MyDocument(BaseDocument):
    tens: NdArray[10]

store = HnswDocumentIndex[MyDocument](work_dir='path/to/work/dir')

# index some data    
data = DocumentArray[MyDocument]([MyDocument(tens=np.random.randn(10)) for i in range(10)])
store.index(data)

# find by np array
result = store.find(np.random.random((10,)), embedding_field='tens', limit=5)
result = store.find_batched(np.random.random((2, 10)), embedding_field='tens', limit=5)

# find by Document
result = store.find(MyDocument(tens=np.random.randn(10)), search_field='tens', limit=5)
result = store.find_batched(DocumentArray[MyDocument]([MyDocument(tens=np.random.randn(10)) for i in range(10)]), search_field='tens', limit=5)

# delete data
del store[data.id]  # delete all

# advanced configs
class MyDocument(BaseDocument):
    tens_one: NdArray = Field(dim=10, space='cosine')
    tens_two: NdArray = Field(dim=10, space='l2')
store = HnswDocumentStore[MyDocument](work_dir='path/to/work/dir')

# select search field
data = DocumentArray[MyDocument]([MyDocument(tens_one=np.random.randn(10), tens_two=np.random.randn(10)) for i in range(10)])
store.index(data)
result = store.find(np.random.random((10,)), search_field='tens_one', limit=5)
result = store.find(np.random.random((10,)), search_field='tens_two', limit=5)

# use query builder
q = store.build_query()
q.find(query=data[0], search_field='tens_one', limit=5)
q.find(query=data[1], search_field='tens_two', limit=5)
q.filter(filter_query={'tens_two': {'$exists': True}})
result = store.execute_query(q.build())

# nested search
class InnerDocument(BaseDocument):
    tens_one: NdArray = Field(dim=10)

class MyDocument(BaseDocument):
    d: InnerDocument
    tens: NdArray = Field(dim=10)

store = HnswDocumentStore[MyDocument](work_dir='path/to/work/dir')
data = DocumentArray[MyDocument]([MyDocument(d=InnerDocument(tens_one=np.random.randn(10)), tens=np.random.randn(10)) for i in range(10)])
store.index(data)

result = store.find(np.random.random((10,)), search_field='tens', limit=5)
result = store.find(np.random.random((10,)), search_field='d__tens_one', limit=5)

ToDo:

Note: unchecked boxes above have been moved to separate PR

Signed-off-by: Johannes Messner <[email protected]>

…te-v2

Signed-off-by: Johannes Messner <[email protected]>

docarray/doc_index/backends/hnswlib_doc_index.py

Co-authored-by: Anne Yang <[email protected]> Signed-off-by: Johannes Messner <[email protected]>

github-actions · 2023-03-01T07:25:32Z

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

AnneYang720 · 2023-03-01T07:29:59Z

docarray/doc_index/backends/hnswlib_doc_index.py

+        default_column_config: Dict[Type, Dict[str, Any]] = field(
+            default_factory=lambda: {
+                np.ndarray: {
+                    'dim': 128,


do we need to keep this dim here when class _Column has n_dim?

good point, let me double check this

Ok to clarify: The n_dim in the _Column is taken from the type parameter, e.g. NdArray[512], then n_dim will be 512. The dim here is just a parameter that people can pass to Field(). So in the _Column it could be that n_dim is empty while dim isn't, or vice versa, or both are empty, etc.

We cannot combine these automatically, because what is called dim here could have other names for other backends.

I will clarify the guidance on this in the doc, thanks for pointing out!

docs/tutorials/add_doc_index.md

docarray/doc_index/backends/hnswlib_doc_index.py

docs/tutorials/add_doc_index.md

Signed-off-by: Johannes Messner <[email protected]>

github-actions · 2023-03-01T10:47:15Z

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

samsja

can we mark all of the test as "slow" and "docstore" related ?

JohannesMessner · 2023-03-01T10:59:49Z

can we mark all of the test as "slow" and "docstore" related ?

yep makes sense

Signed-off-by: Johannes Messner <[email protected]>

github-actions · 2023-03-01T11:11:39Z

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

github-actions · 2023-03-01T11:15:24Z

📝 Docs are deployed on https://ft-feat-doc-store--jina-docs.netlify.app 🎉

samsja · 2023-03-01T11:52:24Z

tests/doc_index/hnswlib/test_index_get_del.py

+from docarray.doc_index.backends.hnswlib_doc_index import HnswDocumentIndex
+from docarray.typing import NdArray
+
+pytestmark = [pytest.mark.slow, pytest.mark.doc_index]


oh did not know you could do this interesting

JohannesMessner added 5 commits February 7, 2023 11:31

docs: add contributing md

2a15b1c

Signed-off-by: Johannes Messner <[email protected]>

Merge remote-tracking branch 'origin/feat-rewrite-v2' into feat-rewri…

dc6fdb1

…te-v2

feat: draft document store

a643c2e

Signed-off-by: Johannes Messner <[email protected]>

wip: initial implementation, incomplete

c21c94a

Signed-off-by: Johannes Messner <[email protected]>

feat: return documents on find

9e51f70

Signed-off-by: Johannes Messner <[email protected]>

JohannesMessner self-assigned this Feb 10, 2023

JohannesMessner linked an issue Feb 10, 2023 that may be closed by this pull request

DocumentStore first implementation #1123

Closed

JohannesMessner mentioned this pull request Feb 13, 2023

Meta: DocArray v2 Roadmap #780

Closed

47 tasks

JohannesMessner added 22 commits February 13, 2023 14:44

feat: delitem

5910578

Signed-off-by: Johannes Messner <[email protected]>

feat: get tensor dim from tensor parametrization

ddd379e

Signed-off-by: Johannes Messner <[email protected]>

docs: add sqlite integration

831d2ce

Signed-off-by: Johannes Messner <[email protected]>

feat: store hnsw indices

bfd2bb5

Signed-off-by: Johannes Messner <[email protected]>

reafctor: use universal id for sqlite and hnswlib

d59eae8

Signed-off-by: Johannes Messner <[email protected]>

feat: implement num docs method

df2e766

Signed-off-by: Johannes Messner <[email protected]>

feat: query builder

fd051cd

Signed-off-by: Johannes Messner <[email protected]>

chore: add lockfile

7a9a7aa

Signed-off-by: Johannes Messner <[email protected]>

test: add tests for base doc store

cc81088

Signed-off-by: Johannes Messner <[email protected]>

test: add test for base query builder

8cbabc2

Signed-off-by: Johannes Messner <[email protected]>

refactor: settings management

02b8670

Signed-off-by: Johannes Messner <[email protected]>

refactor: query builder with decorator

7aee239

Signed-off-by: Johannes Messner <[email protected]>

test: add one of those

b5cd6e2

Signed-off-by: Johannes Messner <[email protected]>

refactor: add structure and comments

4fd9bc6

Signed-off-by: Johannes Messner <[email protected]>

refactor: remove mem doc store

8cf8169

Signed-off-by: Johannes Messner <[email protected]>

refactor: remove useless code

a913054

Signed-off-by: Johannes Messner <[email protected]>

feat: expose all hnswlib options

f1b05a1

Signed-off-by: Johannes Messner <[email protected]>

test: adjust tests

f18026a

Signed-off-by: Johannes Messner <[email protected]>

docs: add comments

aab20f8

Signed-off-by: Johannes Messner <[email protected]>

fix: sort docs when retrieving

ebbeed6

Signed-off-by: Johannes Messner <[email protected]>

refactor: rename some vars

404ba2e

Signed-off-by: Johannes Messner <[email protected]>

test: delitem

f920c5f

Signed-off-by: Johannes Messner <[email protected]>

AnneYang720 reviewed Mar 1, 2023

View reviewed changes

docarray/doc_index/backends/hnswlib_doc_index.py Show resolved Hide resolved

docs: update docs/tutorials/add_doc_index.md

89ac7d2

Co-authored-by: Anne Yang <[email protected]> Signed-off-by: Johannes Messner <[email protected]>

AnneYang720 reviewed Mar 1, 2023

View reviewed changes

samsja reviewed Mar 1, 2023

View reviewed changes

docs/tutorials/add_doc_index.md Show resolved Hide resolved

AnneYang720 reviewed Mar 1, 2023

View reviewed changes

docarray/doc_index/backends/hnswlib_doc_index.py Show resolved Hide resolved

JoanFM reviewed Mar 1, 2023

View reviewed changes

docs/tutorials/add_doc_index.md Outdated Show resolved Hide resolved

JohannesMessner added 8 commits March 1, 2023 10:50

docs: add doc store guidance to docs

9b9ef00

Signed-off-by: Johannes Messner <[email protected]>

docs: more precise wording

701d55e

Signed-off-by: Johannes Messner <[email protected]>

docs: more guidance

37f0e6f

Signed-off-by: Johannes Messner <[email protected]>

refactor: remove kwargs stuff

421b388

Signed-off-by: Johannes Messner <[email protected]>

docs: typo

f31a050

Signed-off-by: Johannes Messner <[email protected]>

refactor: add abstractions for get and del

6855e51

Signed-off-by: Johannes Messner <[email protected]>

refactor: renaming

eb12fde

Signed-off-by: Johannes Messner <[email protected]>

Merge remote-tracking branch 'origin/feat-doc-store' into feat-doc-store

8a02af1

samsja approved these changes Mar 1, 2023

View reviewed changes

samsja requested changes Mar 1, 2023

View reviewed changes

test: add markers for slow and doc index

c5bc8cf

Signed-off-by: Johannes Messner <[email protected]>

JohannesMessner requested a review from samsja March 1, 2023 11:11

samsja reviewed Mar 1, 2023

View reviewed changes

samsja approved these changes Mar 1, 2023

View reviewed changes

samsja merged commit 13cc669 into feat-rewrite-v2 Mar 1, 2023

samsja deleted the feat-doc-store branch March 1, 2023 11:56

This was referenced Mar 6, 2023

DocumentIndex: support for Elastic Search #1209

Closed

DocumentIndex: support for Weaviate #1210

Closed

DocumentIndex: support for Qdrant #1211

Closed

feat: hnswlib document index #1124

feat: hnswlib document index #1124

Uh oh!

Conversation

JohannesMessner commented Feb 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 1, 2023

Uh oh!

AnneYang720 Mar 1, 2023

Choose a reason for hiding this comment

Uh oh!

JohannesMessner Mar 1, 2023

Choose a reason for hiding this comment

Uh oh!

JohannesMessner Mar 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 1, 2023

Uh oh!

samsja left a comment

Choose a reason for hiding this comment

Uh oh!

JohannesMessner commented Mar 1, 2023

Uh oh!

github-actions bot commented Mar 1, 2023

Uh oh!

github-actions bot commented Mar 1, 2023

Uh oh!

samsja Mar 1, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

JohannesMessner commented Feb 10, 2023 •

edited

Loading

JohannesMessner Mar 1, 2023 •

edited

Loading