-
Notifications
You must be signed in to change notification settings - Fork 234
Open
Labels
area/document-indexConcerning Document Index or a Document Index backendConcerning Document Index or a Document Index backendgood-first-issueSuitable as your first contribution to DocArray!Suitable as your first contribution to DocArray!
Description
Data storage in HnswDocumentIndex works in the following way:
- Vectors are stored on disk using
hnswlib. - All other types of data are saved in an SQLITE database.
One of the operations we frequently perform is determining the total number of documents (num_docs()). However, the only way to get number of documents from SQLITE is by scanning the entire table. Even though we've made efforts to reduce the number of times we use this functionality (#1729), it's still a time-consuming process.
For better performance, let's do the following: instead of scanning the SQLITE table, we can use hnswlib's get_current_count function to quickly get the number of documents in the index.
But there's a potential issue with this approach. What if documents don't have associated vectors? get_current_count would return 0.
We have two potential solutions:
- Notify/Warn users about this behavior and return 0.
- Use to the older method of counting using the SQL table if vector-less documents are detected.
Metadata
Metadata
Assignees
Labels
area/document-indexConcerning Document Index or a Document Index backendConcerning Document Index or a Document Index backendgood-first-issueSuitable as your first contribution to DocArray!Suitable as your first contribution to DocArray!
Type
Projects
Status
In progress by community