Skip to content

refactor: count number of documents using hnswlib #1759

@jupyterjazz

Description

@jupyterjazz

Data storage in HnswDocumentIndex works in the following way:

  1. Vectors are stored on disk using hnswlib.
  2. All other types of data are saved in an SQLITE database.

One of the operations we frequently perform is determining the total number of documents (num_docs()). However, the only way to get number of documents from SQLITE is by scanning the entire table. Even though we've made efforts to reduce the number of times we use this functionality (#1729), it's still a time-consuming process.

For better performance, let's do the following: instead of scanning the SQLITE table, we can use hnswlib's get_current_count function to quickly get the number of documents in the index.

But there's a potential issue with this approach. What if documents don't have associated vectors? get_current_count would return 0.

We have two potential solutions:

  1. Notify/Warn users about this behavior and return 0.
  2. Use to the older method of counting using the SQL table if vector-less documents are detected.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/document-indexConcerning Document Index or a Document Index backendgood-first-issueSuitable as your first contribution to DocArray!

    Type

    No type

    Projects

    Status

    In progress by community

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions