Docarray Versions Save

Represent, send, store and search multimodal data

v0.40.0

4 months ago

Release Note (0.40.0)

Release time: 2023-12-22 12:12:15

This release contains 1 new feature, 3 bug fixes and 2 documentation improvements.

πŸ†• Features

Add Epsilla connector (#1835)

We have integrated Epsilla into DocArray.

Here's a simple example of how to use it:

import numpy as np
from docarray import BaseDoc
from docarray.index import EpsillaDocumentIndex
from docarray.typing import NdArray
from pydantic import Field


class MyDoc(BaseDoc):
    text: str
    embedding: NdArray[10] = Field(is_embedding=True)

docs = [MyDoc(text=f'text {i}', embedding=np.random.rand(10)) for i in range(10)]
query = np.random.rand(10)
db = EpsillaDocumentIndex[MyDoc]()
db.index(docs)
results = db.find(query, limit=10)

In this example, we create a document class with both textual and numeric data. Then, we initialize an Epsilla-backed document index and use it to index our documents. Finally, we perform a search query.

🐞 Bug Fixes

Fixed type hints error in Python 3.12 (#1840)

DocArray type-hinting is now available for Python 3.12.

Fix issue serializing and deserializing complex schemas (#1836)

There was an issue when serializing and deserializing protobuf documents with nested documents in dictionaries and other complex structures.

Fix storage issue in TorchTensor class (#1833)

There was a bug when deep-copying a TorchTensor object its dtype was not float32. This has now been fixed.

πŸ“— Documentation Improvements

🀟 Contributors

We would like to thank all contributors to this release:

  • Tony Yang (@tonyyanga )
  • Naymul Islam (@ai-naymul )
  • Ben Shaver (@bpshaver )
  • Joan Fontanals (@@JoanFM)
  • 954 (@954-Ivory )

v0.39.1

6 months ago

Release Note (0.39.1)

Release time: 2023-10-23 08:56:38

This release contains 2 bug fixes.

🐞 Bug Fixes

From_dataframe with numpy==1.26.1 (#1823)

A recent update to numpy has changed some of the versioning semantics, breaking DocArray's from_dataframe() method in some cases where the dataframe contains a numpy array. This has now been now fixed.

class MyDoc(BaseDoc):
    embedding: NdArray
    text: str

da = DocVec[MyDoc](
    [
        MyDoc(
            embedding=[1, 2, 3, 4],
            text='hello',
        ),
        MyDoc(
            embedding=[5, 6, 7, 8],
            text='world',
        ),
    ],
    tensor_type=NdArray,
)
df_da = da.to_dataframe()
# This broke before and is now fixed
da2 = DocVec[MyDoc].from_dataframe(df_da, tensor_type=NdArray)

Type handling in python 3.9 (#1823)

Starting with Python 3.9, Optional.__args__ is not always available, leading to some compatibility problems. This has been fixed by using the typing.get_args helper.

🀟 Contributors

We would like to thank all contributors to this release:

  • Johannes Messner (@JohannesMessner )

v0.39.0

6 months ago

Release Note (0.39.0)

Release time: 2023-10-02 13:06:02

This release contains 4 new features, 8 bug fixes, and 7 documentation improvements.

πŸ†• Features

Support for Pydantic v2 πŸš€ (#1652)

The biggest feature of this release is full support for Pydantic v2! We are continuing to support Pydantic v1 at the same time.

If you use Pydantic v2, you will need to adapt your DocArray code to the new Pydantic API. Check out their migration guide here.

Pydantic v2 has its core written in Rust and provides significant performance improvements to DocArray: JSON serialization is 240% faster and validation of BaseDoc and DocList with non-native types like TorchTensor is 20% faster.

Add BaseDocWithoutId (#1803)

A BaseDoc by default includes an id field. This can be problematic if you want to build an API that requires a model without this ID field. Therefore, we now provide a BaseDocWithoutId which is, as its name suggests, is BaseDoc without the ID field.

Please use this Document with caution, BaseDoc is still the base class to use unless you specifically need to remove the ID.

⚠️ BaseDocWithoutId is not compatible with DocIndex or any feature requiring a vector database. This is because DocIndex needs the id field to store and retrieve documents.

πŸ’£ Breaking change

Remove Jina AI cloud push/pull (#1791)

Jina AI Cloud is being discontinued. Therefore, we are removing the push/pull feature related to Jina AI cloud.

🐞 Bug Fixes

Fix DocList subscription error

DocList can be typed from BaseDoc using the following syntax DocList[MyDoc]().

In this release, we have fixed a bug that allowed users to specify the type of a DocList multiple times

Doing DocList[MyDoc1][MyDoc2] won't work anymore (#1800)

We also fixed a bug that caused a silent failure when users passed DocList the wrong type, for example DocList[doc()]. (#1794)

Milvus connection parameter missing (#1802)

We fixed a small bug that incorrectly set the port of the Milvus client.

πŸ“— Documentation Improvements

🀟 Contributors

We would like to thank all contributors to this release:

  • lvzi (@lvzii )
  • Puneeth K (@punndcoder28 )
  • Joan Fontanals (@JoanFM )
  • samsja (@samsja )

v0.38.0

7 months ago

Release Note (0.38.0)

Release time: 2023-09-07 13:40:16

This release contains 3 bug fixes and 4 documentation improvements, including 1 breaking change.

πŸ’₯ Breaking Changes

Changes to the return type of DocList.to_json() and DocVec.to_json()

In order to make the to_json method consistent across different classes, we changed its return type in DocList and DocVec to str. This means that, if you use this method in your application, make sure to update your codebase to expect str instead of bytes.

🐞 Bug Fixes

Make DocList.to_json() and DocVec.to_json() return str instead of bytes (#1769)

This release changes the return type of the methods DocList.to_json() and DocVec.to_json() in order to be consistent with BaseDoc .to_json() and other pydantic models. After this release, these methods will return str type data instead of bytes. πŸ’₯ Since the return type is changed, this is considered a breaking change.

Casting in reduce before appending (#1758)

This release introduces type casting internally in the reduce helper function, casting its inputs before appending them to the final result. This will make it possible to reduce documents whose schemas are compatible but not exactly the same.

Skip doc attributes in __annotations__ but not in __fields__ (#1777)

This release fixes an issue in the create_pure_python_type_model helper function. Starting with this release, only attributes in the class __fields__ will be considered during type creation. The previous behavior broke applications when users introduced a ClassVar in an input class:

class MyDoc(BaseDoc):
    endpoint: ClassVar[str] = "my_endpoint"
    input_test: str = ""
    field_info = model.__fields__[field_name].field_info
KeyError: 'endpoint'

Kudos to @NarekA for raising the issue and contributing a fix in the Jina project, which was ported in DocArray.

πŸ“— Documentation Improvements

  • Explain how to set Document config (#1773)
  • Add workaround for torch compile (#1754)
  • Add note about pickling dynamically created Doc class (#1763)
  • Improve the docstring of filter_docs (#1762)

🀟 Contributors

We would like to thank all contributors to this release:

  • Sami Jaghouar (@samsja )
  • Johannes Messner (@JohannesMessner )
  • AlaeddineAbdessalem (@alaeddine-13 )
  • Joan Fontanals (@JoanFM ))
  • [d5cb02fb] - version: the next version will be 0.37.2 (Jina Dev Bot)

v0.37.1

8 months ago

Release Note v0.37.1

This release contains 4 bug fixes and 1 Documentation improvement.

🐞 Bug Fixes

Relax the schema check in update mixin (#1755)

The previous schema check in the UpdateMixin was strict and does not allow updating in cases the schema of both documents are similar but do not have the same reference. For instance, if the schemas are dynamically generated but have the same fields and field types, the check will still evaluate to False and it would not be possible to update the documents. This release relaxes the check and allows checking whether the fields of the schemas are similar instead.

Fix non-class type fields (#1752)

We fixed an issue where non-class type fields used in schemas with QdrantDocumentIndex result in a TypeError. The issue has been resolved by replacing the usage of issubclass with safe_issubclass in the QdrantDocumentIndex implementation.

Fix dynamic class creation with doubly nested schemas (#1747)

The following case used to result in a KeyError:

from docarray import BaseDoc
from docarray.utils.create_dynamic_doc_class import create_base_doc_from_schema

class Nested2(BaseDoc):
    value: str

class Nested1(BaseDoc):
    nested: Nested2

class RootDoc(BaseDoc):
    nested: Nested1

new_my_doc_cls = create_base_doc_from_schema(RootDoc.schema(), 'RootDoc')

We fixed this issue by changigng create_base_doc_from_schema such that global definitions of nested schemas are propagated during recursive calls.

Fix readme test (#1746)

πŸ“— Documentation Improvements

  • Update readme (#1744)

🀟 Contributors

We would like to thank all contributors to this release:

  • AlaeddineAbdessalem (@alaeddine-13)
  • Joan Fontanals (@JoanFM)
  • TERBOUCHE Hacene (@TerboucheHacene)
  • samsja (@samsja)

v0.37.0

8 months ago

Release Note (0.37.0)

Release time: 2023-08-03 03:11:16

This release contains 6 new features, 5 bug fixes, 1 performance improvement and 1 documentation improvement.

πŸ†• Features

Milvus Integration (#1681)

Leverage the power of Milvus in your DocArray project with this latest integration. Here's a simple usage example:

import numpy as np
from docarray import BaseDoc
from docarray.index import MilvusDocumentIndex
from docarray.typing import NdArray
from pydantic import Field


class MyDoc(BaseDoc):
    text: str
    embedding: NdArray[10] = Field(is_embedding=True)

docs = [MyDoc(text=f'text {i}', embedding=np.random.rand(10)) for i in range(10)]
query = np.random.rand(10)
db = MilvusDocumentIndex[MyDoc]()
db.index(docs)
results = db.find(query, limit=10)

In this example, we're creating a document class with both textual and numeric data. Then, we initialize a Milvus-backed document index and use it to index our documents. Finally, we perform a search query.

Supported Functionalities

  • Find: Vector search for efficient retrieval of similar documents.
  • Filter: Use Redis syntax to filter based on textual and numeric data.
  • Get/Del: Fetch or delete specific documents from the index.
  • Hybrid Search: Combine find and filter functionalities for more refined search.
  • Subindex: Search through nested data.

Support filtering in HnswDocumentIndex (#1718)

With our latest update, you can easily utilize filtering in HnswDocumentIndex either as an independent function or in conjunction with the query builder to combine it with vector search.

The code below shows how the new feature works:

import numpy as np

from docarray import BaseDoc, DocList
from docarray.index import HnswDocumentIndex
from docarray.typing import NdArray


class SimpleSchema(BaseDoc):
    year: int
    price: int
    embedding: NdArray[128]


# Create dummy documents.
docs = DocList[SimpleSchema](
    SimpleSchema(year=2000 - i, price=i, embedding=np.random.rand(128))
    for i in range(10)
)

doc_index = HnswDocumentIndex[SimpleSchema](work_dir="./tmp_5")
doc_index.index(docs)

# Independent filtering operation (year == 1995)
filter_query = {"year": {"$eq": 1995}}
results = doc_index.filter(filter_query)

# Filtering combined with vector search
hybrid_query = (
    doc_index.build_query()  # get empty query object
    .filter(filter_query={"year": {"$gt": 1994}})  # pre-filtering (year > 1994)
    .find(
        query=np.random.rand(128), search_field="embedding"
    )  # add vector similarity search
    .filter(filter_query={"price": {"$lte": 3}})  # post-filtering (price <= 3)
    .build()
)
results = doc_index.execute_query(hybrid_query)

First, we create and index some dummy documents. Then, we use the filter function in two ways. One is by itself to find documents from a specific year. The other is mixed with a vector search, where we first filter by year, perform a vector search, and then filter by price.

Pre-filtering in InMemoryExactNNIndex (#1713)

You can now add a pre-filter to your queries in InMemoryExactNNIndex. This lets you create flexible queries where you can set up as many pre- and post-filters as you want. Here's a simple example:

query = (
   doc_index.build_query()
   .filter(filter_query={'price': {'$lte': 3}})  # Pre-filter: price <= 3
   .find(query=np.ones(10), search_field='tensor')  # Vector search
   .filter(filter_query={'text': {'$eq': 'hello 1'}})  # Post-filter: text == 'hello 1'
   .build()
)

In this example, we first set a pre-filter to only include items priced 3 or less. We then do a vector search. Lastly, we add a post-filter to find items with the text 'hello 1'. This way, you can easily filter before and after your search!

Support document updates in InMemoryExactNNIndex (#1724)

You can now easily update your documents in InMemoryExactNNIndex. Previously, when you tried to update the same set of documents, it would just add duplicate copies instead of making changes to the existing ones. But not anymore! From now on, If you want to update documents you just have to re-index them.

Choose tensor format with DocVec deserialization (#1679)

Now you can specify the format of your tensor during DocVec deserialization. You can do this with any method you're using to convert data - like protobuf, json, pandas, bytes, binary, or base64. This means you'll always get your tensors in the format you want, whether it's a Torch tensor, TensorFlow tensor, NDarray, and so on.

Add description and example to id field of BaseDoc (#1737)

We added a description and example to the id field of BaseDoc, so that you get a richer OpenAPI specification when building FastAPI based applications with it.

πŸš€ Performance

Improve HnswDocumentIndex performance (#1727, #1729)

We've implemented two key optimizations to enhance the performance of HnswDocumentIndex. Firstly, we've avoided serialization of embeddings to SQLite, which is a costly operation and unnecessary as the embeddings can be reconstructed from hnswlib index itself. Additionally, we've minimized the frequency of computing num_docs(), which previously involved time-consuming full table scan to determine the number of documents in SQLite. As a result, we've seen an approximate speed increase of 10%, enhancing both the indexing and searching processes.

🐞 Bug Fixes

Fix TorchTensor type comparison (#1739)

We have addressed an exception raised when trying to compare TorchTensor with the type keyword in the docarray.typing module. Previously, this would lead to a TypeError, but the error has now been resolved, ensuring proper type comparison.

Add more info from dynamic class (#1733)

When using the method create_base_doc_from_schema to dynamically create a BaseDoc class, some information was lost, so we made sure that the new class keeps FieldInfo information from the original class such as description and examples.

Fix call to unsafe issubclass (#1731)

We fixed a bug calling issubclass by changing the call for a safer implementation against some types.

Align collection and index name in QdrantDocumentIndex (#1723)

We've corrected an issue where the collection name was not being updated to match a newly-initialized subindex name in QdrantDocumentIndex. This ensures consistent naming between collections and their respective subindexes.

Fix deepcopy TorchTensor (#1720)

We fixed a bug that will allow deepcopying documents with TorchTensors.

πŸ“— Documentation Improvements

  • Make Document Indices self-contained (#1678)

🀟 Contributors

We would like to thank all contributors to this release:

  • Joan Fontanals (@JoanFM )
  • Johannes Messner (@JohannesMessner )
  • Saba Sturua (@jupyterjazz )

v0.36.0

9 months ago

Release Note (0.36.0)

Release time: 2023-07-18 14:43:28

This release contains 2 new features, 5 bug fixes, 1 performance improvement and 1 documentation improvement.

πŸ†• Features

JAX Integration (#1646)

You can now use JAX with Docarray. We have introduced JaxArray as a new type option for your documents. JaxArray ensures that JAX can now natively process any array-like data in your DocArray documents. Here's how you use of it:

from docarray import BaseDoc
from docarray.typing import JaxArray
import jax.numpy as jnp


class MyDoc(BaseDoc):
    arr: JaxArray
    image_arr: JaxArray[3, 224, 224] # For images of shape (3, 224, 224)
    square_crop: JaxArray[3, 'x', 'x'] # For any square image, regardless of dimensions
    random_image: JaxArray[3, ...]  # For any image with 3 color channels, and arbitrary other dimensions

As you can see, the JaxArray typing is extremely flexible and can support a wide range of tensor shapes.

Creating a Document with Tensors

Creating a document with tensors is straightforward. Here is an example:

doc = MyDoc(
    arr=jnp.zeros((128,)),
    image_arr=jnp.zeros((3, 224, 224)),
    square_crop=jnp.zeros((3, 64, 64)),
    random_image=jnp.zeros((3, 128, 256)),
)

Redis Integration (#1550)

Leverage the power of Redis in your Docarray project with this latest integration. Here's a simple usage example:

import numpy as np
from docarray import BaseDoc
from docarray.index import RedisDocumentIndex
from docarray.typing import NdArray


class MyDoc(BaseDoc):
    text: str
    embedding: NdArray[10]

docs = [MyDoc(text=f'text {i}', embedding=np.random.rand(10)) for i in range(10)]
query = np.random.rand(10)
db = RedisDocumentIndex[MyDoc](host='localhost')
db.index(docs)
results = db.find(query, search_field='embedding', limit=10)

In this example, we're creating a document class with both textual and numeric data. Then, we initialize a Redis-backed document index and use it to index our documents. Finally, we perform a search query.

Supported Functionalities

Find: Vector search for efficient retrieval of similar documents. Filter: Use Redis syntax to filter based on textual and numeric data. Text Search: Leverage text search methods, such as BM25, to find relevant documents. Get/Del: Fetch or delete specific documents from the index. Hybrid Search: Combine find and filter functionalities for more refined search. Currently, only these two can be combined. Subindex: Search through nested data.

πŸš€ Performance

Speedup HnswDocumentIndex by caching num docs (#1706)

We've optimized the num_docs() operation by caching the document count, addressing previous slowdowns during searches. This change results in a minor increase in indexing time, but significantly accelerates search times.

from docarray import BaseDoc, DocList
from docarray.index import HnswDocumentIndex
from docarray.typing import NdArray
import numpy as np
import time

class MyDoc(BaseDoc):
    text: str
    embedding: NdArray[128]


docs = [MyDoc(text='hey', embedding=np.random.rand(128)) for _ in range(20000)]
index = HnswDocumentIndex[MyDoc](work_dir='tst', index_name='index')

index_start = time.time()
index.index(docs=DocList[MyDoc](docs))
index_time = time.time() - index_start

query = docs[0]

find_start = time.time()
matches, _ = index.find(query, search_field='embedding', limit=10)
find_time = time.time() - find_start

In the above experiment, we observed a 13x improvement in the speed of the search function, reducing its execution time from 0.0238 to 0.0018 seconds.

βš™ Refactoring

Put Contains method in the base class (#1701)

We've moved the contains method into the base class. With this refactoring, the responsibility for checking if a document exists is now delegated to individual backend implementations using the new _doc_exists method.

More robust method to detect duplicate index (#1651)

We have implemented a more robust method of detecting existing indices for WeaviateDocumentIndex

🐞 Bug Fixes

WeaviateDocumentIndex handles lowercase index names (#1711)

We've addressed an issue in the WeaviateDocumentIndex where passing a lowercase index name led to mismatches and subsequent errors. This was due to the system automatically capitalizing the index name when creating an index.

QdrantDocumentIndex unable to see index_name (#1705)

We've resolved an issue where the QdrantDocumentIndex was not properly recognizing the index_name parameter. Previously, the specified index_name was ignored and the system defaulted to the schema name.

Fix search in InMemoryExactNNIndex with AnyEmbedding (#1696)

From now on, you can perform search operations in InMemoryExactNNIndex using AnyEmbedding

Use safe_issubclass everywhere (#1691)

We now use safe_issubclass instead of issubclass because it supports non-class inputs, helping us to avoid unexpected errors

Avoid converting DocLists in the base index (#1685)

We added an additional check to avoid passing DocLists to a function that converts a list of dictionaries to a DocList.

πŸ“— Documentation Improvements

  • Add docs for dict() method (#1643)

🀟 Contributors

We would like to thank all contributors to this release:

  • Puneeth K (@punndcoder28)
  • Joan Fontanals (@JoanFM)
  • Saba Sturua (@jupyterjazz)
  • Aman Agarwal (@agaraman0)
  • samsja (@samsja)
  • Shukri (@hsm207)

v0.35.0

9 months ago

Release Note (0.35.0)

This release contains 3 new features, 2 bug fixes and 1 documentation improvement.

πŸ†• Features

More serialization options for DocVec (#1562)

DocVec now has the same serialization interface as DocList. This means that that following methods are available for it:

  • to_protobuf()/from_protobuf()
  • to_base64()/from_base64()
  • save_binary()/load_binary()
  • to_bytes()/from_bytes()
  • to_dataframe()/from_dataframe()

For example, you can now perform Base64 (de)serialization like this:

from docarray import BaseDoc, DocVec

class SimpleDoc(BaseDoc):
    text: str

dv = DocVec[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(2)])
base64_repr_dv = dv.to_base64(compress=None, protocol='pickle')

dl_from_base64 = DocVec[SimpleDoc].from_base64(
    base64_repr_dv, compress=None, protocol='pickle'
)

For further guidance, check out the documentation section on serialization.

Validate file formats in URL (#1606) (#1669)

Validate the file formats given in URL types such as AudioURL, TextURL, ImageURL to check they correspond to the expected mime type.

Add methods to create BaseDoc from schema (#1667)

Sometimes it can be useful to dynamically create a BaseDoc from a given schema of an original BaseDoc. Using the methods create_pure_python_type_model and create_base_doc_from_schema you can make sure to reconstruct the BaseDoc.

from docarray.utils.create_dynamic_doc_class import (
    create_base_doc_from_schema,
    create_pure_python_type_model,
)

from typing import Optional
from docarray import BaseDoc, DocList
from docarray.typing import AnyTensor
from docarray.documents import TextDoc

class MyDoc(BaseDoc):
    tensor: Optional[AnyTensor]
    texts: DocList[TextDoc]

MyDocPurePython = create_pure_python_type_model(MyDoc) # Due to limitation of DocList as Pydantic List, we need to have the MyDoc `DocList` converted to `List`.
NewMyDoc = create_base_doc_from_schema(
    MyDocPurePython.schema(), 'MyDoc', {}
)

new_doc = NewMyDoc(tensor=None, texts=[TextDoc(text='text')])

🐞 Bug Fixes

Cap Pydantic version (#1682)

Due to the breaking change in Pydantic v2, we have capped the version to avoid problems when installing docarray.

Better error message when DocVec is unusable (#1675)

After calling doc_list = doc_vec.to_doc_list(), doc_vec ends up in an unusable state since its data has been transferred to doc_list. This fix gives users a more informative error message when they try to interact with doc_vec after it has been made unusable.

πŸ“— Documentation Improvements

  • Fix a reference in README (#1674)

🀟 Contributors

We would like to thank all contributors to this release:

  • Saba Sturua (@jupyterjazz )
  • Joan Fontanals (@JoanFM )
  • Han Xiao (@hanxiao )
  • Johannes Messner (@JohannesMessner )

v0.21.1

10 months ago

Release Note (0.21.1)

Release time: 2023-06-21 08:15:43

This release contains 1 bug fix.

🐞 Bug Fixes

Allow passing extra headers to WeaviateDocumentArray (#1673)

This extra headers allow to pass authentication keys to connect to a secured Weaviate instance

WeaviateDocumentArray supports

🀟 Contributors

We would like to thank all contributors to this release:

  • Girish Chandrashekar (@girishc13)

v0.34.0

10 months ago

Release Note (0.34.0)

Release time: 2023-06-21 08:15:43

This release contains 2 breaking changes, 3 new features, 11 bug fixes, and 2 documentation improvements.

:bomb: Breaking Changes

Terminate Python 3.7 support

:warning: :warning: DocArray will now require Python 3.8. We can no longer assure compatibility with Python 3.7.

We decided to drop it for two reasons:

  • Several dependencies of DocArray require Python 3.8.
  • Python long-term support for 3.7 is ending this week. This means there will no longer be security updates for Python 3.7, making this a good time for us to change our requirements.

Changes to DocVec Protobuf definition (#1639)

In order to fix a bug in the DocVec protobuf serialization described in #1561, we have changed the DocVec .proto definition.

This means that DocVec objects serialized with DocArray v0.33.0 or earlier cannot be deserialized with DocArray v.0.34.0 or later, and vice versa.

:warning: :warning: We strongly recommend that everyone using Protobuf with DocVec upgrade to DocArray v0.34.0 or later.

πŸ†• Features

Allow users to check if a Document is already indexed in a DocIndex (#1633)

You can now check if a Document has already been indexed by using the in keyword:

from docarray.index import InMemoryExactNNIndex
from docarray import BaseDoc, DocList
from docarray.typing import NdArray
import numpy as np

class MyDoc(BaseDoc):
    text: str
    embedding: NdArray[128]

docs = DocList[MyDoc](
        [MyDoc(text="Example text", embedding=np.random.rand(128))
         for _ in range(2000)])

index = InMemoryExactNNIndex[MyDoc](docs)
assert docs[0] in index
assert MyDoc(text='New text', embedding=np.random.rand(128)) not in index

Support subindexes in InMemoryExactNNIndex (#1617)

You can now use the find_subindex method with the ExactNNSearch DocIndex.

from docarray.index import InMemoryExactNNIndex
from docarray import BaseDoc, DocList
from docarray.typing import ImageUrl, VideoUrl, AnyTensor

class ImageDoc(BaseDoc):
    url: ImageUrl
    tensor_image: AnyTensor = Field(space='cosine', dim=64)


class VideoDoc(BaseDoc):
    url: VideoUrl
    images: DocList[ImageDoc]
    tensor_video: AnyTensor = Field(space='cosine', dim=128)


class MyDoc(BaseDoc):
    docs: DocList[VideoDoc]
    tensor: AnyTensor = Field(space='cosine', dim=256)

doc_index = InMemoryExactNNIndex[MyDoc]()
...

# find by the `ImageDoc` tensor when index is populated
root_docs, sub_docs, scores = doc_index.find_subindex(
    np.ones(64), subindex='docs__images', search_field='tensor_image', limit=3
)

Flexible tensor types for protobuf deserialization (#1645)

You can deserialize any DocVec protobuf message to any tensor type, by passing the tensor_type parameter to from_protobuf.

This means that you can choose at deserialization time if you are working with numpy, PyTorch, or TensorFlow tensors.

class MyDoc(BaseDoc):
    tensor: TensorFlowTensor

da = DocVec[MyDoc](...)  # doesn't matter what tensor_type is here

proto = da.to_protobuf()
da_after = DocVec[MyDoc].from_protobuf(proto, tensor_type=TensorFlowTensor)

assert isinstance(da_after.tensor, TensorFlowTensor)

βš™ Refactoring

Add DBConfig to InMemoryExactNNSearch

InMemoryExactNNsearch used to get a single parameter index_file_path as a constructor parameter, unlike the rest of the Indexers who accepted their own DBConfig. Now index_file_path is part of the DBConfig which allows to initialize from it. This will allow us to extend this config if more parameters are needed.

The parameters of DBConfig can be passed at construction time as **kwargs making this change compatible with old usage.

These two initializations are equivalent.

from docarray.index import InMemoryExactNNIndex
db_config = InMemoryExactNNIndex.DBConfig(index_file_path='index.bin')

index = InMemoryExactNNIndex[MyDoc](db_config=db_config)
index = InMemoryExactNNIndex[MyDoc](index_file_path='index.bin')

🐞 Bug Fixes

Allow protobuf deserialization of BaseDoc with Union type (#1655)

Serialization of BaseDoc types who have Union types parameter of Python native types is supported.

from docarray import BaseDoc
from typing import Union
class MyDoc(BaseDoc):
    union_field: Union[int, str]

docs1 = DocList[MyDoc]([MyDoc(union_field="hello")])
docs2 = DocList[BasisUnion].from_dataframe(docs_basic.to_dataframe())
assert docs1 == docs2

When these Union types involve other BaseDoc types, an exception is thrown.

class CustomDoc(BaseDoc):
    ud: Union[TextDoc, ImageDoc] = TextDoc(text='union type')

docs = DocList[CustomDoc]([CustomDoc(ud=TextDoc(text='union type'))])

# raises an Exception
DocList[CustomDoc].from_dataframe(docs.to_dataframe())

Cast limit to integer when passed to HNSWDocumentIndex (#1657, #1656)

If you call find or find_batched on an HNSWDocumentIndex, the limit parameter will automatically be cast to integer.

Moved default_column_config from RuntimeConfig to DBconfig (#1648)

default_column_config contains specific configuration information about the columns and tables inside the backend's database. This was previously put inside RuntimeConfig which caused an error because this information is required at initialization time. This information has been moved inside DBConfig so you can edit it there.

from docarray.index import HNSWDocumentIndex
import numpy as np

db_config = HNSWDocumentIndex.DBConfig()
db_conf.default_column_config.get(np.ndarray).update({'ef': 2500})
index = HNSWDocumentIndex[MyDoc](db_config=db_config)

Fix issue with Protobuf (de)serialization for DocVec (#1639)

This bug caused raw Protobuf objects to be stored as DocVec columns after they were deserialized from Protobuf, making the data essentially inaccessible. This has now been fixed, and DocVec objects are identical before and after (de)serialization.

Fix order of returned matches when find and filter combination used in InMemoryExactNNIndex (#1642)

Hybrid search (find+filter) for InMemoryExactNNIndex was prioritizing low similarities (lower scores) for returned matches. Fixed by adding an option to sort matches in a reverse order based on their scores.

# prepare a query
q_doc = MyDoc(embedding=np.random.rand(128), text='query')

query = (
    db.build_query()
    .find(query=q_doc, search_field='embedding')
    .filter(filter_query={'text': {'$exists': True}})
    .build()
)

results = db.execute_query(query)
# Before: results was sorted from worst to best matches
# Now: It's sorted in the correct order, showing better matches first

Working with external Qdrant collections (#1632)

When using QdrandDocumentIndex to connect to a Qdrant DB initialized outside of docarray raised a KeyError. This has been fixed, and now you can use QdrantDocumentIndex to connect to externally initialized collections.

Other bug fixes

  • Update text search to match Weaviate client's new sig (#1654)
  • Fix DocVec equality (#1641, #1663)
  • Fix exception when summary() called for LegacyDocument. (#1637)
  • Fix DocList and DocVec coersion. (#1568)
  • Fix update() on BaseDoc with tensors fields (#1628)

πŸ“— Documentation Improvements

  • Enhance DocVec section (#1658)
  • Qdrant in memory usage (#1634)

🀟 Contributors

We would like to thank all contributors to this release:

  • Johannes Messner (@JohannesMessner)
  • Nikolas Pitsillos (@npitsillos)
  • Shukri (@hsm207)
  • Kacper Łukawski (@kacperlukawski)
  • Aman Agarwal (@agaraman0)
  • maxwelljin (@maxwelljin)
  • samsja (@samsja)
  • Saba Sturua (@jupyterjazz)
  • Joan Fontanals (@JoanFM)