Docarray Versions Save

Represent, send, store and search multimodal data

v0.33.0

11 months ago

Release Note (0.33.0)

Release time: 2023-06-06 14:05:56

This release contains 1 new feature, 1 performance improvement, 9 bug fixes and 4 documentation improvements.

๐Ÿ†• Features

Allow coercion between different Tensor types (#1552) (#1588)

Allow coercing to a TorchTensor from an NdArray or TensorFlowTensor and the other way around.

from docarray import BaseDoc
from docarray.typing import TorchTensor
import numpy as np


class MyTensorsDoc(BaseDoc):
    tensor: TorchTensor


doc = MyTensorsDoc(tensor=np.zeros(512))
doc.summary()
๐Ÿ“„ MyTensorsDoc : 0a10f88 ...
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Attribute           โ”‚ Value                                                  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ tensor: TorchTensor โ”‚ TorchTensor of shape (512,), dtype: torch.float64      โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

๐Ÿš€ Performance

Avoid stack embedding for every search (#1586)

We have made a performance improvement for the find interface for InMemoryExactNNIndex that gives a ~2x speedup.

The script used to measure this is as follows:

from torch import rand
from time import perf_counter
โ€‹
from docarray import BaseDoc, DocList
from docarray.index.backends.in_memory import InMemoryExactNNIndex
from docarray.typing import TorchTensor
โ€‹
โ€‹
class MyDocument(BaseDoc):
    embedding: TorchTensor
    embedding2: TorchTensor
    embedding3: TorchTensor
โ€‹
def generate_doc_list(num_docs: int, dims: int) -> DocList[MyDocument]:
    return DocList[MyDocument](
        [
            MyDocument(
                embedding=rand(dims),
                embedding2=rand(dims),
                embedding3=rand(dims),
            )
            for _ in range(num_docs)
        ]
    )
โ€‹
num_docs, num_queries, dims = 500000, 1000, 128
data_list = generate_doc_list(num_docs, dims)
queries = generate_doc_list(num_queries, dims)
โ€‹
index = InMemoryExactNNIndex[MyDocument](data_list)
โ€‹
start = perf_counter()
for _ in range(5):
    matches, scores =  index.find_batched(queries, search_field='embedding')
โ€‹
print(f"Number of queries: {num_queries} \n"
      f"Number of indexed documents: {num_docs} \n"
      f"Total time: {(perf_counter() - start)/5} seconds")

๐Ÿž Bug Fixes

Respect limit parameter in filter for index backends (#1618)

InMemoryExactNNIndex and HnswDocumentIndex now respect the limit parameter in the filter API.

HnswDocumentIndex can search with limit greater than number of documents (#1611)

HnswDocumentIndex now allows to call find with a limit parameter larger than the number of indexed documents.

Allow updating HnswDocumentIndex (#1604)

HnswDocumentIndex now allows reindexing documents with the same id, updating the original documents.

Dynamically resize internal index to adapt to increasing number of documents (#1602)

HnswDocumentIndex now allows indexing more than max_elements, dynamically adapting the index as it grows.

Fix simple usage of HnswDocumentIndex (#1596)

from docarray.index import HnswDocumentIndex
from docarray import DocList, BaseDoc
from docarray.typing import NdArray
import numpy as np

class MyDoc(BaseDoc):
    text: str
    embedding: NdArray[128]

docs = [MyDoc(text='hey', embedding=np.random.rand(128)) for i in range(200)]
index = HnswDocumentIndex[MyDoc](work_dir='./tmp', index_name='index')
index.index(docs=DocList[MyDoc](docs))
resp = index.find_batched(queries=DocList[MyDoc](docs[0:3]), search_field='embedding')

Previously, this basic usage threw an exception:

TypeError: ModelMetaclass object argument after  must be a mapping, not MyDoc

Now, it works as expected.

Fix InMemoryExactNNIndex index initialization with nested DocList (#1582)

Instantiating an InMemoryExactNNIndex with a Document schema that had a nested DocList previously threw this error:

from docarray import BaseDoc, DocList
from docarray.documents import TextDoc
from docarray.index import HnswDocumentIndex

class MyDoc(BaseDoc):
    text: str,
    d_list: DocList[TextDoc]

index = HnswDocumentIndex[MyDoc]()
TypeError: docarray.index.abstract.BaseDocIndex.__init__() got multiple values for keyword argument 'db_config'

Now it can be successfully instantiated.

Fix summary of document with list (#1595)

Calling summary on a document with a List attribute previously showed the wrong type:

from docarray import BaseDoc, DocList
from typing import List
class TestDoc(BaseDoc):
    str_list: List[str]

dl = DocList[TestDoc]([TestDoc(str_list=[]), TestDoc(str_list=["1"])])
dl.summary()

Previous output:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€ DocList Summary โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚                               โ”‚
โ”‚   Type     DocList[TestDoc]   โ”‚
โ”‚   Length   2                  โ”‚
โ”‚                               โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€โ”€โ”€ Document Schema โ”€โ”€โ”€โ•ฎ
โ”‚                       โ”‚
โ”‚   TestDoc             โ”‚
โ”‚   โ””โ”€โ”€ str_list: str   โ”‚
โ”‚                       โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

New output:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€ DocList Summary โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚                               โ”‚
โ”‚   Type     DocList[TestDoc]   โ”‚
โ”‚   Length   2                  โ”‚
โ”‚                               โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€ Document Schema โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚                             โ”‚
โ”‚   TestDoc                   โ”‚
โ”‚   โ””โ”€โ”€ str_list: List[str]   โ”‚
โ”‚                             โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Solve issues caused by issubclass (#1594)

DocArray relies heavily on calling Python's issubclass method which caused multiple issues. We now use a safe version that counts for edge cases and types.

Make example payload a string rather than bytes (#1587)

The example payload of a given document schema with Tensor attribute was previously of bytes type. This has now been changed to str.

from docarray import DocList, BaseDoc
from docarray.documents import TextDoc
from docarray.typing import NdArray
import numpy as np


class MyDoc(BaseDoc):
    text: str
    embedding: NdArray[128]

print(f'{type(MyDoc.schema()["properties"]["embedding"]["example"])}')

๐Ÿ“— Documentation Improvements

  • Add forward declaration steps to example to avoid pickling error (#1615)
  • Fix n_dim to dim (#1610)
  • Add "in memory" to documentation as list of supported vector indexes (#1607)
  • Add a tensor section (#1576)

๐ŸคŸ Contributors

We would like to thank all contributors to this release:

  • Mohammad Kalim Akram (@makram93)
  • samsja (@samsja)
  • Saba Sturua (@jupyterjazz)
  • Joan Fontanals (@JoanFM)
  • maxwelljin (@maxwelljin)

v0.32.1

11 months ago

Release Note (0.32.1)

Release time: 2023-05-26 14:50:34

This release contains 4 bug fixes, 1 refactoring and 2 documentation improvements.

โš™ Refactoring

Improve ElasticDocIndex logging (#1551)

More debugging logs have been added inside ElasticDocIndex.

๐Ÿž Bug Fixes

Allow InMemoryExactNNIndex with Optional embedding tensors (#1575)

You can now index Documents where the tensor search_field is Optional. The index will not consider these None embeddings when running a search.

import torch
from typing import Optional

from docarray import BaseDoc, DocList
from docarray.typing import TorchTensor
from docarray.index import InMemoryExactNNIndex


class EmbeddingDoc(BaseDoc):
    embedding: Optional[TorchTensor[768]]

index = InMemoryExactNNIndex[TestDoc](DocList[TestDoc]([TestDoc(embedding=(torch.rand(768,) if i % 2 else None)) for i in range(5)]))
index.find(torch.rand((768,)), search_field="embedding", limit=3)

Safe is_subclass check (#1569)

In DocArray, especially when dealing with indexers, field types are checked that lead to calls to Python's is_subclass method. This call fails under some circumstances, for instance when checked for a List or Tuple. Starting with this release, we use a safe version that does not fail for these cases.

This enables the following usage, which would otherwise fail:

from docarray import BaseDoc
from docarray.index import HnswDocumentIndex

class MyDoc(BaseDoc):
    test: List[str]

index = HnswDocumentIndex[MyDoc]()

Fix AnyDoc deserialization (#1571)

AnyDoc is a schema-less special Document that adapts to the schema of the data it tries to load. However, in cases where the data contained Dictionaries or Lists, deserialization failed. This is now fixed and you can have this behavior:

from docarray.base_doc import AnyDoc, BaseDoc
from typing import Dict

class ConcreteDoc(BaseDoc):
    text: str
    tags: Dict[str, int]

doc = ConcreteDoc(text='text', tags={'type': 1})

any_doc = AnyDoc.from_protobuf(doc.to_protobuf())
assert any_doc.text == 'text'
assert any_doc.tags == {'type': 1}

dict method for Document view (#1559)

Prior to this fix, doc.dict() would return an empty Dictionary if doc.is_view() == True:

class MyDoc(BaseDoc):
    foo: int

vec = DocVec[MyDoc]([MyDoc(foo=3)])
# before
doc = vec[0]
assert doc.is_view()
print(doc.dict())
# > {}

# after
doc = vec[0]
assert doc.is_view()
print(doc.dict())
# > {'id': 'f285db406a949a7e7ab084032800f7d8', 'foo': 3}

๐Ÿ“— Documentation Improvements

  • Update doc building guide (#1566)
  • Explain the state of DocList in FastAPI (#1546)

๐ŸคŸ Contributors

We would like to thank all contributors to this release:

  • aman-exp-infy (@agaraman0)
  • Johannes Messner (@JohannesMessner)
  • Joan Fontanals (@JoanFM)
  • Saba Sturua (@jupyterjazz)
  • Ge Jin (@maxwelljin)

v0.32.0

11 months ago

Release Note (v0.32.0)

This release contains 4 new features, 0 performance improvements, 5 bug fixes and 4 documentation improvements.

๐Ÿ†• Features

Subindex for document index (#1428)

The subindex feature allows you to index documents that contain another DocList by automatically creating a separate collection/index for each such DocList:

# create nested document schema
class SimpleDoc(BaseDoc):
    tensor: NdArray[10]
    text: str


class MyDoc(BaseDoc):
    docs: DocList[SimpleDoc]


# create some docs
my_docs = [
    MyDoc(
        docs=DocList[SimpleDoc](
            [
                SimpleDoc(
                    tensor=np.ones(10) * (j + 1),
                    text=f"hello {j}",
                )
                for j in range(10)
            ]
        ),
    )
]

# index them into Elasticsearch
index = ElasticDocIndex[MyDoc](index_name="idx")
index.index(my_docs)  # index with name 'idx' and 'idx__docs' will be generated

# search on the nested level (subindex)
query = np.random.rand(10)
matches_root, matches_nested, scores = index.find_subindex(
    query, search_field="docs__tensor", limit=5
)

Openapi and FastAPI tensor shapes (#1510)

We have enabled shaped tensors to be properly represented in OpenAPI/SwaggerUI, both in examples and the schema.

This means that you can now built web APIs using FastAPI where the SwaggerUI properly communicates tensor shapes to your users:

class Doc(BaseDoc):
    embedding_torch: TorchTensor[3, 4]


app = FastAPI()


@app.post("/foo", response_model=Doc, response_class=DocArrayResponse)
async def foo(doc: Doc) -> Doc:
    return Doc(embedding=doc.embedding_np)

Generated Swagger UI:

image image

Save and load inmemory index (#1534)

We added a persist method to the InMemoryExactNNIndex class to save the index to disk.

# Save your existing index as a binary file
doc_index.persist('docs.bin')
# Initialize a new document index using the saved binary file
new_doc_index = InMemoryExactNNIndex[MyDoc](index_file_path='docs.bin')

๐Ÿž Bug Fixes

search_field should be optional in hybrid text search (#1516)

We have added a sane default to text_search() for the search_field argument that is now Optional.

Check if file path exists for in-memory index (#1537)

We have added an internal check to see if index_file_path exists when passed to InMemoryExactNNIndex.

Add empty judgement to index search (#1533)

We have ensured that empty indices do not fail when find is called.

Detach torch tensors (#1526)

Serializing tensors with gradients no longer fails.

Docvec display (#1522)

Docvec display issues have been resolved.

๐Ÿ“— Documentation Improvements

  • Remove erroneous info (#1531)
  • Fix link to documentation in readme (#1525)
  • Flatten structure (#1520)
  • Fix links (#1518)

๐ŸคŸ Contributors

We would like to thank all contributors to this release:

  • Mohammad Kalim Akram (@makram93)
  • Johannes Messner (@JohannesMessner)
  • Anne Yang (@AnneYang720)
  • Zhaofeng Miao (@mapleeit)
  • Joan Fontanals (@JoanFM)
  • Kacper ลukawski (@kacperlukawski)
  • IyadhKhalfallah (@IyadhKhalfallah)
  • Saba Sturua (@jupyterjazz)

v0.31.1

1 year ago

Release Note (0.31.1)

This patch release fixes a small bug that was introduced in the latest minor release (0.31.0).

๐Ÿž Bug Fixes

  • Calling json or dict on a Optional nested DocList does not throw an error anymore if the value is set to None (#1512)

๐ŸคŸ Contributors

We would like to thank all contributors to this release:

  • samsja (@samsja)

v0.31.0

1 year ago

Release Note (v0.31.0)

This release contains 4 new features, 11 bug fixes, and several documentation improvements.

๐Ÿ’ฅ Breaking changes

Return type of DocVec Optional Tensor (#1472)

Optional tensor fields in a DocVec will return None instead of a list of Nan if the column does not hold any tensor.

This code snippet shows the breaking change:

from typing import Optional

from docarray import BaseDoc, DocVec
from docarray.typing import NdArray

class MyDoc(BaseDoc):
    tensor: Optional[NdArray[10]]

docs = DocVec[MyDoc]([MyDoc() for j in range(2)])

print(docs.tensor)
Version Return type
0.30.0 [nan nan]
0.31.0 None

Default index collection names

Most vector databases have a concept similar to a 'table' in a relational database; this concept is usually called 'collection', 'index', 'class' or similar.

In DocArray v0.30.0, every Document Index backend defined its own default name for this, i.e. a default index_name or collection_name.

Starting with DocArray v0.30.0, the default index_name/collection_name will be derived from the document schema name:

from docarray.index.backends.weaviate import WeaviateDocumentIndex
from docarray import BaseDoc

class MyDoc(BaseDoc):
    pass

# With v0.30.0, the line below defaults to `index_name='Document'`.
# This was the default regardless of the Document Index schema.
# With v0.31.0, the line below defaults to `index_name='MyDoc'`
# The default now depends on the schema, i.e. the `MyDoc` class.
store = WeaviateDocumentIndex[MyDoc]()

If you create an persist a Document Index with v0.30.0, and try to access it using v0.31.0 without manually specifying an index name, an Exception will occur.

You can fix this by manually specifying the index name to match the old default:

# Create new Document Index using v0.30.0
store = WeaviateDocumentIndex[MyDoc](host=..., port=...)
# Access it using v0.31.0
store = WeaviateDocumentIndex[MyDoc](host=..., port=..., index_name='Document')

The below table summarizes the change for all DB backends:

DBConfig argument Default in v0.30.0 Default in v0.31.0
WeaviateDocumentIndex index_name 'Document' Schema class name
QdrantDocumentIndex collection_name 'documents' Schema class name
ElasticDocIndex index_name 'index__' + a random id Schema class name
ElasticV7DocIndex index_name 'index__' + a random id Schema class name
HnswDocumentIndex n/a n/a n/a

๐Ÿ†• Features

Add InMemoryExactNNIndex (#1441)

In this version we have introduced the InMemoryExactNNIndex Document Index which allows you to perform in-memory exact vector search (as opposed to approximate nearest neighbor search in vector databases).

The InMemoryExactNNIndex can be used for prototyping and is suitable for dealing with small-scale documents (1k-10k), as opposed to a vector database that is suitable for larger scales but comes with a performance overhead at smaller scales.

from docarray import BaseDoc, DocList
from docarray.index.backends.in_memory import InMemoryExactNNIndex
from docarray.typing import NdArray

import numpy as np

class MyDoc(BaseDoc):
    tensor: NdArray[512]

docs = DocList[MyDoc](MyDoc(tensor=i*np.ones(512)) for i in range(10))

doc_index = InMemoryExactNNIndex[MyDoc]()
doc_index.index(docs)

print(doc_index.find(3*np.ones(512), search_field='tensor', top_k=3))
FindResult(documents=<DocList[MyDoc] (length=10)>, scores=array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]))

DocList inherits from Python list (#1457)

DocList is now a subclass of Python's list. This means that you can now use all the methods that are available to Python lists on DocList objects. For example, you can now use len on DocList objects and tools like Pydantic or FastAPI will be able to work with it more easily.

Add len to DocIndex (#1454)

You can now perform len(vector_index) which is equivalent to vector_index.num_docs().

Other minor features

  • Add a to_json alias to BaseDoc (#1494)

๐Ÿž Bug Fixes

Point to older versions when importing Document or Documentarray (#1422)

Trying to load Document or DocumentArray from DocArray would previously raise an error, saying that you needed to downgrade your version of DocArray if you wanted to use these two objects. This behavior has been fixed.

Fix AnyDoc.from_protobuf (#1437)

AnyDoc can now read any BaseDoc protobuf file. The same applies to DocList.

Other bug fixes

  • Fix extend to DocList (#1493)
  • Fix bug when calling dict() on BaseDoc (#1481)
  • Fix bug when calling json() on BaseDoc (#1481)
  • Support Pandas 2.0 by using pd.concat() instead of df.append() in to_dataframe() to avoid warning (#1478)
  • Add logs to Elasticsearch index (#1427)
  • Fix a bug in Document Index where Torch tensors that required grad were not able to be converted to ndarray (#1429)
  • Fix a bug with HNSW (#1426)
  • Hubble Binary format version bump (#1414)
  • Save index during creation for hnswlib (#1424)

๐Ÿ“— Documentation Improvements

  • Fix FastAPI docs (#1453)
  • Index predefined Documents (#1434)
  • Clean up data types section (#1412)
  • Remove duplicate API reference section (#1408)
  • Docindex URLs (#1433)
  • Fix Install commands hint (#1421)
  • Add Google Analytics (#1432)
  • Add install instructions for hnswlib and elastic document indexes (#1431)
  • Various fixes (#1436, #1417, #1423, #1418, #1411, #1419)

๐ŸคŸ Contributors

We would like to thank all contributors to this release:

  • Alex Cureton-Griffiths (@alexcg1)
  • samsja (@samsja)
  • Johannes Messner (@JohannesMessner)
  • Anne Yang (@AnneYang720)
  • Scott Martens (@scott-martens)
  • ใ‚ซใƒฌใƒณ (@RStar2022)
  • Aman Agarwal (@agaraman0)
  • Yanlong Wang (@nomagick)
  • Charlotte Gerhaher (@anna-charlotte)

v0.30.0

1 year ago

๐Ÿ’ซ Release v0.30.0 (a.k.a DocArray v2)

Warning This version of DocArrray is a complete rewrite, therefore it includes several (more than breaking) changes. Be sure to check the documentation to prepare your migration.

Changelog

If you are using DocArray v<0.30.0, you will be familiar with its dataclass API.

DocArray v2 is that idea, taken seriously. Every document is created through dataclass-like interface, courtesy of Pydantic.

This gives the following advantages:

  • Flexibility: No need to conform to a fixed set of fields -- your data defines the schema.
  • Multimodality: Easily store multiple modalities and multiple embeddings in the same Document.
  • Language agnostic: At their core, Documents are just dictionaries. This makes it easy to create and send them from any language, not just Python.

You may also be familiar with our old Document Stores for vector database integration. They are now called Document Indexes and offer the following improvements:

  • Hybrid search: You can now combine vector search with text search, and even filter by arbitrary fields.
  • Production-ready: The new Document Indexes are a much thinner wrapper around the various vector DB libraries, making them more robust and easier to maintain.
  • Increased flexibility: We strive to support any configuration or setting that you could perform through the DB's first-party client.

For now, Document Indexes support Weaviate, Qdrant, ElasticSearch, and HNSWLib, with more to come.

Changes to Document

  • Document has been renamed to BaseDoc.
  • BaseDoc cannot be used directly, but instead has to be extended. Therefore, each document class is created through a dataclass-like interface.
  • Following from the previous point, extending BaseDoc allows for a flexible schema compared to the Document class in v1 which only allowed for a fixed schema, with one of tensor, text and blob, and additional chunks and matches.
  • Due to the added flexibility, one can not know what fields your document class will provide. Therefore, various methods from v1 (such as .load_uri_to_image_tensor()) are not supported in v2. Instead, we provide some of those methods on the typing-level.
  • In v2 we have the LegacyDocument class, which extends BaseDoc while following the same schema as v1's Document. The LegacyDocument can be useful to start migrating your codebase from v1 to v2. Nevertheless, the API is not fully compatible with DocArray v1 Document. Indeed, none of the methods associated with Document are present. Only the schema of the data is similar.

Changes to DocumentArray

DocList

  • The DocumentArray class from v1 has been renamed to DocList, to be more descriptive of its actual functionality, since it is a list of BaseDocs.

DocVec

  • Additionally, we introduced the class DocVec, which is a column-based representation of BaseDocs. Both DocVec and DocList extend AnyDocArray.
  • DocVec is a container of Documents appropriates to perform computation that require batches of data (ex: matrix multiplication, distance calculation, deep learning forward pass).
  • A DocVec has a similar interface as DocList but with an underlying implementation that is column-based instead of row-based. Each field of the schema of the DocVec (the .doc_type which is a BaseDoc) will be stored in a column. If the field is a tensor, the data from all Documents will be stored as a single doc_vec (Torch/TensorFlow/NumPy) tensor. If the tensor field is AnyTensor or a Union of tensor types, the .tensor_type will be used to determine the type of the doc_vec column.

Parameterized DocList

  • With the added flexibility of your document schema, and therefore endless options to design your document schema, when initializing a DocList it does not necessarily have to be homogenous.
  • If you want a homogenous DocList you can parameterize it at initialization time:
from docarray import DocList
from docarray.documents import ImageDoc

docs = DocList[ImageDoc]()
  • Methods like .from_csv() or .pull() only work with parameterized DocLists.

Access attributes of your DocumentArray

  • In v1 you could access an attribute of all Documents in your DocumentArray by calling the plural of the attribute's name on your DocArray instance.
  • In v2 you don't have to use the plural, but instead just use the document's attribute name, since AnyDocArray will expose the same attributes as the BaseDocs it contains. This will return a list of type(attribute). However, this only works if (and only if) all the BaseDocs in the AnyDocArray have the same schema. Therefore only this works:
from docarray import BaseDoc, DocList


class Book(BaseDoc):
    title: str
    author: str = None


docs = DocList[Book]([Book(title=f'title {i}') for i in range(5)])
book_titles = docs.title  # returns a list[str]

# this would fail
# docs = DocList([Book(title=f'title {i}') for i in range(5)])
# book_titles = docs.title

Changes to Document Store

In v2 the Document Store has been renamed to DocIndex and can be used for fast retrieval using vector similarity. DocArray v2 DocIndex supports:

Instead of creating a DocumentArray instance and setting the storage parameter to a vector database of your choice, in v2 you can initialize a DocIndex object of your choice, such as:

db = HnswDocumentIndex[MyDoc](work_dir='/my/work/dir')

In contrast, DocStore in v2 can be used for simple long-term storage, such as with AWS S3 buckets or Jina AI Cloud.

Thank you to all of the contributors to this release:

  • @samsja
  • @JohannesMessner
  • @anna-charlotte
  • @AnneYang720
  • @hsm207
  • @kacperlukawski
  • @JoanFM
  • @alexcg1
  • @Jackmin801
  • @nan-wang
  • @jupyterjazz
  • @azayz
  • @agaraman0
  • @hrik2001
  • @srini047

v0.21.0

1 year ago

Release Note (0.21.0)

Release time: 2023-01-17 09:10:50

This release contains 3 new features, 7 bug fixes and 5 documentation improvements.

๐Ÿ†• Features

OpenSearch Document Store (#853)

This version of DocArray adds a new Document Store: OpenSearch!

You can use the OpenSearch Document Store to index your Documents and perform ANN search on them:

from docarray import Document, DocumentArray
import numpy as np

# Connect to OpenSearch instance
n_dim = 3

da = DocumentArray(
    storage='opensearch',
    config={'n_dim': n_dim},
)

# Index Documents
with da:
    da.extend(
        [
            Document(id=f'r{i}', embedding=i * np.ones(n_dim))
            for i in range(10)
        ]
    )

# Perform ANN search
np_query = np.ones(n_dim) * 8
results = da.find(np_query, limit=10)

Additionally, the OpenSearch Document Store can perform filter queries, search by text, and search by tags.

Learn more about its usage in the official documentation.

Add color to point cloud display (#961)

You can now include color information in your point cloud data, which can be visualized using display_point_cloud_tensor():

coords = np.load('a_red_motorbike/coords.npy')
colors = np.load('a_red_motorbike/coord_colors.npy')

doc = Document(
    tensor=coords,
    chunks=DocumentArray([Document(tensor=colors, name='point_cloud_colors')])
)
doc.display()

image

Add language attribute to Redis Document Store (#953)

The Redis Document Store now supports text search in various supported languages. To set a desired language, change the language parameter in the Redis configuration:

da = DocumentArray(
    storage='redis',
    config={
        'n_dim': 128,
        'index_text': True,
        'language': 'chinese',
    },
)

๐Ÿž Bug Fixes

Replace newline with whitespace to fix display in plot embeddings (#963)

Whenever the string "\n" was contained in any Document field, doc.plot() would result in a rendering error. This fixes those errors be rendering "\n" as whitespace.

Fix unwanted coercion in to_pydantic_model (#949)

This bug caused all strings of the form 'Infinity' to be coerced to the string 'inf' when calling to_pydantic_model() or to_dict(). This is fixed now, leaving such strings unchanged.

Calculate relevant docs on index instead of queries (#950)

In the embed_and_evaluate() method, the number of relevant Documents per label used to be calculated based on the Document in self. This is not generally correct, so after this fix the quantity is calculated based on the Documents in the index data.

Remove offset index create on list like false (#936)

When a Document Store has list-like behavior disabled, it no longer creates an offset to id mapping, which improves performance.

Add support for remote audio files (#933)

Loading audio files from a remote URL would cause FileNotFoundError, which is now fixed.

Query operator $exists does not work correctly with tags (#911) (#923)

Before this fix, $exists would treat false-y values such as 0 or [] as non existent. This is now fixed.

Document from dataclass with singleton list (#1018)

When casting from a dataclass to Document, singleton lists were treated like an individual element, even if the corresponding field was annotated with List[...]. Now this case is considered, and accessing such a field will yield a DocumentArray, even for singleton inputs.

๐Ÿ“— Documentation Improvements

  • Link to Discord (#1010)
  • Have less versions to avoid deployment timeout (#977)
  • Fix data management section not appearing in Documentation (#967)
  • Link to OpenSearch docs in sidebar (#960)
  • Multimodal to datatypes (#934)

๐ŸคŸ Contributors

We would like to thank all contributors to this release:

  • Jay Bhambhani (@jay-bhambhani)
  • Alvin Prayuda (@alphinside)
  • Johannes Messner (@JohannesMessner)
  • samsja (@samsja)
  • Marco Luca Sbodio (@marcosbodio)
  • Anne Yang (@AnneYang720)
  • Michael Gรผnther (@guenthermi)
  • AlaeddineAbdessalem (@alaeddine-13)
  • Han Xiao (@hanxiao)
  • Alex Cureton-Griffiths (@alexcg1)
  • Charlotte Gerhaher (@anna-charlotte)

v0.20.1

1 year ago

Release Note (0.20.1)

Release time: 2022-12-12 09:32:37

๐Ÿž Bug Fixes

Make Milvus DocumentArray thread safe and suitable for pytest (#904)

This bug was causing connectivity issues when using multiple DocumentArrays in different threads to connect to the same Milvus instance, e.g. in pytest.

This would produce an error like the following:

E1207 14:59:51.357528591    2279 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers
E1207 14:59:51.367985469    2279 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers
E1207 14:59:51.457061884    3934 ev_epoll1_linux.cc:824]     assertion failed: gpr_atm_no_barrier_load(&g_active_poller) != (gpr_atm)worker
Fatal Python error: Aborted

This fix creates a separate gRPC connection for each MilvusDocumentArray instance, circumventing the issue.

Restore backwards compatibility for (de)serialization (#903)

DocArray v0.20.0 broke (de)serialization backwards compatibility with earlier versions of the library, making it impossible to load DocumentArrays from v0.19.1 or earlier from disk:

# DocArray <= 0.19.1
da = DocumentArray([Document() for _ in range(10)])
da.save_binary('old-da.docarray')
# DocArray == 0.20.0
da = DocumentArray.load_binary('old-da.docarray')
da.extend([Document()])
print(da)
AttributeError: 'DocumentArrayInMemory' object has no attribute '_is_subindex'

This fix restores backwards compatibility by not relying on newly introduced private attributes:

# DocArray <= 0.19.1
da = DocumentArray([Document() for _ in range(10)])
da.save_binary('old-da.docarray')
# DocArray == 0.20.1
da = DocumentArray.load_binary('old-da.docarray')
da.extend([Document()])
print(da)
<DocumentArray (length=11) at 140683902276416>

Process finished with exit code 0

๐Ÿ“— Documentation Improvements

  • Polish docs throughout (#895)

๐ŸคŸ Contributors

We would like to thank all contributors to this release:

  • Anne Yang (@AnneYang720)
  • Nan Wang (@nan-wang)
  • anna-charlotte (@anna-charlotte)
  • Alex Cureton-Griffiths (@alexcg1)

v0.20.0

1 year ago

Release Note (0.20.0)

Release time: 2022-12-07 12:15:30

This release contains 8 new features, 3 bug fixes and 7 documentation improvements.

๐Ÿ†• Features

Milvus document store (#587)

This release supports the Milvus vector database as a document store.

da = DocumentArray(storage='milvus', config={'n_dim': 3))

Root_id for document stores (#808)

When working with a vector database you can now retrieve the root document even if you search at a nested level with sub-indices (for example at chunk level).

top_level_matches = da.find(query=np.random.rand(512), on='@.[image]', return_root=True)

To allow this we now store the root_id in the chunks' tags. You can enable this by passing root_id=True in your document store configuration.

Filtering based on text keywords for Qdrant (#849)

You can now filter based on text keywords for the Qdrant document store.

filter = {
    'must': [
        {"key": "info", "match": {"text": "shoes"}}
    ]
}

results = da.find(np.random.rand(n_dim), filter=filter)

RGB-D representation of 3D meshes (#753)

DocArray already supports 3D mesh representation in different formats and this release adds support for RGB-D representation.

doc.load_uris_to_rgbd_tensor()

Load multi page tiff files into chunks (#845)

Multi page tiff images can now be loaded with load_uri_to_image_tensor().

d = Document(uri="foo.tiff")
d.load_uri_to_image_tensor()
print(d)
<Document ('id', 'uri', 'chunks') at 7f907d786d6c11ec840a1e008a366d49>
  โ””โ”€ chunks
     โ”œโ”€ <Document ('id', 'parent_id', 'granularity', 'tensor') at 7aa4c0ba66cf6c300b7f07fdcbc2fdc8>
     โ”œโ”€ <Document ('id', 'parent_id', 'granularity', 'tensor') at bc94a3e3ca60352f2e4c9ab1b1bb9c22>
     โ””โ”€ <Document ('id', 'parent_id', 'granularity', 'tensor') at 36fe0d1daf4442ad6461c619f8bb25b7>

Store key frame indices when loading video tensor from uri (#880)

key_frame_indices are now stored in a Document's tags when loading a video to tensor. This allows extracting the section of the video between key frames.

d = Document(uri="video.mp4").load_uri_to_video_tensor()
print(d.tags['keyframe_indices'])
[0, 25, 196, ...]

Better plotting of embeddings for nested and complex data (#891)

You can now choose which meta field parameters to exclude when calling DocumentArray's plot_embedding() method. This makes it easier to plot embeddings for complex and nested data.

docs.plot_embeddings(exclude_fields_metas=['chunks'])

Better support for information retrieval evaluation (#826)

This release adds a max_rel_per_label parameter to better support metric calculations that require the number of relevant Documents.

metrics = da.evaluate(['recall_at_k'], max_rel_per_label={i: 1 for i in range(3)})

๐Ÿž Bug Fixes

Support length calculation independently from list-like behavior (#840)

DocArray 0.19 added the ability to instantiate a document store without list-like behavior for improved performance. However, calculating the length of certain document stores relied on such list-like behavior. This release fixes length calculation for the Redis document store, making it independent from list-like behavior.

Remove cosine similarity field with false assignment (#835)

In the Weaviate document store, cosine distance is no longer mistakenly assigned to the cosine_similarity field.

Rebuild index after clearing storage (#837)

The index for Redis and Elasticsearch document stores is now rebuilt when _clear_storage is called.

๐Ÿ“— Documentation Improvements

  • Correct Document description (#842)
  • Minor correction in Document description (#834)
  • Add username to DocArray pull (#847)
  • Fix broken docs (#805)
  • Fix data management section (#801)
  • Change logic order according to blog (#797)
  • Move cloud support to integrations (#798)

๐ŸคŸ Contributors

We would like to thank all contributors to this release:

  • Delgermurun (@delgermurun)
  • Anne Yang (@AnneYang720)
  • anna-charlotte (@anna-charlotte)
  • Johannes Messner (@JohannesMessner)
  • Alex Cureton-Griffiths (@alexcg1)
  • AlaeddineAbdessalem (@alaeddine-13)
  • dong xiang (@dongxiang123)
  • coolmian (@coolmian)
  • Joan Fontanals (@JoanFM)
  • Nan Wang (@nan-wang)
  • samsja (@samsja)
  • Michael Gรผnther (@guenthermi)

v0.19.1

1 year ago

Release note 0.19.1

This release contains 1 hot fix.

๐Ÿž Hot Fix

Support for new Jina AI Cloud namespace format.

This release introduces namespaces when pushing/pulling DocumentArrays to/from Jina AI Cloud.

from docarray import DocumentArray

DocumentArray.pull('<username>/<da-name>')
DocumentArray.push('<username>/<da-name>')

You should now use a namespace when accessing an artifact. This release fixes a bug related to this namespace in DocArray.

๐ŸคŸ Contributors

  • samsja (@samsja)
  • delgermurun (@delgermurun)