Represent, send, store and search multimodal data
0.33.0
)Release time: 2023-06-06 14:05:56
This release contains 1 new feature, 1 performance improvement, 9 bug fixes and 4 documentation improvements.
Allow coercing to a TorchTensor
from an NdArray
or TensorFlowTensor
and the other way around.
from docarray import BaseDoc
from docarray.typing import TorchTensor
import numpy as np
class MyTensorsDoc(BaseDoc):
tensor: TorchTensor
doc = MyTensorsDoc(tensor=np.zeros(512))
doc.summary()
๐ MyTensorsDoc : 0a10f88 ...
โญโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ Attribute โ Value โ
โโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ tensor: TorchTensor โ TorchTensor of shape (512,), dtype: torch.float64 โ
โฐโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
We have made a performance improvement for the find
interface for InMemoryExactNNIndex
that gives a ~2x speedup.
The script used to measure this is as follows:
from torch import rand
from time import perf_counter
โ
from docarray import BaseDoc, DocList
from docarray.index.backends.in_memory import InMemoryExactNNIndex
from docarray.typing import TorchTensor
โ
โ
class MyDocument(BaseDoc):
embedding: TorchTensor
embedding2: TorchTensor
embedding3: TorchTensor
โ
def generate_doc_list(num_docs: int, dims: int) -> DocList[MyDocument]:
return DocList[MyDocument](
[
MyDocument(
embedding=rand(dims),
embedding2=rand(dims),
embedding3=rand(dims),
)
for _ in range(num_docs)
]
)
โ
num_docs, num_queries, dims = 500000, 1000, 128
data_list = generate_doc_list(num_docs, dims)
queries = generate_doc_list(num_queries, dims)
โ
index = InMemoryExactNNIndex[MyDocument](data_list)
โ
start = perf_counter()
for _ in range(5):
matches, scores = index.find_batched(queries, search_field='embedding')
โ
print(f"Number of queries: {num_queries} \n"
f"Number of indexed documents: {num_docs} \n"
f"Total time: {(perf_counter() - start)/5} seconds")
limit
parameter in filter
for index backends (#1618)InMemoryExactNNIndex
and HnswDocumentIndex
now respect the limit
parameter in the filter
API.
HnswDocumentIndex
can search with limit
greater than number of documents (#1611)HnswDocumentIndex
now allows to call find
with a limit
parameter larger than the number of indexed documents.
HnswDocumentIndex
(#1604)HnswDocumentIndex
now allows reindexing documents with the same id
, updating the original documents.
HnswDocumentIndex
now allows indexing more than max_elements
, dynamically adapting the index as it grows.
HnswDocumentIndex
(#1596)from docarray.index import HnswDocumentIndex
from docarray import DocList, BaseDoc
from docarray.typing import NdArray
import numpy as np
class MyDoc(BaseDoc):
text: str
embedding: NdArray[128]
docs = [MyDoc(text='hey', embedding=np.random.rand(128)) for i in range(200)]
index = HnswDocumentIndex[MyDoc](work_dir='./tmp', index_name='index')
index.index(docs=DocList[MyDoc](docs))
resp = index.find_batched(queries=DocList[MyDoc](docs[0:3]), search_field='embedding')
Previously, this basic usage threw an exception:
TypeError: ModelMetaclass object argument after must be a mapping, not MyDoc
Now, it works as expected.
InMemoryExactNNIndex
index initialization with nested DocList
(#1582)Instantiating an InMemoryExactNNIndex
with a Document
schema that had a nested DocList
previously threw this error:
from docarray import BaseDoc, DocList
from docarray.documents import TextDoc
from docarray.index import HnswDocumentIndex
class MyDoc(BaseDoc):
text: str,
d_list: DocList[TextDoc]
index = HnswDocumentIndex[MyDoc]()
TypeError: docarray.index.abstract.BaseDocIndex.__init__() got multiple values for keyword argument 'db_config'
Now it can be successfully instantiated.
Calling summary
on a document with a List
attribute previously showed the wrong type:
from docarray import BaseDoc, DocList
from typing import List
class TestDoc(BaseDoc):
str_list: List[str]
dl = DocList[TestDoc]([TestDoc(str_list=[]), TestDoc(str_list=["1"])])
dl.summary()
Previous output:
โญโโโโโโโ DocList Summary โโโโโโโโฎ
โ โ
โ Type DocList[TestDoc] โ
โ Length 2 โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโโโ Document Schema โโโโฎ
โ โ
โ TestDoc โ
โ โโโ str_list: str โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโฏ
New output:
โญโโโโโโโ DocList Summary โโโโโโโโฎ
โ โ
โ Type DocList[TestDoc] โ
โ Length 2 โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโโโโโโ Document Schema โโโโโโโฎ
โ โ
โ TestDoc โ
โ โโโ str_list: List[str] โ
โ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
issubclass
(#1594)DocArray
relies heavily on calling Python's issubclass
method which caused multiple issues. We now use a safe version that counts for edge cases and types.
The example
payload of a given document schema with Tensor
attribute was previously of bytes
type. This has now been changed to str
.
from docarray import DocList, BaseDoc
from docarray.documents import TextDoc
from docarray.typing import NdArray
import numpy as np
class MyDoc(BaseDoc):
text: str
embedding: NdArray[128]
print(f'{type(MyDoc.schema()["properties"]["embedding"]["example"])}')
n_dim
to dim
(#1610)We would like to thank all contributors to this release:
0.32.1
)Release time: 2023-05-26 14:50:34
This release contains 4 bug fixes, 1 refactoring and 2 documentation improvements.
ElasticDocIndex
logging (#1551)More debugging logs have been added inside ElasticDocIndex
.
InMemoryExactNNIndex
with Optional
embedding tensors (#1575)You can now index Documents where the tensor search_field
is Optional
. The index will not consider these None
embeddings when running a search.
import torch
from typing import Optional
from docarray import BaseDoc, DocList
from docarray.typing import TorchTensor
from docarray.index import InMemoryExactNNIndex
class EmbeddingDoc(BaseDoc):
embedding: Optional[TorchTensor[768]]
index = InMemoryExactNNIndex[TestDoc](DocList[TestDoc]([TestDoc(embedding=(torch.rand(768,) if i % 2 else None)) for i in range(5)]))
index.find(torch.rand((768,)), search_field="embedding", limit=3)
is_subclass
check (#1569)In DocArray, especially when dealing with indexers, field types are checked that lead to calls to Python's is_subclass
method.
This call fails under some circumstances, for instance when checked for a List
or Tuple
. Starting with this release, we use a safe version that does not fail for these cases.
This enables the following usage, which would otherwise fail:
from docarray import BaseDoc
from docarray.index import HnswDocumentIndex
class MyDoc(BaseDoc):
test: List[str]
index = HnswDocumentIndex[MyDoc]()
AnyDoc
deserialization (#1571)AnyDoc
is a schema-less special Document that adapts to the schema of the data it tries to load. However, in cases where the data contained Dictionaries or Lists, deserialization failed. This is now fixed and you can have this behavior:
from docarray.base_doc import AnyDoc, BaseDoc
from typing import Dict
class ConcreteDoc(BaseDoc):
text: str
tags: Dict[str, int]
doc = ConcreteDoc(text='text', tags={'type': 1})
any_doc = AnyDoc.from_protobuf(doc.to_protobuf())
assert any_doc.text == 'text'
assert any_doc.tags == {'type': 1}
dict
method for Document view (#1559)Prior to this fix, doc.dict()
would return an empty Dictionary if doc.is_view() == True
:
class MyDoc(BaseDoc):
foo: int
vec = DocVec[MyDoc]([MyDoc(foo=3)])
# before
doc = vec[0]
assert doc.is_view()
print(doc.dict())
# > {}
# after
doc = vec[0]
assert doc.is_view()
print(doc.dict())
# > {'id': 'f285db406a949a7e7ab084032800f7d8', 'foo': 3}
DocList
in FastAPI (#1546)We would like to thank all contributors to this release:
v0.32.0
)This release contains 4 new features, 0 performance improvements, 5 bug fixes and 4 documentation improvements.
The subindex feature allows you to index documents that contain another DocList
by automatically creating a separate collection/index for each such DocList
:
# create nested document schema
class SimpleDoc(BaseDoc):
tensor: NdArray[10]
text: str
class MyDoc(BaseDoc):
docs: DocList[SimpleDoc]
# create some docs
my_docs = [
MyDoc(
docs=DocList[SimpleDoc](
[
SimpleDoc(
tensor=np.ones(10) * (j + 1),
text=f"hello {j}",
)
for j in range(10)
]
),
)
]
# index them into Elasticsearch
index = ElasticDocIndex[MyDoc](index_name="idx")
index.index(my_docs) # index with name 'idx' and 'idx__docs' will be generated
# search on the nested level (subindex)
query = np.random.rand(10)
matches_root, matches_nested, scores = index.find_subindex(
query, search_field="docs__tensor", limit=5
)
We have enabled shaped tensors to be properly represented in OpenAPI/SwaggerUI, both in examples and the schema.
This means that you can now built web APIs using FastAPI where the SwaggerUI properly communicates tensor shapes to your users:
class Doc(BaseDoc):
embedding_torch: TorchTensor[3, 4]
app = FastAPI()
@app.post("/foo", response_model=Doc, response_class=DocArrayResponse)
async def foo(doc: Doc) -> Doc:
return Doc(embedding=doc.embedding_np)
Generated Swagger UI:
We added a persist
method to the InMemoryExactNNIndex
class to save the index to disk.
# Save your existing index as a binary file
doc_index.persist('docs.bin')
# Initialize a new document index using the saved binary file
new_doc_index = InMemoryExactNNIndex[MyDoc](index_file_path='docs.bin')
search_field
should be optional in hybrid text search (#1516)We have added a sane default to text_search()
for the search_field
argument that is now Optional.
We have added an internal check to see if index_file_path
exists when passed to InMemoryExactNNIndex
.
We have ensured that empty indices do not fail when find
is called.
Serializing tensors with gradients no longer fails.
Docvec
display (#1522)Docvec
display issues have been resolved.
We would like to thank all contributors to this release:
0.31.1
)This patch release fixes a small bug that was introduced in the latest minor release (0.31.0
).
json
or dict
on a Optional nested DocList does not throw an error anymore if the value is set to None
(#1512)We would like to thank all contributors to this release:
v0.31.0
)This release contains 4 new features, 11 bug fixes, and several documentation improvements.
DocVec
Optional Tensor (#1472)Optional tensor fields in a DocVec
will return None
instead of a list of Nan
if the column does not hold any tensor.
This code snippet shows the breaking change:
from typing import Optional
from docarray import BaseDoc, DocVec
from docarray.typing import NdArray
class MyDoc(BaseDoc):
tensor: Optional[NdArray[10]]
docs = DocVec[MyDoc]([MyDoc() for j in range(2)])
print(docs.tensor)
Version | Return type |
---|---|
0.30.0 | [nan nan] |
0.31.0 | None |
Most vector databases have a concept similar to a 'table' in a relational database; this concept is usually called 'collection', 'index', 'class' or similar.
In DocArray v0.30.0, every Document Index backend defined its own default name for this, i.e. a default index_name
or collection_name
.
Starting with DocArray v0.30.0, the default index_name
/collection_name
will be derived from the document schema name:
from docarray.index.backends.weaviate import WeaviateDocumentIndex
from docarray import BaseDoc
class MyDoc(BaseDoc):
pass
# With v0.30.0, the line below defaults to `index_name='Document'`.
# This was the default regardless of the Document Index schema.
# With v0.31.0, the line below defaults to `index_name='MyDoc'`
# The default now depends on the schema, i.e. the `MyDoc` class.
store = WeaviateDocumentIndex[MyDoc]()
If you create an persist a Document Index with v0.30.0, and try to access it using v0.31.0 without manually specifying an index name, an Exception will occur.
You can fix this by manually specifying the index name to match the old default:
# Create new Document Index using v0.30.0
store = WeaviateDocumentIndex[MyDoc](host=..., port=...)
# Access it using v0.31.0
store = WeaviateDocumentIndex[MyDoc](host=..., port=..., index_name='Document')
The below table summarizes the change for all DB backends:
DBConfig argument | Default in v0.30.0 | Default in v0.31.0 | |
---|---|---|---|
WeaviateDocumentIndex | index_name |
'Document' | Schema class name |
QdrantDocumentIndex | collection_name |
'documents' | Schema class name |
ElasticDocIndex | index_name |
'index__' + a random id | Schema class name |
ElasticV7DocIndex | index_name |
'index__' + a random id | Schema class name |
HnswDocumentIndex | n/a | n/a | n/a |
InMemoryExactNNIndex
(#1441)In this version we have introduced the InMemoryExactNNIndex
Document Index which allows you to perform in-memory exact vector search (as opposed to approximate nearest neighbor search in vector databases).
The InMemoryExactNNIndex
can be used for prototyping and is suitable for dealing with small-scale documents (1k-10k), as opposed to a vector database that is suitable for larger scales but comes with a performance overhead at smaller scales.
from docarray import BaseDoc, DocList
from docarray.index.backends.in_memory import InMemoryExactNNIndex
from docarray.typing import NdArray
import numpy as np
class MyDoc(BaseDoc):
tensor: NdArray[512]
docs = DocList[MyDoc](MyDoc(tensor=i*np.ones(512)) for i in range(10))
doc_index = InMemoryExactNNIndex[MyDoc]()
doc_index.index(docs)
print(doc_index.find(3*np.ones(512), search_field='tensor', top_k=3))
FindResult(documents=<DocList[MyDoc] (length=10)>, scores=array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]))
DocList
inherits from Python list
(#1457)DocList
is now a subclass of Python's list
. This means that you can now use all the methods that are available to Python lists on DocList
objects. For example, you can now use len
on DocList
objects and tools like Pydantic or FastAPI will be able to work with it more easily.
len
to DocIndex
(#1454)You can now perform len(vector_index)
which is equivalent to vector_index.num_docs()
.
to_json
alias to BaseDoc
(#1494)Document
or Documentarray
(#1422)Trying to load Document
or DocumentArray
from DocArray would previously raise an error, saying that you needed to downgrade your version of DocArray if you wanted to use these two objects. This behavior has been fixed.
AnyDoc.from_protobuf
(#1437)AnyDoc
can now read any BaseDoc
protobuf file. The same applies to DocList
.
extend
to DocList
(#1493)dict()
on BaseDoc
(#1481)json()
on BaseDoc
(#1481)pd.concat()
instead of df.append()
in to_dataframe()
to avoid warning (#1478)ndarray
(#1429)hnswlib
(#1424)Docindex
URLs (#1433)hnswlib
and elastic
document indexes (#1431)We would like to thank all contributors to this release:
Warning This version of DocArrray is a complete rewrite, therefore it includes several (more than breaking) changes. Be sure to check the documentation to prepare your migration.
If you are using DocArray v<0.30.0, you will be familiar with its dataclass API.
DocArray v2 is that idea, taken seriously. Every document is created through dataclass-like interface, courtesy of Pydantic.
This gives the following advantages:
You may also be familiar with our old Document Stores for vector database integration. They are now called Document Indexes and offer the following improvements:
For now, Document Indexes support Weaviate, Qdrant, ElasticSearch, and HNSWLib, with more to come.
Document
Document
has been renamed to BaseDoc
.BaseDoc
cannot be used directly, but instead has to be extended. Therefore, each document class is created through a dataclass-like interface.BaseDoc
allows for a flexible schema compared to the Document
class in v1 which only allowed for a fixed schema, with one of tensor
, text
and blob
, and additional chunks
and matches
..load_uri_to_image_tensor()
) are not supported in v2. Instead, we provide some of those methods on the typing-level.LegacyDocument
class, which extends BaseDoc
while following the same schema as v1's Document
. The LegacyDocument
can be useful to start migrating your codebase from v1 to v2. Nevertheless, the API is not fully compatible with DocArray v1 Document
. Indeed, none of the methods associated with Document
are present. Only the schema of the data is similar.DocumentArray
DocumentArray
class from v1 has been renamed to DocList
, to be more descriptive of its actual functionality, since it is a list of BaseDoc
s.DocVec
, which is a column-based representation of BaseDoc
s. Both DocVec
and DocList
extend AnyDocArray
.DocVec
is a container of Documents appropriates to perform computation that require batches of data (ex: matrix multiplication, distance calculation, deep learning forward pass).DocVec
has a similar interface as DocList
but with an underlying implementation that is column-based instead of row-based. Each field of the schema of the DocVec
(the .doc_type
which is a BaseDoc
) will be stored in a column. If the field is a tensor, the data from all Documents will be stored as a single doc_vec
(Torch/TensorFlow/NumPy) tensor. If the tensor field is AnyTensor
or a Union of tensor types, the .tensor_type
will be used to determine the type of the doc_vec
column.DocList
it does not necessarily have to be homogenous.DocList
you can parameterize it at initialization time:from docarray import DocList
from docarray.documents import ImageDoc
docs = DocList[ImageDoc]()
.from_csv()
or .pull()
only work with parameterized DocList
s.AnyDocArray
will expose the same attributes as the BaseDoc
s it contains. This will return a list of type(attribute)
. However, this only works if (and only if) all the BaseDoc
s in the AnyDocArray
have the same schema. Therefore only this works:from docarray import BaseDoc, DocList
class Book(BaseDoc):
title: str
author: str = None
docs = DocList[Book]([Book(title=f'title {i}') for i in range(5)])
book_titles = docs.title # returns a list[str]
# this would fail
# docs = DocList([Book(title=f'title {i}') for i in range(5)])
# book_titles = docs.title
In v2 the Document Store
has been renamed to DocIndex
and can be used for fast retrieval using vector similarity. DocArray v2 DocIndex
supports:
Instead of creating a DocumentArray
instance and setting the storage
parameter to a vector database of your choice, in v2 you can initialize a DocIndex
object of your choice, such as:
db = HnswDocumentIndex[MyDoc](work_dir='/my/work/dir')
In contrast, DocStore
in v2 can be used for simple long-term storage, such as with AWS S3 buckets or Jina AI Cloud.
Thank you to all of the contributors to this release:
0.21.0
)Release time: 2023-01-17 09:10:50
This release contains 3 new features, 7 bug fixes and 5 documentation improvements.
This version of DocArray adds a new Document Store: OpenSearch!
You can use the OpenSearch Document Store to index your Documents and perform ANN search on them:
from docarray import Document, DocumentArray
import numpy as np
# Connect to OpenSearch instance
n_dim = 3
da = DocumentArray(
storage='opensearch',
config={'n_dim': n_dim},
)
# Index Documents
with da:
da.extend(
[
Document(id=f'r{i}', embedding=i * np.ones(n_dim))
for i in range(10)
]
)
# Perform ANN search
np_query = np.ones(n_dim) * 8
results = da.find(np_query, limit=10)
Additionally, the OpenSearch Document Store can perform filter queries, search by text, and search by tags.
Learn more about its usage in the official documentation.
You can now include color information in your point cloud data, which can be visualized using display_point_cloud_tensor()
:
coords = np.load('a_red_motorbike/coords.npy')
colors = np.load('a_red_motorbike/coord_colors.npy')
doc = Document(
tensor=coords,
chunks=DocumentArray([Document(tensor=colors, name='point_cloud_colors')])
)
doc.display()
The Redis Document Store now supports text search in various supported languages. To set a desired language, change the language
parameter in the Redis configuration:
da = DocumentArray(
storage='redis',
config={
'n_dim': 128,
'index_text': True,
'language': 'chinese',
},
)
Whenever the string "\n"
was contained in any Document field, doc.plot()
would result in a rendering error. This fixes those errors be rendering "\n"
as whitespace.
to_pydantic_model
(#949)This bug caused all strings of the form 'Infinity'
to be coerced to the string 'inf'
when calling to_pydantic_model()
or to_dict()
. This is fixed now, leaving such strings unchanged.
In the embed_and_evaluate()
method, the number of relevant Documents per label used to be calculated based on the Document in self
. This is not generally correct, so after this fix the quantity is calculated based on the Documents in the index data.
When a Document Store has list-like behavior disabled, it no longer creates an offset to id mapping, which improves performance.
Loading audio files from a remote URL would cause FileNotFoundError
, which is now fixed.
$exists
does not work correctly with tags (#911) (#923)Before this fix, $exists
would treat false-y values such as 0
or []
as non existent. This is now fixed.
When casting from a dataclass to Document, singleton lists were treated like an individual element, even if the corresponding field was annotated with List[...]
. Now this case is considered, and accessing such a field will yield a DocumentArray, even for singleton inputs.
We would like to thank all contributors to this release:
0.20.1
)Release time: 2022-12-12 09:32:37
This bug was causing connectivity issues when using multiple DocumentArrays in different threads to connect to the same Milvus instance, e.g. in pytest.
This would produce an error like the following:
E1207 14:59:51.357528591 2279 fork_posix.cc:76] Other threads are currently calling into gRPC, skipping fork() handlers
E1207 14:59:51.367985469 2279 fork_posix.cc:76] Other threads are currently calling into gRPC, skipping fork() handlers
E1207 14:59:51.457061884 3934 ev_epoll1_linux.cc:824] assertion failed: gpr_atm_no_barrier_load(&g_active_poller) != (gpr_atm)worker
Fatal Python error: Aborted
This fix creates a separate gRPC connection for each MilvusDocumentArray instance, circumventing the issue.
DocArray v0.20.0 broke (de)serialization backwards compatibility with earlier versions of the library, making it impossible to load DocumentArrays from v0.19.1 or earlier from disk:
# DocArray <= 0.19.1
da = DocumentArray([Document() for _ in range(10)])
da.save_binary('old-da.docarray')
# DocArray == 0.20.0
da = DocumentArray.load_binary('old-da.docarray')
da.extend([Document()])
print(da)
AttributeError: 'DocumentArrayInMemory' object has no attribute '_is_subindex'
This fix restores backwards compatibility by not relying on newly introduced private attributes:
# DocArray <= 0.19.1
da = DocumentArray([Document() for _ in range(10)])
da.save_binary('old-da.docarray')
# DocArray == 0.20.1
da = DocumentArray.load_binary('old-da.docarray')
da.extend([Document()])
print(da)
<DocumentArray (length=11) at 140683902276416>
Process finished with exit code 0
We would like to thank all contributors to this release:
0.20.0
)Release time: 2022-12-07 12:15:30
This release contains 8 new features, 3 bug fixes and 7 documentation improvements.
This release supports the Milvus vector database as a document store.
da = DocumentArray(storage='milvus', config={'n_dim': 3))
When working with a vector database you can now retrieve the root document even if you search at a nested level with sub-indices (for example at chunk level).
top_level_matches = da.find(query=np.random.rand(512), on='@.[image]', return_root=True)
To allow this we now store the root_id
in the chunks' tags. You can enable this by passing root_id=True
in your document store configuration.
You can now filter based on text keywords for the Qdrant document store.
filter = {
'must': [
{"key": "info", "match": {"text": "shoes"}}
]
}
results = da.find(np.random.rand(n_dim), filter=filter)
DocArray already supports 3D mesh representation in different formats and this release adds support for RGB-D representation.
doc.load_uris_to_rgbd_tensor()
Multi page tiff
images can now be loaded with load_uri_to_image_tensor()
.
d = Document(uri="foo.tiff")
d.load_uri_to_image_tensor()
print(d)
<Document ('id', 'uri', 'chunks') at 7f907d786d6c11ec840a1e008a366d49>
โโ chunks
โโ <Document ('id', 'parent_id', 'granularity', 'tensor') at 7aa4c0ba66cf6c300b7f07fdcbc2fdc8>
โโ <Document ('id', 'parent_id', 'granularity', 'tensor') at bc94a3e3ca60352f2e4c9ab1b1bb9c22>
โโ <Document ('id', 'parent_id', 'granularity', 'tensor') at 36fe0d1daf4442ad6461c619f8bb25b7>
key_frame_indices
are now stored in a Document's tags when loading a video to tensor. This allows extracting the section of the video between key frames.
d = Document(uri="video.mp4").load_uri_to_video_tensor()
print(d.tags['keyframe_indices'])
[0, 25, 196, ...]
You can now choose which meta field parameters to exclude when calling DocumentArray's plot_embedding()
method. This makes it easier to plot embeddings for complex and nested data.
docs.plot_embeddings(exclude_fields_metas=['chunks'])
This release adds a max_rel_per_label
parameter to better support metric calculations that require the number of relevant Documents.
metrics = da.evaluate(['recall_at_k'], max_rel_per_label={i: 1 for i in range(3)})
DocArray 0.19 added the ability to instantiate a document store without list-like behavior for improved performance. However, calculating the length of certain document stores relied on such list-like behavior. This release fixes length calculation for the Redis document store, making it independent from list-like behavior.
In the Weaviate document store, cosine distance is no longer mistakenly assigned to the cosine_similarity field.
The index for Redis and Elasticsearch document stores is now rebuilt when _clear_storage
is called.
We would like to thank all contributors to this release:
This release contains 1 hot fix.
This release introduces namespaces when pushing/pulling DocumentArrays to/from Jina AI Cloud.
from docarray import DocumentArray
DocumentArray.pull('<username>/<da-name>')
DocumentArray.push('<username>/<da-name>')
You should now use a namespace when accessing an artifact. This release fixes a bug related to this namespace in DocArray.