Represent, send, store and search multimodal data
0.40.0
)Release time: 2023-12-22 12:12:15
This release contains 1 new feature, 3 bug fixes and 2 documentation improvements.
We have integrated Epsilla into DocArray.
Here's a simple example of how to use it:
import numpy as np
from docarray import BaseDoc
from docarray.index import EpsillaDocumentIndex
from docarray.typing import NdArray
from pydantic import Field
class MyDoc(BaseDoc):
text: str
embedding: NdArray[10] = Field(is_embedding=True)
docs = [MyDoc(text=f'text {i}', embedding=np.random.rand(10)) for i in range(10)]
query = np.random.rand(10)
db = EpsillaDocumentIndex[MyDoc]()
db.index(docs)
results = db.find(query, limit=10)
In this example, we create a document class with both textual and numeric data. Then, we initialize an Epsilla-backed document index and use it to index our documents. Finally, we perform a search query.
DocArray type-hinting is now available for Python 3.12.
There was an issue when serializing and deserializing protobuf
documents with nested documents in dictionaries and other complex structures.
There was a bug when deep-copying a TorchTensor
object its dtype
was not float32
. This has now been fixed.
We would like to thank all contributors to this release:
0.39.1
)Release time: 2023-10-23 08:56:38
This release contains 2 bug fixes.
A recent update to numpy has changed some of the versioning semantics, breaking DocArray's from_dataframe()
method in some cases where the dataframe contains a numpy array. This has now been now fixed.
class MyDoc(BaseDoc):
embedding: NdArray
text: str
da = DocVec[MyDoc](
[
MyDoc(
embedding=[1, 2, 3, 4],
text='hello',
),
MyDoc(
embedding=[5, 6, 7, 8],
text='world',
),
],
tensor_type=NdArray,
)
df_da = da.to_dataframe()
# This broke before and is now fixed
da2 = DocVec[MyDoc].from_dataframe(df_da, tensor_type=NdArray)
Starting with Python 3.9, Optional.__args__
is not always available, leading to some compatibility problems. This has been fixed by using the typing.get_args
helper.
We would like to thank all contributors to this release:
0.39.0
)Release time: 2023-10-02 13:06:02
This release contains 4 new features, 8 bug fixes, and 7 documentation improvements.
The biggest feature of this release is full support for Pydantic v2! We are continuing to support Pydantic v1 at the same time.
If you use Pydantic v2, you will need to adapt your DocArray code to the new Pydantic API. Check out their migration guide here.
Pydantic v2 has its core written in Rust and provides significant performance improvements to DocArray: JSON serialization is 240% faster and validation of BaseDoc and DocList with non-native types like TorchTensor
is 20% faster.
A BaseDoc
by default includes an id
field. This can be problematic if you want to build an API that requires a model without this ID field. Therefore, we now provide a BaseDocWithoutId
which is, as its name suggests, is BaseDoc without the ID field.
Please use this Document with caution, BaseDoc is still the base class to use unless you specifically need to remove the ID.
β οΈ BaseDocWithoutId
is not compatible with DocIndex
or any feature requiring a vector database. This is because DocIndex needs the id field to store and retrieve documents.
Jina AI Cloud is being discontinued. Therefore, we are removing the push/pull
feature related to Jina AI cloud.
DocList
can be typed from BaseDoc using the following syntax DocList[MyDoc]()
.
In this release, we have fixed a bug that allowed users to specify the type of a DocList
multiple times
Doing DocList[MyDoc1][MyDoc2]
won't work anymore (#1800)
We also fixed a bug that caused a silent failure when users passed DocList
the wrong type, for example DocList[doc()]
. (#1794)
We fixed a small bug that incorrectly set the port of the Milvus client.
We would like to thank all contributors to this release:
0.38.0
)Release time: 2023-09-07 13:40:16
This release contains 3 bug fixes and 4 documentation improvements, including 1 breaking change.
DocList.to_json()
and DocVec.to_json()
In order to make the to_json
method consistent across different classes, we changed its return type in DocList
and DocVec
to str
.
This means that, if you use this method in your application, make sure to update your codebase to expect str
instead of bytes
.
This release changes the return type of the methods DocList.to_json()
and DocVec.to_json()
in order to be consistent with BaseDoc .to_json()
and other pydantic models. After this release, these methods will return str
type data instead of bytes
.
π₯ Since the return type is changed, this is considered a breaking change.
This release introduces type casting internally in the reduce
helper function, casting its inputs before appending them to the final result. This will make it possible to reduce documents whose schemas are compatible but not exactly the same.
__annotations__
but not in __fields__
(#1777)This release fixes an issue in the create_pure_python_type_model helper function. Starting with this release, only attributes in the class __fields__
will be considered during type creation.
The previous behavior broke applications when users introduced a ClassVar in an input class:
class MyDoc(BaseDoc):
endpoint: ClassVar[str] = "my_endpoint"
input_test: str = ""
field_info = model.__fields__[field_name].field_info
KeyError: 'endpoint'
Kudos to @NarekA for raising the issue and contributing a fix in the Jina project, which was ported in DocArray.
filter_docs
(#1762)We would like to thank all contributors to this release:
d5cb02fb
] - version: the next version will be 0.37.2 (Jina Dev Bot)This release contains 4 bug fixes and 1 Documentation improvement.
The previous schema check in the UpdateMixin
was strict and does not allow updating in cases the schema of both documents are similar but do not have the same reference.
For instance, if the schemas are dynamically generated but have the same fields and field types, the check will still evaluate to False
and it would not be possible to update the documents.
This release relaxes the check and allows checking whether the fields of the schemas are similar instead.
We fixed an issue where non-class type fields used in schemas with QdrantDocumentIndex
result in a TypeError
.
The issue has been resolved by replacing the usage of issubclass
with safe_issubclass
in the QdrantDocumentIndex
implementation.
The following case used to result in a KeyError
:
from docarray import BaseDoc
from docarray.utils.create_dynamic_doc_class import create_base_doc_from_schema
class Nested2(BaseDoc):
value: str
class Nested1(BaseDoc):
nested: Nested2
class RootDoc(BaseDoc):
nested: Nested1
new_my_doc_cls = create_base_doc_from_schema(RootDoc.schema(), 'RootDoc')
We fixed this issue by changigng create_base_doc_from_schema
such that global definitions of nested schemas are propagated during recursive calls.
We would like to thank all contributors to this release:
0.37.0
)Release time: 2023-08-03 03:11:16
This release contains 6 new features, 5 bug fixes, 1 performance improvement and 1 documentation improvement.
Leverage the power of Milvus in your DocArray project with this latest integration. Here's a simple usage example:
import numpy as np
from docarray import BaseDoc
from docarray.index import MilvusDocumentIndex
from docarray.typing import NdArray
from pydantic import Field
class MyDoc(BaseDoc):
text: str
embedding: NdArray[10] = Field(is_embedding=True)
docs = [MyDoc(text=f'text {i}', embedding=np.random.rand(10)) for i in range(10)]
query = np.random.rand(10)
db = MilvusDocumentIndex[MyDoc]()
db.index(docs)
results = db.find(query, limit=10)
In this example, we're creating a document class with both textual and numeric data. Then, we initialize a Milvus-backed document index and use it to index our documents. Finally, we perform a search query.
Supported Functionalities
HnswDocumentIndex
(#1718)With our latest update, you can easily utilize filtering in HnswDocumentIndex
either as an independent function or in conjunction with the query builder to combine it with vector search.
The code below shows how the new feature works:
import numpy as np
from docarray import BaseDoc, DocList
from docarray.index import HnswDocumentIndex
from docarray.typing import NdArray
class SimpleSchema(BaseDoc):
year: int
price: int
embedding: NdArray[128]
# Create dummy documents.
docs = DocList[SimpleSchema](
SimpleSchema(year=2000 - i, price=i, embedding=np.random.rand(128))
for i in range(10)
)
doc_index = HnswDocumentIndex[SimpleSchema](work_dir="./tmp_5")
doc_index.index(docs)
# Independent filtering operation (year == 1995)
filter_query = {"year": {"$eq": 1995}}
results = doc_index.filter(filter_query)
# Filtering combined with vector search
hybrid_query = (
doc_index.build_query() # get empty query object
.filter(filter_query={"year": {"$gt": 1994}}) # pre-filtering (year > 1994)
.find(
query=np.random.rand(128), search_field="embedding"
) # add vector similarity search
.filter(filter_query={"price": {"$lte": 3}}) # post-filtering (price <= 3)
.build()
)
results = doc_index.execute_query(hybrid_query)
First, we create and index some dummy documents. Then, we use the filter function in two ways. One is by itself to find documents from a specific year. The other is mixed with a vector search, where we first filter by year, perform a vector search, and then filter by price.
InMemoryExactNNIndex
(#1713)You can now add a pre-filter to your queries in InMemoryExactNNIndex
. This lets you create flexible queries where you can set up as many pre- and post-filters as you want. Here's a simple example:
query = (
doc_index.build_query()
.filter(filter_query={'price': {'$lte': 3}}) # Pre-filter: price <= 3
.find(query=np.ones(10), search_field='tensor') # Vector search
.filter(filter_query={'text': {'$eq': 'hello 1'}}) # Post-filter: text == 'hello 1'
.build()
)
In this example, we first set a pre-filter to only include items priced 3 or less. We then do a vector search. Lastly, we add a post-filter to find items with the text 'hello 1'. This way, you can easily filter before and after your search!
InMemoryExactNNIndex
(#1724)You can now easily update your documents in InMemoryExactNNIndex
. Previously, when you tried to update the same set of documents, it would just add duplicate copies instead of making changes to the existing ones. But not anymore! From now on, If you want to update documents you just have to re-index them.
DocVec
deserialization (#1679)Now you can specify the format of your tensor during DocVec
deserialization. You can do this with any method you're using to convert data - like protobuf
, json
, pandas
, bytes
, binary
, or base64
. This means you'll always get your tensors in the format you want, whether it's a Torch tensor, TensorFlow tensor, NDarray, and so on.
id
field of BaseDoc
(#1737)We added a description and example to the id
field of BaseDoc, so that you get a richer OpenAPI specification when building FastAPI based applications with it.
HnswDocumentIndex
performance (#1727, #1729)We've implemented two key optimizations to enhance the performance of HnswDocumentIndex
. Firstly, we've avoided serialization of embeddings to SQLite, which is a costly operation and unnecessary as the embeddings can be reconstructed from hnswlib
index itself. Additionally, we've minimized the frequency of computing num_docs()
, which previously involved time-consuming full table scan to determine the number of documents in SQLite. As a result, we've seen an approximate speed increase of 10%, enhancing both the indexing and searching processes.
TorchTensor
type comparison (#1739)We have addressed an exception raised when trying to compare TorchTensor
with the type
keyword in the docarray.typing
module. Previously, this would lead to a TypeError
, but the error has now been resolved, ensuring proper type comparison.
When using the method create_base_doc_from_schema
to dynamically create a BaseDoc class, some information was lost, so we made sure that the new class keeps FieldInfo information from the original class such as description
and examples
.
issubclass
(#1731)We fixed a bug calling issubclass
by changing the call for a safer implementation against some types.
QdrantDocumentIndex
(#1723)We've corrected an issue where the collection name was not being updated to match a newly-initialized subindex name in QdrantDocumentIndex
. This ensures consistent naming between collections and their respective subindexes.
We fixed a bug that will allow deepcopying documents with TorchTensors.
We would like to thank all contributors to this release:
0.36.0
)Release time: 2023-07-18 14:43:28
This release contains 2 new features, 5 bug fixes, 1 performance improvement and 1 documentation improvement.
You can now use JAX with Docarray. We have introduced JaxArray as a new type option for your documents. JaxArray ensures that JAX can now natively process any array-like data in your DocArray documents. Here's how you use of it:
from docarray import BaseDoc
from docarray.typing import JaxArray
import jax.numpy as jnp
class MyDoc(BaseDoc):
arr: JaxArray
image_arr: JaxArray[3, 224, 224] # For images of shape (3, 224, 224)
square_crop: JaxArray[3, 'x', 'x'] # For any square image, regardless of dimensions
random_image: JaxArray[3, ...] # For any image with 3 color channels, and arbitrary other dimensions
As you can see, the JaxArray typing is extremely flexible and can support a wide range of tensor shapes.
Creating a document with tensors is straightforward. Here is an example:
doc = MyDoc(
arr=jnp.zeros((128,)),
image_arr=jnp.zeros((3, 224, 224)),
square_crop=jnp.zeros((3, 64, 64)),
random_image=jnp.zeros((3, 128, 256)),
)
Leverage the power of Redis in your Docarray project with this latest integration. Here's a simple usage example:
import numpy as np
from docarray import BaseDoc
from docarray.index import RedisDocumentIndex
from docarray.typing import NdArray
class MyDoc(BaseDoc):
text: str
embedding: NdArray[10]
docs = [MyDoc(text=f'text {i}', embedding=np.random.rand(10)) for i in range(10)]
query = np.random.rand(10)
db = RedisDocumentIndex[MyDoc](host='localhost')
db.index(docs)
results = db.find(query, search_field='embedding', limit=10)
In this example, we're creating a document class with both textual and numeric data. Then, we initialize a Redis-backed document index and use it to index our documents. Finally, we perform a search query.
Find: Vector search for efficient retrieval of similar documents. Filter: Use Redis syntax to filter based on textual and numeric data. Text Search: Leverage text search methods, such as BM25, to find relevant documents. Get/Del: Fetch or delete specific documents from the index. Hybrid Search: Combine find and filter functionalities for more refined search. Currently, only these two can be combined. Subindex: Search through nested data.
HnswDocumentIndex
by caching num docs (#1706)We've optimized the num_docs() operation by caching the document count, addressing previous slowdowns during searches. This change results in a minor increase in indexing time, but significantly accelerates search times.
from docarray import BaseDoc, DocList
from docarray.index import HnswDocumentIndex
from docarray.typing import NdArray
import numpy as np
import time
class MyDoc(BaseDoc):
text: str
embedding: NdArray[128]
docs = [MyDoc(text='hey', embedding=np.random.rand(128)) for _ in range(20000)]
index = HnswDocumentIndex[MyDoc](work_dir='tst', index_name='index')
index_start = time.time()
index.index(docs=DocList[MyDoc](docs))
index_time = time.time() - index_start
query = docs[0]
find_start = time.time()
matches, _ = index.find(query, search_field='embedding', limit=10)
find_time = time.time() - find_start
In the above experiment, we observed a 13x improvement in the speed of the search function, reducing its execution time from 0.0238 to 0.0018 seconds.
We've moved the contains method into the base class. With this refactoring, the responsibility for checking if a document exists is now delegated to individual backend implementations using the new _doc_exists method.
We have implemented a more robust method of detecting existing indices for WeaviateDocumentIndex
WeaviateDocumentIndex
handles lowercase index names (#1711)We've addressed an issue in the WeaviateDocumentIndex
where passing a lowercase index name led to mismatches and subsequent errors. This was due to the system automatically capitalizing the index name when creating an index.
QdrantDocumentIndex
unable to see index_name
(#1705)We've resolved an issue where the QdrantDocumentIndex
was not properly recognizing the index_name
parameter. Previously, the specified index_name
was ignored and the system defaulted to the schema name.
InMemoryExactNNIndex
with AnyEmbedding
(#1696)From now on, you can perform search operations in InMemoryExactNNIndex
using AnyEmbedding
safe_issubclass
everywhere (#1691)We now use safe_issubclass instead of issubclass because it supports non-class inputs, helping us to avoid unexpected errors
DocLists
in the base index (#1685)We added an additional check to avoid passing DocLists to a function that converts a list of dictionaries to a DocList.
We would like to thank all contributors to this release:
0.35.0
)This release contains 3 new features, 2 bug fixes and 1 documentation improvement.
DocVec
(#1562)DocVec
now has the same serialization interface as DocList
. This means that that following methods are available for it:
to_protobuf()
/from_protobuf()
to_base64()
/from_base64()
save_binary()
/load_binary()
to_bytes()
/from_bytes()
to_dataframe()
/from_dataframe()
For example, you can now perform Base64 (de)serialization like this:
from docarray import BaseDoc, DocVec
class SimpleDoc(BaseDoc):
text: str
dv = DocVec[SimpleDoc]([SimpleDoc(text=f'doc {i}') for i in range(2)])
base64_repr_dv = dv.to_base64(compress=None, protocol='pickle')
dl_from_base64 = DocVec[SimpleDoc].from_base64(
base64_repr_dv, compress=None, protocol='pickle'
)
For further guidance, check out the documentation section on serialization.
Validate the file formats given in URL types such as AudioURL, TextURL, ImageURL
to check they correspond to the expected mime type.
BaseDoc
from schema (#1667)Sometimes it can be useful to dynamically create a BaseDoc
from a given schema of an original BaseDoc
. Using the methods create_pure_python_type_model
and create_base_doc_from_schema
you can make sure to reconstruct the BaseDoc
.
from docarray.utils.create_dynamic_doc_class import (
create_base_doc_from_schema,
create_pure_python_type_model,
)
from typing import Optional
from docarray import BaseDoc, DocList
from docarray.typing import AnyTensor
from docarray.documents import TextDoc
class MyDoc(BaseDoc):
tensor: Optional[AnyTensor]
texts: DocList[TextDoc]
MyDocPurePython = create_pure_python_type_model(MyDoc) # Due to limitation of DocList as Pydantic List, we need to have the MyDoc `DocList` converted to `List`.
NewMyDoc = create_base_doc_from_schema(
MyDocPurePython.schema(), 'MyDoc', {}
)
new_doc = NewMyDoc(tensor=None, texts=[TextDoc(text='text')])
Due to the breaking change in Pydantic v2
, we have capped the version to avoid problems when installing docarray.
After calling doc_list = doc_vec.to_doc_list()
, doc_vec
ends up in an unusable state since its data has been transferred to doc_list
. This fix gives users a more informative error message when they try to interact with doc_vec
after it has been made unusable.
We would like to thank all contributors to this release:
0.21.1
)Release time: 2023-06-21 08:15:43
This release contains 1 bug fix.
This extra headers allow to pass authentication keys to connect to a secured Weaviate instance
WeaviateDocumentArray supports
We would like to thank all contributors to this release:
0.34.0
)Release time: 2023-06-21 08:15:43
This release contains 2 breaking changes, 3 new features, 11 bug fixes, and 2 documentation improvements.
:warning: :warning: DocArray will now require Python 3.8. We can no longer assure compatibility with Python 3.7.
We decided to drop it for two reasons:
DocVec
Protobuf definition (#1639)In order to fix a bug in the DocVec
protobuf serialization described in #1561,
we have changed the DocVec
.proto definition.
This means that DocVec
objects serialized with DocArray v0.33.0 or earlier cannot be deserialized with DocArray
v.0.34.0 or later, and vice versa.
:warning: :warning: We strongly recommend that everyone using Protobuf with DocVec
upgrade to DocArray v0.34.0 or
later.
You can now check if a Document has already been indexed by using the in
keyword:
from docarray.index import InMemoryExactNNIndex
from docarray import BaseDoc, DocList
from docarray.typing import NdArray
import numpy as np
class MyDoc(BaseDoc):
text: str
embedding: NdArray[128]
docs = DocList[MyDoc](
[MyDoc(text="Example text", embedding=np.random.rand(128))
for _ in range(2000)])
index = InMemoryExactNNIndex[MyDoc](docs)
assert docs[0] in index
assert MyDoc(text='New text', embedding=np.random.rand(128)) not in index
InMemoryExactNNIndex
(#1617)You can now use the find_subindex method with the ExactNNSearch DocIndex.
from docarray.index import InMemoryExactNNIndex
from docarray import BaseDoc, DocList
from docarray.typing import ImageUrl, VideoUrl, AnyTensor
class ImageDoc(BaseDoc):
url: ImageUrl
tensor_image: AnyTensor = Field(space='cosine', dim=64)
class VideoDoc(BaseDoc):
url: VideoUrl
images: DocList[ImageDoc]
tensor_video: AnyTensor = Field(space='cosine', dim=128)
class MyDoc(BaseDoc):
docs: DocList[VideoDoc]
tensor: AnyTensor = Field(space='cosine', dim=256)
doc_index = InMemoryExactNNIndex[MyDoc]()
...
# find by the `ImageDoc` tensor when index is populated
root_docs, sub_docs, scores = doc_index.find_subindex(
np.ones(64), subindex='docs__images', search_field='tensor_image', limit=3
)
You can deserialize any DocVec
protobuf message to any tensor type,
by passing the tensor_type
parameter to from_protobuf
.
This means that you can choose at deserialization time if you are working with numpy, PyTorch, or TensorFlow tensors.
class MyDoc(BaseDoc):
tensor: TensorFlowTensor
da = DocVec[MyDoc](...) # doesn't matter what tensor_type is here
proto = da.to_protobuf()
da_after = DocVec[MyDoc].from_protobuf(proto, tensor_type=TensorFlowTensor)
assert isinstance(da_after.tensor, TensorFlowTensor)
DBConfig
to InMemoryExactNNSearch
InMemoryExactNNsearch
used to get a single parameter index_file_path
as a constructor parameter, unlike the rest of
the Indexers who accepted their own DBConfig
. Now index_file_path
is part of the DBConfig
which allows to
initialize from it.
This will allow us to extend this config if more parameters are needed.
The parameters of DBConfig
can be passed at construction time as **kwargs
making this change compatible with old
usage.
These two initializations are equivalent.
from docarray.index import InMemoryExactNNIndex
db_config = InMemoryExactNNIndex.DBConfig(index_file_path='index.bin')
index = InMemoryExactNNIndex[MyDoc](db_config=db_config)
index = InMemoryExactNNIndex[MyDoc](index_file_path='index.bin')
BaseDoc
with Union
type (#1655)Serialization of BaseDoc
types who have Union
types parameter of Python native types is supported.
from docarray import BaseDoc
from typing import Union
class MyDoc(BaseDoc):
union_field: Union[int, str]
docs1 = DocList[MyDoc]([MyDoc(union_field="hello")])
docs2 = DocList[BasisUnion].from_dataframe(docs_basic.to_dataframe())
assert docs1 == docs2
When these Union
types involve other BaseDoc
types, an exception is thrown.
class CustomDoc(BaseDoc):
ud: Union[TextDoc, ImageDoc] = TextDoc(text='union type')
docs = DocList[CustomDoc]([CustomDoc(ud=TextDoc(text='union type'))])
# raises an Exception
DocList[CustomDoc].from_dataframe(docs.to_dataframe())
HNSWDocumentIndex
(#1657, #1656)If you call find
or find_batched
on an HNSWDocumentIndex
, the limit
parameter will automatically be cast to
integer
.
default_column_config
from RuntimeConfig
to DBconfig
(#1648)default_column_config
contains specific configuration information about the columns and tables inside the backend's
database. This was previously put inside RuntimeConfig
which caused an error because this information is required at
initialization time. This information has been moved inside DBConfig
so you can edit it there.
from docarray.index import HNSWDocumentIndex
import numpy as np
db_config = HNSWDocumentIndex.DBConfig()
db_conf.default_column_config.get(np.ndarray).update({'ef': 2500})
index = HNSWDocumentIndex[MyDoc](db_config=db_config)
This bug caused raw Protobuf objects to be stored as DocVec columns after they were deserialized from Protobuf, making the
data essentially inaccessible. This has now been fixed, and DocVec
objects are identical before and after (de)serialization.
find
and filter
combination used in InMemoryExactNNIndex
(#1642)Hybrid search (find+filter) for InMemoryExactNNIndex
was prioritizing low similarities (lower scores) for returned
matches. Fixed by adding an option to sort matches in a reverse order based on their scores.
# prepare a query
q_doc = MyDoc(embedding=np.random.rand(128), text='query')
query = (
db.build_query()
.find(query=q_doc, search_field='embedding')
.filter(filter_query={'text': {'$exists': True}})
.build()
)
results = db.execute_query(query)
# Before: results was sorted from worst to best matches
# Now: It's sorted in the correct order, showing better matches first
When using QdrandDocumentIndex
to connect to a Qdrant DB initialized outside of docarray
raised a KeyError
.
This has been fixed, and now you can use QdrantDocumentIndex
to connect to externally initialized collections.
DocVec
equality (#1641, #1663)summary()
called for LegacyDocument
. (#1637)DocList
and DocVec
coersion. (#1568)update()
on BaseDoc
with tensors fields (#1628)We would like to thank all contributors to this release: