:dart: Task-oriented embedding tuning for BERT, CLIP, etc.
This release covers Finetuner version 0.8.1, including dependency finetuner-core 0.13.10.
This release contains 1 new feature, 1 refactoring and 1 documentation improvement.
We have included jina-embedding-t-en-v1
in our list of supported models. This very small embedding model, comprising 14 million parameters, offers lightning-fast inference on CPUs.
In our experiments, it was able to encode 1730 sentences per second on a Macbook Pro Core-i5, making it perfectly suitable for edge devices. To utilize the Tiny model, follow these steps:
!pip install finetuner
import finetuner
model = finetuner.build_model('jinaai/jina-embedding-t-en-v1')
embeddings = finetuner.encode(
model=model,
data=['how is the weather today', 'What is the current weather like today?']
)
print(finetuner.cos_sim(embeddings[0], embeddings[1]))
typing-extensions
from Finetuner dependenciesWe have eliminated the dependency on typing-extensions
due to compatibility issues when using the Finetuner on Google Colab.
We have updated the documentation page to include information about jina-embedding-t-en-v1
. We have also added technical reports and citation details to the README and documentation page.
We would like to thank all contributors to this release:
This release covers Finetuner version 0.8.0, including dependency finetuner-core 0.13.9.
This release contains 1 new feature and 1 refactoring.
We have made contributions to the open-source community by releasing three pre-trained embedding models:
jina-embedding-s-en-v1
: 35 million parameter compact embedding model.jina-embedding-b-en-v1
: 110 million parameter standard-sized embedding model.jina-embedding-l-en-v1
: 330 million parameter large embedding model.We have trained all three models using Jina AI's Linnaeus-Clean dataset. This dataset consists of 380 million pairs of sentences in query-document pairs. These pairs were curated from a variety of domains in the Linnaeus-Full dataset through a thorough cleaning process. The Linnaeus-Full dataset contains 1.6 billion sentence pairs.
If you wish to use these embeddings with Finetuner, follow the instructions below:
!pip install finetuner
import finetuner
model = finetuner.build_model('jinaai/jina-embedding-s-en-v1')
embeddings = finetuner.encode(
model=model,
data=['how is the weather today', 'What is the current weather like today?']
)
print(finetuner.cos_sim(embeddings[0], embeddings[1]))
With the launch of Finetuner 0.8.0, installing it using pip install finetuner
will automatically include the necessary torch-related dependencies. This enables Finetuner to function as an optimal provider of embedding models. If you intend to fine-tune an embedding model, make sure that you install Finetuner with all the additional dependencies by using the command pip install "finetuner[full]"
.
We would like to thank all contributors to this release:
This release covers Finetuner version 0.7.8, including dependencies finetuner-api 0.5.10 and finetuner-core 0.13.7.
This release contains 4 new features, 1 performance improvement, 1 refactoring, 2 bug fixes, and 1 documentation improvement.
We have added support for the multilingual embedding model distiluse-base-multi
(a copy of distiluse-base-multilingual-cased-v1
). It supports semantic search in Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, and Turkish.
We now support data synthesis for datasets in languages other than English, specifically the ones supported by distiluse-base-multi
(see above). To use them you need to add the synthesis model synthesis_model_multi
as the models
parameter to the finetuner.synthesis
function:
from finetuner.model import synthesis_model_multi
synthesis_run = finetuner.synthesize(
...
models=synthesis_model_multi,
)
We will soon publish select fine-tuned models to the huggingface hub. With the new Finetuner version, you can now load those models directly:
import finetuner
model = finetuner.get_model('jinaai/ecommerce-sbert-model')
e1, e2 = finetuner.encode(model, ['XBox', 'Xbox One Console 500GB - Black (2015)'])
Previously, tracking callbacks like WandBLogger
did not consider the evaluation results of the model before fine-tuning, because they only start the tracking when the actual model tuning starts. Now, we add an option log_zero_shot
to those callbacks (which is True
by default). When enabled, this makes Finetuner send evaluation metrics calculated before training to the tracking service used during training.
We optimized data synthesis to reduce its memory consumption, which enables synthesis jobs to run on larger datasets and reduces the run-time of fine-tuning jobs using synthesized training data.
num_relations
from 3 to 10 for data synthesis jobs. (https://github.com/jina-ai/finetuner/pull/750)Data synthesis jobs are more effective if a large amount of training data is generated from small and medium-sized query datasets. Therefore, we have increased the default number of triplets generated for each query from 3 to 10. If you run data synthesis jobs with a large number of queries (>1M), you should consider resetting the num_relations
parameter to a lower number.
The English cross-encoder model which we used was actually a multi-lingual one. By using an English one instead, we produce higher-quality synthetic training data and the resulting embedding models achieve better evaluation results.
We noticed that data synthesis jobs can accept either a named DocumentArray
object stored on Jina AI Cloud or a list of text values. However, passing file paths to locally stored DocumentArray datasets failed. This bug is fixed by this release.
We have added documentation on how to apply data synthesis to datasets that include materials in languages other than English.
We would like to thank all contributors to this release:
This release covers Finetuner version 0.7.7, including dependencies finetuner-api 0.5.9 and finetuner-core 0.13.5.
This release contains 2 new features, 2 refactorings, 3 bug fixes, and 1 documentation improvement.
In this release of Finetuner, we have introduced a training data synthesis feature. This feature is particularly useful for users in the e-commerce domain, who may have difficulty obtaining enough labeled training data.
This feature allows you to use historical queries collected from your search system, along with your articles, to generate training data:
import finetuner
from finetuner.model import synthesis_model_en
synthesis_run = finetuner.synthesize(
query_data='finetuner/xmarket_queries_da',
corpus_data='finetuner/xmarket_corpus_da',
models=synthesis_model_en,
)
Once the synthesis job is done, you can get the training data with:
train_data_name = synthesis_run.train_data
And then, you can continue fine-tuning your embedding model with the generated training data:
training_run = finetuner.fit(
model='bert-base-en',
train_data=synthesis_run.train_data,
loss='MarginMSELoss',
...,
)
EvaluationCallback
In order to facilitate the training and evaluation of large language models (LLMs) using Finetuner, we have made significant changes to the EvaluationCallback
.
These changes now enable evaluation on multiple datasets. Users can now use the caption
parameter to EvaluationCallback
to get output that labels which dataset each evaluation corresponds to:
import finetuner
from finetuner.callback import EvaluationCallback
finetuner.fit(
...,
callbacks=[
EvaluationCallback(
query_data='query-1',
index_data='index-1',
caption='dataset-1',
),
EvaluationCallback(
query_data='query-2',
index_data='index-2',
caption='dataset-2',
),
]
)
To avoid displaying "0.000" for very small loss values, the display precision has been increased.
In order to enhance the readability of the logs, we have excluded debugging messages generated by the PIL package.
batch_size
for text models.This pull request resolves a bug where the batch size finder would incorrectly overestimate the maximum usable batch size for text models like BERT. This is likely to happen when users fine-tune the bert-base-en
model without specifying batch_size
.
None
error in EvaluationCallback
.Runs set up with automatic batch-size configuration and automatic evaluation callback previously passed the value None
to EvaluationCallback
as batch_size
. This resulted in a division by None
error.
EvaluationCallback
.When there are queries in the evaluation data which do not have any matches, Finetuner was previously unable to calculate any metrics, which leads to division by zero errors. It has been fixed in this release.
We have provided a tutorial for the new data synthesis module.
We would like to thank all contributors to this release:
This release covers Finetuner version 0.7.6, including dependencies finetuner-api 0.5.6 and finetuner-core 0.13.4.
This release contains 2 refactorings and 1 bug fix.
Warning
Due to the release of DocArray v2, which is not yet compatible with Finetuner, all previous versions of Finetuner will break when they update DocArray automatically. It is strongly recommended that you upgrade to this release because previous versions will not work.
Beforehand, when fine-tuning with a vision backbone, the PIL package would generate numerous warning messages that contaminated Finetuner's logs. However, this issue has been resolved, and PIL warnings are filtered out.
Previously, the progress bar would display a loss value of 0 when it was too small. To address this issue, we now use a higher precision when the loss value is too small.
Previously, Finetuner automatically installed the latest version of DocArray. However, Docarray v2 has now been released and is a breaking change that is incompatible with the current version of Finetuner. Finetuner only supports DocArray up to version 0.3.0. Please upgrade your Finetuner to the latest version to resolve this issue.
We will release a version of Finetuner compatible with DocArray v2 in the immediate future.
We would like to thank all contributors to this release:
This release covers Finetuner version 0.7.5, including dependencies finetuner-api 0.5.6 and finetuner-core 0.13.3.
This release contains 2 refactorings and 2 bug fixes.
Previously, when a fine-tuning job was completed and the get_model
function was called, we would construct the model, load the pre-trained weights, and then overwrite them with fine-tuned weights. We have now disabled the downloading of pre-trained weights, which speeds up the get_model
function and eliminates unneeded network traffic.
Before creating a Run
, users are required to call finetuner.login()
and use third-party authentication to log in. Previously, if they had not already done so, they would receive an error message that did not tell them to log in. We now display a more informative error message in the event that a user forgets to log in or their login attempt was unsuccessful.
When users request a model by name, they use names with the format name-size-lang
, for example: bert-base-en
. However, these names were not included in our internal schema for validation and jobs would fail validation. This has now been rectified.
In the past, BatchSizeFinder
was unable to properly select batch sizes for PointNet++ models. This has been fixed.
We would like to thank all contributors to this release:
This release covers Finetuner version 0.7.4, including dependencies finetuner-api 0.5.5 and finetuner-core 0.13.0.
This release contains 2 new features, 3 refactoring, 1 bug fix, and 3 documentation improvements.
LLRD sets a large learning rate for the top (last) layer and uses a multiplicative decay rate to decrease the learning rate layer-by-layer from top (last) to bottom (first). With high learning rates, the features recognized by the top layers change more and adapt to new tasks more easily, while the bottom layers have low learning rates and more easily preserve the features learned during pre-training:
import finetuner
run = finetuner.fit(
...,
optimizer='Adam'
+ optimizer_options={'layer_wise_lr_decay': 0.98},
...,
)
In the previous release, Finetuner added support for training with data that is not specifically labeled, but each pair of training items has a numerical similarity score between 0.0 (totally dissimilar) and 1.0 (totally the same). This extends the potential scenarios for which Finetuner is applicable.
Now, users can prepare training data with a CSV file like this:
The weather is nice, The weather is beautiful, 0.9
The weather is nice, The weather is bad, 0
We unified backbone names into the name-size-[lang]
format. Now model names are easier to understand and remember. For example: bert-base-cased
becomes bert-base-en
, openai/clip-vit-base-patch32
becomes clip-base-en
, resnet152
becomes resnet-large
. Users can see the support backbones with:
finetuner.describe_models(task='text-to-image')
Note that you can keep using the old names, for example:
# build a zero-shot resnet embedding model
model = finetuner.build_model('resnet50')
# this is identical to
model = finetuner.build_model('resnet-base')
To simplify our offerings and create a more intuitive user experience, we have decreased the number of supported machine-learning models. By doing so, we hope to make it easier for our users to navigate our platform and find the models that best suit their needs.
batch_size
for EvaluationCallback
from 8 to 32.We recently implemented a change to our EvaluationCallback
by increasing the batch size. This has resulted in a significant improvement in evaluation speed, which in turn has reduced the overall cost of evaluation. By processing data more efficiently, we can evaluate models more quickly and accurately, which ultimately benefits our users.
IndexError
in EvaluationCallback
when 'gather_examples' is enabled.We resolved an issue in the EvaluationCallback
where the 'list out of range' error was occurring when 'gather_examples' was enabled. This fix ensures that the evaluation callback works correctly with 'gather_examples' and enables users to gather evaluation examples without encountering errors.
CosineSimilarityLoss
(#697)We would like to thank all contributors to this release:
This release covers Finetuner version 0.7.3, including dependencies finetuner-api 0.5.4 and finetuner-core 0.12.9.
This release contains 4 new features, 1 refactoring, and 1 bug fix.
It can be complicated to find a good batch size for the finetuner.fit
function. If you choose a value that's too small, fine-tuning may not be very effective; if you choose a value that's too large, the job may run out of memory and fail. To make this easier, Finetuner now sets the batch size automatically if you leave out the batch_size
parameter to finetuner.fit
or set it to None
. This will choose the largest batch size supported by the current CUDA device.
You no longer need to retrieve the logs of Finetuner runs or manually unpack fine-tuned models to get the evaluation metrics. Now, you can get the metrics directly from the Run.metrics
function:
run = finetuner.fit(...)
metrics = run.metrics()
To print nicely formatted evaluation metrics to the console, use the Run.display_metrics()
function. This will print tables showing evaluation metrics before and after fine-tuning:
In addition to evaluation metrics, you may find it helpful to see actual query results. We have introduced a new parameter gather_examples
to the evaluation callback to make this easy. If this parameter is set to True
, the evaluation callback also tracks the Top-K results for some example queries samples from the query dataset:
run = finetuner.fit(
...,
callbacks=[
EvaluationCallback(
query_data='query-data-name'',
index_data='index-data-name'',
gather_examples=True,
)
],
...
)
Like the evaluation metrics, you can retrieve the query results, before and after fine-tuning, with the Run.example_results
function or print them on the console using Run.display_examples
:
Finetuner now supports training with data that is not specifically labeled, but that has for each pair of training items, a numerical similarity score between 0.0 (totally dissimilar) and 1.0 (totally the same). This extends the potential scenarios for which Finetuner is applicable.
For example, you can now use DocArray to prepare data pairs with scores like this:
from docarray import Document, DocumentArray
d1 = Document(
[Document(text='I am driving to Los Angeles'), Document('I am driving to Hollywood')],
tags={'finetuner_score': 0.9},
)
d2 = Document(
[Document(text='I am driving to Los Angeles'), Document('I am flying to New York')],
tags={'finetuner_score': 0.3},
)
...
train_data = DocumentArray([d1, d2, ...])
Then, use CosineSimilarityLoss
as the loss function in the finetuner.fit
function:
finetuner.fit(
model='sentence-transformers/msmarco-distilbert-base-v3',
train_data=train_data,
loss='CosineSimilarityLoss',
...
)
In the future, we will also support data with scores in CSV format.
Previously, users could only run three jobs in parallel. This limit has been removed.
After fine-tuning jobs finish, the logs were lost after some length of time. Now, logs will remain available indefinitely.
We would like to thank all contributors to this release:
This release covers Finetuner version 0.7.2, including dependencies finetuner-api 0.5.2 and finetuner-core 0.12.7.
This release contains 2 new features, 4 refactorings, 1 bug fix, and 3 documentation improvements.
This PR add supports for the learning rate scheduler. The scheduler is used to adjust the learning rate during training. We support 6 learning rate schedulers: linear
, cosine
, cosine_with_restarts
, polynomial
, constant
and constant_with_warmup
.
When a scheduler is configured, the learning rate is by default adjusted after each batch.
Alternatively, you can set scheduler_optons = {'scheduler_step': 'epoch'}
to adjust the learning rate after each epoch instead.
You can use them by specifying their name in the scheduler
attribute of the fit
function.
run = finetuner.fit(
...,
scheduler='linear',
scheduler_options={'scheduler_step': 'batch'},
...
)
steps_per_interval
in EvaluationCallback
When working with large datasets, you may want to perform evaluations multiple times during each epoch. This parameter allows to specify a number of batches after which an evaluation should be performed.
If set to None
, an evaluation is performed only at the end of each epoch.
run = finetuner.fit(
...,
callbacks=[
EvaluationCallback(
query_data=...,
index_data=...,
steps_per_interval=3, # evaluate every 3 batches.
),
],
...
)
scheduler_step
becomes part of scheduler_options
(#679)We removed the scheduler_step
argument from the fit
function, now it is part of the scheduler_options
.
run = finetuner.fit(
...,
scheduler='linear',
- scheduler_step='batch',
+ scheduler_options={'scheduler_step': 'batch'},
...
)
epochs
and batch_size
in cloud.jina.aiFor Web UI users, we have reduced the default epochs
from 10 to 5, and reduced the default batch_size
from 128 to 64 to avoid out-of-memory errors from 3D-mesh fine-tuning.
Add more textual guidance on creating Finetuner runs in the Jina AI Cloud UI.
Finetuner now groups query-document pairs by their queries, thereby eliminating duplicate queries, when parsing CSV files. This leads to more effective fine-tuning.
GeM
pooler.This PR removes the output_dim
argument from the GeM
pooler's forward function. You can use GeM
pooler together with ArcFaceLoss
to deliever better visual embedding quality.
run = finetuner.fit(
...,
model_options = {
...
- 'output_dim': 512,
+ 'pooler': 'GeM',
+ 'pooler_options': {'p': 2.4, 'eps': 1e-5}
}
)
GeM
pooling (#684)We have added a new section to our documentation which explains the pooling options in more detail.
ArcFaceLoss
(#680)We have added a new page to our documentation which demonstrates ArcFaceLoss
on the Stanford Cars dataset.
We have added a new section to our documentation explaining how to create training data made of query-document pairs instead of explicitly annotated and labeled data.
We would like to thank all contributors to this release:
This release covers Finetuner version 0.7.1, including dependencies finetuner-api 0.5.0 and finetuner-core 0.12.6.
This release contains 2 new features, 3 refactorings, 3 bug fixes, and 4 documentation improvements.
SphereFace loss functions were first formulated for computer vision, specifically face recognition, tasks. Finetuner supports two variations of this loss function, ArcFaceLoss
, and CosFaceLoss
. Instead of attempting to minimize the distance between positive pairs and maximize the distance between negative pairs, the SphereFace loss functions compare each sample with an estimate of the center point of each class's embeddings.
Like all supported loss functions, you can use them by specifying their name in the loss
attribute of the fit
function.
run = finetuner.fit(
...,
loss='ArcFaceLoss',
...
)
To track and refine our estimate of the class center points across batches, these SphereFace loss functions require an additional optimizer during training. By default, the type of optimizer used will be the same as the one used for the model itself, but you can also choose a different optimizer for your loss function using the loss_optimizer
parameter.
run = finetuner.fit(
...,
loss='ArcFaceLoss',
+ loss_optimizer='Adam',
+ loss_optimizer_options={'weight_decay': 0.01}
)
If you want to start fine-tuning from a model produced by a previous Run, or you collected new training data and want to use it to continue training, this is now possible. To use this feature, you need to set the artifact id of the model you want to continue training from via the model_artifact
parameter of the fit
function:
train_data = 'path/to/another/data.csv'
new_run = finetuner.fit(
model='efficientnet_b0',
train_data=train_data,
model_artifact=previous_run.artifact_id,
)
Due to low usage, we removed CLIP models which are based on ResNet.
For image-to-image search, we now support EfficientNet B7 as a backbone model.
For Web UI users, we have increased the upload file size from 1MB to 32MB. Python client users have always been able to upload much larger datasets and are unaffected by this change.
A new SQLAlchemy release caused the MLFlow callback to behave incorrectly in some cases. This release fixes the problem.
num_items_per_class
ParameterSome loss functions do not use the num_items_per_class
parameter. In some cases, it is possible for users to set this parameter in a way that is incompatible with the rest of the configuration and cause Finetuner to fail. Now the parameter is only validated if it is actually used, and for loss functions that do not use it, it is completely ignored.
Sometimes, when calling finetuner.login()
in a Jupyter notebook, login would appear successful, but Finetuner might not always behave correctly. Previously, users had to call finetuner.login(force=True)
to be sure they were correctly logged in. This problem has been resolved, and finetuner.login()
works correctly without the force
flag.
We add a new page to our documentation which explains several loss functions and the pooling options in more detail.
We add a list with articles to our README that make use of Finetuner and provide more insights for using Finetuner in practice.
If you need example training datasets that have already been prepared for use in Finetuner, you can look at the dataset folder in our repository.
We repaired broken links and fixed typos found in the Finetuner documentation.
We would like to thank all contributors to this release: