Finetuner Versions Save

:dart: Task-oriented embedding tuning for BERT, CLIP, etc.

v0.8.1

9 months ago

Release Note Finetuner 0.8.1

This release covers Finetuner version 0.8.1, including dependency finetuner-core 0.13.10.

This release contains 1 new feature, 1 refactoring and 1 documentation improvement.

🆕 Features

Add Jina Tiny Embedding model

We have included jina-embedding-t-en-v1 in our list of supported models. This very small embedding model, comprising 14 million parameters, offers lightning-fast inference on CPUs.

In our experiments, it was able to encode 1730 sentences per second on a Macbook Pro Core-i5, making it perfectly suitable for edge devices. To utilize the Tiny model, follow these steps:

!pip install finetuner
import finetuner

model = finetuner.build_model('jinaai/jina-embedding-t-en-v1')
embeddings = finetuner.encode(
    model=model,
    data=['how is the weather today', 'What is the current weather like today?']
)
print(finetuner.cos_sim(embeddings[0], embeddings[1]))

⚙ Refactoring

Remove `typing-extensions` from Finetuner dependencies

We have eliminated the dependency on typing-extensions due to compatibility issues when using the Finetuner on Google Colab.

📗 Documentation Improvements

Add Tiny model and technical report to Finetuner Readme and Docs. (https://github.com/jina-ai/finetuner/pull/763)

We have updated the documentation page to include information about jina-embedding-t-en-v1. We have also added technical reports and citation details to the README and documentation page.

🤟 Contributors

We would like to thank all contributors to this release:

Wang Bo (@bwanglzu)
Louis Milliken (@LMMilliken)
Michael Günther (@guenthermi)
George Mastrapas (@gmastrapas)
Scott Martens (@scott-martens)

v0.8.0

10 months ago

Release Note Finetuner 0.8.0

This release covers Finetuner version 0.8.0, including dependency finetuner-core 0.13.9.

This release contains 1 new feature and 1 refactoring.

🆕 Features

Add Jina embeddings suite (https://github.com/jina-ai/finetuner/pull/757)

We have made contributions to the open-source community by releasing three pre-trained embedding models:

jina-embedding-s-en-v1: 35 million parameter compact embedding model.
jina-embedding-b-en-v1: 110 million parameter standard-sized embedding model.
jina-embedding-l-en-v1: 330 million parameter large embedding model.

We have trained all three models using Jina AI's Linnaeus-Clean dataset. This dataset consists of 380 million pairs of sentences in query-document pairs. These pairs were curated from a variety of domains in the Linnaeus-Full dataset through a thorough cleaning process. The Linnaeus-Full dataset contains 1.6 billion sentence pairs.

If you wish to use these embeddings with Finetuner, follow the instructions below:

!pip install finetuner
import finetuner

model = finetuner.build_model('jinaai/jina-embedding-s-en-v1')
embeddings = finetuner.encode(
    model=model,
    data=['how is the weather today', 'What is the current weather like today?']
)
print(finetuner.cos_sim(embeddings[0], embeddings[1]))

⚙ Refactoring

Change installation behavior (https://github.com/jina-ai/finetuner/pull/757)

With the launch of Finetuner 0.8.0, installing it using pip install finetuner will automatically include the necessary torch-related dependencies. This enables Finetuner to function as an optimal provider of embedding models. If you intend to fine-tune an embedding model, make sure that you install Finetuner with all the additional dependencies by using the command pip install "finetuner[full]".

🤟 Contributors

We would like to thank all contributors to this release:

Wang Bo (@bwanglzu)
Louis Milliken (@LMMilliken)
Michael Günther (@guenthermi)
George Mastrapas (@gmastrapas)
Scott Martens (@scott-martens)
Jonathan Geuter (@j-geuter)

v0.7.8

11 months ago

Release Note Finetuner 0.7.8

This release covers Finetuner version 0.7.8, including dependencies finetuner-api 0.5.10 and finetuner-core 0.13.7.

This release contains 4 new features, 1 performance improvement, 1 refactoring, 2 bug fixes, and 1 documentation improvement.

🆕 Features

Add multilingual text encoder models

We have added support for the multilingual embedding model distiluse-base-multi (a copy of distiluse-base-multilingual-cased-v1). It supports semantic search in Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, and Turkish.

Add multilingual model for training data synthesis jobs (https://github.com/jina-ai/finetuner/pull/750)

We now support data synthesis for datasets in languages other than English, specifically the ones supported by distiluse-base-multi (see above). To use them you need to add the synthesis model synthesis_model_multi as the models parameter to the finetuner.synthesis function:

from finetuner.model import synthesis_model_multi

synthesis_run = finetuner.synthesize(
    ...
    models=synthesis_model_multi,
)

Support loading models directly from Jina's huggingface site (https://github.com/jina-ai/finetuner/pull/751)

We will soon publish select fine-tuned models to the huggingface hub. With the new Finetuner version, you can now load those models directly:

import finetuner

model = finetuner.get_model('jinaai/ecommerce-sbert-model')
e1, e2 = finetuner.encode(model, ['XBox', 'Xbox One Console 500GB - Black (2015)'])

Add an option to the tracking callback to include zero-shot metrics in logging.

Previously, tracking callbacks like WandBLogger did not consider the evaluation results of the model before fine-tuning, because they only start the tracking when the actual model tuning starts. Now, we add an option log_zero_shot to those callbacks (which is True by default). When enabled, this makes Finetuner send evaluation metrics calculated before training to the tracking service used during training.

🚀 Performance

Reduce memory consumption during data synthesis and make the resulting dataset more compact

We optimized data synthesis to reduce its memory consumption, which enables synthesis jobs to run on larger datasets and reduces the run-time of fine-tuning jobs using synthesized training data.

⚙ Refactoring

Increase the default `num_relations` from 3 to 10 for data synthesis jobs. (https://github.com/jina-ai/finetuner/pull/750)

Data synthesis jobs are more effective if a large amount of training data is generated from small and medium-sized query datasets. Therefore, we have increased the default number of triplets generated for each query from 3 to 10. If you run data synthesis jobs with a large number of queries (>1M), you should consider resetting the num_relations parameter to a lower number.

🐞 Bug Fixes

Change the English cross-encoder model from multi-lingual to an actual English model.

The English cross-encoder model which we used was actually a multi-lingual one. By using an English one instead, we produce higher-quality synthetic training data and the resulting embedding models achieve better evaluation results.

Fix create synthesis run not accepting DocumentArray as input type. (https://github.com/jina-ai/finetuner/pull/748)

We noticed that data synthesis jobs can accept either a named DocumentArray object stored on Jina AI Cloud or a list of text values. However, passing file paths to locally stored DocumentArray datasets failed. This bug is fixed by this release.

📗 Documentation Improvements

Update data synthesis tutorial including English and multilingual models. (https://github.com/jina-ai/finetuner/pull/750)

We have added documentation on how to apply data synthesis to datasets that include materials in languages other than English.

🤟 Contributors

We would like to thank all contributors to this release:

Wang Bo (@bwanglzu)
Louis Milliken (@LMMilliken)
Michael Günther (@guenthermi)
George Mastrapas (@gmastrapas)
Scott Martens (@scott-martens)
Jonathan Geuter (@j-geuter)

v0.7.7

11 months ago

Release Note Finetuner 0.7.7

This release covers Finetuner version 0.7.7, including dependencies finetuner-api 0.5.9 and finetuner-core 0.13.5.

This release contains 2 new features, 2 refactorings, 3 bug fixes, and 1 documentation improvement.

🆕 Features

Training data synthesis (#715)

In this release of Finetuner, we have introduced a training data synthesis feature. This feature is particularly useful for users in the e-commerce domain, who may have difficulty obtaining enough labeled training data.

This feature allows you to use historical queries collected from your search system, along with your articles, to generate training data:

import finetuner
from finetuner.model import synthesis_model_en

synthesis_run = finetuner.synthesize(
    query_data='finetuner/xmarket_queries_da',
    corpus_data='finetuner/xmarket_corpus_da',
    models=synthesis_model_en,
)

Once the synthesis job is done, you can get the training data with:

train_data_name = synthesis_run.train_data

And then, you can continue fine-tuning your embedding model with the generated training data:

training_run = finetuner.fit(
    model='bert-base-en',
    train_data=synthesis_run.train_data,
    loss='MarginMSELoss',
    ...,
)

Evaluation on multiple datasets in `EvaluationCallback`

In order to facilitate the training and evaluation of large language models (LLMs) using Finetuner, we have made significant changes to the EvaluationCallback.

These changes now enable evaluation on multiple datasets. Users can now use the caption parameter to EvaluationCallback to get output that labels which dataset each evaluation corresponds to:

import finetuner
from finetuner.callback import EvaluationCallback

finetuner.fit(
    ...,
    callbacks=[
        EvaluationCallback(
            query_data='query-1',
            index_data='index-1',
            caption='dataset-1',
        ),
        EvaluationCallback(
            query_data='query-2',
            index_data='index-2',
            caption='dataset-2',
        ),
    ]
)

⚙ Refactoring

Display small loss values with higher precision.

To avoid displaying "0.000" for very small loss values, the display precision has been increased.

Filter PIL debugging messages from logging stack.

In order to enhance the readability of the logs, we have excluded debugging messages generated by the PIL package.

🐞 Bug Fixes

No longer overestimate the `batch_size` for text models.

This pull request resolves a bug where the batch size finder would incorrectly overestimate the maximum usable batch size for text models like BERT. This is likely to happen when users fine-tune the bert-base-en model without specifying batch_size.

Fix division by `None` error in `EvaluationCallback`.

Runs set up with automatic batch-size configuration and automatic evaluation callback previously passed the value None to EvaluationCallback as batch_size. This resulted in a division by None error.

Filter out queries that do not have any matches in `EvaluationCallback`.

When there are queries in the evaluation data which do not have any matches, Finetuner was previously unable to calculate any metrics, which leads to division by zero errors. It has been fixed in this release.

📗 Documentation Improvements

Add a tutorial for data synthesis (#745)

We have provided a tutorial for the new data synthesis module.

🤟 Contributors

We would like to thank all contributors to this release:

Wang Bo (@bwanglzu)
Louis Milliken (@LMMilliken)
Michael Günther (@guenthermi)
George Mastrapas (@gmastrapas)
Scott Martens (@scott-martens)

v0.7.6

1 year ago

Release Note Finetuner 0.7.6

This release covers Finetuner version 0.7.6, including dependencies finetuner-api 0.5.6 and finetuner-core 0.13.4.

This release contains 2 refactorings and 1 bug fix.

Warning

Due to the release of DocArray v2, which is not yet compatible with Finetuner, all previous versions of Finetuner will break when they update DocArray automatically. It is strongly recommended that you upgrade to this release because previous versions will not work.

⚙ Refactoring

Do not display PIL warning messages.

Beforehand, when fine-tuning with a vision backbone, the PIL package would generate numerous warning messages that contaminated Finetuner's logs. However, this issue has been resolved, and PIL warnings are filtered out.

Display small loss values with higher precision.

Previously, the progress bar would display a loss value of 0 when it was too small. To address this issue, we now use a higher precision when the loss value is too small.

🐞 Bug Fixes

The DocArray version is set to a value lower than 0.3.0.

Previously, Finetuner automatically installed the latest version of DocArray. However, Docarray v2 has now been released and is a breaking change that is incompatible with the current version of Finetuner. Finetuner only supports DocArray up to version 0.3.0. Please upgrade your Finetuner to the latest version to resolve this issue.

We will release a version of Finetuner compatible with DocArray v2 in the immediate future.

🤟 Contributors

We would like to thank all contributors to this release:

Wang Bo (@bwanglzu)
Michael Günther (@guenthermi)
Scott Martens (@scott-martens)

v0.7.5

1 year ago

Release Note Finetuner 0.7.5

This release covers Finetuner version 0.7.5, including dependencies finetuner-api 0.5.6 and finetuner-core 0.13.3.

This release contains 2 refactorings and 2 bug fixes.

⚙ Refactoring

Downloading pre-trained weights is not necessary anymore

Previously, when a fine-tuning job was completed and the get_model function was called, we would construct the model, load the pre-trained weights, and then overwrite them with fine-tuned weights. We have now disabled the downloading of pre-trained weights, which speeds up the get_model function and eliminates unneeded network traffic.

Before creating a Run, users are required to call finetuner.login() and use third-party authentication to log in. Previously, if they had not already done so, they would receive an error message that did not tell them to log in. We now display a more informative error message in the event that a user forgets to log in or their login attempt was unsuccessful.

🐞 Bug Fixes

Fix model name validation error using the model display name

When users request a model by name, they use names with the format name-size-lang, for example: bert-base-en. However, these names were not included in our internal schema for validation and jobs would fail validation. This has now been rectified.

Fix automatic batch size selection for PointNet models

In the past, BatchSizeFinder was unable to properly select batch sizes for PointNet++ models. This has been fixed.

🤟 Contributors

We would like to thank all contributors to this release:

Wang Bo (@bwanglzu)
Louis Milliken (@LMMilliken)
Michael Günther (@guenthermi)
George Mastrapas (@gmastrapas)
Scott Martens (@scott-martens)

v0.7.4

1 year ago

Release Note Finetuner 0.7.4

This release covers Finetuner version 0.7.4, including dependencies finetuner-api 0.5.5 and finetuner-core 0.13.0.

This release contains 2 new features, 3 refactoring, 1 bug fix, and 3 documentation improvements.

🆕 Features

Layer-wise learning rate decay (LLRD) (#697)

LLRD sets a large learning rate for the top (last) layer and uses a multiplicative decay rate to decrease the learning rate layer-by-layer from top (last) to bottom (first). With high learning rates, the features recognized by the top layers change more and adapt to new tasks more easily, while the bottom layers have low learning rates and more easily preserve the features learned during pre-training:

import finetuner
run = finetuner.fit(
    ...,
    optimizer='Adam'
+   optimizer_options={'layer_wise_lr_decay': 0.98},
    ...,
)

Support CSV for similarity-based training (#696)

In the previous release, Finetuner added support for training with data that is not specifically labeled, but each pair of training items has a numerical similarity score between 0.0 (totally dissimilar) and 1.0 (totally the same). This extends the potential scenarios for which Finetuner is applicable.

Now, users can prepare training data with a CSV file like this:

The weather is nice, The weather is beautiful, 0.9
The weather is nice, The weather is bad, 0

⚙ Refactoring

Unify supported backbone names (#700)

We unified backbone names into the name-size-[lang] format. Now model names are easier to understand and remember. For example: bert-base-cased becomes bert-base-en, openai/clip-vit-base-patch32 becomes clip-base-en, resnet152 becomes resnet-large. Users can see the support backbones with:

finetuner.describe_models(task='text-to-image')

Note that you can keep using the old names, for example:

# build a zero-shot resnet embedding model
model = finetuner.build_model('resnet50')

# this is identical to
model = finetuner.build_model('resnet-base')

Simplify model offerings. (#700)

To simplify our offerings and create a more intuitive user experience, we have decreased the number of supported machine-learning models. By doing so, we hope to make it easier for our users to navigate our platform and find the models that best suit their needs.

Change the default `batch_size` for `EvaluationCallback` from 8 to 32.

We recently implemented a change to our EvaluationCallback by increasing the batch size. This has resulted in a significant improvement in evaluation speed, which in turn has reduced the overall cost of evaluation. By processing data more efficiently, we can evaluate models more quickly and accurately, which ultimately benefits our users.

🐞 Bug Fixes

Fix `IndexError` in `EvaluationCallback` when 'gather_examples' is enabled.

We resolved an issue in the EvaluationCallback where the 'list out of range' error was occurring when 'gather_examples' was enabled. This fix ensures that the evaluation callback works correctly with 'gather_examples' and enables users to gather evaluation examples without encountering errors.

📗 Documentation Improvements

Add section for layer-wise learning rate decay (LLRD) (#697)
Add section for CosineSimilarityLoss (#697)
Add section on creating query-document pairs as CSV (#697)

🤟 Contributors

We would like to thank all contributors to this release:

Wang Bo (@bwanglzu)
Louis Milliken (@LMMilliken)
Michael Günther (@guenthermi)
George Mastrapas (@gmastrapas)
Scott Martens (@scott-martens)

v0.7.3

1 year ago

Release Note Finetuner 0.7.3

This release covers Finetuner version 0.7.3, including dependencies finetuner-api 0.5.4 and finetuner-core 0.12.9.

This release contains 4 new features, 1 refactoring, and 1 bug fix.

🆕 Features

Automatic batch size configuration (#691)

It can be complicated to find a good batch size for the finetuner.fit function. If you choose a value that's too small, fine-tuning may not be very effective; if you choose a value that's too large, the job may run out of memory and fail. To make this easier, Finetuner now sets the batch size automatically if you leave out the batch_size parameter to finetuner.fit or set it to None. This will choose the largest batch size supported by the current CUDA device.

Retrieving evaluation metrics (#687)

You no longer need to retrieve the logs of Finetuner runs or manually unpack fine-tuned models to get the evaluation metrics. Now, you can get the metrics directly from the Run.metrics function:

run = finetuner.fit(...)
metrics = run.metrics()

To print nicely formatted evaluation metrics to the console, use the Run.display_metrics() function. This will print tables showing evaluation metrics before and after fine-tuning:

Metrics before and after fine-tuning

Calculating example results (#687)

In addition to evaluation metrics, you may find it helpful to see actual query results. We have introduced a new parameter gather_examples to the evaluation callback to make this easy. If this parameter is set to True, the evaluation callback also tracks the Top-K results for some example queries samples from the query dataset:

run = finetuner.fit(
    ...,
    callbacks=[
        EvaluationCallback(
            query_data='query-data-name'',
            index_data='index-data-name'',
            gather_examples=True,
        )
    ],
    ...
)

Like the evaluation metrics, you can retrieve the query results, before and after fine-tuning, with the Run.example_results function or print them on the console using Run.display_examples:

Example Results before and after fine-tuning

Similarity-based training and Cosine Similarity Loss

Finetuner now supports training with data that is not specifically labeled, but that has for each pair of training items, a numerical similarity score between 0.0 (totally dissimilar) and 1.0 (totally the same). This extends the potential scenarios for which Finetuner is applicable.

For example, you can now use DocArray to prepare data pairs with scores like this:

from docarray import Document, DocumentArray
d1 = Document(
    [Document(text='I am driving to Los Angeles'), Document('I am driving to Hollywood')],
    tags={'finetuner_score': 0.9},
)
d2 = Document(
    [Document(text='I am driving to Los Angeles'), Document('I am flying to New York')],
    tags={'finetuner_score': 0.3},
)
...
train_data = DocumentArray([d1, d2, ...])

Then, use CosineSimilarityLoss as the loss function in the finetuner.fit function:

finetuner.fit(
  model='sentence-transformers/msmarco-distilbert-base-v3',
  train_data=train_data,
  loss='CosineSimilarityLoss',
  ...
)

In the future, we will also support data with scores in CSV format.

⚙ Refactoring

Remove job limit

Previously, users could only run three jobs in parallel. This limit has been removed.

🐞 Bug Fixes

Logs become unavailable after some time

After fine-tuning jobs finish, the logs were lost after some length of time. Now, logs will remain available indefinitely.

🤟 Contributors

We would like to thank all contributors to this release:

Wang Bo (@bwanglzu)
Louis Milliken (@LMMilliken)
Michael Günther (@guenthermi)
George Mastrapas (@gmastrapas)
Scott Martens (@scott-martens)

v0.7.2

1 year ago

Release Note Finetuner 0.7.2

This release covers Finetuner version 0.7.2, including dependencies finetuner-api 0.5.2 and finetuner-core 0.12.7.

This release contains 2 new features, 4 refactorings, 1 bug fix, and 3 documentation improvements.

🆕 Features

Support learning rate scheduler (#679)

This PR add supports for the learning rate scheduler. The scheduler is used to adjust the learning rate during training. We support 6 learning rate schedulers: linear, cosine, cosine_with_restarts, polynomial, constant and constant_with_warmup.

When a scheduler is configured, the learning rate is by default adjusted after each batch. Alternatively, you can set scheduler_optons = {'scheduler_step': 'epoch'} to adjust the learning rate after each epoch instead.

You can use them by specifying their name in the scheduler attribute of the fit function.

run = finetuner.fit(
    ...,
    scheduler='linear',
    scheduler_options={'scheduler_step': 'batch'},
    ...
)

Support `steps_per_interval` in `EvaluationCallback`

When working with large datasets, you may want to perform evaluations multiple times during each epoch. This parameter allows to specify a number of batches after which an evaluation should be performed. If set to None, an evaluation is performed only at the end of each epoch.

run = finetuner.fit(
    ...,
    callbacks=[
        EvaluationCallback(
            query_data=...,
            index_data=...,
            steps_per_interval=3, # evaluate every 3 batches.
        ),
    ],
    ...
)

⚙ Refactoring

`scheduler_step` becomes part of `scheduler_options` (#679)

We removed the scheduler_step argument from the fit function, now it is part of the scheduler_options.

run = finetuner.fit(
    ...,
    scheduler='linear',
-   scheduler_step='batch',
+   scheduler_options={'scheduler_step': 'batch'},
    ...
)

Change the default `epochs` and `batch_size` in cloud.jina.ai

For Web UI users, we have reduced the default epochs from 10 to 5, and reduced the default batch_size from 128 to 64 to avoid out-of-memory errors from 3D-mesh fine-tuning.

Improve the user journey in cloud.jina.ai

Add more textual guidance on creating Finetuner runs in the Jina AI Cloud UI.

Remove duplicate query-document pairs in unlabeled CSVs. (#678)

Finetuner now groups query-document pairs by their queries, thereby eliminating duplicate queries, when parsing CSV files. This leads to more effective fine-tuning.

🐞 Bug Fixes

Remove invalid argument from `GeM` pooler.

This PR removes the output_dim argument from the GeM pooler's forward function. You can use GeM pooler together with ArcFaceLoss to deliever better visual embedding quality.

run = finetuner.fit(
    ...,
    model_options = {
        ...
-       'output_dim': 512,
+       'pooler': 'GeM',
+       'pooler_options': {'p': 2.4, 'eps': 1e-5}
    }
)

📗 Documentation Improvements

Add a documentation section for `GeM` pooling (#684)

We have added a new section to our documentation which explains the pooling options in more detail.

Add a documentation page and notebook for `ArcFaceLoss` (#680)

We have added a new page to our documentation which demonstrates ArcFaceLoss on the Stanford Cars dataset.

Add a documentation section on creating query-document pairs as data (#678)

We have added a new section to our documentation explaining how to create training data made of query-document pairs instead of explicitly annotated and labeled data.

🤟 Contributors

We would like to thank all contributors to this release:

Wang Bo (@bwanglzu)
Louis Milliken (@LMMilliken)
Michael Günther (@guenthermi)
CatStark (@CatStark)
George Mastrapas (@gmastrapas)
Martin Matousek (@matousek-martin)
Scott Martens (@scott-martens)

v0.7.1

1 year ago

Release Note Finetuner 0.7.1

This release covers Finetuner version 0.7.1, including dependencies finetuner-api 0.5.0 and finetuner-core 0.12.6.

This release contains 2 new features, 3 refactorings, 3 bug fixes, and 4 documentation improvements.

🆕 Features

Support SphereFace Loss Functions (#664)

SphereFace loss functions were first formulated for computer vision, specifically face recognition, tasks. Finetuner supports two variations of this loss function, ArcFaceLoss, and CosFaceLoss. Instead of attempting to minimize the distance between positive pairs and maximize the distance between negative pairs, the SphereFace loss functions compare each sample with an estimate of the center point of each class's embeddings.

Like all supported loss functions, you can use them by specifying their name in the loss attribute of the fit function.

run = finetuner.fit(
    ...,
    loss='ArcFaceLoss',
    ...
)

To track and refine our estimate of the class center points across batches, these SphereFace loss functions require an additional optimizer during training. By default, the type of optimizer used will be the same as the one used for the model itself, but you can also choose a different optimizer for your loss function using the loss_optimizer parameter.

run = finetuner.fit(
    ...,
    loss='ArcFaceLoss',
+   loss_optimizer='Adam',
+   loss_optimizer_options={'weight_decay': 0.01}
)

Support Continuing Training from an Artifact of a Previous Run (#668)

If you want to start fine-tuning from a model produced by a previous Run, or you collected new training data and want to use it to continue training, this is now possible. To use this feature, you need to set the artifact id of the model you want to continue training from via the model_artifact parameter of the fit function:

train_data = 'path/to/another/data.csv'
new_run = finetuner.fit(
    model='efficientnet_b0',
    train_data=train_data,
    model_artifact=previous_run.artifact_id,
)

⚙ Refactoring

Removing ResNet-based CLIP Models (#662)

Due to low usage, we removed CLIP models which are based on ResNet.

Add the EfficientNet B7 Model (#662)

For image-to-image search, we now support EfficientNet B7 as a backbone model.

Increase Upload Size of CSV Files for cloud.jina.ai

For Web UI users, we have increased the upload file size from 1MB to 32MB. Python client users have always been able to upload much larger datasets and are unaffected by this change.

🐞 Bug Fixes

Solve Dependency Problem in MLFlow Callback

A new SQLAlchemy release caused the MLFlow callback to behave incorrectly in some cases. This release fixes the problem.

Prevent Errors caused by wrong `num_items_per_class` Parameter

Some loss functions do not use the num_items_per_class parameter. In some cases, it is possible for users to set this parameter in a way that is incompatible with the rest of the configuration and cause Finetuner to fail. Now the parameter is only validated if it is actually used, and for loss functions that do not use it, it is completely ignored.

Sometimes, when calling finetuner.login() in a Jupyter notebook, login would appear successful, but Finetuner might not always behave correctly. Previously, users had to call finetuner.login(force=True) to be sure they were correctly logged in. This problem has been resolved, and finetuner.login() works correctly without the force flag.

📗 Documentation Improvements

Add a Documentation Page for Loss Functions and Pooling (#664)

We add a new page to our documentation which explains several loss functions and the pooling options in more detail.

Add a Section about Finetuner Articles (#669)

We add a list with articles to our README that make use of Finetuner and provide more insights for using Finetuner in practice.

Add a Folder for Example CSV files (#663)

If you need example training datasets that have already been prepared for use in Finetuner, you can look at the dataset folder in our repository.

Proofread the documentation as a whole to fix typos and broken links (#661, #666)

We repaired broken links and fixed typos found in the Finetuner documentation.

🤟 Contributors

We would like to thank all contributors to this release:

Wang Bo (@bwanglzu)
Louis Milliken (@LMMilliken)
Michael Günther (@guenthermi)
CatStark (@CatStark)
George Mastrapas (@gmastrapas)
Scott Martens (@scott-martens)

Finetuner Versions Save

v0.8.1

Release Note Finetuner 0.8.1

🆕 Features

Add Jina Tiny Embedding model

⚙ Refactoring

Remove typing-extensions from Finetuner dependencies

📗 Documentation Improvements

Add Tiny model and technical report to Finetuner Readme and Docs. (https://github.com/jina-ai/finetuner/pull/763)

🤟 Contributors

v0.8.0

Release Note Finetuner 0.8.0

🆕 Features

Add Jina embeddings suite (https://github.com/jina-ai/finetuner/pull/757)

⚙ Refactoring

Change installation behavior (https://github.com/jina-ai/finetuner/pull/757)

🤟 Contributors

v0.7.8

Release Note Finetuner 0.7.8

🆕 Features

Add multilingual text encoder models

Add multilingual model for training data synthesis jobs (https://github.com/jina-ai/finetuner/pull/750)

Support loading models directly from Jina's huggingface site (https://github.com/jina-ai/finetuner/pull/751)

Add an option to the tracking callback to include zero-shot metrics in logging.

🚀 Performance

Reduce memory consumption during data synthesis and make the resulting dataset more compact

⚙ Refactoring

Increase the default num_relations from 3 to 10 for data synthesis jobs. (https://github.com/jina-ai/finetuner/pull/750)

🐞 Bug Fixes

Change the English cross-encoder model from multi-lingual to an actual English model.

Fix create synthesis run not accepting DocumentArray as input type. (https://github.com/jina-ai/finetuner/pull/748)

📗 Documentation Improvements

Update data synthesis tutorial including English and multilingual models. (https://github.com/jina-ai/finetuner/pull/750)

🤟 Contributors

v0.7.7

Release Note Finetuner 0.7.7

🆕 Features

Training data synthesis (#715)

Evaluation on multiple datasets in EvaluationCallback

⚙ Refactoring

Display small loss values with higher precision.

Filter PIL debugging messages from logging stack.

🐞 Bug Fixes

No longer overestimate the batch_size for text models.

Fix division by None error in EvaluationCallback.

Filter out queries that do not have any matches in EvaluationCallback.

📗 Documentation Improvements

Add a tutorial for data synthesis (#745)

🤟 Contributors

v0.7.6

Release Note Finetuner 0.7.6

⚙ Refactoring

Do not display PIL warning messages.

Display small loss values with higher precision.

🐞 Bug Fixes

The DocArray version is set to a value lower than 0.3.0.

🤟 Contributors

v0.7.5

Release Note Finetuner 0.7.5

⚙ Refactoring

Downloading pre-trained weights is not necessary anymore

Provide informative error messages when user did not login. (#708)

🐞 Bug Fixes

Fix model name validation error using the model display name

Fix automatic batch size selection for PointNet models

🤟 Contributors

v0.7.4

Release Note Finetuner 0.7.4

🆕 Features

Layer-wise learning rate decay (LLRD) (#697)

Support CSV for similarity-based training (#696)

⚙ Refactoring

Unify supported backbone names (#700)

Simplify model offerings. (#700)

Change the default batch_size for EvaluationCallback from 8 to 32.

🐞 Bug Fixes

Fix IndexError in EvaluationCallback when 'gather_examples' is enabled.

📗 Documentation Improvements

🤟 Contributors

v0.7.3

Remove `typing-extensions` from Finetuner dependencies

Increase the default `num_relations` from 3 to 10 for data synthesis jobs. (https://github.com/jina-ai/finetuner/pull/750)

Evaluation on multiple datasets in `EvaluationCallback`

No longer overestimate the `batch_size` for text models.

Fix division by `None` error in `EvaluationCallback`.

Filter out queries that do not have any matches in `EvaluationCallback`.

Change the default `batch_size` for `EvaluationCallback` from 8 to 32.

Fix `IndexError` in `EvaluationCallback` when 'gather_examples' is enabled.

Support `steps_per_interval` in `EvaluationCallback`

`scheduler_step` becomes part of `scheduler_options` (#679)

Change the default `epochs` and `batch_size` in cloud.jina.ai

Remove invalid argument from `GeM` pooler.

Add a documentation section for `GeM` pooling (#684)

Add a documentation page and notebook for `ArcFaceLoss` (#680)

Prevent Errors caused by wrong `num_items_per_class` Parameter