End-to-end pipeline with TFX to train and deploy a BERT model for sentiment analysis.
The end-to-end TFX pipeline will cover most of the main areas of a machine learning solution, from data ingestion and validation to model training and serving, those steps are further described below, this repository also aims to provide different options for managing the pipeline, this will be done using orchestrators, the orchestrators covered will be AirFlow, KubeFlow and an interactive option that can be used at Google Colab
for demonstration purposes.
ExampleGen
is the initial input component of a pipeline that ingests and optionally splits the input dataset.
IMDB
dataset stored as a CSV
file and spits the data into train (2/3
) and validation (1/3
).StatisticsGen
calculates statistics for the dataset.
SchemaGen
examines the statistics and creates a data schema.ExampleValidator
looks for anomalies and missing values in the dataset.
SchemaGen
's schema.Transform
performs feature engineering on the dataset.
Tuner
uses kerastuner
to perform hyperparameters tuning for the model.
Trainer
Trainer
trains the model.
BERT
model, this model also has a built-in text tokenizer.Resolver
performs model validation.
Evaluator
performs deep analysis of the training results and helps you validate your exported models, ensuring that they are "good enough" to be pushed to production.InfraValidator
used as an early warning layer before pushing a model into production. The name "infra" validator came from the fact that it is validating the model in the actual model serving "infrastructure".
Pusher
deploys the model on a serving infrastructure.
At the modeling part, we are going to use the BERT model, for better performance we will use transfer learning, this means that we are using a model that was pre-trained on another task (usually a task that is more generic or similar), from the pre-trained model we will use all layers until the output of the last embedding, to be more specific only the output from the CLS
token, shown in the image below, then we add a classifier layer at the top, this classifier layer will be responsible for classifying the input text as being positive
or negative
, this task is also known as sentiment analysis, and is very common in natural language processing.
The dataset used for training and evaluating the model is the known IMDB review dataset, this dataset has 25,000 movies reviews, being either negative (label 0)
or positive (label 1)
, this dataset was slightly processed to be used here, labels have been encoded to be integers (0 or 1), and for faster experimentation, the data was reduced to have only 5,000 samples.