Kickstart your MLOps initiative with a flexible, robust, and productive Python package.
This repository contains a Python code base with best practices designed to support your MLOps initiatives.
The package leverages several tools and tips to make your MLOps experience as flexible, robust, productive as possible.
You can use this package as part of your MLOps toolkit or platform (e.g., Model Registry, Experiment Tracking, Realtime Inference, ...).
I'm currently preparing a course and a mentoring offer to help you create and use MLOps package for your projects. Stay tuned :)
This section details the requirements, actions, and next steps to kickstart your MLOps project.
# with ssh (recommended)
$ git clone [email protected]:fmind/mlops-python-package
# with https
$ git clone https://github.com/fmind/mlops-python-package
$ cd mlops-python-package/
$ poetry install
Going from there, there are dozens of ways to integrate this package to your MLOps platform.
For instance, you can use Databricks or AWS as your compute platform and model registry.
It's up to you to adapt the package code to the solution you target. Good luck champ!
This section explains how configure the project code and execute it on your system.
You can add or edit config files in the confs/
folder to change the program behavior.
# confs/training.yaml
job:
KIND: TrainingJob
inputs:
KIND: ParquetReader
path: data/inputs.parquet
targets:
KIND: ParquetReader
path: data/targets.parquet
This config file instructs the program to start a TrainingJob
with 2 parameters:
inputs
: dataset that contains the model inputstargets
: dataset that contains the model targetYou can find all the parameters of your program in the src/[package]/jobs/*.py
files.
You can also print the full schema supported by this package using poetry run bikes --schema
.
The project code can be executed with poetry during your development:
$ poetry run [package] confs/tuning.yaml
$ poetry run [package] confs/training.yaml
$ poetry run [package] confs/promotion.yaml
$ poetry run [package] confs/inference.yaml
In production, you can build, ship, and run the project as a Python package:
poetry build
poetry publish # optional
python -m pip install [package]
[package] confs/inference.yaml
You can also install and use this package as a library for another AI/ML project:
from [package] import jobs
job = jobs.TrainingJob(...)
with job as runner:
runner.run()
Additional tips:
--extras
flag
This project includes several automation tasks to easily repeat common actions.
You can invoke the actions from the command-line or VS Code extension.
# execute the project DAG
$ inv dags
# create a code archive
$ inv packages
# list other actions
$ inv --list
Available tasks:
This package supports two GitHub Workflows in .github/workflows
:
check.yml
: validate the quality of the package on each Pull Requestpublish.yml
: build and publish the docs and packages on code release.You can use and extend these workflows to automate repetitive package management tasks.
This sections motivates the use of developer tools to improve your coding experience.
Pre-defined actions to automate your project development.
Execution of automated workflows on code push and releases.
Integrations with the Command-Line Interface (CLI) of your system.
Edition, validation, and versioning of your project source code.
Manage the configs files of your project to change executions.
Define the datasets to provide data inputs and outputs.
Generate and share the project documentations.
Toolkit to handle machine learning models.
Define and build modern Python package.
Select your programming environment.
This sections gives some tips and tricks to enrich the develop experience.
You should decouple the pointer to your data from how to access it.
In your code, you can refer to your dataset with a tag (e.g., inputs
, targets
).
This tag can then be associated to a reader/writer implementation in a configuration file:
inputs:
KIND: ParquetReader
path: data/inputs.parquet
targets:
KIND: ParquetReader
path: data/targets.parquet
In this package, the implementation are described in src/[package]/io/datasets.py
and selected by KIND
.
You should select the best hyperparameters for your model using optimization search.
The simplest projects can use a sklearn.model_selection.GridSearchCV
to scan the whole search space.
This package provides a simple interface to this hyperparameter search facility in src/[package]/utils/searchers.py
.
For more complex project, we recommend to use more complex strategy (e.g., Bayesian) and software package (e.g., Optuna).
You should properly split your dataset into a training, validation, and testing sets.
The sets should be exclusive, and the testing set should never be used as training inputs!
This package provides a simple deterministic strategy implemented in src/[package]/utils/splitters.py
.
You should use Directed-Acyclic Graph (DAG) to connect the steps of your ML pipeline.
A DAG can express the dependencies between steps while keeping the individual step independent.
This package provides a simple DAG example in tasks/dags.py
. This approach is based on PyInvoke.
In production, we recommend to use a scalable system such as Airflow, Dagster, Prefect, Metaflow, or ZenML.
You should provide a global context for the execution of your program.
There are several approaches such as Singleton, Global Variable, or Component.
This package takes inspiration from Clojure mount. It provides an implementation in src/[package]/io/services.py
.
You should separate the program implementation from the program configuration.
Exposing configurations to users allow them to influence the execution behavior without code changes.
This package seeks to expose as much parameter as possible to the users in configurations stored in the confs/
folder.
You should implement the SOLID principles to make your code as flexible as possible.
In practice, this mean you can implement software contracts with interface and swap the implementation.
For instance, you can implement several jobs in src/[package]/jobs/*.py
and swap them in your configuration.
To learn more about the mechanism select for this package, you can check the documentation for Pydantic Tagged Unions.
You should separate the code interacting with the external world from the rest.
The external is messy and full of risks: missing files, permission issue, out of disk ...
To isolate these risks, you can put all the related code in an io
package and use interfaces
You should use Python context manager to control and enhance an execution.
Python provides contexts that can be used to extend a code block. For instance:
# in src/[package]/scripts.py
with job as runner: # context
runner.run() # run in context
This pattern has the same benefit as Monad, a powerful programming pattern.
The package uses src/[package]/jobs/*.py
to handle exception and services.
You should create Python package to create both library and application for others.
Using Python package for your AI/ML project has the following benefits:
To build a Python package with Poetry, you simply have to type in a terminal:
# for all poetry project
poetry build
# for this project only
inv packages
You should type your Python code to make it more robust and explicit for your user.
Python provides the typing module for adding type hints and mypy to checking them.
# in src/[package]/core/models.py
@abc.abstractmethod
def fit(self, inputs: schemas.Inputs, targets: schemas.Targets) -> "Model":
"""Fit the model on the given inputs and target."""
@abc.abstractmethod
def predict(self, inputs: schemas.Inputs) -> schemas.Outputs:
"""Generate an output with the model for the given inputs."""
This code snippet clearly state the inputs and outputs of the method, both for the developer and the type checker.
The package aims to type every functions and classes to facilitate the developer experience and fix mistakes before execution.
You should type your configuration to avoid exceptions during the program execution.
Pydantic allows to define classes that can validate your configs during the program startup.
# in src/[package]/utils/splitters.py
class TrainTestSplitter(Splitter):
shuffle: bool = False # required (time sensitive)
test_size: int | float = 24 * 30 * 2 # 2 months
random_state: int = 42
This code snippet allows to communicate the values expected and avoid error that could be avoided.
The package combines both OmegaConf and Pydantic to parse YAML files and validate them as soon as possible.
You should type your dataframe to communicate and validate their fields.
Pandera supports dataframe typing for Pandas and other library like PySpark:
# in src/package/schemas.py
class InputsSchema(Schema):
instant: papd.Index[papd.UInt32] = pa.Field(ge=0, check_name=True)
dteday: papd.Series[papd.DateTime] = pa.Field()
season: papd.Series[papd.UInt8] = pa.Field(isin=[1, 2, 3, 4])
yr: papd.Series[papd.UInt8] = pa.Field(ge=0, le=1)
mnth: papd.Series[papd.UInt8] = pa.Field(ge=1, le=12)
hr: papd.Series[papd.UInt8] = pa.Field(ge=0, le=23)
holiday: papd.Series[papd.Bool] = pa.Field()
weekday: papd.Series[papd.UInt8] = pa.Field(ge=0, le=6)
workingday: papd.Series[papd.Bool] = pa.Field()
weathersit: papd.Series[papd.UInt8] = pa.Field(ge=1, le=4)
temp: papd.Series[papd.Float16] = pa.Field(ge=0, le=1)
atemp: papd.Series[papd.Float16] = pa.Field(ge=0, le=1)
hum: papd.Series[papd.Float16] = pa.Field(ge=0, le=1)
windspeed: papd.Series[papd.Float16] = pa.Field(ge=0, le=1)
casual: papd.Series[papd.UInt32] = pa.Field(ge=0)
registered: papd.Series[papd.UInt32] = pa.Field(ge=0)
This code snippet defines the fields of the dataframe and some of its constraint.
The package encourages to type every dataframe used in src/[package]/core/schemas.py
.
You should use the Objected Oriented programming to benefit from polymorphism.
Polymorphism combined with SOLID Principles allows to easily swap your code components.
class Reader(abc.ABC, pdt.BaseModel):
@abc.abstractmethod
def read(self) -> pd.DataFrame:
"""Read a dataframe from a dataset."""
This code snippet uses the abc module to define code interfaces for a dataset with a read/write method.
The package defines class interface whenever possible to provide intuitive and replaceable parts for your AI/ML project.
You should use semantic versioning to communicate the level of compatibility of your releases.
Semantic Versioning (SemVer) provides a simple schema to communicate code changes. For package X.Y.Z:
Poetry and this package leverage Semantic Versioning to let developers control the speed of adoption for new releases.
You can run your tests in parallel to speed up the validation of your code base.
Pytest can be extended with the pytest-xdist plugin for this purpose.
This package enables Pytest in its automation tasks by default.
You should define reusable objects and actions for your tests with fixtures.
Fixture can prepare objects for your test cases, such as dataframes, models, files.
This package defines fixtures in tests/conftest.py
to improve your testing experience.
You can use VS Code workspace to define configurations for your project.
Code Workspace can enable features (e.g. formatting) and set the default interpreter.
{
"settings": {
"editor.formatOnSave": true,
"python.defaultInterpreterPath": ".venv/bin/python",
...
},
}
This package defines a workspace file that you can load from [package].code-workspace
.
You can use GitHub Copilot to increase your coding productivity by 30%.
GitHub Copilot has been a huge productivity thanks to its smart completion.
You should become familiar with the solution in less than a single coding session.
You can use VIM keybindings to more efficiently navigate and modify your code.
Learning VIM is one of the best investment for a career in IT. It can make you 30% more productive.
Compared to GitHub Copilot, VIM can take much more time to master. You can expect a ROI in less than a month.
This section provides resources for building packages for Python and AI/ML/MLOps.