Say the magic word 😸
Say the magic word.
Plz (pronounced "please") runs your jobs storing code, input, outputs and results so that they can be queried programmatically. That way, it helps with traceability and reproducibility. In case you want to run your jobs in the cloud, it makes the process frictionless compared to running them locally. Jump here to see it in action.
At Prodo.AI, we use Plz to train our PyTorch-based machine learning models.
Plz is an experimental product and is not guaranteed to be stable across versions.
We offer more details below on how to setup Plz and run your jobs, but we can start by giving you an overview of what Plz does.
Plz offers a command-line interface. You start by adding a plz.config.json
file to the directory where you have your source code. This file contains, among
other things, the command you run to put your program to work (for instance,
python3 main.py
). Then you can use Plz to run your program with plz run
. The
following example (provided in this repository) demonstrates this:
sergio@spaceship:~/plz/examples/pytorch$ plz run
👌 Capturing the files in /home/sergio/plz/examples/pytorch
👌 Building the program snapshot
Step 1/4 : FROM prodoai/plz_ml-pytorch
# Executing 3 build triggers
---> Using cache
[...]
---> 9c39e889659d
Successfully built 9c39e889659d
Successfully tagged 024444204267.dkr.ecr.eu-west-1.amazonaws.com/plz/builds:some-person-trying-pytorch-mnist-example-1541436382135
👌 Capturing the input
👌 983663 input bytes to upload
👌 Sending request to start execution
Instance status: querying availability
Instance status: requesting new instance
Instance status: pending
[...]
Instance status: starting container
Instance status: running
👌 Execution ID is: 55b66652-e11a-11e8-a36a-233ad251f4c1
👌 Streaming logs...
Using device: cuda
Epoch: 1. Training loss: 2.146302
Evaluation accuracy: 47.90 (max 0.00)
Best model found at epoch 1, with accurary 47.90
Epoch: 2. Training loss: 0.660179
Evaluation accuracy: 83.30 (max 47.90)
Best model found at epoch 2, with accurary 83.30
Epoch: 3. Training loss: 0.251717
Evaluation accuracy: 87.80 (max 83.30)
Best model found at epoch 3, with accurary 87.80
[...]
Epoch: 30. Training loss: 0.010750
Evaluation accuracy: 97.50 (max 98.10)
👌 Harvesting the output...
👌 Retrieving summary of measures (if present)...
{
"max_accuracy": 98.1,
"training_loss_at_max": 0.008485347032546997,
"epoch_at_max": 25,
"training_time": 43.3006055355072
}
👌 Execution succeeded.
👌 Retrieving the output...
le_net.pth
👌 Done and dusted.
From the above output, you'll see Plz do the following:
plz.config.json
).You can be patient and wait until it finishes, or you can hit Ctrl+C
and stop
the program early:
Epoch: 9 Training loss: 0.330538
^C
👌 Your program is still running. To stream the logs, type:
plz logs ad96b586-89e5-11e8-a7c5-8142e2563487
Plz runs your commands in a Docker container, either in your AWS infrastructure
or in your local machine, and so your actions in the terminal don't affect the
execution. If you are running this execution only, you can just type plz logs
and logs will be streamed from the current moment (unless you specify
--since=start
, which will tell it to stream from the start of execution).
The big hexadecimal number you see in the output, next to plz logs
, is the
execution ID you can use to refer to this execution. Plz remembers the last
execution that was started, and if you want to refer to that one you don't
need to include it in your command (you can just type plz logs
). But if you
need to specify the execution ID, you can do plz logs <execution-id>
.
Once your program has finished (or you've stopped it with plz stop
) you can
run plz output
, and it will download the files that your program has written.
In order to use this functionality, you need to tell your program to write to a
specific directory, which is provided to your program as an environment
variable. The files are saved under output/<execution-id>
by default, but you
can specify the location with the -p
option.
The instance will be kept there for some time (specified in plz.config.json
)
in case you're running things interactively (so that you don't need to wait
while the instance goes through the startup process again).
You can use plz describe
to print metadata about an execution in JSON format.
It's useful to tell one execution from another if you have several running at
the same time.
You can use plz run --parameters a_json_file.json
to pass parameters to your
program. Passing parameters this way has two advantages:
plz history
below)plz rerun --override-parameters some_json_file.json
and run
exactly the same execution but with different parameters, which helps running
experiments in a systematic fashion.There's also plz history
, returning a JSON mapping from execution IDs to
metadata. If you write JSON files to a specific directory (see
test/end-to-end/measures/simple
) they will be available in the metadata. You
can store things you've measured during your experiment (for instance, training
loss). Parameters will be in the metadata as well, so you can transform the
metadata using, for instance, jq
, and find
out how your training loss changed as you changed your parameters.
sergio@spaceship:~/plz/examples/pytorch$ plz history | \
jq 'to_entries[] | { "execution_id": .key,
"learning_rate": .value.parameters.learning_rate,
"accuracy": .value.measures.summary.max_accuracy }'
{
"execution_id": "dafcb478-e11e-11e8-9f2c-87dc520968d5",
"learning_rate": 0.01,
"accuracy": 98
}
{
"execution_id": "9cfd3f1a-e1cf-11e8-9449-b1cc03bcdb5f",
"learning_rate": 0.1,
"accuracy": 98.5
}
{
"execution_id": "c0d65d66-e1cf-11e8-8ed8-0d6f99ec4bc3",
"learning_rate": 0.5,
"accuracy": 13
}
In this example, you can see that increasing the learning rate from 0.01
to
0.1
gives you an improvement in accuracy from 98% to 98.5%, but further
increasing the learning rate leads to a disastrous decrease to 13%.
You can run plz list
to list the running executions, as well as any running
instances on AWS. It also shows the instance IDs. You can kill instances with
plz kill -i <instance-id>
.
The command plz last
is useful, particularly when writing shell commands, to
get the last execution started.
We also make it easy to manage dependencies for projects using Anaconda.
Projects using the image prodoai/plz_ml-pytorch
need to have an
environment.yml
file, as the one produced by conda env export
(see
the one in the Pytorch example). This file
will be applied on top of
the environment in the image.
Installation of dependencies is cached, so the process of dependency
installation occurs only the first time after you change the environment file.
Plz consists of a controller service and a command-line interface (CLI) that
issues requests to the controller. The CLI is a Python executable, plz
, which
takes instructions (such as plz run ...
) as described above.
There are two configurations of the controller that are ready for you to use: in one of them your jobs are run locally, while in the other one an AWS instance is started for each job. (Note: the controller itself can be deployed to the cloud, and if you're in a production environment that's the recommended way to use it, but we suggest you try the examples with a controller that runs locally first.)
When you have a directory with source code, you can just add a plz.config.json
file including information such as:
Then, just typing plz run
will run the job for you, either locally or on AWS,
depending on the controller you've started.
Chances are you that you have most of the supporting tools already installed, as these are broadly used tools.
sudo apt install -y git python3 python3-pip python-pip
.brew install git python
.sudo apt install -y curl
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker "$USER"
then start a new shell with sudo su - "$USER"
so that it picks up the
membership to the docker
group.brew cask install docker
.pip install docker-compose
). You might want to make
sure that pip
installs the docker-compose
command somewhere in your
PATH
. On Ubuntu with the default Python installation, this is typically
$HOME/.local/bin
(so you need the command
export PATH="${HOME}/.local/bin:${PATH}"
).git clone https://github.com/prodo-ai/plz
, then cd plz
../install_cli
, which calls pip3
. Same as for
docker-compose
you might want to check that the plz
command is in your
path.The first time you run the controller, it will take some time, as it downloads a
"standard" environment which includes Anaconda and PyTorch. When it's ready the
logs will show Harvesting complete. You can run plz commands now
.
The controller runs in the foreground, and can be killed with Ctrl+C. If you'd
like to run it in the background, append -d
to the command to run it in
"detached" mode.
If you've run the controller in the background, or if you lose your terminal, it
will carry on running. You can stop it with ./stop
.
Once you've set up your system as above, run:
./start/local-prebuilt
The controller can be stopped at any time with:
./stop
If you want to run the examples using the AWS instances, be aware that this has a cost. By default, Plz uses t2.micro on-demand instances. You can find out how much these cost on the AWS EC2 Pricing page.
To start a controller that talks to AWS, you'll need to first set up the AWS CLI:
pip install awscli
aws configure
aws iam get-user
and checking your
username is correct.If you usually use AWS in a particular region, please edit
aws_config/config.json
and set your region there. The default file sets the
region to eu-west-1 (Ireland).
Then run:
./start/aws-prebuilt
Unless you add "instance_max_uptime_in_minutes": null,
to your
plz.config.json
, all AWS instances you start terminate after 60 minutes.
That's on purpose, in case you're just trying the tool and something doesn't go
well (for example, there's a power cut). You can always use plz list
and
plz kill
before leaving your computer, as to make sure that there no instances
remaining. For maximum assurance, we recommend checking the state of your
instances in the AWS console.
By default, Plz uses on-demand instances. In order to use spot instances, specify the following in your plz.config.json file:
{
...
"instance_market_type": "spot",
"max_bid_price_in_dollars_per_hour": <price>
}
The value in the example configuration files range from $0.5/hour to $2/hour (for GPU-powered machines).
In the directory examples/python
, there is a minimal example showing how to
run a program with Plz that handles input and output. Once you
have a working controller, running plz run
inside the directory will start the job.
In the directory examples/pytorch
, there's a full-fledged example for the task
of digit recognition using the classic approach of LeNets and a subset of the
well-known MNIST dataset.
Anything related to Plz is in main.py
. In fact the most relevant lines are the
following ones:
def get_from_plz_config(key: str, non_plz_value: T) -> T:
configuration_file = os.environ.get('CONFIGURATION_FILE', None)
if configuration_file is not None:
with open(configuration_file) as c:
config = json.load(c)
return config[key]
else:
return non_plz_value
[...]
input_directory = get_from_plz_config(
'input_directory', os.path.join('..', 'data'))
output_directory = get_from_plz_config('output_directory', 'models')
parameters = get_from_plz_config('parameters', DEFAULT_PARAMETERS)
measures_directory = get_from_plz_config('measures_directory', 'measures')
summary_measures_path = get_from_plz_config(
'summary_measures_path',
os.path.join('measures', 'summary'))
This shows how to get the input data and parameters that Plz uploads for you.
There's a configuration file whose name comes in the environment variable
CONFIGURATION_FILE
. If that variable is present, you're running with Plz, and
you can read and parse the file as a JSON object. The object has the following
keys:
input_directory
is a directory where you'll find your input data. If you
have "input": "file://../data/mnist",
in your plz.config.json
file, the
directory config['input_directory']
will have the same contents that
../data/mnist
has locally.
output_directory
is directory where you can write files. These are retrieved
via plz output
, or downloaded if you keep the CLI running until the end of
the job.
parameters
is the JSON object that you passed with
plz run --parameters a_json_file.json
, if you so did. Otherwise it's an
empty object.
measures_directory
is a directory in which you can write measures. You can
query these with plz measures
. Each file is interpreted as a property in a
JSON object, using the file name as the key, and the file contents as the
value, interpreted as JSON. By writing the code:
with open(os.path.join(measures_directory, f'epoch_{epoch}'), 'w') as f:
json.dump({'training_loss': training_loss, 'accuracy': accuracy}, f)
You can then run:
sergio@spaceship:~/plz/examples/pytorch$ plz measures
{
"epoch_1": {
"training_loss": 2.1326301097869873,
"accuracy": 45.4
},
"epoch_2": {
[...]
}
}
summary_measures_path
is a path to a file in which you can write a JSON
object with a summary of the results you obtained in your run (best accuracy,
total training time, etc.). The summary is available via plz measures -s
,
and also printed by the CLI if you wait until the job finishes.
If you want to use CUDA for this example, we have provided an example configuration file for this purpose:
plz -c plz.cuda.config.json run
This tells Docker to use the CUDA runtime.
We built Plz following these principles:
Plz is routinely used at prodo.ai
to train ML models on AWS, some of them
taking days to run in the most powerful instances available. We trust it to
start and terminate these instances as needed, and to manage our spot instances,
allowing us to get a much better price than if we were using on-demand instances
all the time.
In the future, Plz is intended to:
pip install pipenv
to install pipenv
.make environment
to create the virtual environments and install the
dependencies.make check
to run the tests.For more information, take a look at
the pipenv
documentation.
See the CLI's README.rst.
export SECRETS_DIR="${PWD}/secrets"
make deploy
.Do just as above, but put your secrets directory somewhere else (for example, another repository, this one private and encrypted).