An official TensorFlow implementation of "Neural Program Synthesis from Diverse Demonstration Videos" (ICML 2018) by Shao-Hua Sun, Hyeonwoo Noh, Sriram Somasundaram, and Joseph J. Lim
This project is a TensorFlow implementation of Neural Program Synthesis from Diverse Demonstration Videos, which is published in ICML 2018. We provide codes and checkpoints for our model and all baselines presented in the paper. Also, we provide scripts and codes for generating datasets as well as the datasets we used to train and test all models.
As interpreting decision making logic in demonstration videos is key to collaborating with and mimicking humans, our goal is to empower machines with this ability. To this end, we propose a neural program synthesizer that is able to explicitly synthesize underlying programs from behaviorally diverse and visually complicated demonstration videos, as illustrated in the following figure.
We introduce a summarizer module as part of our model to improve the network’s ability to integrate multiple demonstrations varying in behavior. We also employ a multi-task objective to encourage the model to learn meaningful intermediate representations for end-to-end training. Our proposed model consists three components:
The illustration of the overall architecture is as follows. For more details, please refer to the paper.
Our method is evaluated on a fully observable, third-person environment (Karel environment) and a partially observable, egocentric game (ViZDoom environment). We show that our model is able to reliably synthesize underlying programs as well as capture diverse behaviors exhibited in demonstrations.
*This code is still being developed and subject to change.
The structure of the repository:
./karel_env/generate_dataset.sh
Default arguments are identical to the settings described in the paper.
./vizdoom_env/generate_dataset.sh
python trainer.py --model full --dataset_path /path/to/the/dataset/ --dataset_type [karel/vizdoom]
python trainer.py --model summarizer --dataset_path /path/to/the/dataset/ --dataset_type [karel/vizdoom]
python trainer.py --model synthesis_baseline --dataset_path /path/to/the/dataset/ --dataset_type [karel/vizdoom]
python trainer.py --model induction_baseline --dataset_path /path/to/the/dataset/ --dataset_type [karel/vizdoom]
True
to see debugging visualization (LSTM masks, etc.)karel
and vizdoom
. You can also add your own datasets.True
to perform expotential weight decay on the learning rateTrue
to train models with scheduled sampling
python evaler.py --model [full/synthesis_baseline/summarizer/induction_baseline] --dataset_path /path/to/the/dataset/ --dataset_type [karel/vizdoom] [--train_dir /path/to/the/training/dir/ OR --checkpoint /path/to/the/trained/model]
Because the size of the demonstrations in VizDoom is usually very large, it is difficult to use large batch size and results in very slow training. To circumvent this issue, we used two stage training; we pretrain our model with short demonstrations first and then finetune the model with the whole dataset. For the first stage of the training, we used batch size: 32, and for the second stage of the training, we used batch size: 8. Here are links to datasets that we used for the first and second stage of the training:
To reproduce our result, you can train and evaluate models with the following command:
python trainer.py --model full --dataset_path path_to_vizdoom_shorter --dataset_type vizdoom --num_k 25 --batch_size 32
python trainer.py --model full --dataset_path path_to_vizdoom_full --dataset_type vizdoom --num_k 25 --batch_size 8 --checkpoint path_to_1st_step_checkpoint
For evaluation, use the following command:
python evaler.py --model full --dataset_path path_to_vizdoom_full --dataset_type vizdoom --num_k 25 --checkpoint path_to_2nd_step_checkpoint
Methods | Execution | Program | Sequence |
---|---|---|---|
Induction baseline | 62.8% | - | - |
Synthesis baseline | 64.1% | 42.4% | 35.7% |
+ summarizer (ours) | 68.6% | 45.3% | 38.3% |
+ multi-task loss (ours-full) | 72.1% | 48.9% | 41.0% |
To verify the effectiveness of our proposed summarizer module, we conduct experiments where models are trained on varying numbers of demonstrations (k) and compare the execution accuracy.
Methods | k=3 | k=5 | k=10 |
---|---|---|---|
Synthesis baseline | 58.5% | 60.1% | 64.1% |
+ summarizer (ours) | 60.6% | 63.1% | 68.6% |
Methods | Execution | Program | Sequence |
---|---|---|---|
Induction baseline | 35.1% | - | - |
Synthesis baseline | 48.2% | 39.9% | 33.1% |
Ours-full | 78.4% | 62.5% | 53.2% |
To verify the importance of inferring underlying conditions, we perform evaluation only with programs containing a single if-else statement with two branching consequences. This setting is sufficiently simple to isolate other diverse factors that might affect the evaluation result.
Methods | Execution | Program | Sequence |
---|---|---|---|
Induction baseline | 26.5% | - | - |
Synthesis baseline | 59.9% | 44.4% | 36.1% |
Ours-full | 89.4% | 69.1% | 58.8% |
The baseline models and our model trained with 25 seen demonstration are evaluated with fewer or more seen demonstrations.
To reproduce our results, you can download our datasets and checkpoints.
While we provide the scripts and codes for generating customized datasets, we also made the datasets we used to train and test our models and baselines available.
We provide checkpoints and evaluation report files of our models and baselines for all experiments.
If you find this useful, please cite
@inproceedings{sun2018neural,
title = {Neural Program Synthesis from Diverse Demonstration Videos},
author = {Sun, Shao-Hua and Noh, Hyeonwoo and Somasundaram, Sriram and Lim, Joseph},
booktitle = {Proceedings of the 35th International Conference on Machine Learning},
year = {2018},
}
Shao-Hua Sun*, Hyeonwoo Noh*, Sriram Somasundaram, and Joseph J. Lim
(*Equal contribution)