Neural Scene Flow Fields using pytorch-lightning, with potential improvements
Neural Scene Flow Fields using pytorch-lightning. This repo reimplements the NSFF idea, but modifies several operations based on observation of NSFF results and discussions with the authors. For discussion details, please see the issues of the original repo. The code is based on my previous implementation.
The main modifications are the followings:
Top: Reference image. Center: Warped images, artifacts appear at boundaries. Bottom: Estimated disocclusion.
As training goes, the disocclusion tends to get close to 1 almost everywhere, i.e. occlusion does not exist even in warping. In my opinion, this means the empty space learns to "move a little" to avoid the space occupied by dynamic objects (although the network has never been trained to do so).
Implementation details are in models/rendering.py.
The implementation is verified on several sequences, and produces visually plausible results. Qualitatively, these modifications produces better result on the kid-running
scene compared to the original repo.
Left: GT. Center: this repo (PSNR=35.02). Right: pretrained model of the original repo(PSNR=30.45).
Left: this repo. Right: pretrained model of the original repo (by setting raw_blend_w to 0).
Left: this repo. Right: pretrained model of the original repo.
Left: this repo. Right: pretrained model of the original repo.
Left: this repo. Right: pretrained model of the original repo. The 2nd and 3rd rows are 0th frame and 29th frame to show the difference of the background.
The color of our method is more vivid and closer to the GT images both qualitatively and quantitatively (not because of gif compression). Also, even without any kind of supervision (either direct or self supervision), the network learns to separate the foreground and the background more cleanly than the original implementation, which is unexpected! Bad fg/bg separation not only means the background actually changes each frame, but also the color information is not leverage across time, so the reconstruction quality degrades, as can be shown in the original NSFF result towards the end.
Our method also produces smoother depths, although it might not have direct impact on image quality.
Top left: static depth from this repo. Top right: full depth from this repo.
Bottom left: static depth from the original repo. Bottom right: full depth from the original repo.
git clone --recursive https://github.com/kwea123/nsff_pl
conda create -n nsff_pl python=3.7
to create a conda environment and activate it by conda activate nsff_pl
)pip install -r requirements.txt
cupy
via pip install cupy-cudaxxx
by replacing xxx
with your cuda version.Create a root directory (e.g. foobar
), create a folder named frames
and prepare your images (it is recommended to have at least 30 images) under it, so the structure looks like:
└── foobar
└── frames
├── 00000.png
...
└── 00029.png
The image names can be arbitrary, but the lexical order should be the same as time order! E.g. you can name the images as a.png
, c.png
, dd.png
but the time order must be a -> c -> dd
.
In order to correctly reconstruct the camera poses, we must first filter out the dynamic areas so that feature points in these areas are not matched during estimation.
I use maskrcnn from detectron2. Only semantic masks are used, as I find flow-based masks too noisy.
Install detectron2 by python -m pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu102/torch1.8/index.html
.
Modify the DYNAMIC_CATEGORIES
variable in third_party/predict_mask.py
to the dynamic classes in your data (only COCO classes are supported).
Next, NSFF requires depth and optical flows. We'll use some SOTA methods to perform the prediction.
The instructions and code are borrowed from DPT.
Download the model weights from here and put it in third_party/depth/weights/
.
The instructions and code are borrowed from RAFT.
Download raft-things.pth
from google drive and put it in third_party/flow/models/
.
Thanks to owang, after preparing the images and the model weights, we can automate the whole process by a single command python preprocess.py --root_dir <path/to/foobar>
.
Finally, your root directory will have all of this:
└── foobar
├── frames (original images, not used, you can delete)
│ ├── 00000.png
│ ...
│ └── 00029.png
├── images_resized (resized images, not used, you can delete)
│ ├── 00000.png
│ ...
│ └── 00029.png
├── images (the images to use in training)
│ ├── 00000.png
│ ...
│ └── 00029.png
├── masks (not used but do not delete)
│ ├── 00000.png.png
│ ...
│ └── 00029.png.png
├── database.db
├── sparse
│ └── 0
│ ├── cameras.bin
│ ├── images.bin
│ ├── points3D.bin
│ └── project.ini
├── disps
│ ├── 00000.png
│ ...
│ └── 00029.png
├── flow_fw
│ ├── 00000.flo
│ ...
│ └── 00028.flo
└── flow_bw
├── 00001.flo
...
└── 00029.flo
Now you can start training!
Run the following command (modify the parameters according to opt.py
):
python train.py \
--dataset_name monocular --root_dir $ROOT_DIR \
--img_wh 512 288 --start_end 0 30 \
--N_samples 128 --N_importance 0 --encode_t --use_viewdir \
--num_epochs 50 --batch_size 512 \
--optimizer adam --lr 5e-4 --lr_scheduler cosine \
--exp_name exp
I also implemented a hard sampling strategy to improve the quality of the hard regions. Add --hard_sampling
to enable it.
Specifically, I compute the SSIM between the prediction and the GT at the end of each epoch, and use 1-SSIM as the sampling probability for the next epoch. This allows rays with larger errors to be sampled more frequently, and thus improve the result. The choice of SSIM is that it reflects more visual quality, and is less sensible to noise or small pixel displacement like PSNR.
See test.ipynb for scene reconstruction, scene decomposition, fix-time-change-view, ..., etc. You can get almost everything out of this notebook. I will add more instructions inside in the future.
Use eval.py to create the whole sequence of moving views. E.g.
python eval.py \
--dataset_name monocular --root_dir $ROOT_DIR \
--N_samples 128 --N_importance 0 --img_wh 512 288 --start_end 0 30 \
--encode_t --output_transient \
--split test --video_format gif --fps 5 \
--ckpt_path kid.ckpt --scene_name kid_reconstruction
More specifically, the split
argument specifies which novel view to generate:
test
: test on training pose and timestest_spiral
: spiral path over the whole sequence, with time gradually advances (integer time for now)test_spiralX
: fix the time to X
and generate spiral path around training view X
.test_fixviewX_interpY
: fix the view to training pose X
and interpolate the time from start to end, adding Y
frames between each integer timestamps.z>0.95
).thickness//2
far from the dynamic peak.Thank to the authors of the NSFF paper, owang zhengqili sniklaus, for fruitful discussions and supports!