Main repository for LipReading with Deep Neural Networks
The goal is to implement LipReading: Similar to how end-to-end Speech Recognition systems work, mapping high-fidelity speech audio to sensible characters and word level outputs, we will do the same for "speech visuals". In particular, we will take video frame input, extract the relevant mouth/chin signals as input to map to characters and words.
A high level overview of some TODO items. For more project details please see the Github project
There are two primary interconnected pipelines: a "vision" pipeline for extracting the face and lip features from video frames, along with a "nlp-inspired" pipeline for temporally correlating the sequential lip features into the final output.
Here's a quick dive into tensor dimensionalities
Video -> Frames -> Face Bounding Box Detection -> Face Landmarking Repr. -> (n, y, x, c) -> (n, (box=1, y_i, x_i, w_i, h_i)) -> (n, (idx=68, y, x))
-> Letters -> Words -> Language Model -> (chars,) -> (words,) -> (sentences,)
all: 926 videos (projected, not generated yet)
large: 464 videos (failed at 35/464)
medium: 104 videos (currently at 37/104)
small: 23 videos
micro: 6 videos
nano: 1 video
Please make sure you run python scripts, setup your
./, as well as a workspace env variable.
git clone [email protected]:joseph-zhong/LipReading.git # (optional, setup venv) cd LipReading; python3 -m venv .
PYTHONPATHand workspace environment variable to take advantage of standardized directory utilities in
Copy the following into your
export PYTHONPATH="$PYTHONPATH:/path/to/LipReading/" export LIP_READING_WS_PATH="/path/to/LipReading/"
SpaCy, and others.
On MacOS for CPU capabilities only.
pip3 install -r requirements.macos.txt
On Ubuntu, for GPU support
pip3 install -r requirements.ubuntu.txt
We need to install a pre-built English model for some capabilities
python3 -m spacy download en
This allows us to have a simple standardized directory structure for all our datasets, raw data, model weights, logs, etc.
./data/ --/datasets (numpy dataset files for dataloaders to load) --/raw (raw caption/video files extracted from online sources) --/weights (model weights, both for training/checkpointing/running) --/tb (Tensorboard logging) --/...
./src/utils/utility.py for more.
Now that the dependencies are all setup, we can finally do stuff!
Each of our "standard" scripts in
./src/scripts (i.e. not
./src/scripts/misc) take the standard
arguments. For each of the "standard" scripts, you will be able to pass
--help to see the expected arguments.
To maintain reproducibility, cmdline arguments can be written in a raw text file with one argument per line.
Represent the arguments to pass to
./src/scripts/generate_dataview.py, automatically passable via
./src/scripts/generate_dataview.py $(cat ./config/gen_dataview/nano)
The arguments will be used from left-to-right order, so if arguments are repeated, they will be overwritten by the latter settings. This allows for modularity in configuring hyperparameters.
(For demonstration purposes, not a working example)
./src/scripts/train.py \ $(cat ./config/dataset/large) \ $(cat ./config/train/model/small-model) \ $(cat ./config/train/model/rnn/lstm) \ ...
./src/scripts/train_model.py $(cat ./config/train/micro)
This is a collection of external links, papers, projects, and otherwise potentially helpful starting points for the project.