TVM stack: exploring the incredible explosion of deep-learning frameworks and how to bring them together
(Image Source: http://tvmlang.org/)
TVM: End-to-End Optimization Stack for Deep Learning
This repo hosts my notes, tutorial materials (source code) for TVM stack as I explore the incredible explosition of deep-learning frameworks and how to bring them together.
The number and diversity of specialized deep learning (DL) accelerators pose an adoption challenge
Providing support in various DL frameworks for diverse hardware back-ends in the present ad-hoc fashion is unsustainable.
Hardware targets significantly diverge in terms of memory organization, compute, etc..
The Goal: easily deploy DL workloads to all kinds of hardware targets, including embedded devives, GPUs, FPGAs, ASCIs (e.g, the TPU).
Current DL frameworks rely on a computational graph intermediate representation to implement optimizations such as:
Graph-level optimizations are often too high-level to handle hardware back-end-specific operator transformations.
Current operator-level libraries that DL frameworks rely on are:
---> to be easily ported across hardware devices
To address these weaknesses, we need a compiler framework that can expose optimization opportunities across both
---> to deliver competitive performance across hardware back-ends.
High-level dataflow rewriting:
Different hardware devices may have vastly different memory hierarchies.
Enabling strategies to fuse operators and optimize data layouts are crucial for optimizing memory access.
Memory reuse across threads:
Tensorized compute intrinsics:
Latency Hiding
Four categories of graph operators:
Parallel programming is key to improving the efficiency of compute intensive kernels in deep learning workloads.
Modern GPUs offer massive parallelism
---> Requiring TVM to bake parallel programming models into schedule transformations
Most existing solutions adopt a parallel programming model referred to as nested parallel programs, which is a form of fork-join parallelism.
TVM uses a parallel schedule primitive to parallelize a data parallel task
This model is called shared-nothing nested parallelism
A better alternative to the shared-nothing approach is to fetch data cooperatively across threads
TVM introduces the concept of memory scopes to the schedule space, so that a stage can be marked as shared.
TVM includes infrastructure to make profiling and autotuning easier on embedded devices.
Traditionally, targeting an embedded device for tuning requires:
TVM provides remote function call support. Through the RPC interface: