Incubator Tvm Versions Save

Open deep learning compiler stack for cpu, gpu and specialized accelerators

v0.10.0

1 year ago

Introduction

The TVM community has worked since the v0.9 release to deliver the following new exciting improvments!

Metaschedule
- Software pipelining and padding for irregular shapes for auto tensorization
- Stabilized and polished user-interfaces (e.g. database changes, tune_relay)
- A new MLP-based cost model
TIR
- New schedule primitive for PadEinsum
- A new TIR node: DeclBuffer
- INT8 Intrinsics for TensorCores for CUDA!
microTVM
- Improved schedule primitives for ARM v8-m ISA

And many other general improvements to code quality, TVMScript, and more! Please visit the full listing of commits for a complete view: https://github.com/apache/tvm/compare/v0.9.0...v0.10.0rc0.

RFCs

These RFCs have been merged in apache/tvm-rfcs since the last release.

What's Changed

Please visit the full listing of commits for a complete view: https://github.com/apache/tvm/compare/v0.9.0...v0.10.0rc0.

Note that this list is not comprehensive of all PRs and discussions since v0.9. A non-truncated summary can be found here: https://github.com/apache/tvm/issues/12979

TIR

#12720 - [TIR] Implement API for padded layout transformations
#12797 - [TIR] Construct the inverse in SuggestIndexMap
#12827 - [TIR] Support pattern matching argmax/argmin generated by TOPI
#12750 - [TIR, Schedule] Add schedule primitive PadEinsum
#11639 - [TIR][Meta-Schedule] Tuple-reduction scheduling support
#12515 - [TIR][Arith] Add more strict checking in imm construction and folding.
#12717 - [TIR, Schedule] Check consumer in-bound and covered in reverse_compute_inline
#12652 - [TIR] Handle axis_separators during FlattenBuffer
#12623 - [TIR] Expose MMA-related PTX builtins
#12607 - [TIR][Schedule] enhance compute_at and reverse_compute_at primitive to choose possible position ...

v0.9.0

1 year ago

Introduction

The TVM community has worked since the v0.8 release to deliver many exciting features and improvements. v0.9.0 is the first release on the new quarterly release schedule and includes many highlights, such as:

MetaSchedule's full implementation
ARM cascading scheduler for Arm Ethos(TM)-U NPUs
Collage which brings tuning to BYOC
Several microTVM improvements
New tvm.relay.build parameters - runtime=, executor=,
AOT - Support for the C++ runtime (with llvm and c targets only) and support for host-driven AOT in the C runtime
Hexagon RPC support
- Testing via Hexagon SDK simulator and on device via Snapdragon-based HDK boards and phones
- AOT and USMP support
- Threading
- Initial op support
MLF - Support for multiple modules in a single MLF artifact
Several TIR schedule primitives and transforms including (abridged):
- schedule.transform_layout - Applies a layout transformation to a buffer as specified by an IndexMap.
- schedule.transform_block_layout - Applies a schedule transformation to a block as specified by an IndexMap.
- schedule.set_axis_separators - Sets axis separators in a buffer to lower to multi-dimensional memory (e.g. texture memory).
- transform.InjectSoftwarePipeline - Transforms annotated loop nest into a pipeline prologue, body and epilogue where producers and consumers are overlapped.
- transform.CommonSubexprElimTIR - Implements common-subexpression elimination for TIR.
- transform.InjectPTXAsyncCopy - Rewrites global to shared memory copies in CUDA with async copy when annotated tir::attr::async_scope.
- transform.LowerCrossThreadReduction - Enables support for reductions across threads on GPUs.
And many more! See the list of RFCs and PRs included in v0.9.0 for a complete list, as well as the full change list.

RFCs

These RFCs have been merged in apache/tvm-rfcs since the last release.

What's Changed

Note that this list is not comprehensive of all PRs and discussions since v0.8. Please visit the full listing of commits for a complete view: https://github.com/apache/tvm/compare/v0.8.0...v0.9.0.rc0.

AOT

#11208 - Calculate used memory at the callsite of primitive functions
#11365 - Fix function number datatype from char to uint16_t
#11091 - Enable A-Normal Form in the AOT executor
#10753 - Support LLVM backend with C++ runtime
#10518 - Use python temporary directory for AOT tests
#10337 - BugFix of workspace calculation
#10282 - [runtime] Add Metadata classes for AOTExecutor
#9501 - [3/3][DeviceAPI] Wire up cpacked Device API context
#9500 - [2/3][DeviceAPI] Add Hooks for Activate/Deactivate/Open/Close
#9395 - [1/3][DeviceAPI] Connecting devices structure to relevant operators

BYOC

#11474 - Two helper passes for external codegen using RelayToTIR custom pass machinery
#11144 - Remove support for run-time linked-params from codegen
#10590 - Add order to functions in C Codegen
#11638 - [DNNL][CBLAS]Unifles all MKLDNN/DNNL to DNNL
#11619 - RelayToTIR custom codegen passes can still depend on dynamic shape functions
DNNL - #11902, #11642, #11513, #11571, #11560, #11345, #11111, #10837, #10421, #9995, #9797
TensorRT - #11923, #11203, #10759, #10772, #10388
CMSIS-NN - #11732, #11625, #10939, #11013, #10817, #10563, #10224, #10148, #10100, #9338, #9531, #9409, #9331
OpenCLML - #10243
CUTLASS - #11631, #10185, #10177, #10110, #10036, #9899, #9820, #9800, #9795, #9746, #9737, #9698, #9595, #9571
CUDNN - #10997, #9986, #9948
ACL - #10801
PTX - #10855, #10339, #9909
CUBLAS - #10826, #10820

CI

#11313 - Refactor of tvm.testing.requires_* annotations
#11666 - Enable pylint for tests/python/ci
#11657 - Apply linting rules to AOT tests
#11380 - Restructure Jenkinsfile
Automation - #11813, #11775, #11480, #11437, #10833, #10056, #9973, #9934
User experience improvements - #11470, #11329, #11553, #11497, #11051, #10933, #10960, #10525, #10425, #10322, #10121, #9971, #9554, #9752, #9556
Reduce CI runtime - #11402, #11349, #11258, #11132, #10946, #10743, #10359
Code cleanups - #10968, #10740

Frontends

PaddlePaddle - #11537, #9724, #9564
TFLite - #10915, #10566
Oneflow - #11321, #11036, #8790
PyTorch - #11190, #10504, #10184, #10091
ONNX - #10949, #9438, #9186, #9493, #9475
Keras - #7006

Hexagon

#11549 - Initial clip operator for Hexagon
#11834 - Add op resize2d for hexagon
#11559 - Softmax slice op initial version
#11529 - Slice ops added - add, subtract, multiply
#11720 - [testing] add max_pool2d benchmark
#11417 - Implement avg_pool2d slice op
#11653 - Add HexagonThreadManager
#11547 - Run single RPC server on Android in each testing session
#11490 - [testing] add TVMScript elemwise-add
#11400 - [testing] refactor benchmark-table code
#11277 - moves conftest.py to tvm.contrib.hexagon so outside repos can access the testing fixtures
#11319 - Add unit tests for Hexagon Device API
#11279 - Add USMP tests
#11283 - Update Readme
#11239 - capture gtest output and return over FFI
#11175 - Add schedule and test for conv2d_transpose_nchw
#11018 - [Runtime] Add QuRT thread pool backend
#11145 - Add support for on-device unit testing using gtest
#11138 - Add test for depthwise conv2d schedule
#11016 - Add test for registered schedules
#11104 - Add mobilenet test
#11090 - Delete offload runtime, move files to right places
#11065 - AoT with LLVM Codegen on Hexagon
#11025 - Deprecate USE_HEXAGON_DEVICE, introduce USE_HEXAGON
#10604 - HVX scheduling and bench-marking of TE element-wise add
#10905 - [LLVM] Enable/test tensorized Hexagon DMA on 2d transformed layout
#10907 - Move aot/graph_executor interactions into launcher
#10919 - Register basic strategies and schedules for common operators
#10904 - Add unit tests executing 2-d VTCM usage
#10910 - Refactor to keep HexagonBuffer private to the device api
#10908 - [LLVM][CodeGen] Make CodeGenHexagon a subclass of CodeGenCPU
#10878 - Generalized HexagonBuffer::CopyTo/CopyFrom
#10846 - Support both 1-d and 2-d VTCM allocations
#10581 - Improved ergonomics of HexagonLauncher in unit tests.
#10616 - Refactor tvm.contrib.hexagon, NFC
#10612 - Deprecate SDK 3.x, rewrite HexagonSDK.cmake
#10586 - Codegen for 2d Load/Store
#10558 - Generalize builtin for Nd memory alloc with storage scope and add lowering for VTCM / Hexagon
#10543 - [Runtime][PipelineExecutor] Add the pipeline internal forwarding logic.
#10507 - Add doc on TVM - Hexagon RPC flow
#10520 - Resolve breakage in test_hexagon/test_cache_read_write
#10311 - [runtime]AOTExecutor implementation for C Codegen
#10454 - Allow execution on target or simulator from HexagonLauncher
#10365 - Lower cache_read and cache_write to Hexagon DMA via tensorize
#10361 - RPC server/client for simulator
#10302 - [CI]Add Hexagon Tests to pipeline
#10263 - [Docker]Add docker file and scripts
#10227 - Refactor Hexagon.cmake
#10217 - Adding support for Hexagon User DMA Engine
#10068 - Update hexagon API build instruction and cleanup hexagon_proxy_rpc
#9970 - Do not auto-build apps when building TVM
#9736 - Add unit tests for HexagonBuffer
#9525 - Add Hexagon VTCM and discontiguous allocation support
#9631 - Add RPC Mechanism for Hexagon
#9473 - cleanup Hexagon conv2d tests

MetaSchedule

#11884 - Postproc: Rewrite-Layout
#11848 - [OpStrategy] Support MetaSchedule Layout
#11845 - [Relay][Pass] Meta-Schedule-Layout-Rewrite
#11758 - [Runtime] Enhance Runner RandomFill
#11683 - Distributed Measurement
#11751 - [Minor] Organize Testing Scripts
#11735 - Modify Profiler Timers
#11727 - Developer Ergonomics Enhancement II
#11692 - Apply-History-Best Task Filtering
#11486 - Add Profiler Support For Tuning Efficiency Optimization
#11680 - JSONDatabase Utilities
#11641 - Generate MetaSchedule Dataset
#11622 - Developer Ergonomics Enhancement
#11604 - Resolve dependencies between header files
#11587 - Add Testing Script with ONNX Support
#11590 - Evo Independence from TaskScheduler
#11534 - No explicit unrolling for spatial PrimFunc
#11512 - Enable Task Filtering
#11177 - AutoBind rule and MutateThreadBinding
#11157 - Logging Interface Unification
#11088 - Auto tensorization for CPU / GPU dot product
#10986 - [Refactor] Introduce TuneConfig
#11020 - [Metaschedule, Refactor] Move MultiLevelTilingNode decl to a header
#10927 - [Refactor] Clarify Integration Logic
#10876 - Add utility API to ease using manual schedules
#10885 - [BugFix] Fix skipped tests
#10366 - Add Gradient Based Task Scheduler
#10823 - Fine-Grained Rewrite Unbound Block
#10793 - Add demonstration of selectively tuning relay ops with TIR schedules
#10811 - Support grouping in the cost model
#10810 - Extract task weights during task extraction
#10782 - [TIR]Estimate TIR FLOPs
#10776 - Misc updates for tuning end-to-end workloads
#10689 - Upstream the leftover changes
#10648 - [Meta Schedule] Refactor meta schedule testing utils
#10578 - New relay backend for meta schedule task extraction
#10534 - Bug Fix for Relay Integration
#10501 - Update scripts for subgraph tuning
#10497 - Refactor testing workloads
#10461 - Enable AutoTVM-style template-based search space
#10368 - Fix Cyclic Dependency in PyClass Family
#10403 - Arithmetic analysis
#10367 - Update Tuning Interfaces.
#10079 - [M4a] User-API: Tune-TE/TIR/Relay
#10081 - [M4a] Rewrite-Cooperative-Fetch
#10055 - [M4b] Testcases for TensorRT builder/runner
#10092 - [M4a] Mutator: Mutate-Tile-Size
#10096 - [M4a] Mutator: Mutate Parallel
#10071 - [M4a] PostProcessor: Rewrite-Parallel-Vectorize-Unroll
#10043 - [M4a] Schedule Rule: Multi-Level-Tiling
#10045 - Mutator: Mutate-Unroll
#10033 - [M4a] Schedule Rule: Parallelize-Vectorize-Unroll
#10027 - [M4a] PostProcessor: Rewrite-Unbound-Block
#10028 - Mutator: Mutate-Compute-Location
#9997 - [M4a] PostProcessor: Disallow-Dynamic-Loop
#9994 - [M4a] Schedule Rule: Cross-Thread-Reduction
#10013 - [M4a] PostProcessor: Rewrite Reduction Block
#9975 - [M4a] Schedule Rule: Add-RFactor
#9945 - [M4a] PostProcessor: Verify-GPU-Code
#9940 - [M4a] Schedule Rule: Random-Compute-Location
#9943 - [M4a] Schedule Rule: Auto-Inline
#9860 - [M3c] Add Per-Store-Feature
#9859 - [M3c] XGB-based Cost Model
#9836 - [M4a] Add EvolutionarySearch Search Strategy
#9799 - [M4a] Add ReplayFunc Search Strategy
#9789 - [M3c] Update TuneContext, TaskScheduler & Search Strategy Design
#9780 - [M3c] Add More Measure Callbacks
#9761 - [M4a] Add ScheduleRule class & PostOrderApply space generator
#9760 - [M3c] Random Feature Extractor

MicroTVM

#11741 - Refactor RVM scripts and fix DNS network issue
#11472 - [ARM]Add tests for arm schedules
#11634 - Update pyproject to python3.7
Zephyr support - #11650
RPC - #11227, #10967

Relay

#11825 - [realy][pass]add split infer shape with convert op layout pass
#11674 - Finish implementations of WithFields
#11481 - IndexedGraph improvements in preparation for Collage
#11432 - Plumb external codegen target via Target.current()
#11494 - [Pass] Add MaxPool, AvgPool to FoldExplicitPadding
#11183 - Add unidirectional sequence lstm
#11442 - Add 'static_library' runtime::Module
#11413 - [Topi]Support for FP16 ERF on CPU.
#11382 - Finish support for list-of-targets
#11386 - [Tests] Replace the Relay interpreter with the VM in the op tests
#11224 - Support i16, f16 scalars in Relay text
#11337 - Fix eltwise alter op layout for broadcast axis
#11199 - Flexible shape dispatch transformation
#11173 - Support 'external codegen targets'.
#10996 - Add FlattenAtrousConv transformation
#10871 - [CUDNN] Add cuDNN as a Relay partitioning target (BYOC)
#10787 - [Pass][Bugfix] Disable re-use of non-flat buffers in StorageRewrite.
#10378 - [FQ2I] Add leaky relu to FQ21
#10400 - RelayViz graphviz renderer
#10352 - [VIRTUALDEVICE] Change syntax for device planning and store parameter virtual devices in virtual_device_ field
#10310 - [ARM_CPU] Conv2d int8 intrinsic for cortex-A72
#10085 - RelayViz interface and terminal ast-dump
#10239 - Add a conversion of individual operations in FQ2I pass.
#10236 - [Refactor] Clean up type relations that are declared as template for no reason
#10156 - Fix broadcast InferCorrectLayout
#10026 - [VM] Relay VM memory liveness/lifetime analysis
#10089 - [Pass] Add a relay pass to extract fake quantized ops
#9690 - Change function constructors to WithFields
#10069 - [DefuseOps pass] bug fix: To support function body types other…
#9954 - Add conv2d_backward_weight op (without topi)
#9838 - [FoldScaleAxis] Support dense and bias_add op in fold scale axis
#9816 - Add sliding_window operator
#9874 - Add a JSON converter for 0.7 -> 0.8 and 0.8 -> 0.9
#9735 - [AMP][Pass][Typing] Add faster type inference
#9723 - [Frontend] Add Span filling for frontends to Relay
#9749 - Fix invalid shape function for "copy" operator
#9759 - s/SEScope/VirtualDevice/g
#9734 - Support large constants saved/loaded outside of VM executable
#9613 - Re-run PlanDevices after LowerTE to flow new memory scope constraints.
#9693 - PlanDevices supports 'free' on_device annotations
#9641 - [AST] Add virtual_device as a first class field in Relay
#9483 - Switch the VM to use the LowerTE pass instead of TECompiler::{Lower,LowerShapeFunc}.
#9569 - WithFields method for Call, Function, Var, TupleGetItem, If, Let, RefCreate, RefRead, RefWrite, Match, and Clause
#9533 - WithFields for Tuples
#9550 - Prepare for switching VM to LowerTEPass.
#9542 - Prepare DeadCodeElimination for running post LowerTEPass/ManifestAlloc.
#9352 - [TVMC]Introduce executor and runtime parameters
#9457 - Add the Arm(R) Ethos(TM)-U NPU identity operator
#9326 - Switch PlanDevices pass to be w.r.t. SEScopes instead of DLDeviceTypes.
QNN - #11228, #10718, #10086, #10053, #9637, #9982

Runtime

#11334 - [PipelineExecutor] Add graph manually splitting logic into the unit test.
#11133 - [PipelineExecutor] Refactor PipelineExecutor.py and Add cross compile support for pipeline executor.
#11172 - Move WrapTimeEvaluator from RPC to profiling, NFC
#10990 - [PipelineExecutor]Add forwarding queue logic for set input.
#10953 - [Vulkan] Add RGP support to TVM for vulkan device
#10723 - [PipelineExecutor] Getting the asynchronous output
#10283 - AOTExecutor implementation and c target code-generator
#9802 - [ThreadPool]Refactor affinity function and support CPU affinity list setting.
#10234 - [Pipeline Executor] multiple threads management and the data forwarding notification mechanism.
#10326 - Improved log information with function signature
#10032 - [PackedFunc] Bring PackedFunc into TVM Object System
#10082 - [PipelineExecutor] Pipeline Executor Sequential execution
#10010 - [PipelineExecutor] Add Pipeline Executor Interface
#9846 - [Pipeline executor] Global parameters group name and runtime modules parameters map.
#9889 - [GraphExecutor] Add API get_input_info to graph_executor
#9751 - [Pipeline Executor] Add the map logic of global input and subgraph input.

TE

#11589 - Support schedulable TIR compute definitions in TOPI
#11341 - Optimized version of concatenation layer
#10561 - [TECompiler] Decouple TE compute and schedule lowering in ScheduleBuilder

TIR

#11592 - HoistExpression, generalization of HoistIfThenElse
#11870 - [Pass] Remove-Weight-Layout-Rewrite-Block
#11740 - [TIR, analysis] Add GetAutoTensorizeMappingInfo to generate transforms for auto tensorization
#11585 - Add preserve-unit-iters
#11677 - Register CUDA WMMA tensor intrinsics
#11658 - [TIR, CUDA] Add pass to replace global to shared memory copy with cp.async
#11624 - [Schedule] Allow named block and buffer arguments in Schedule
#11628 - [PASS] Refactor a couple of TIR passes - BindTarget, AnnotateEntryFunc, Filter, LowerInitBlock
#11574 - CSE pass : Restrict the equivalence to be decided by a normal form - avoids comparison of terms
#11575 - Schedule Primitive: Add-Unit-Loop
#11515 - Add schedule primitive ReIndex
#11524 - [Arith] Additional Simplifications Inside Conditionals
#11485 - Add schedule primitive TransformBlockLayout
#11495 - [Software pipeline] Fix hardcoded index in access_ptr rewriting, add a GPU test with depth 4
#11269 - [Schedule] Transform layout quality of life
#11355 - Support tensorization using ldmatrix + MMA
#11289 - [Schedule] Allowed typing.Tuple in tir.schedule._type_checker
#11317 - Support affine expressions as indices in reverse compute inline
#11235 - [Arith] Implemented padded inverses in IndexMap
#11238 - [ROOFLINE] Calculate roofline from existing TIR PrimFunc
#11225 - Add schedule primitive SetAxisSeparator
#11110 - Get read/write access precisely for opaque access.
#11106 - Enhance software pipeline validation and fix predicate of epilogue
#10843 - StmtFunctor RenewDefs
#11075 - Add function to tile a block according to a given tensor intrinsic
#11050 - Utility function to decide loop mapping for auto tensorization
#11009 - [ROCM] DP4A intrinsic support for TE/TIR
#10925 - VNNI and ARM dot product intrinsic for tensorization
#10887 - [Schedule] Relax reorder primitive's affine binding check
#10732 - [Analysis] Add SuggestIndexMap for layout rewriting
#10538 - [Schedule] Transform layout
#10638 - Change the behavior of read/write region analysis for reduction blocks.
#10705 - Use local complete block and local reduction block to identify compact dataflow
#10671 - Tuple Reduction Support in CreatePrimFunc
#9727 - [TE]Implement layout transformations, non-flat memory buffers
#10405 - [TensorIR] Update VerifyGPU
#10401 - [TensorIR] Renormalize split pattern
#10112 - [TIR, Relay] improve bfloat16 support
#8509 - Tir constants integration into compilation pipeline
#9996 - add support for multi-blocking layout and their transformation
#10066 - Add software pipelining
#10207 - Support sub warp reduction for CUDA target.
#9482 - Implementation of Common Subexpression Elimination for TIR
#9527 - Allow compute_at create block predicate for non-trivial bounds and support floordiv pattern
#10158 - [Schedule] Update compact_dataflow constraint
#9871 - [Schedule] Blockize and Tensorize
#10016 - [BugFix]Fix cross-thread reduction when single reduction loop with predicate
#9880 - Encode conditional accesses info into block read/write regions
#9699 - Affine utility support iter lowerbound and diagnostics
#9742 - [Schedule] Add Annotate/Unannotate primitive
#9738 - [TensorIR] Primitive "SetScope"
#9743 - [Schedule] Analysis functions to check if compute_inline and com…
#9689 - Allow memory (aka storage) scopes to be retrieved/applied to PrimFuncs
#9559 - [TensorIR][UX] Type annotation-based runtime type checking
#9444 - Add a 'rolling_buffer' scheduling primitive
#9360 - [TensorIR] Cross-Thread Reduction

TOPI

#11531 - TE implementation of LSTM using scan
#11161 - Add Adreno GPU target and topi supporting textures with dynamically allocated textures
#10332 - VNNI support for batch matmul
#9873 - Add support for groupped conv3d
#10230 - VNNI support for int8 dense
#10098 - [Op]5 ops can accept unsigned integers as indices
#9832 - Support grouped conv1d
#9694 - Add generic batch norm
#9233 - Cortex-M DSP support

TVMScript

#11308 - Represent ramp as index slice
#10099 - Support T.buffer_decl using data pointer from Let/Allocate
#9680 - Improve printer for TIR syntax sugar
#9492 - Add syntax sugar for T.handle and T.match_buffer
#9620 - Add for loop syntax sugar
#9543 - Misc error message improvements
#9505 - [Fix] Add type hints for more uncovered cases

USMP

#11015 - U3 use case
#10189 - Adding support for U1 usecase for constant pools
#10785 - Adding support for U4 usecase
#10193 - adding support for U2 and U3 usecases
#10005 - Add performance characteristics to PoolInfo
#9565 - [TIR]Integrating USMP to AoT Executor
#9704 - Hill Climb allocator
#9418 - [TIR]adding the pass to convert to pool offsets
#9649 - [TIR]Augmenting the algo interface with memory pressure
#9214 - [TIR]Greedy memory planning algorithm
#8468 - [TIR]Added buffer info extraction pass

microNPU

#11468 - Optimize separate padding operation for conv2d
#11453 - Add transform matrices and part matcher to identity op
#11410 - add E2E tests with cascader wo striping
#11288 - Expose compute cycle annotations to TIR lowering
#10959 - Add a pass to reorder copy and compute nodes
#10509 - Add various options to the cascader
#11263 - Adding a option to enable striping
#10251 - Add support for conv2d running on two cores on U65
#10862 - Integrate the cascader
#10344 - Integrate rolling buffers in Arm(R) Ethos(TM)-U
#10824 - Some housekeeping in the test_ethosu folder
#10763 - Tweak a layout transform matrix
#10725 - Add a pass to move allocate nodes to the outer scope
#10695 - Determine block configs using the cascader
#10599 - Refactor Relay to TIR hook
#10508 - Improve cascader memory transfer estimates
#10345 - Add support for TFLite FULLY_CONNECTED
#10254 - Introduce a pass to remove redundant identity operations
#10062 - [5] Convert Proposals to te.Schedules
#9959 - [4] Add the cascader Proposal generator
#10022 - enable USMP
#10127 - Add support for LeakyReLU
#10004 - Add FreeRTOS variant of NPU demo
#10060 - Refactor type inference data type checks
#9960 - Add support for pack and unpack
#10143 - Fix layout assignment in layout optimizer pass
#9890 - [3] Plan generation for the cascader
#9855 - Add support for transpose convolution
#9841 - Add support for nearest neighbor and bilinear upsampling
#9951 - Removing constant args from PrimFunc
#9929 - Refactor base address determination to codegen
#9910 - Add support for requantize
#9831 - Move optimization passes to be a module pass and ensure they are running
#9785 - [2d] Add more Part matchers to cascader
#9778 - [2c] Add performance modelling to cascader
#9471 - [2b] Create CascaderGraphs from TE graphs
#9469 - [2a] Add CascaderGraph for cascading analysis
#9621 - Add support for SPLIT and SPLIT_V
#9508 - Update Conv2D Tests to Use TF API to Gen Test Cases
#9627 - Add support for SIGMOID
#9589 - Add support for TFLite concatenate
#9623 - Refactor codegen tests
#9561 - Add NHWC -> NHCWB16 layout transformation pass
#9576 - Mean legalization support
#9597 - Move the compilation to use Target Hooks.
#9458 - [1] Add affine analysis structures for the cascader
#9547 - Add the infrastructure for lookup table and TANH
#9521 - Support binary elementwise with non-4D inputs
#9560 - Fix incorrectly calculated stride when converting NHWC to NHCWB16
#9530 - Add unary elementwise operator infrastructure with ABS
#9514 - Adding rounding mode attribute to operators
#9515 - Allow constants to be given as input to an operator

microTVM

#11250 - [ARM] Add Relay tests for conv2d registered schedules
#11232 - [rpc] Implemented rpc logging
#11044 - Add support for host-driven AoT Executor
#11043 - Better version handling for Arduino
#10555 - Enable micro tvmc tutorial testing in CI
#10194 - [RVM] Add scripts for automated build and testing
#10144 - TVMCon 2021 Zephyr Demo with CMSIS-NN
#10024 - [tvmc] Add TVMC Micro tutorial for Zephyr
#9684 - Fix zephye/test_zephyr_armv7m test
#9584 - [TVMC] Add TVMC test for Arduino and Zephyr
#9526 - Add minimal forwarding RPC server for host driven python execution on Hexagon
Zephyr support - #11362, #10138

Misc

#11465 - Add cooldown interval logic for the profiling functional
#11888 - [LLVM] Include LLVM headers in files that use them, not in llvm_common.h
#11646 - [Arith] Simplification of ceil, log2, and left_shift
#11464 - [MLF] Add support for multiple modules in Model Library Format
#11632 - [AutoTVM][Autoscheduler] Default build funcs inherit PassContext
#11543 - [OpenCL] Implement conv2d_winograd algorithm for Adreno
#11287 - [Arith] Merge surjective/non-surjective iter mapping detections
#11393 - Add utility to replace direct call to pytest.main
#11252 - [ROOFLINE] Roofline analysis over RPC
#11000 - [Graph Debugger] Expose way to benchmark individual nodes.
#10794 - bump PyTorch version to 1.11
#10821 - [REFACTOR] Remove legacy nnvm folder
#10798 - [Arith] Remove diagnostic ctx argument from DetectIterMap
#10567 - [Refactor] Reduced repetition in CodeGenLLVM's buffer access
#10455 - [AUTO_SCHEDULER] Add feature extraction directly from PrimFunc
#7401 - RFC: initial stab at TorchScript fallback
#10391 - [vulkan] Add integer dot product (4xint8, 4xuint8) tensorization for the vulkan SPIR-V target.
#10293 - [VirtualMachine] new method allowing to set one input tensor by its index or name
#10191 - Generate correct output tensor names in C Interface API
#9276 - Parameterize test_link_params
#9808 - [Rust] Update Rust bindings
#9553 - [PROFILING] Add ability to profile a single function_profiling
#9611 - [CMAKE] Automatically detect newly added source files
#9544 - [Target] enable -arch=sm_xx for assigning cuda target arch and deprecate autotvm.measure.set_cuda_target_arch api
Profiler - #11530, #11066
Docs - #10921, #11403, #10774, #10912, #9633, #9906, #9534, #9307, #9654, #9580
Android - #11241
ETHOSN - #11261, #10486, #10018, #9596
TVMC - #11012, #10962, #10722, #9817, #9529, #9229

v0.8.0

2 years ago

Overview
Accepted RFCs
Features and Improvements

Overview

Apache TVM v0.8 brings several major exciting experimental features, including:

PaddlePaddle frontend
TVMScript: round-trippable python-based syntax for TIR
TorchScript integration
TensorIR scheduling language
TensorRT and CUTLASS integration via BYOC
Int4 TensorCore support in AutoTVM
MicroTVM Project API and Zephyr, Arduino support
AOT executor
Robust Windows support
Affine analysis infra: iter-affine-map
Improved Vulkan backend
CUDA graph support in TVM runtime

Besides, The community has been working together to refactor and evolve the existing infrastructure, including but not limited to:

Relay compilation engine
Relay pattern language
CI and build process
Refactoring documentation and tutorials
Stablizing AutoScheduler
Stablizing TVMC command line driver interface
Stablizing target system
Frontend coverage, quantization, dynamic shape, training

Full changelog: https://gist.github.com/junrushao1994/c669905dbc41edc2e691316df49d8562.

Accepted RFCs

The community has adopted a formal RFC process. Below is a list of the formal RFCs accepted by the community since then:

[RFC-0005] Meta schedule (AutoTIR)
[RFC-0006] Automatic mixed-precision pass and support
[RFC-0007] Parametrized unit tests
[RFC-0008] MicroTVM Project API
[RFC-0009] Unified static memory planner
[RFC-0010] Target-registered compiler flow customisation
[RFC-0011] Arm® Ethos-U integration
[RFC-0014] Pipeline executor
[RFC-0015] Use CMSIS-NN with TVM
[RFC-0019] Add PaddlePaddle frontend
[RFC-0020] Extend metadata in project option
[RFC-0022] TIR non-scalar constants
[RFC-0023] Adding annotation field to tir.allocate nodes
[RFC-0025] PyTorchTVM
[RFC-0027] Formalize TVM documentation organization
[RFC-0028] Command line composition from internal registry
[RFC-0029] Migrating target attributes to IRModule
[RFC-0030] Command line configuration files
[RFC-0031] C Device API
[RFC-0036] TVMScript namespace
[RFC-0041] Update TVMScript block syntax

Features and Improvements

TE, TIR, TVMScript

TVMScript parser and printer #7630 #9115 #9286
Scheduleable TIR (S-TIR) infrastructure, analysis and lowering passes #7553 #7765 #7847 #8114 #8121 #7873 #7923 #7962 #7848 #8044 #7806
S-TIR schedule primitives: compute-inline, reverse-compute-inline, fuse, split, rfactor, storage-align, vectorize, unroll, bind, reorder, cache-read, cache-write, compute-at, reverse-compute-at, decompose-reduction #8170 #8467 #8544 #8693 #8716 #8767 #8863 #8943 #9041
While loop in TIR #7425 #9004
Metaprogramming in S-TIR via specialize #8354
Support Return value in TIR #7084 #7932
Storage scope support in PointerType #8017 #8366 #8463
Creation of S-TIR via TE compute #7987

AutoTVM, AutoScheduler, Meta Schedule

PopenPoolExecutor is used to replace python native library to provide better multiprocessing support as well as enable auto-tuning in Jupyter notebooks for AutoTVM and AutoScheduler #6959 #8492 #8913 #8820 #8851
AutoScheduler improvement and stabilization: task scheduler, layout rewrite, early stopping, dispatching #6945 #6750 #6987 #7156 #8862 #8995 #7571 #7376 #7377 #7344 #7185
AutoScheduler support for sparse workloads #7313 #7635 #8065
AutoScheduler support for Vulkan, ROCm, Mali #7626 #7038 #7132
AutoTVM support for int4 TensorCore #7831 #8402
Meta Schedule core infrastructure, builder runner and database #8615 #8623 #8642 #8817 #9079 #9132 #9154 #9053 #9059 #9044 #9111 #9061 #9153

Operator Coverage

Operators for Int-8 vision transformer on GPU #7814
Optimizing NMS and ROI-related kernel on GPU #7257 #7172 #7136 #7796 #7463 #6516 #7440 #7666 #8174
Support and optimize sparse operators #8605 #7477 #7435 #6889 #6580 #8437
Sort-related operators and optimization #9184 #7669 #8672 #7611 #7195 #7056 #6978
Support for einsum operator #6370
Matmul, dense operators and their optimization #8921 #8527 #8234 #8250 #6616 #8229 #8401 #7404 #8669
Convolution and pooling operators and their optimization #8620 #8936 #8584 #7075 #7142 #7515 #6999 #6899 #6840 #6137 #6802 #6445 #6711 #6714 #8167 #8222 #8275 #8276 #8422 #8430 #6687 #7928 #8897
Scatter and gather operators and their optimization #8479 #7600 #7044 #7464 #7233 #6533 #6856 #6854 #7927 #8105
Prefix scan, cumsum and cumprod #7722 #7303 #7314 #7334 #7123 #6868
Dynamic shape and shape functions #7414 #6979 #6912 #6898 #6373 #8068 #7490 #7487
Miscellaneous improvement. Operators including: reshape, resize, pad, PRNG, transpose, where, softmax, concat, nll_loss, space_to_batch_nd, batch_to_space_nd, slice_like; Libraries including thrust, cuDNN, cuBLAS, MIOpen; Improving schedules for generic reduction and softmax. #8592 #7375 #7287 #7184 #7131 #7086 #7083 #8030 #6851 #6477 #8346 #6759 #8028 #8056 #8369 #7468 #7458 #7194 #8138 #8543

Training

Relay AutoDiff #7677 #8318
TE AutoDiff #7321
Gradient operators #7685 #7340 #6767 #8307 #7357 #6827

Relay

Pattern language and mixed-mode visitor: matching more IR constructs, fuzzy matching; converting more passes to non-recursive. #8843 #7754 #7355 #7332 #7282 #7151 #7120 #6958 #7507 #8325 #8774 #7817 #7374 #6695 #6704
Improving or adding passes including ExtractOperators, SimplifyExpr, DynamicToStatic, DefuseOps, ConvertLayout, FoldConstant. Added a set of utilities that allows a model to be run efficiently on TensorCores #9253 #9245 #8996 #7827 #9034 #7807 #8755 #7731 #7368 #7603 #7656 #7423 #7354 #6946 #6748 #6720 #6776 #7835 #7895 #8205
TECompiler and refactoring of compilation workflow #9103 #8974 #8886 #8802 #8501 #8526 #8486 #8597 #7518 #7552 #8914 #9130
Quantization and automatic-mixed precision #8883 #8810 #8644 #7613 #8069 #8341 #8126 #8460
Parser, printer and diagnostic #7347 #6274 #6692 #8352 #8000

MicroTVM, AOT, Graph Executor and VM

Pipeline Executor #8702 #9108
CUDA graph integration in graph executor #7616
Enable add set_output_zero_copy in graph executor #8497
VM: memory allocation improvement, shape function improvement and misc #7746 #7451 #7413 #7210 #8040 #6938 #8661 #7676 #8285
AOT compilation and execution #8697 #7785 #8014 #8023 #8096 #8075
Project API infrastructure: #8380 #8963 #8708 #8019
MicroTVM, Zephyr, Arduino RVM, AutoTVM support #9320 #8941 #7804 #7786 #7449 #7891 #7915 #8055 #8037 #8386 #8519 #8748 8154 #8945 #8624 #8701 #7723 #8715 #7225 #6964 #7813 #7528
The pure C runtime (CRT) #7398 #7333 #7095 #7225
Model library format #8270 #8072 #7938

Arithmetic Analysis

Tighter bounds and more simplification on cast #6771 #7045
Introducing iterator (quasi-) affine map detection #6667 #7752 #7759
Inverse of iterator affine map #8384 #8427
Subspace division in iterator affine map #7760

Frontends

PaddlePaddle initial support #8645 #9124 #9126 #9295 #9370 #9236 #9283
ONNX support, including better handling of control flow, coverage of more operators, better dynamic shape support, more tests. #9265 #9178 #9146 #8894 #8966 #8967 #7818 #9000 #9001 #9066 #9028 #9002 #8985 #9019 #9017 #8972 #7802 #7800 #7781 #8919 #9054 #8906 #8933 #8959 #8907 #7771 #8923 #8924 #7755 #7720 #8773 #8872 #7655 #8741 #7633 #8781 #8866 #8867 #7522 #7519 #7489 #7438 #7429 #7364 #7300 #7259 #7243 #7237 #7208 #7189 #7115 #7109 #7089 #7036 #7031 #6839 #6351 #7842 #7844 #6646 #6647 #6681 #6700 #7883 #6726 #6730 #7899 #7900 #7906 #7934 #7956 #8007 #8011 #8084 #8099 #8189 #8191 #8304 #8321 #8337 #8356 #8385 #8502 #8426 #8440 #8456 #8475 #7391 #7394 #8621 #8322 #8323 #8435 #8436 #8455 #7353 #7215
TensorFlow and TFLite, including more operators, better TensorArray support and quantization #9404 #9256 #8689 #7789 #7736 #8763 #8647 #8648 #8558 #8780 #8538 #7659 #7639 #7531 #7520 #7502 #7496 #7473 #7452 #7442 #7441 #7400 #7320 #7293 #7267 #7159 #7148 #7114 #7113 #7093 #7074 #7048 #7030 #6998 #6984 #6970 #6949 #6933 #6918 #6901 #6885 #6849 #5767 #6589 #6670 #6674 #6675 #7866 #6685 #7885 #6729 #7901 #6774 #6783 #6799 #7951 #8024 #8051 #8060 #8074 #8142 #8179 #8251 #8277 #8335 #8364 #8375 #8431 #8454 #6818 #8483 #9099 #9165
PyTorch: more operators including activations, inplace operators, RNNs, NMS #9371 #9204 #9185 #9135 #9133 #9015 #8839 #8718 #8699 #8692 #7712 #8753 #7694 #8583 #7675 #7646 #7606 #7592 #7569 #7544 #7549 #7535 #7517 #7465 #7397 #7371 #7348 #7346 #7325 #7231 #7174 #7154 #7137 #7134 #7133 #7128 #7088 #7023 #6900 #6602 #7845 #6659 #6740 #6782 #6784 #7958 #8192 #8397 #8398 #8403 #8447 #6829
MXNet support. More operators and NLP model coverage in GluonNLP #7568 #7409 #7209 #7191 #7062 #6561 #6699
Misc: CoreML, Keras, DarkNet, etc. #7667 #6676 #6651 #6963 #7949 #7035 #7446 #8562 #8599

Codegen Backends and Runtime

LLVM backend: recover LLVM support on windows; support target feature strings in function attributes; atomic support in NVPTX, ROCm; LLVM compatibility to LLVM 12+ #9305 #9223 #9138 #8860 #8958 #6763 #6698 #6717 #6738 #8293 #6907 #7051
ROCm 3.9 bitcode files search #6865
Vulkan and SPIR-V refactoring and major improvement in codegen and runtime. A critical bug fix in SPIRV codegen allows the Vulkan backend to produce correct outputs on more hardwares and drivers. Added support for querying device specific hardware parameters and capabilities, dynamic shapes, irregular ops such as sorting and NMS, UBO, fp16, and vectorization. We can now run complicated models like MaskRCNN on Vulkan end to end. #8904 #7833 #7717 #7681 #8746 #8813 #7609 #8882 #7607 #7591 #7574 #7572 #7833 #6662 #7969 #8013 #8048 #8098 #8102 #8107 #8127 #8151 #8196 #8320 #8588 #8332 #8333 #8348 #8528
Metal language version upgrade (MTLLanguageVersion2_3), better codegen support, int64 support, various bug fixes #7830 #7819 #7714 #7118 #7116 #7105 #7980 #8054 #8175 #8202 #8206 #8313
OpenCL, VTA, Verilator: refactored code generator, better error messages, various bug fixes #7834 #7777 #7761 #7100 #6125 #6126 #6191 #7834 #8256 #8257 #8731 #8756 #8973
CUDA: enable __launch_bounds__, dynamic shared memory, TensorCore, BF16, half2, NVCC version upgrade #9341 #8678 #7561 #7273 #7146 #7147 #7099 #7065 #7033 #7014 #7907 #7964 #9087 #8135 #8137 #8457 #8466 #8571
ARM: CMSIS-NN, Ethos-N #8653 #7628 #8951 #7506 #7443 #7858 #6982 #8795 #8806 #8833 #9147 #9159 #9160 #9162 #9163 #9167 #9209 #9386 #9387
Hexagon: build, compilation, model launcher, more target options and better runtime #7784 #6718 #8821 #8822 #9033 #8823 #8859 #8865 #8915 #8954 #9024 #9025 #8960 #8986 #9010 #9011 #9189 #9220 #9355 #9356
WASM: Update support for latest emcc, add ffi test. #6751

BYOC Integration with Vendor Libraries: TensorRT, ACL, VitisAI

TensorRT initial integration, stabilization, int8 calibration, dynamism support #6395 #7702 #7595 #7581 #7412 #7372 #9047 #8073 #8808 #6905 #7967 #8005 #8172 #8461 #8506 #8607 #7205 #7026 #7016 #7011 #6955 #6872 #7253 #6805 #9324
Arm Compute Library (ACL) integration #7649 #7206 #6532 #7121 #6724 #8149 #7251 #9396
Verilator integration #7406 #7351 #7286 #8094
VitisAI integration #6343 #7350
BYOC infrastructure enhancement: improving control flow, AnnotateTarget, custom codegen #6641 #6655 #6697 #6786 #7977 #8464

TVMC

MacOS support #8396
AutoScheduler support #7070
Support cross compiler options #7922
Python scripting #7823 #7698
More flexible input specification #7366 #7788
More options, --disable-pass and --config #7816 #8253
Allow passing optional arguments to importers #7674
Model library format (MLF) support #8086 #8331
More backend and library support: metal, ACL, Vulkan, OpenCL, ROCm, Vitis AI #8282 #7508 #8359 #6831 #8896 #7577
Support for the new target system #7651 #7654 #6788 #7304 #6855

Rust Binding

Rust bindings installable via Cargo #7503 #6678 #8631 #8665
Initial support for diagnostic interface #6656
Fixes for using Python APIs from Rust #7085
Improve NDArray, GraphRt, Relay, IRModule, Array, Attrs bindings #6563 #6741 #7138 #8353 #7082
Improve error handling, error messages and fix memory leaks #8289 #6815 #8714 #8725

Misc

Enhanced CPP-RPC implementation: allow user supplied work dir, support of CPP-RPC server for Apple, support adb-shell style CPP-RPC #7670 #8224 #8223 #7766 #7013
Use PopenWorker to handle RPC system: #7889 #7757 #7961
Fold target host into target #7462 #7791 #7534 #8835
Target-based intrinsic lowering and legalization #7936 #7809
Add target tags for all existing CUDA GPU models #7410
Linear Congruential Random Engine #8642

v0.7.0

3 years ago

v0.6.1

3 years ago

Apache TVM (incubating) is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator PMC.

Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects.

While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.

Apache TVM (incubating) 0.6.1 is a maintenance release incorporating important bug fixes and important performance improvements. All users of Apache TVM (incubating) 0.6.0 are advised to upgrade. Please review following release notes to learn the bug fixes.

Bug Fixes

Fixed process termination routine in windows #4844
[Runtime] Fix NDArray SaveDLTensor declaration and implementation signature different #4586
[NODE][Serialization]fix serialization precision loss in float #4503
[Relay][Frontend][TF] fix _parse_param bug #4711
Fix bias_add gradient #4516
Make sure to visit the arguments of inlined functions #4783
Fix Python syntax error in start_rpc_server_to_tracker.py #4682
[Bugfix] Fixed crash caused by reversing bitwise operations #4852
[Fix][VM] Fix copy constructor #5237
fix small bug about dense_grad #5695
[Fix] Fix conv2d alter op for arm cpu #5532
[Fix] Fix dense x86 schedule #4728
[Relay][Fix] Fix alter op layout when calling a global var #4454
[Relay][Pass] Fix lambda lift pass for recursive call #4432
[BUGFIX] Fix search path for libtvm_topi.so #4467
[Bugfix] Fix Python debugger segfaults with TVM built with LLVM #5685
[RUNTIME] Fix compile errors of OpenCL FPGA backend #4492
[BUGFIX][BACKPORT-0.6][ARITH] Fix FloorMod Simplifier #5509
Some Windows and MSVC fixes #4569
[Chisel][VTA] Fix multiple transfer issue in LoadUop module #4442
[VTA] Fix an issue in updating uop_idx in the TensorGemm module #4694
[VTA] Fixed a crash issue in TSIM driver #4527
[VTA] Enable streamlined GEMM execution #4392
[VTA][Chisel] End-to-end Inference with Chisel VTA #4574
Added declare of aluBits for TensorAlu #4624
[Quantization] Fix annotation for multiply op #4458
LRN only supports 4D tensors, remove it from alter_op_layout #5520
fix topi.nn.global_pool layout="NHWC" #4656
[FFI][Windows] Fix hasattr by extracting Python error type from Windows error message #4780
[Runtime] Export GraphRuntime in tvm_runtime.dll #5002
Fix Base64OutStream portability issue #4668
[AUTOTVM] Fix a bug in generating the search space #4779
[Relay][VM] Fix compilation of If-Elses #5040
[RELAY][FRONTEND][TENSORFLOW] Fix FuseBatchNorm output cast error if need_cast is True #4894
[Bugfix] fskip of EliminateCommonSubexpr cannot always return false #4620
[Fix] Add ConstantNode to IsAtomic #5457
[Fix] Fix RemoveUnusedFunctions pass #4700
[Realy][fix] Fix alpha_equal bug for attribute check #4897
[Arith] keep div_mode during floordiv simplify #5922
[ARITH][BACKPORT-0.6] fix a min/max simplify bug #5761
[0.6-BACKPORT] Improve robustness of the docs build #5583

v0.6.0

4 years ago

0.6.0.rc0

4 years ago

v0.5

5 years ago

NOTE: This is a release pre apache incubation

This release features several major improvements. Some of the highlights are: Arbitrary bits quantization algorithm; High-level auto-differentiable programming IR--Relay(NNVMv2).

The community welcomes new reviewers @nishi-t @were @siju-samuel @jroesch @xqdan @zhiics @grwlf @ajtulloch @vinx13 @junrushao1994 @FrozenGene @liangfu , new committers @srkreddy1238 @eqy @masahi @nhynes @phisiart @merrymercy @Laurawly @adityaatluri @Huyuwei

Change List

Fully featured 8-bit network support
- 8bit quantizer
- Arbitrary bits quantization algorithm
- Intel cpu support
NVidia GPU 8-bit kernel
- int8 gemm recipe
- int8 conv2d
- Autotvm integration
Automated tuning and scheduling
- AutoTVM optimizations for mobile GPUs
- AutoTVM optimizations for CUDA
- AutoTVM optimizations for x86
Initial release of the differentiable programming IR, Relay
- Generic & informative Relay error reporting #2408
- Relay IR text format support #1781
- Support control flows
- A Normal Form Canonicalization #2251
- Type system support
- End to end compilation
  - Frontend support: Caffe2 #2507 , CoreML #2476 , Keras #2376 , MXNet #2163 , ONNX, TFLite #2365
  - Operator coverage #1799 #2051
- FoldScaleAxis #2020
- SimplifyInference #2033
- CombineParallelConv2D #2089
- InstrumentBoundCheckers pass #2079
- Bind & FoldConstant #2100
- Alter Op Layout #2150
- General OpFusion #2090
CodeGen
- Gcc / g++ compatible C code generator for TVM #2161
- Device type annotation for heterogeneous compilation #2361
- Cache packed func ptr, lift alloca #2070
- Generalize compute to tensor region #1476
Runtime
- Relay interpreter and compiler #1954
- Heterogeneous runtime #1695
- Language bindings: Golang runtime #1470 , Rust runtime #1597
- Add min_repeat_ms to time_evaluator #2200
- Bundled interpreter demonstration #2297
- Enable PlanMemory in the graph runtime #2120
Language Binding
- Rust frontend #2292
VTA
- Improved RPC for VTA #2043
Hybrid python programming model
- Support for scheduling #2416
- Support for Inter-function call #2287
- Backend support https://github.com/dmlc/tvm/pull/2477
TOP
- Initial support for sparse tensor computation
- Improve ARM CPU depthwise convolution performance #2345
- Port winograd ops to relay #2356
Tutorials and docs
- Relay language docs #2232
- Tutorials on how to use SGX backend
- How to write a pass in python
- General lowering flow of TVM
- How to do tensorize
- TFLite frontend tutorial #2508
- Keras seq2seq model for translation tutorial #1815
- Committer guide and tips #2468
- Code review guideline on API designs #2459

Contributors

Code reviewers

@tqchen
@liangfu quantization, relay, topi, frontend
@zhiics relay, runtime, frontend
@nhynes quantization, rust
@Huyuwei frontend
@yzhliu relay, frontend, perf
@xqdan hybrid script, tvm/lang
@ZihengJiang relay
@vinx13 relay/pass, topi
@masahi relay/pass, frontend, doc, topi
@grwlf frontend, topi, relay, quantization
@tmoreau89 vta, relay, backend, runtime
@kazum frontend
@nishi-t frontend, topi
@PariksheetPinjari909 frontend
@jroesch relay, frontend, doc
@srkreddy1238 relay/op, frontend
@siju-samuel relay/op, frontend
@junrushao1994 relay
@icemelon9 relay, perf, tvm/lang, codegen
@ajtulloch relay, frontend
@alex-weaver relay
@kevinthesun hybrid script, topi, relay
@Laurawly topi
@were hybrid script, topi
@FrozenGene frontend, topi, relay/pass
@eqy relay, topi, runtime, rust
@zhreshold frontend, relay/op
@merrymercy relay/op, topi, runtime, frontend
@derisavi-huawei symbolic integers

Code contributions

@tqchen tvm
@vinx13 relay/pass, topi
@siju-samuel topi, relay/op
@merrymercy autotvm, topi, relay/pass
@srkreddy1238 relay/op, frontend/tf
@MarisaKirisame relay
@slyubomirsky relay, docs
@jroesch relay
@nhynes rust
@wweic docs, relay/pass
@yzhliu perf, frontend
@zhiics relay/pass, relay/op, runtime
@were hybrid script
@icemelon9 perf, relay/pass, relay/op
@joshpoll relay, docs
@sgrechanik-h codegen
@kazum frontend/keras, topi
@masahi relay/op, docs
@FrozenGene perf, frontend/tf
@liangdzou docs
@junrushao1994 relay/op
@eqy autotvm, runtime
@apivovarov docs
@ajtulloch runtime, nnpack
@kevinthesun relay/op, perf
@ZihengJiang relay/pass, quantization
@hlu1 nnpack, frontend/caffe2
@lixiaoquan nnvm
@imorinaga frontend/mxnet
@liangfu topi, docs
@xqdan codegen
@PariksheetPinjari909 frontend/darknet
@alexeyr frontend/tensorflow
@Rasterer topi
@yangchen-MS codegen
@anijain2305 relay/op
@grwlf topi
@Huyuwei topi, frontend/keras
@denis0x0D runtime/trace, relay/pass
@Mutinifni codegen
@derisavi relay/pass
@tmoreau89 vta
@Laurawly topi, perf
@zhreshold frontend, topi
@kun-zh codegen
@reminisce relay/op
@ehsanmok rust
@cnuernber perf
@cowanmeg topi, codegen
@yuruofeifei topi

v0.4

5 years ago

NOTE: This is a release pre apache incubation

This release features several major improvements. The high-level graph optimizer is now part of TVM repo. Some of the highlights are: Initial support of AutoTVM for automated optimization; customized accelerator backend VTA. Please also check out tvm.ai for latest blogposts.

The community welcomes new reviewers @kazum @alex-weaver @masahi @zhreshold @PariksheetPinjari909 @srkreddy1238 @eqy, new code owner @merrymercy, and new committer @yzhliu

Change List

Tensor Expression and Optimization

Tensor operator primitives
- Introduce attrs field to operator primitives(e.g. compute) to store additional metadata, the attrs can be used as hint for scheduling
Enable embedding of asm micro-kernels
Hybrid python programming model
- python AST based IR builder interface
- support GPU programs
AutoTVM, Automated tuning, and scheduling
- basic autotvm infra
- GPU IR verifier
- basic autotuning tutorial
- topi integration
ARM support
- winograd support
- initial support of ARM autotuning records
TOPI Vision
- Generic GPU sort support(useful for vision)
- SSD operator support
TOPI numpy consistency
- Rename all binary operators for numpy consistecy: broadcast_add-> add, broadcast_sub -> substract, broadcast_mul -> multiply, broadcast_div->divide
- New operators: slice, LRN, equal, not_equal, less, greater
- tutorials on topi
Initial low-bit operator support support
- Optimized popcount generation on ARM
- general bit-serial convolution and GEMM
- optimized low bit kernels
- parallel optimization
New topi backend optimization for intel graphics
Adapt AVX schedules for SSE target

Backend

VTA: customized accelerator backend
- custom hardware backend example
- tutorials on how to use customized accelerator
Initial experimental support for HLS backend
Bugfix in SPIRV code generator for vulkan
libdevice support, enable NVPTX backend

Runtime

Introduce NDArrayContainer for managed NDarray
RPC and Device API
- Support communication between big/small endian machines.
- RPC and device API protocol upgrade (this is a non-backward compatible change) to support big-small endian communication. This is a non-backward compatible change, need to use the latest version of TVM runtime with the RPC
- graduate rpc from contrib, tvm.contrib.rpc->tvm.rpc -Support tracker in Android RPC, add fault tolerance for AutoTVM
BIG.LITTLE aware threadpool
tvm4j graph runtime that runs end to end workload in java
DLPack support
- Support from_dlpack and to_dlpack
- Enables bridges to pytorch
Enable link of stackvm in runtime

NNVM

Tensorflow graphdef frontend
Keras frontend
- improved to support reuse layers, add activations
ONNX
- gather, LRN
CoreML frontend
- Support C-RNN and activation functions
Fix grads for sum and expand_like
Enhanced operator fusion for multiple elemwise branches
Separate nnvm fusion and compilation pass

Misc

Unified build system to cmake, customizable cmake path for vulkan, rocm, cuda

Contributors

See the complete list here. Thanks to all the contributors to contribute to this release.

Code reviewers

@yzhliu topi, tvm4j, nnvm
@kevinthesun nnvm
@Huyuwei topi operators
@tmoreau89 hardware backends
@comaniac fpga backends
@kazum nnvm, opencl backend, fpga
@nishi-t nnvm, opencl backend
@merrymercy topi, arm,
@vinx13 gpu backend
@masahi nnvm, topi
@eqy autotvm
@jroesch runtime
@PariksheetPinjari909 frontends, topi
@srkreddy1238 frontends, topi
@FrozenGene autotvm

Compiler

@alex-weaver vulkan
@were hybrid script mode
@nishi-t CUDA, fp16, int8 support
@ktabata intel FPGA support
@kazum xilinx fpga support
@cowanmeg arm optimized popcount
@tmoreau89 VTA customized accelerator

TOPI, graph optimization

@merrymercy AutoTVM
@yzhliu tvm4j graph runtime, x86
@Laurawly intel graphics
@abergeron conda build fix
@nhynes sgx random
@masahi topi, more robust op fusion
@kevinthesun vision ops
@grwlf argmax/min ops
@cowanmeg bit-serial operator
@ehsanmok topi tutorial
@zhiics refactor fusion and compilation into separate pass
@liangfu binary logical operators

Frontends

@srkreddy1238 tutorials for deployment, tensorflow frontend
@siju-samuel coreml, tf frontend
@PariksheetPinjari909 nnvm, slice
@kazum keras
@nishi-t mxnet, nnvm

Deploy

@eqy rpc, thread runtime
@dayanandasiet android tutorials

v0.3

5 years ago

NOTE: This is a release pre apache incubation

This release features numerous improvements in TOPI and backends. We make the first step toward object detection support in TOPI, featuring operators necessary for YOLO and SSDs. The topi now supports numpy-style API and operator overloading. RPC is significantly improved to support resource allocation and using a pool of devices. We are adding two new backends: WebGL for running GPUs on the browser, and Vulkan for running on next-generation graphics API. Please also check out tvm blogs for latest blogposts

Change List

TOPI Vision operators
- SSD support
- YOLO support
- NMS operator support in vision
TOPI general numpy-style operators
- numpy style operator overload in topi
- more operators: flip, take
- dilation support on conv2d and depthwise
8bit support
- ARM 8bit gemm
- ARM 8bit conv
Low bit operator support
- popcount intrinsics
- 1-bit fully connected
Contrib: MPSDNN fully-connected and conv2d support
Better RPC support
- RPC Tracker support to allow centralized resource management
- RPC protocol upgrade (this is a non-backward compatible change) to support timeout in the proxy
  - This is a breaking change, need to use the latest version of TVM runtime with the RPC
- Fault-tolerant to early server termination with correct exception propagated
- RPC support enabled for ROCm AMDGPUs
Tutorials and docs
- How to deploy to android devices.
Optimizations for hardware backends
- intel CPU (AVX and AVX512)
Schedule Primitives
- rfactor now support factor_axis to specify the factored dimension in the result
- cache_write now support multiple output operators
- enable warp memory which generates shuffle instructions
Framework bridge
- MXNet bridge supported
C++ compiler API support
- build migration
- topi migration to c++
- Target system in c++
WebGL backend
- runtime and codegen
- topi integration
- end to end pipeline on the browser
Vulkan backend
- vulkan runtime
- spirv code generator
Security
- intel SGX runtime support
- multi-threaded SGX runtime
LLVM 7.0 support
Robustness
- VerifyMemory to verify incorrect GPU schedules that writes into GPU memory from cpu
- Verify compute formulas
Better CPU parallel runtime

Main Contributors

See complete list here. Thanks to all the contributors to contribute to this release.

Code Reviewers

@zhreshold for reviewing many vision ops
@Huyuwei topi operators
@sxjscience for reviewing topi operators

TOPI:

@merrymercy Mali GPU support
@PariksheetPinjari909 topi vision ops, support for darknet operators
@yzhliu intel CPU optimization
@kevinthesun Vision operators, initial ssd, nms operator support
@dingobye Various great TOPI improvements for operator overloading
@Huyuwei dilation support to conv
@masahi Intel CPU topi
@nishi-t improvements in pooling

Compiler:

@nhynes SGX support
@phisiart WebGL backend
@alex-weaver C++ compiler support
@kun-zh bug fix bound checking in code.
@xqdan improvement low-level schedule rewrite.
@yidawang parallel runtime improvement
@eqy AMD GPU backend improvements
@Laurawly Initial improvements for Intel GPU
@cnuernber Improved runtime device stream API