Incubator Tvm Versions Save

Open deep learning compiler stack for cpu, gpu and specialized accelerators

v0.10.0

1 year ago

Introduction

The TVM community has worked since the v0.9 release to deliver the following new exciting improvments!

  • Metaschedule
    • Software pipelining and padding for irregular shapes for auto tensorization
    • Stabilized and polished user-interfaces (e.g. database changes, tune_relay)
    • A new MLP-based cost model
  • TIR
    • New schedule primitive for PadEinsum
    • A new TIR node: DeclBuffer
    • INT8 Intrinsics for TensorCores for CUDA!
  • microTVM
    • Improved schedule primitives for ARM v8-m ISA

And many other general improvements to code quality, TVMScript, and more! Please visit the full listing of commits for a complete view: https://github.com/apache/tvm/compare/v0.9.0...v0.10.0rc0.

RFCs

These RFCs have been merged in apache/tvm-rfcs since the last release.

What's Changed

Please visit the full listing of commits for a complete view: https://github.com/apache/tvm/compare/v0.9.0...v0.10.0rc0.

Note that this list is not comprehensive of all PRs and discussions since v0.9. A non-truncated summary can be found here: https://github.com/apache/tvm/issues/12979

TIR

  • #12720 - [TIR] Implement API for padded layout transformations
  • #12797 - [TIR] Construct the inverse in SuggestIndexMap
  • #12827 - [TIR] Support pattern matching argmax/argmin generated by TOPI
  • #12750 - [TIR, Schedule] Add schedule primitive PadEinsum
  • #11639 - [TIR][Meta-Schedule] Tuple-reduction scheduling support
  • #12515 - [TIR][Arith] Add more strict checking in imm construction and folding.
  • #12717 - [TIR, Schedule] Check consumer in-bound and covered in reverse_compute_inline
  • #12652 - [TIR] Handle axis_separators during FlattenBuffer
  • #12623 - [TIR] Expose MMA-related PTX builtins
  • #12607 - [TIR][Schedule] enhance compute_at and reverse_compute_at primitive to choose possible position ...

v0.9.0

1 year ago

Introduction

The TVM community has worked since the v0.8 release to deliver many exciting features and improvements. v0.9.0 is the first release on the new quarterly release schedule and includes many highlights, such as:

  • MetaSchedule's full implementation
  • ARM cascading scheduler for Arm Ethos(TM)-U NPUs
  • Collage which brings tuning to BYOC
  • Several microTVM improvements
  • New tvm.relay.build parameters - runtime=, executor=,
  • AOT - Support for the C++ runtime (with llvm and c targets only) and support for host-driven AOT in the C runtime
  • Hexagon RPC support
    • Testing via Hexagon SDK simulator and on device via Snapdragon-based HDK boards and phones
    • AOT and USMP support
    • Threading
    • Initial op support
  • MLF - Support for multiple modules in a single MLF artifact
  • Several TIR schedule primitives and transforms including (abridged):
    • schedule.transform_layout - Applies a layout transformation to a buffer as specified by an IndexMap.
    • schedule.transform_block_layout - Applies a schedule transformation to a block as specified by an IndexMap.
    • schedule.set_axis_separators - Sets axis separators in a buffer to lower to multi-dimensional memory (e.g. texture memory).
    • transform.InjectSoftwarePipeline - Transforms annotated loop nest into a pipeline prologue, body and epilogue where producers and consumers are overlapped.
    • transform.CommonSubexprElimTIR - Implements common-subexpression elimination for TIR.
    • transform.InjectPTXAsyncCopy - Rewrites global to shared memory copies in CUDA with async copy when annotated tir::attr::async_scope.
    • transform.LowerCrossThreadReduction - Enables support for reductions across threads on GPUs.
  • And many more! See the list of RFCs and PRs included in v0.9.0 for a complete list, as well as the full change list.

RFCs

These RFCs have been merged in apache/tvm-rfcs since the last release.

What's Changed

Note that this list is not comprehensive of all PRs and discussions since v0.8. Please visit the full listing of commits for a complete view: https://github.com/apache/tvm/compare/v0.8.0...v0.9.0.rc0.

AOT

  • #11208 - Calculate used memory at the callsite of primitive functions
  • #11365 - Fix function number datatype from char to uint16_t
  • #11091 - Enable A-Normal Form in the AOT executor
  • #10753 - Support LLVM backend with C++ runtime
  • #10518 - Use python temporary directory for AOT tests
  • #10337 - BugFix of workspace calculation
  • #10282 - [runtime] Add Metadata classes for AOTExecutor
  • #9501 - [3/3][DeviceAPI] Wire up cpacked Device API context
  • #9500 - [2/3][DeviceAPI] Add Hooks for Activate/Deactivate/Open/Close
  • #9395 - [1/3][DeviceAPI] Connecting devices structure to relevant operators

BYOC

  • #11474 - Two helper passes for external codegen using RelayToTIR custom pass machinery
  • #11144 - Remove support for run-time linked-params from codegen
  • #10590 - Add order to functions in C Codegen
  • #11638 - [DNNL][CBLAS]Unifles all MKLDNN/DNNL to DNNL
  • #11619 - RelayToTIR custom codegen passes can still depend on dynamic shape functions
  • DNNL - #11902, #11642, #11513, #11571, #11560, #11345, #11111, #10837, #10421, #9995, #9797
  • TensorRT - #11923, #11203, #10759, #10772, #10388
  • CMSIS-NN - #11732, #11625, #10939, #11013, #10817, #10563, #10224, #10148, #10100, #9338, #9531, #9409, #9331
  • OpenCLML - #10243
  • CUTLASS - #11631, #10185, #10177, #10110, #10036, #9899, #9820, #9800, #9795, #9746, #9737, #9698, #9595, #9571
  • CUDNN - #10997, #9986, #9948
  • ACL - #10801
  • PTX - #10855, #10339, #9909
  • CUBLAS - #10826, #10820

CI

  • #11313 - Refactor of tvm.testing.requires_* annotations
  • #11666 - Enable pylint for tests/python/ci
  • #11657 - Apply linting rules to AOT tests
  • #11380 - Restructure Jenkinsfile
  • Automation - #11813, #11775, #11480, #11437, #10833, #10056, #9973, #9934
  • User experience improvements - #11470, #11329, #11553, #11497, #11051, #10933, #10960, #10525, #10425, #10322, #10121, #9971, #9554, #9752, #9556
  • Reduce CI runtime - #11402, #11349, #11258, #11132, #10946, #10743, #10359
  • Code cleanups - #10968, #10740

Frontends

  • PaddlePaddle - #11537, #9724, #9564
  • TFLite - #10915, #10566
  • Oneflow - #11321, #11036, #8790
  • PyTorch - #11190, #10504, #10184, #10091
  • ONNX - #10949, #9438, #9186, #9493, #9475
  • Keras - #7006

Hexagon

  • #11549 - Initial clip operator for Hexagon
  • #11834 - Add op resize2d for hexagon
  • #11559 - Softmax slice op initial version
  • #11529 - Slice ops added - add, subtract, multiply
  • #11720 - [testing] add max_pool2d benchmark
  • #11417 - Implement avg_pool2d slice op
  • #11653 - Add HexagonThreadManager
  • #11547 - Run single RPC server on Android in each testing session
  • #11490 - [testing] add TVMScript elemwise-add
  • #11400 - [testing] refactor benchmark-table code
  • #11277 - moves conftest.py to tvm.contrib.hexagon so outside repos can access the testing fixtures
  • #11319 - Add unit tests for Hexagon Device API
  • #11279 - Add USMP tests
  • #11283 - Update Readme
  • #11239 - capture gtest output and return over FFI
  • #11175 - Add schedule and test for conv2d_transpose_nchw
  • #11018 - [Runtime] Add QuRT thread pool backend
  • #11145 - Add support for on-device unit testing using gtest
  • #11138 - Add test for depthwise conv2d schedule
  • #11016 - Add test for registered schedules
  • #11104 - Add mobilenet test
  • #11090 - Delete offload runtime, move files to right places
  • #11065 - AoT with LLVM Codegen on Hexagon
  • #11025 - Deprecate USE_HEXAGON_DEVICE, introduce USE_HEXAGON
  • #10604 - HVX scheduling and bench-marking of TE element-wise add
  • #10905 - [LLVM] Enable/test tensorized Hexagon DMA on 2d transformed layout
  • #10907 - Move aot/graph_executor interactions into launcher
  • #10919 - Register basic strategies and schedules for common operators
  • #10904 - Add unit tests executing 2-d VTCM usage
  • #10910 - Refactor to keep HexagonBuffer private to the device api
  • #10908 - [LLVM][CodeGen] Make CodeGenHexagon a subclass of CodeGenCPU
  • #10878 - Generalized HexagonBuffer::CopyTo/CopyFrom
  • #10846 - Support both 1-d and 2-d VTCM allocations
  • #10581 - Improved ergonomics of HexagonLauncher in unit tests.
  • #10616 - Refactor tvm.contrib.hexagon, NFC
  • #10612 - Deprecate SDK 3.x, rewrite HexagonSDK.cmake
  • #10586 - Codegen for 2d Load/Store
  • #10558 - Generalize builtin for Nd memory alloc with storage scope and add lowering for VTCM / Hexagon
  • #10543 - [Runtime][PipelineExecutor] Add the pipeline internal forwarding logic.
  • #10507 - Add doc on TVM - Hexagon RPC flow
  • #10520 - Resolve breakage in test_hexagon/test_cache_read_write
  • #10311 - [runtime]AOTExecutor implementation for C Codegen
  • #10454 - Allow execution on target or simulator from HexagonLauncher
  • #10365 - Lower cache_read and cache_write to Hexagon DMA via tensorize
  • #10361 - RPC server/client for simulator
  • #10302 - [CI]Add Hexagon Tests to pipeline
  • #10263 - [Docker]Add docker file and scripts
  • #10227 - Refactor Hexagon.cmake
  • #10217 - Adding support for Hexagon User DMA Engine
  • #10068 - Update hexagon API build instruction and cleanup hexagon_proxy_rpc
  • #9970 - Do not auto-build apps when building TVM
  • #9736 - Add unit tests for HexagonBuffer
  • #9525 - Add Hexagon VTCM and discontiguous allocation support
  • #9631 - Add RPC Mechanism for Hexagon
  • #9473 - cleanup Hexagon conv2d tests

MetaSchedule

  • #11884 - Postproc: Rewrite-Layout
  • #11848 - [OpStrategy] Support MetaSchedule Layout
  • #11845 - [Relay][Pass] Meta-Schedule-Layout-Rewrite
  • #11758 - [Runtime] Enhance Runner RandomFill
  • #11683 - Distributed Measurement
  • #11751 - [Minor] Organize Testing Scripts
  • #11735 - Modify Profiler Timers
  • #11727 - Developer Ergonomics Enhancement II
  • #11692 - Apply-History-Best Task Filtering
  • #11486 - Add Profiler Support For Tuning Efficiency Optimization
  • #11680 - JSONDatabase Utilities
  • #11641 - Generate MetaSchedule Dataset
  • #11622 - Developer Ergonomics Enhancement
  • #11604 - Resolve dependencies between header files
  • #11587 - Add Testing Script with ONNX Support
  • #11590 - Evo Independence from TaskScheduler
  • #11534 - No explicit unrolling for spatial PrimFunc
  • #11512 - Enable Task Filtering
  • #11177 - AutoBind rule and MutateThreadBinding
  • #11157 - Logging Interface Unification
  • #11088 - Auto tensorization for CPU / GPU dot product
  • #10986 - [Refactor] Introduce TuneConfig
  • #11020 - [Metaschedule, Refactor] Move MultiLevelTilingNode decl to a header
  • #10927 - [Refactor] Clarify Integration Logic
  • #10876 - Add utility API to ease using manual schedules
  • #10885 - [BugFix] Fix skipped tests
  • #10366 - Add Gradient Based Task Scheduler
  • #10823 - Fine-Grained Rewrite Unbound Block
  • #10793 - Add demonstration of selectively tuning relay ops with TIR schedules
  • #10811 - Support grouping in the cost model
  • #10810 - Extract task weights during task extraction
  • #10782 - [TIR]Estimate TIR FLOPs
  • #10776 - Misc updates for tuning end-to-end workloads
  • #10689 - Upstream the leftover changes
  • #10648 - [Meta Schedule] Refactor meta schedule testing utils
  • #10578 - New relay backend for meta schedule task extraction
  • #10534 - Bug Fix for Relay Integration
  • #10501 - Update scripts for subgraph tuning
  • #10497 - Refactor testing workloads
  • #10461 - Enable AutoTVM-style template-based search space
  • #10368 - Fix Cyclic Dependency in PyClass Family
  • #10403 - Arithmetic analysis
  • #10367 - Update Tuning Interfaces.
  • #10079 - [M4a] User-API: Tune-TE/TIR/Relay
  • #10081 - [M4a] Rewrite-Cooperative-Fetch
  • #10055 - [M4b] Testcases for TensorRT builder/runner
  • #10092 - [M4a] Mutator: Mutate-Tile-Size
  • #10096 - [M4a] Mutator: Mutate Parallel
  • #10071 - [M4a] PostProcessor: Rewrite-Parallel-Vectorize-Unroll
  • #10043 - [M4a] Schedule Rule: Multi-Level-Tiling
  • #10045 - Mutator: Mutate-Unroll
  • #10033 - [M4a] Schedule Rule: Parallelize-Vectorize-Unroll
  • #10027 - [M4a] PostProcessor: Rewrite-Unbound-Block
  • #10028 - Mutator: Mutate-Compute-Location
  • #9997 - [M4a] PostProcessor: Disallow-Dynamic-Loop
  • #9994 - [M4a] Schedule Rule: Cross-Thread-Reduction
  • #10013 - [M4a] PostProcessor: Rewrite Reduction Block
  • #9975 - [M4a] Schedule Rule: Add-RFactor
  • #9945 - [M4a] PostProcessor: Verify-GPU-Code
  • #9940 - [M4a] Schedule Rule: Random-Compute-Location
  • #9943 - [M4a] Schedule Rule: Auto-Inline
  • #9860 - [M3c] Add Per-Store-Feature
  • #9859 - [M3c] XGB-based Cost Model
  • #9836 - [M4a] Add EvolutionarySearch Search Strategy
  • #9799 - [M4a] Add ReplayFunc Search Strategy
  • #9789 - [M3c] Update TuneContext, TaskScheduler & Search Strategy Design
  • #9780 - [M3c] Add More Measure Callbacks
  • #9761 - [M4a] Add ScheduleRule class & PostOrderApply space generator
  • #9760 - [M3c] Random Feature Extractor

MicroTVM

  • #11741 - Refactor RVM scripts and fix DNS network issue
  • #11472 - [ARM]Add tests for arm schedules
  • #11634 - Update pyproject to python3.7
  • Zephyr support - #11650
  • RPC - #11227, #10967

Relay

  • #11825 - [realy][pass]add split infer shape with convert op layout pass
  • #11674 - Finish implementations of WithFields
  • #11481 - IndexedGraph improvements in preparation for Collage
  • #11432 - Plumb external codegen target via Target.current()
  • #11494 - [Pass] Add MaxPool, AvgPool to FoldExplicitPadding
  • #11183 - Add unidirectional sequence lstm
  • #11442 - Add 'static_library' runtime::Module
  • #11413 - [Topi]Support for FP16 ERF on CPU.
  • #11382 - Finish support for list-of-targets
  • #11386 - [Tests] Replace the Relay interpreter with the VM in the op tests
  • #11224 - Support i16, f16 scalars in Relay text
  • #11337 - Fix eltwise alter op layout for broadcast axis
  • #11199 - Flexible shape dispatch transformation
  • #11173 - Support 'external codegen targets'.
  • #10996 - Add FlattenAtrousConv transformation
  • #10871 - [CUDNN] Add cuDNN as a Relay partitioning target (BYOC)
  • #10787 - [Pass][Bugfix] Disable re-use of non-flat buffers in StorageRewrite.
  • #10378 - [FQ2I] Add leaky relu to FQ21
  • #10400 - RelayViz graphviz renderer
  • #10352 - [VIRTUALDEVICE] Change syntax for device planning and store parameter virtual devices in virtual_device_ field
  • #10310 - [ARM_CPU] Conv2d int8 intrinsic for cortex-A72
  • #10085 - RelayViz interface and terminal ast-dump
  • #10239 - Add a conversion of individual operations in FQ2I pass.
  • #10236 - [Refactor] Clean up type relations that are declared as template for no reason
  • #10156 - Fix broadcast InferCorrectLayout
  • #10026 - [VM] Relay VM memory liveness/lifetime analysis
  • #10089 - [Pass] Add a relay pass to extract fake quantized ops
  • #9690 - Change function constructors to WithFields
  • #10069 - [DefuseOps pass] bug fix: To support function body types other…
  • #9954 - Add conv2d_backward_weight op (without topi)
  • #9838 - [FoldScaleAxis] Support dense and bias_add op in fold scale axis
  • #9816 - Add sliding_window operator
  • #9874 - Add a JSON converter for 0.7 -> 0.8 and 0.8 -> 0.9
  • #9735 - [AMP][Pass][Typing] Add faster type inference
  • #9723 - [Frontend] Add Span filling for frontends to Relay
  • #9749 - Fix invalid shape function for "copy" operator
  • #9759 - s/SEScope/VirtualDevice/g
  • #9734 - Support large constants saved/loaded outside of VM executable
  • #9613 - Re-run PlanDevices after LowerTE to flow new memory scope constraints.
  • #9693 - PlanDevices supports 'free' on_device annotations
  • #9641 - [AST] Add virtual_device as a first class field in Relay
  • #9483 - Switch the VM to use the LowerTE pass instead of TECompiler::{Lower,LowerShapeFunc}.
  • #9569 - WithFields method for Call, Function, Var, TupleGetItem, If, Let, RefCreate, RefRead, RefWrite, Match, and Clause
  • #9533 - WithFields for Tuples
  • #9550 - Prepare for switching VM to LowerTEPass.
  • #9542 - Prepare DeadCodeElimination for running post LowerTEPass/ManifestAlloc.
  • #9352 - [TVMC]Introduce executor and runtime parameters
  • #9457 - Add the Arm(R) Ethos(TM)-U NPU identity operator
  • #9326 - Switch PlanDevices pass to be w.r.t. SEScopes instead of DLDeviceTypes.
  • QNN - #11228, #10718, #10086, #10053, #9637, #9982

Runtime

  • #11334 - [PipelineExecutor] Add graph manually splitting logic into the unit test.
  • #11133 - [PipelineExecutor] Refactor PipelineExecutor.py and Add cross compile support for pipeline executor.
  • #11172 - Move WrapTimeEvaluator from RPC to profiling, NFC
  • #10990 - [PipelineExecutor]Add forwarding queue logic for set input.
  • #10953 - [Vulkan] Add RGP support to TVM for vulkan device
  • #10723 - [PipelineExecutor] Getting the asynchronous output
  • #10283 - AOTExecutor implementation and c target code-generator
  • #9802 - [ThreadPool]Refactor affinity function and support CPU affinity list setting.
  • #10234 - [Pipeline Executor] multiple threads management and the data forwarding notification mechanism.
  • #10326 - Improved log information with function signature
  • #10032 - [PackedFunc] Bring PackedFunc into TVM Object System
  • #10082 - [PipelineExecutor] Pipeline Executor Sequential execution
  • #10010 - [PipelineExecutor] Add Pipeline Executor Interface
  • #9846 - [Pipeline executor] Global parameters group name and runtime modules parameters map.
  • #9889 - [GraphExecutor] Add API get_input_info to graph_executor
  • #9751 - [Pipeline Executor] Add the map logic of global input and subgraph input.

TE

  • #11589 - Support schedulable TIR compute definitions in TOPI
  • #11341 - Optimized version of concatenation layer
  • #10561 - [TECompiler] Decouple TE compute and schedule lowering in ScheduleBuilder

TIR

  • #11592 - HoistExpression, generalization of HoistIfThenElse
  • #11870 - [Pass] Remove-Weight-Layout-Rewrite-Block
  • #11740 - [TIR, analysis] Add GetAutoTensorizeMappingInfo to generate transforms for auto tensorization
  • #11585 - Add preserve-unit-iters
  • #11677 - Register CUDA WMMA tensor intrinsics
  • #11658 - [TIR, CUDA] Add pass to replace global to shared memory copy with cp.async
  • #11624 - [Schedule] Allow named block and buffer arguments in Schedule
  • #11628 - [PASS] Refactor a couple of TIR passes - BindTarget, AnnotateEntryFunc, Filter, LowerInitBlock
  • #11574 - CSE pass : Restrict the equivalence to be decided by a normal form - avoids comparison of terms
  • #11575 - Schedule Primitive: Add-Unit-Loop
  • #11515 - Add schedule primitive ReIndex
  • #11524 - [Arith] Additional Simplifications Inside Conditionals
  • #11485 - Add schedule primitive TransformBlockLayout
  • #11495 - [Software pipeline] Fix hardcoded index in access_ptr rewriting, add a GPU test with depth 4
  • #11269 - [Schedule] Transform layout quality of life
  • #11355 - Support tensorization using ldmatrix + MMA
  • #11289 - [Schedule] Allowed typing.Tuple in tir.schedule._type_checker
  • #11317 - Support affine expressions as indices in reverse compute inline
  • #11235 - [Arith] Implemented padded inverses in IndexMap
  • #11238 - [ROOFLINE] Calculate roofline from existing TIR PrimFunc
  • #11225 - Add schedule primitive SetAxisSeparator
  • #11110 - Get read/write access precisely for opaque access.
  • #11106 - Enhance software pipeline validation and fix predicate of epilogue
  • #10843 - StmtFunctor RenewDefs
  • #11075 - Add function to tile a block according to a given tensor intrinsic
  • #11050 - Utility function to decide loop mapping for auto tensorization
  • #11009 - [ROCM] DP4A intrinsic support for TE/TIR
  • #10925 - VNNI and ARM dot product intrinsic for tensorization
  • #10887 - [Schedule] Relax reorder primitive's affine binding check
  • #10732 - [Analysis] Add SuggestIndexMap for layout rewriting
  • #10538 - [Schedule] Transform layout
  • #10638 - Change the behavior of read/write region analysis for reduction blocks.
  • #10705 - Use local complete block and local reduction block to identify compact dataflow
  • #10671 - Tuple Reduction Support in CreatePrimFunc
  • #9727 - [TE]Implement layout transformations, non-flat memory buffers
  • #10405 - [TensorIR] Update VerifyGPU
  • #10401 - [TensorIR] Renormalize split pattern
  • #10112 - [TIR, Relay] improve bfloat16 support
  • #8509 - Tir constants integration into compilation pipeline
  • #9996 - add support for multi-blocking layout and their transformation
  • #10066 - Add software pipelining
  • #10207 - Support sub warp reduction for CUDA target.
  • #9482 - Implementation of Common Subexpression Elimination for TIR
  • #9527 - Allow compute_at create block predicate for non-trivial bounds and support floordiv pattern
  • #10158 - [Schedule] Update compact_dataflow constraint
  • #9871 - [Schedule] Blockize and Tensorize
  • #10016 - [BugFix]Fix cross-thread reduction when single reduction loop with predicate
  • #9880 - Encode conditional accesses info into block read/write regions
  • #9699 - Affine utility support iter lowerbound and diagnostics
  • #9742 - [Schedule] Add Annotate/Unannotate primitive
  • #9738 - [TensorIR] Primitive "SetScope"
  • #9743 - [Schedule] Analysis functions to check if compute_inline and com…
  • #9689 - Allow memory (aka storage) scopes to be retrieved/applied to PrimFuncs
  • #9559 - [TensorIR][UX] Type annotation-based runtime type checking
  • #9444 - Add a 'rolling_buffer' scheduling primitive
  • #9360 - [TensorIR] Cross-Thread Reduction

TOPI

  • #11531 - TE implementation of LSTM using scan
  • #11161 - Add Adreno GPU target and topi supporting textures with dynamically allocated textures
  • #10332 - VNNI support for batch matmul
  • #9873 - Add support for groupped conv3d
  • #10230 - VNNI support for int8 dense
  • #10098 - [Op]5 ops can accept unsigned integers as indices
  • #9832 - Support grouped conv1d
  • #9694 - Add generic batch norm
  • #9233 - Cortex-M DSP support

TVMScript

  • #11308 - Represent ramp as index slice
  • #10099 - Support T.buffer_decl using data pointer from Let/Allocate
  • #9680 - Improve printer for TIR syntax sugar
  • #9492 - Add syntax sugar for T.handle and T.match_buffer
  • #9620 - Add for loop syntax sugar
  • #9543 - Misc error message improvements
  • #9505 - [Fix] Add type hints for more uncovered cases

USMP

  • #11015 - U3 use case
  • #10189 - Adding support for U1 usecase for constant pools
  • #10785 - Adding support for U4 usecase
  • #10193 - adding support for U2 and U3 usecases
  • #10005 - Add performance characteristics to PoolInfo
  • #9565 - [TIR]Integrating USMP to AoT Executor
  • #9704 - Hill Climb allocator
  • #9418 - [TIR]adding the pass to convert to pool offsets
  • #9649 - [TIR]Augmenting the algo interface with memory pressure
  • #9214 - [TIR]Greedy memory planning algorithm
  • #8468 - [TIR]Added buffer info extraction pass

microNPU

  • #11468 - Optimize separate padding operation for conv2d
  • #11453 - Add transform matrices and part matcher to identity op
  • #11410 - add E2E tests with cascader wo striping
  • #11288 - Expose compute cycle annotations to TIR lowering
  • #10959 - Add a pass to reorder copy and compute nodes
  • #10509 - Add various options to the cascader
  • #11263 - Adding a option to enable striping
  • #10251 - Add support for conv2d running on two cores on U65
  • #10862 - Integrate the cascader
  • #10344 - Integrate rolling buffers in Arm(R) Ethos(TM)-U
  • #10824 - Some housekeeping in the test_ethosu folder
  • #10763 - Tweak a layout transform matrix
  • #10725 - Add a pass to move allocate nodes to the outer scope
  • #10695 - Determine block configs using the cascader
  • #10599 - Refactor Relay to TIR hook
  • #10508 - Improve cascader memory transfer estimates
  • #10345 - Add support for TFLite FULLY_CONNECTED
  • #10254 - Introduce a pass to remove redundant identity operations
  • #10062 - [5] Convert Proposals to te.Schedules
  • #9959 - [4] Add the cascader Proposal generator
  • #10022 - enable USMP
  • #10127 - Add support for LeakyReLU
  • #10004 - Add FreeRTOS variant of NPU demo
  • #10060 - Refactor type inference data type checks
  • #9960 - Add support for pack and unpack
  • #10143 - Fix layout assignment in layout optimizer pass
  • #9890 - [3] Plan generation for the cascader
  • #9855 - Add support for transpose convolution
  • #9841 - Add support for nearest neighbor and bilinear upsampling
  • #9951 - Removing constant args from PrimFunc
  • #9929 - Refactor base address determination to codegen
  • #9910 - Add support for requantize
  • #9831 - Move optimization passes to be a module pass and ensure they are running
  • #9785 - [2d] Add more Part matchers to cascader
  • #9778 - [2c] Add performance modelling to cascader
  • #9471 - [2b] Create CascaderGraphs from TE graphs
  • #9469 - [2a] Add CascaderGraph for cascading analysis
  • #9621 - Add support for SPLIT and SPLIT_V
  • #9508 - Update Conv2D Tests to Use TF API to Gen Test Cases
  • #9627 - Add support for SIGMOID
  • #9589 - Add support for TFLite concatenate
  • #9623 - Refactor codegen tests
  • #9561 - Add NHWC -> NHCWB16 layout transformation pass
  • #9576 - Mean legalization support
  • #9597 - Move the compilation to use Target Hooks.
  • #9458 - [1] Add affine analysis structures for the cascader
  • #9547 - Add the infrastructure for lookup table and TANH
  • #9521 - Support binary elementwise with non-4D inputs
  • #9560 - Fix incorrectly calculated stride when converting NHWC to NHCWB16
  • #9530 - Add unary elementwise operator infrastructure with ABS
  • #9514 - Adding rounding mode attribute to operators
  • #9515 - Allow constants to be given as input to an operator

microTVM

  • #11250 - [ARM] Add Relay tests for conv2d registered schedules
  • #11232 - [rpc] Implemented rpc logging
  • #11044 - Add support for host-driven AoT Executor
  • #11043 - Better version handling for Arduino
  • #10555 - Enable micro tvmc tutorial testing in CI
  • #10194 - [RVM] Add scripts for automated build and testing
  • #10144 - TVMCon 2021 Zephyr Demo with CMSIS-NN
  • #10024 - [tvmc] Add TVMC Micro tutorial for Zephyr
  • #9684 - Fix zephye/test_zephyr_armv7m test
  • #9584 - [TVMC] Add TVMC test for Arduino and Zephyr
  • #9526 - Add minimal forwarding RPC server for host driven python execution on Hexagon
  • Zephyr support - #11362, #10138

Misc

  • #11465 - Add cooldown interval logic for the profiling functional
  • #11888 - [LLVM] Include LLVM headers in files that use them, not in llvm_common.h
  • #11646 - [Arith] Simplification of ceil, log2, and left_shift
  • #11464 - [MLF] Add support for multiple modules in Model Library Format
  • #11632 - [AutoTVM][Autoscheduler] Default build funcs inherit PassContext
  • #11543 - [OpenCL] Implement conv2d_winograd algorithm for Adreno
  • #11287 - [Arith] Merge surjective/non-surjective iter mapping detections
  • #11393 - Add utility to replace direct call to pytest.main
  • #11252 - [ROOFLINE] Roofline analysis over RPC
  • #11000 - [Graph Debugger] Expose way to benchmark individual nodes.
  • #10794 - bump PyTorch version to 1.11
  • #10821 - [REFACTOR] Remove legacy nnvm folder
  • #10798 - [Arith] Remove diagnostic ctx argument from DetectIterMap
  • #10567 - [Refactor] Reduced repetition in CodeGenLLVM's buffer access
  • #10455 - [AUTO_SCHEDULER] Add feature extraction directly from PrimFunc
  • #7401 - RFC: initial stab at TorchScript fallback
  • #10391 - [vulkan] Add integer dot product (4xint8, 4xuint8) tensorization for the vulkan SPIR-V target.
  • #10293 - [VirtualMachine] new method allowing to set one input tensor by its index or name
  • #10191 - Generate correct output tensor names in C Interface API
  • #9276 - Parameterize test_link_params
  • #9808 - [Rust] Update Rust bindings
  • #9553 - [PROFILING] Add ability to profile a single function_profiling
  • #9611 - [CMAKE] Automatically detect newly added source files
  • #9544 - [Target] enable -arch=sm_xx for assigning cuda target arch and deprecate autotvm.measure.set_cuda_target_arch api
  • Profiler - #11530, #11066
  • Docs - #10921, #11403, #10774, #10912, #9633, #9906, #9534, #9307, #9654, #9580
  • Android - #11241
  • ETHOSN - #11261, #10486, #10018, #9596
  • TVMC - #11012, #10962, #10722, #9817, #9529, #9229

v0.8.0

2 years ago

Overview

Apache TVM v0.8 brings several major exciting experimental features, including:

  • PaddlePaddle frontend
  • TVMScript: round-trippable python-based syntax for TIR
  • TorchScript integration
  • TensorIR scheduling language
  • TensorRT and CUTLASS integration via BYOC
  • Int4 TensorCore support in AutoTVM
  • MicroTVM Project API and Zephyr, Arduino support
  • AOT executor
  • Robust Windows support
  • Affine analysis infra: iter-affine-map
  • Improved Vulkan backend
  • CUDA graph support in TVM runtime

Besides, The community has been working together to refactor and evolve the existing infrastructure, including but not limited to:

  • Relay compilation engine
  • Relay pattern language
  • CI and build process
  • Refactoring documentation and tutorials
  • Stablizing AutoScheduler
  • Stablizing TVMC command line driver interface
  • Stablizing target system
  • Frontend coverage, quantization, dynamic shape, training

Full changelog: https://gist.github.com/junrushao1994/c669905dbc41edc2e691316df49d8562.

Accepted RFCs

The community has adopted a formal RFC process. Below is a list of the formal RFCs accepted by the community since then:

  • [RFC-0005] Meta schedule (AutoTIR)
  • [RFC-0006] Automatic mixed-precision pass and support
  • [RFC-0007] Parametrized unit tests
  • [RFC-0008] MicroTVM Project API
  • [RFC-0009] Unified static memory planner
  • [RFC-0010] Target-registered compiler flow customisation
  • [RFC-0011] Arm® Ethos-U integration
  • [RFC-0014] Pipeline executor
  • [RFC-0015] Use CMSIS-NN with TVM
  • [RFC-0019] Add PaddlePaddle frontend
  • [RFC-0020] Extend metadata in project option
  • [RFC-0022] TIR non-scalar constants
  • [RFC-0023] Adding annotation field to tir.allocate nodes
  • [RFC-0025] PyTorchTVM
  • [RFC-0027] Formalize TVM documentation organization
  • [RFC-0028] Command line composition from internal registry
  • [RFC-0029] Migrating target attributes to IRModule
  • [RFC-0030] Command line configuration files
  • [RFC-0031] C Device API
  • [RFC-0036] TVMScript namespace
  • [RFC-0041] Update TVMScript block syntax

Features and Improvements

TE, TIR, TVMScript

AutoTVM, AutoScheduler, Meta Schedule

Operator Coverage

Training

Relay

MicroTVM, AOT, Graph Executor and VM

Arithmetic Analysis

  • Tighter bounds and more simplification on cast #6771 #7045
  • Introducing iterator (quasi-) affine map detection #6667 #7752 #7759
  • Inverse of iterator affine map #8384 #8427
  • Subspace division in iterator affine map #7760

Frontends

Codegen Backends and Runtime

BYOC Integration with Vendor Libraries: TensorRT, ACL, VitisAI

TVMC

Rust Binding

Misc

  • Enhanced CPP-RPC implementation: allow user supplied work dir, support of CPP-RPC server for Apple, support adb-shell style CPP-RPC #7670 #8224 #8223 #7766 #7013
  • Use PopenWorker to handle RPC system: #7889 #7757 #7961
  • Fold target host into target #7462 #7791 #7534 #8835
  • Target-based intrinsic lowering and legalization #7936 #7809
  • Add target tags for all existing CUDA GPU models #7410
  • Linear Congruential Random Engine #8642

v0.7.0

3 years ago

v0.6.1

3 years ago

Apache TVM (incubating) is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator PMC.

Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects.

While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.

Apache TVM (incubating) 0.6.1 is a maintenance release incorporating important bug fixes and important performance improvements. All users of Apache TVM (incubating) 0.6.0 are advised to upgrade. Please review following release notes to learn the bug fixes.

Bug Fixes

  • Fixed process termination routine in windows #4844
  • [Runtime] Fix NDArray SaveDLTensor declaration and implementation signature different #4586
  • [NODE][Serialization]fix serialization precision loss in float #4503
  • [Relay][Frontend][TF] fix _parse_param bug #4711
  • Fix bias_add gradient #4516
  • Make sure to visit the arguments of inlined functions #4783
  • Fix Python syntax error in start_rpc_server_to_tracker.py #4682
  • [Bugfix] Fixed crash caused by reversing bitwise operations #4852
  • [Fix][VM] Fix copy constructor #5237
  • fix small bug about dense_grad #5695
  • [Fix] Fix conv2d alter op for arm cpu #5532
  • [Fix] Fix dense x86 schedule #4728
  • [Relay][Fix] Fix alter op layout when calling a global var #4454
  • [Relay][Pass] Fix lambda lift pass for recursive call #4432
  • [BUGFIX] Fix search path for libtvm_topi.so #4467
  • [Bugfix] Fix Python debugger segfaults with TVM built with LLVM #5685
  • [RUNTIME] Fix compile errors of OpenCL FPGA backend #4492
  • [BUGFIX][BACKPORT-0.6][ARITH] Fix FloorMod Simplifier #5509
  • Some Windows and MSVC fixes #4569
  • [Chisel][VTA] Fix multiple transfer issue in LoadUop module #4442
  • [VTA] Fix an issue in updating uop_idx in the TensorGemm module #4694
  • [VTA] Fixed a crash issue in TSIM driver #4527
  • [VTA] Enable streamlined GEMM execution #4392
  • [VTA][Chisel] End-to-end Inference with Chisel VTA #4574
  • Added declare of aluBits for TensorAlu #4624
  • [Quantization] Fix annotation for multiply op #4458
  • LRN only supports 4D tensors, remove it from alter_op_layout #5520
  • fix topi.nn.global_pool layout="NHWC" #4656
  • [FFI][Windows] Fix hasattr by extracting Python error type from Windows error message #4780
  • [Runtime] Export GraphRuntime in tvm_runtime.dll #5002
  • Fix Base64OutStream portability issue #4668
  • [AUTOTVM] Fix a bug in generating the search space #4779
  • [Relay][VM] Fix compilation of If-Elses #5040
  • [RELAY][FRONTEND][TENSORFLOW] Fix FuseBatchNorm output cast error if need_cast is True #4894
  • [Bugfix] fskip of EliminateCommonSubexpr cannot always return false #4620
  • [Fix] Add ConstantNode to IsAtomic #5457
  • [Fix] Fix RemoveUnusedFunctions pass #4700
  • [Realy][fix] Fix alpha_equal bug for attribute check #4897
  • [Arith] keep div_mode during floordiv simplify #5922
  • [ARITH][BACKPORT-0.6] fix a min/max simplify bug #5761
  • [0.6-BACKPORT] Improve robustness of the docs build #5583

v0.6.0

4 years ago

0.6.0.rc0

4 years ago

v0.5

5 years ago

NOTE: This is a release pre apache incubation

This release features several major improvements. Some of the highlights are: Arbitrary bits quantization algorithm; High-level auto-differentiable programming IR--Relay(NNVMv2).

The community welcomes new reviewers @nishi-t @were @siju-samuel @jroesch @xqdan @zhiics @grwlf @ajtulloch @vinx13 @junrushao1994 @FrozenGene @liangfu , new committers @srkreddy1238 @eqy @masahi @nhynes @phisiart @merrymercy @Laurawly @adityaatluri @Huyuwei

Change List

  • Fully featured 8-bit network support
    • 8bit quantizer
    • Arbitrary bits quantization algorithm
    • Intel cpu support
  • NVidia GPU 8-bit kernel
    • int8 gemm recipe
    • int8 conv2d
    • Autotvm integration
  • Automated tuning and scheduling
    • AutoTVM optimizations for mobile GPUs
    • AutoTVM optimizations for CUDA
    • AutoTVM optimizations for x86
  • Initial release of the differentiable programming IR, Relay
    • Generic & informative Relay error reporting #2408
    • Relay IR text format support #1781
    • Support control flows
    • A Normal Form Canonicalization #2251
    • Type system support
    • End to end compilation
      • Frontend support: Caffe2 #2507 , CoreML #2476 , Keras #2376 , MXNet #2163 , ONNX, TFLite #2365
      • Operator coverage #1799 #2051
    • FoldScaleAxis #2020
    • SimplifyInference #2033
    • CombineParallelConv2D #2089
    • InstrumentBoundCheckers pass #2079
    • Bind & FoldConstant #2100
    • Alter Op Layout #2150
    • General OpFusion #2090
  • CodeGen
    • Gcc / g++ compatible C code generator for TVM #2161
    • Device type annotation for heterogeneous compilation #2361
    • Cache packed func ptr, lift alloca #2070
    • Generalize compute to tensor region #1476
  • Runtime
    • Relay interpreter and compiler #1954
    • Heterogeneous runtime #1695
    • Language bindings: Golang runtime #1470 , Rust runtime #1597
    • Add min_repeat_ms to time_evaluator #2200
    • Bundled interpreter demonstration #2297
    • Enable PlanMemory in the graph runtime #2120
  • Language Binding
    • Rust frontend #2292
  • VTA
    • Improved RPC for VTA #2043
  • Hybrid python programming model
  • TOP
    • Initial support for sparse tensor computation
    • Improve ARM CPU depthwise convolution performance #2345
    • Port winograd ops to relay #2356
  • Tutorials and docs
    • Relay language docs #2232
    • Tutorials on how to use SGX backend
    • How to write a pass in python
    • General lowering flow of TVM
    • How to do tensorize
    • TFLite frontend tutorial #2508
    • Keras seq2seq model for translation tutorial #1815
    • Committer guide and tips #2468
    • Code review guideline on API designs #2459

Contributors

Code reviewers

  • @tqchen
  • @liangfu quantization, relay, topi, frontend
  • @zhiics relay, runtime, frontend
  • @nhynes quantization, rust
  • @Huyuwei frontend
  • @yzhliu relay, frontend, perf
  • @xqdan hybrid script, tvm/lang
  • @ZihengJiang relay
  • @vinx13 relay/pass, topi
  • @masahi relay/pass, frontend, doc, topi
  • @grwlf frontend, topi, relay, quantization
  • @tmoreau89 vta, relay, backend, runtime
  • @kazum frontend
  • @nishi-t frontend, topi
  • @PariksheetPinjari909 frontend
  • @jroesch relay, frontend, doc
  • @srkreddy1238 relay/op, frontend
  • @siju-samuel relay/op, frontend
  • @junrushao1994 relay
  • @icemelon9 relay, perf, tvm/lang, codegen
  • @ajtulloch relay, frontend
  • @alex-weaver relay
  • @kevinthesun hybrid script, topi, relay
  • @Laurawly topi
  • @were hybrid script, topi
  • @FrozenGene frontend, topi, relay/pass
  • @eqy relay, topi, runtime, rust
  • @zhreshold frontend, relay/op
  • @merrymercy relay/op, topi, runtime, frontend
  • @derisavi-huawei symbolic integers

Code contributions

  • @tqchen tvm
  • @vinx13 relay/pass, topi
  • @siju-samuel topi, relay/op
  • @merrymercy autotvm, topi, relay/pass
  • @srkreddy1238 relay/op, frontend/tf
  • @MarisaKirisame relay
  • @slyubomirsky relay, docs
  • @jroesch relay
  • @nhynes rust
  • @wweic docs, relay/pass
  • @yzhliu perf, frontend
  • @zhiics relay/pass, relay/op, runtime
  • @were hybrid script
  • @icemelon9 perf, relay/pass, relay/op
  • @joshpoll relay, docs
  • @sgrechanik-h codegen
  • @kazum frontend/keras, topi
  • @masahi relay/op, docs
  • @FrozenGene perf, frontend/tf
  • @liangdzou docs
  • @junrushao1994 relay/op
  • @eqy autotvm, runtime
  • @apivovarov docs
  • @ajtulloch runtime, nnpack
  • @kevinthesun relay/op, perf
  • @ZihengJiang relay/pass, quantization
  • @hlu1 nnpack, frontend/caffe2
  • @lixiaoquan nnvm
  • @imorinaga frontend/mxnet
  • @liangfu topi, docs
  • @xqdan codegen
  • @PariksheetPinjari909 frontend/darknet
  • @alexeyr frontend/tensorflow
  • @Rasterer topi
  • @yangchen-MS codegen
  • @anijain2305 relay/op
  • @grwlf topi
  • @Huyuwei topi, frontend/keras
  • @denis0x0D runtime/trace, relay/pass
  • @Mutinifni codegen
  • @derisavi relay/pass
  • @tmoreau89 vta
  • @Laurawly topi, perf
  • @zhreshold frontend, topi
  • @kun-zh codegen
  • @reminisce relay/op
  • @ehsanmok rust
  • @cnuernber perf
  • @cowanmeg topi, codegen
  • @yuruofeifei topi

v0.4

5 years ago

NOTE: This is a release pre apache incubation

This release features several major improvements. The high-level graph optimizer is now part of TVM repo. Some of the highlights are: Initial support of AutoTVM for automated optimization; customized accelerator backend VTA. Please also check out tvm.ai for latest blogposts.

The community welcomes new reviewers @kazum @alex-weaver @masahi @zhreshold @PariksheetPinjari909 @srkreddy1238 @eqy, new code owner @merrymercy, and new committer @yzhliu

Change List

Tensor Expression and Optimization

  • Tensor operator primitives
    • Introduce attrs field to operator primitives(e.g. compute) to store additional metadata, the attrs can be used as hint for scheduling
  • Enable embedding of asm micro-kernels
  • Hybrid python programming model
    • python AST based IR builder interface
    • support GPU programs
  • AutoTVM, Automated tuning, and scheduling
    • basic autotvm infra
    • GPU IR verifier
    • basic autotuning tutorial
    • topi integration
  • ARM support
    • winograd support
    • initial support of ARM autotuning records
  • TOPI Vision
    • Generic GPU sort support(useful for vision)
    • SSD operator support
  • TOPI numpy consistency
    • Rename all binary operators for numpy consistecy: broadcast_add-> add, broadcast_sub -> substract, broadcast_mul -> multiply, broadcast_div->divide
    • New operators: slice, LRN, equal, not_equal, less, greater
    • tutorials on topi
  • Initial low-bit operator support support
    • Optimized popcount generation on ARM
    • general bit-serial convolution and GEMM
    • optimized low bit kernels
    • parallel optimization
  • New topi backend optimization for intel graphics
  • Adapt AVX schedules for SSE target

Backend

  • VTA: customized accelerator backend
    • custom hardware backend example
    • tutorials on how to use customized accelerator
  • Initial experimental support for HLS backend
  • Bugfix in SPIRV code generator for vulkan
  • libdevice support, enable NVPTX backend

Runtime

  • Introduce NDArrayContainer for managed NDarray
  • RPC and Device API
    • Support communication between big/small endian machines.
    • RPC and device API protocol upgrade (this is a non-backward compatible change) to support big-small endian communication. This is a non-backward compatible change, need to use the latest version of TVM runtime with the RPC
    • graduate rpc from contrib, tvm.contrib.rpc->tvm.rpc -Support tracker in Android RPC, add fault tolerance for AutoTVM
  • BIG.LITTLE aware threadpool
  • tvm4j graph runtime that runs end to end workload in java
  • DLPack support
    • Support from_dlpack and to_dlpack
    • Enables bridges to pytorch
  • Enable link of stackvm in runtime

NNVM

  • Tensorflow graphdef frontend
  • Keras frontend
    • improved to support reuse layers, add activations
  • ONNX
    • gather, LRN
  • CoreML frontend
    • Support C-RNN and activation functions
  • Fix grads for sum and expand_like
  • Enhanced operator fusion for multiple elemwise branches
  • Separate nnvm fusion and compilation pass

Misc

  • Unified build system to cmake, customizable cmake path for vulkan, rocm, cuda

Contributors

See the complete list here. Thanks to all the contributors to contribute to this release.

Code reviewers

  • @yzhliu topi, tvm4j, nnvm
  • @kevinthesun nnvm
  • @Huyuwei topi operators
  • @tmoreau89 hardware backends
  • @comaniac fpga backends
  • @kazum nnvm, opencl backend, fpga
  • @nishi-t nnvm, opencl backend
  • @merrymercy topi, arm,
  • @vinx13 gpu backend
  • @masahi nnvm, topi
  • @eqy autotvm
  • @jroesch runtime
  • @PariksheetPinjari909 frontends, topi
  • @srkreddy1238 frontends, topi
  • @FrozenGene autotvm

Compiler

  • @alex-weaver vulkan
  • @were hybrid script mode
  • @nishi-t CUDA, fp16, int8 support
  • @ktabata intel FPGA support
  • @kazum xilinx fpga support
  • @cowanmeg arm optimized popcount
  • @tmoreau89 VTA customized accelerator

TOPI, graph optimization

  • @merrymercy AutoTVM
  • @yzhliu tvm4j graph runtime, x86
  • @Laurawly intel graphics
  • @abergeron conda build fix
  • @nhynes sgx random
  • @masahi topi, more robust op fusion
  • @kevinthesun vision ops
  • @grwlf argmax/min ops
  • @cowanmeg bit-serial operator
  • @ehsanmok topi tutorial
  • @zhiics refactor fusion and compilation into separate pass
  • @liangfu binary logical operators

Frontends

  • @srkreddy1238 tutorials for deployment, tensorflow frontend
  • @siju-samuel coreml, tf frontend
  • @PariksheetPinjari909 nnvm, slice
  • @kazum keras
  • @nishi-t mxnet, nnvm

Deploy

  • @eqy rpc, thread runtime
  • @dayanandasiet android tutorials

v0.3

5 years ago

NOTE: This is a release pre apache incubation

This release features numerous improvements in TOPI and backends. We make the first step toward object detection support in TOPI, featuring operators necessary for YOLO and SSDs. The topi now supports numpy-style API and operator overloading. RPC is significantly improved to support resource allocation and using a pool of devices. We are adding two new backends: WebGL for running GPUs on the browser, and Vulkan for running on next-generation graphics API. Please also check out tvm blogs for latest blogposts

Change List

  • TOPI Vision operators
    • SSD support
    • YOLO support
    • NMS operator support in vision
  • TOPI general numpy-style operators
    • numpy style operator overload in topi
    • more operators: flip, take
    • dilation support on conv2d and depthwise
  • 8bit support
    • ARM 8bit gemm
    • ARM 8bit conv
  • Low bit operator support
    • popcount intrinsics
    • 1-bit fully connected
  • Contrib: MPSDNN fully-connected and conv2d support
  • Better RPC support
    • RPC Tracker support to allow centralized resource management
    • RPC protocol upgrade (this is a non-backward compatible change) to support timeout in the proxy
      • This is a breaking change, need to use the latest version of TVM runtime with the RPC
    • Fault-tolerant to early server termination with correct exception propagated
    • RPC support enabled for ROCm AMDGPUs
  • Tutorials and docs
    • How to deploy to android devices.
  • Optimizations for hardware backends
    • intel CPU (AVX and AVX512)
  • Schedule Primitives
    • rfactor now support factor_axis to specify the factored dimension in the result
    • cache_write now support multiple output operators
    • enable warp memory which generates shuffle instructions
  • Framework bridge
    • MXNet bridge supported
  • C++ compiler API support
    • build migration
    • topi migration to c++
    • Target system in c++
  • WebGL backend
    • runtime and codegen
    • topi integration
    • end to end pipeline on the browser
  • Vulkan backend
    • vulkan runtime
    • spirv code generator
  • Security
    • intel SGX runtime support
    • multi-threaded SGX runtime
  • LLVM 7.0 support
  • Robustness
    • VerifyMemory to verify incorrect GPU schedules that writes into GPU memory from cpu
    • Verify compute formulas
  • Better CPU parallel runtime

Main Contributors

See complete list here. Thanks to all the contributors to contribute to this release.

Code Reviewers

  • @zhreshold for reviewing many vision ops
  • @Huyuwei topi operators
  • @sxjscience for reviewing topi operators

TOPI:

  • @merrymercy Mali GPU support
  • @PariksheetPinjari909 topi vision ops, support for darknet operators
  • @yzhliu intel CPU optimization
  • @kevinthesun Vision operators, initial ssd, nms operator support
  • @dingobye Various great TOPI improvements for operator overloading
  • @Huyuwei dilation support to conv
  • @masahi Intel CPU topi
  • @nishi-t improvements in pooling

Compiler:

  • @nhynes SGX support
  • @phisiart WebGL backend
  • @alex-weaver C++ compiler support
  • @kun-zh bug fix bound checking in code.
  • @xqdan improvement low-level schedule rewrite.
  • @yidawang parallel runtime improvement
  • @eqy AMD GPU backend improvements
  • @Laurawly Initial improvements for Intel GPU
  • @cnuernber Improved runtime device stream API