Incubator Tvm Versions Save

Open deep learning compiler stack for cpu, gpu and specialized accelerators

v0.16.0

1 week ago

Introduction

The TVM community has worked since the v0.15.0 release to deliver the following new exciting improvements! This release version is:

First support of Relax, with dynamic shape and pipeline
Dlight module for optimizing LLM TIR workloads on GPU
Disco module for initial SPMD multi-GPU support

The main tags are below (bold text is with lots of progress):

Community, RFCs
Adreno, ArmComputeLibrary, Metal, cuda & cutlass & tensorrt, micoNPU, Runtime
Relax, Dlight, Disco
Arith, TIR, TVMScript
Docs, CI, Misc, BugFix

Please visit the full listing of commits for a complete view: v0.16.dev0...v0.16.0.rc0.

Community

#16695 - Add new key for release signing
#16419 - Add new key for release signing

RFCs

This new RFC explores how TVM can be utilized to generate code for the SME ISA to achieve improved inference performance on supported Arm®-based hardware implementing the SME extension.

#107 - [RFC] Scalable Matrix Extension enablement

Arith

#16735 - [Fixup] Require feature flag for tighter inequality bounds
#16588 - Provide tighter ConstIntBounds for special cases
#16704 - [Fix]Fix canonical simplification of LE

BYOC

#16567 - Skip processed functions in FuseOpsByPattern and RunCodegen

BugFix

#16766 - [Target] Added null check to fix segfault at ->defined() in cpu.cc DetectSystemTriple()
#16739 - [Ansor] Fixing Ansor Gradient Bug
#16820 - [Fix] PAPI docs
#16793 - [Fix] fix for numpy 2.0 compatibility
#16790 - [Fix] Fix build errors with VS2022
#16780 - [Fix] Fix numpy dtype map
#16773 - [Fix] Fix the purity flag of "vm.call_tir_dyn" and "kill" ops
#16770 - [Hotfix] Revert driver API pass ordering that breaks MLC, mark failing test
#16771 - [Fix] Remove redundant "remove_all_unused" in IPC memory lowering
#16746 - [Fix][Builtin] Fix "GetQueryPosition" of PagedKVCache
#16728 - [Fix] Introduce TVM_DEBUG_WITH_ABI_CHANGE to warn ABI changes in debug mode
#16714 - [Fix] PagedKVCache fetching compute stream when copy stream is needed
#16684 - [SLM] Produce well-formed Relax for nn.modules.KVCache
#16659 - add the default value for DFT in ONNX frontend
#16637 - [Transform] Preserve symbolic variables in FuseOps
#16649 - [FFI] Add a missing default for datatype lanes
#16492 - [Executor] fix debug_executor function debug_get_output
#16598 - [Transform]Handle non-composite lambda functions in FuseOps
#16565 - [Transform] Keep private non-primitive functions in FuseTIR
#16518 - Use xxx instead of pow(x,3)
#16436 - Ensure that bf16 arrays are created as expected
#16361 - Disable SingleEnvThreadVerifier
#16289 - [AUTOTVM][FIX] Typo fixes and add a warning in the Droplet Search

CI

#16837 - Disable flaky unit test
#16765 - [AOT][Testing] Improve output mismatch information on test failure
#16661 - add merge_with_main in unity
#16611 - [AOT][Testing] Print output values on test failure
#16546 - Disable testing that downloads from mxnet
#16521 - Fix CI Script and Broken Tests
#16502 - Support tvm-bot rerun for tvm-unity task
#16435 - Update image tag to 20240126-070121-8ade9c30e
#16420 - [WASM] Update emsdk and nodejs version
#16384 - Remove NVIDIA_DISABLE_REQUIRE
#16382 - In jenkins.cmd_utils.Sh.tee, check for failing subprocess
#16366 - Upgrade sccache version to 0.7.*
#16369 - Upgrade Unity ci images
#16344 - Update docker images tag to 20240105-165030-51bdaec6
#16340 - [Unity][UnitTest] Increase atol to resolve flaky CI failure
#16337 - [Hexagon][UnitTest] Disable flaky quantization test
#16336 - Upgrade cmake version to 3.24.0

Docker

#16755 - [SME]Add Fixed Virtual Platform (FVP) and toolchain install
#16348 - Upgrade pip in i386 container

Disco

#16618 - [Disco] Propagate structlog configuration to disco workers
#16639 - [Disco] Expose functions to query the per-worker device/rank
#16617 - [Disco] Implement Session.import_python_module method
#16715 - [Disco] Propagate structlog/logging config to workers
#16845 - [Debug][Disco] Check if a PackedFunc exists before calling it
#16817 - [Disco] Reduce Process/ThreadSession message queue reads and writes
#16807 - [Disco] Support setting workers' CPU affinity
#16375 - [Unity] Fix creation of disco ProcessSession
#16821 - [Fix] Add TVM_DLL to Disco session
#16752 - [Fix] Lazy import of "psutil" in disco process pool

Dlight

#16775 - [Fix][Dlight] (Low-batched-)GeMV on small spatial loops
#16429 - [Unity][Dlight][Fix] Reduction rule support dyn-shape epilogue
#16351 - [Unity] Add dlight.gpu.Fallback in DispatchSortScan, add argsort, topk, and cumprod
#16338 - [Unity][DLight] Introduce Specific Rule for RMSNorm
#16251 - [Unity][Dlight] Support dlight gemv rule on nested inner block
#16878 - [Dlight] Enhance vectorization loading weight for gemv
#16848 - [DLight] Fix a corner case for reduction rule
#16701 - [Dlight] Add fallback for low batch gemv with outer reduction
#16678 - [Dlight] LowBatchGemv rule only apply to function with spatial symbolic var
#16665 - [Dlight] Skip GeMV when normalization fails
#16579 - [Dlight] Scheduling Low batch GEMM using GEMV-like rule
#16579 - [Dlight] Scheduling Low batch GEMM using GEMV-like rule
#16321 - [DLight] Skip rule if target is not suitable
#16731 - [Dlight] Fix GeMV shared memory estimation

Docs

#16792 - [Doc] Fix set_axis_separator example
#16610 - [Doc] Fixed Docstring usage example in tvm.ir.make_node
#16572 - [Doc] Remove MxNet related tutorials
#16514 - [Unity][Doc] Document passes that depend on DataflowBlocks and encourage using ConvertToDataflow
#16482 - [Doc] Fix Docstring in extern.py for Sphinx
#16346 - [Doc] Fix minor error in "Expressions in Relay"

Frontend

#16001 - [ONNX] Fix interpreting auto_pad parameters in ConvTranspose operator
#16651 - [PaddlePaddle] PaddlePaddle model with NCHW data format that supports quantization
#16616 - [PaddlePaddle] Support conv2d when data_format is NHWC
#16526 - [Keras] Enable Dense operator for any input dims
#16478 - [PaddlePaddle] Fixed the bug that prevented the model from being successfully converted to microTVM on MacOS

Hexagon

#16762 - [VM]Cache operations when bypass mode is enabled
#16706 - [VM] Add buffers to dma_wait builtin
#16448 - [VM]Implement dma_copy and dma_wait builtin for hexagon

LLVM

#16782 - [SVE] Support scalable vectors in LoopVectorizer
#16812 - Fix compilation failure due to minor change
#16808 - [Runtime]Fix errors during loading of target tags
#16748 - Lack of DWARF type is not an error
#16696 - [SVE] Add codegen support for scalable buffer accesses
#15964 - [RUNTIME] Add optional LLVM ORCJIT runtime executor
#16612 - [SVE] Add support for scalable data type strings
#16523 - [SVE] Change the dtype of Ramp and Broadcast lanes to PrimExpr
#16484 - [SVE] Add vscale builtin
#16373 - Update Host.h path

MetaSchedule

#16725 - Make the opt_level of tune_relay() adjustable

Metal

#16713 - [RUNTIME]Provide richer runtime when error happens
#16605 - [RUNTIME]Fix multithreading access of metal runtime
#16438 - Dispatch numerically stable tanh for metal

OpenCL & CLML

#16854 - [OpenCL] Add OpenCL device for automatic target detection
#16846 - [Meta-Schedule][OpenCL] Enable MS tuning for Android OpenCL
#16768 - [RUNTIME][OPENCL] Bugfix for ciImage create with host ptr
#16672 - [CLML] Fix build TVM with CLML on MacOS
#16328 - [RUNTIME][CLML] Fix for Softmax op for 4D tensors
#16394 - [OpenCL][CMake] Fix OpenCL tests compilation

ROCm

#16441 - [WebGPU] Intrin Dispatch: tanh, erf, log
#16404 - Some fixes of ROCm codegen

Relax

#16872 - Enhance symbolic expr estimation in memory planning
#16867 - Dispatch sort/scan for non-cuda gpu backends
#16852 - Fix EliminiateCommonSubexpr removing alloc tensor
#16851 - [Relax,Topi] Allow passing workspace to thrust to avoid allocations
#16841 - Provide well-formed output in transform.LazyGetInput
#16798 - [Transform] Provide callback versions of LazyTransformParams
#16801 - Allow DeadCodeElimination within ApplyPassToFunction
#16834 - Capture symbolic vars in struct info of weights
#16830 - Share storage allocs among functions after cuda graph rewriting
#16823 - [VM] Refactor CUDA graph builtins as VM extension
#16828 - [Bugfix] Provide the full Expr to pattern-match rewriter
#16805 - [Bugfix]BlockBuilder may not assume unique input functions
#16815 - Enable capturing symbolic shapes in cuda graph
#16642 - Allow R.Prim('bool') in relax::If and assert_op
#16796 - Unit-test for structural equal of recursive function
#16732 - Allow composition of DFPattern replacements
#16783 - Improve CanonicalizeBindings in DataflowVar edge case
#16721 - Implement operators to inspec DLTensor::strides and offset
#16730 - Refactor PatternRewriter into separate Block/Expr mutators
#16756 - [IR]Improve highlighting in assert_structural_equal
#16779 - Improve malform error msg
#16569 - [Unity][Parser] Check well-formedness in the parser
#16759 - [Pass] Lowering passes for GPU IPC memory and allreduce
#16697 - Implement relax.transform.TopologicalSort
#16658 - Normalize use of void-type variable to inline R.tuple()
#16711 - [Frontend] Add op tanh, exp, negative, and permute
#16703 - [Fix]Fix top-p/top-k sampling kernel
#16669 - [Frontend][Onnx] add sum and globalavgpool 1d/3d op
#16691 - CUDA graph rewrite treating StringImm as static
#16685 - Implement StructInfoPattern for dataflow pattern matching
#16681 - [Frontend][Onnx] support MaxPool1/2/3D and AveragePool1/2/3D
#16584 - [Unity][TIR] Clear struct info when specializing PrimFunc
#16676 - Remove the legalization of cumsum/cumprob
#16654 - [Frontend][NN] Add support for Conv3D
#16674 - Eager free original weights in transform_params
#16675 - add sample_indices in sampling
#16648 - [Runtime] Support Unpack API for NDArrayCache
#16591 - [Unity][Transform] Handle dynamic shapes in CombineParallelMatmul
#16594 - [Transform] Preserve param names in LiftTransformParams
#16575 - [Unity] GPU sampling
#16574 - Additional unit tests for RemoveUnusedParameters
#16585 - [Unity][Analysis] Include impure call in VerifyWellFormed errors
#16421 - [Unity][Transform] Raise error in FuseOpsByPattern for SSA violation
#16629 - Fix error message in BlockBuilder
#16592 - Handle dynamic arguments in legalization of nn.attention
#16590 - [Unity][Transform] Check for permute_dims in ExpandMatmulOfSum
#16604 - [Frontend][Onnx] fix clip unsqueeze opset implement
#16568 - [Runtime] RNNState for Space State Models
#16563 - Implement operators to read runtime DLTensor* information
#16581 - [Unity][MSC][M4.2][Step2] Enable plugin with manager, test plugins in compile pipeline
#16600 - Expose name_hint field for BlockBuilder.match_cast
#16601 - [Transform] Canonicalize let var = R.const bindings
#16583 - [Unity][VM] Recursively visit match bindings in VMShapeLowerMutator
#16586 - Ignore non-relax functions in relax.transform.RunCodegen
#16573 - [VM] Re-implementation of callback functions
#16561 - [Bugfix]Remove call to tvm.build for empty TIR module
#16564 - [Unity] Check for symbolic vars in PrimValue in when lowering to TIR
#16558 - Minor updates for NN frontend
#16542 - Support callback as argument
#16487 - [Unity][Transform] Handle call_tir_inplace in FuseTIR and FuseOps
#16355 - [Unity] Infer struct info for relax.op.split on dynamic-sized index
#16465 - [Redo][Unity] Split DecomposeOpsForTraining into two steps
#16495 - [Unity][MSC][M4.2][Step1] Enable plugin with manager, test plugins in compile pipeline
#16498 - [Frontent] "tensor_ir_inplace" op
#16500 - [Unity] Support storage reuse for dynamic shapes
#16493 - [Pass] Skip data type node for CSE pass
#16467 - [Unity][MSC][Refactor] Reconstruct BYOC and runner
#16422 - [Unity][CodeGen] RunCodegen based on externally-exposed functions
#16483 - [Unity][Frontend] Add Sigmoid and Square Op
#16472 - [Unity] Improved error message in tvm::relax::UpdateStructInfo
#16473 - [Unity] Improve error message in tensor_to_shape struct inference
#16466 - Memory planning for "partially dynamic" shapes
#16464 - NDArray Cache Update with DLTensor Support
#16315 - [Unity][Transform] Implement relax.transform.ReorderTakeAfterMatmul
#16313 - [Unity][Transform] Implement relax.transform.ExpandMatmulOfSum
#16411 - [Unity][Transform] Handle symbolic variables in LambdaLift
#16443 - [Unity][FIX] fix thread dtype mismatch
#16442 - Revert "[Unity] Split DecomposeOpsForTraining into two steps"
#16437 - [Unity] Improve buffer allocation for handling duplicated buffer names.
#16439 - [Unity] Support cumsum with pure int32
#16432 - [Unity] downgrade cmake version requirement
#16427 - [Unity][Frontend][NN] Better support for dynamic convolutions
#16418 - [Unity][Fix] Fix mismatched intrinsic name
#16129 - [Unity][Transform] Replace eligible operators with in-place versions in dataflow blocks
#16414 - [Bugfix][Unity] Recover MSVC/NVCC/ROCm/Vulkan
#15954 - [Unity] Split DecomposeOpsForTraining into two steps
#16111 - [Unity][Transform] Memory planning for dynamic-shape func return
#16396 - [Unity] PagedKVCache supporting on-the-fly RoPE calculation
#16395 - [Frontend][ONNX]fix onnx frontend parse
#16385 - [Unity][Op] Add Conv3D Operator
#16284 - [Unity][nnModule] Dynamic shape support in nn Module
#16378 - [Unity][BlockBuilder] Restore bb.get()
#16374 - [Unity] Support TIR kernel for PagedKVCache
#16314 - [Unity][Transform] Implement relax.transform.AdjustMatmulOrder
#16349 - [Unity][MSC] Avoid depending on trivial bindings in Relax intermediate
#16376 - [Unity][Contrib] Fix a bug due to typo in vllm reconstruct_from_cache kernel and add test
#16388 - [Unity] Update dispatch test cases following the merge from main
#16335 - [Unity] Set CMAKE_CUDA_ARCHITECTURES default to native
#16306 - [Unity][Transform] Update LambdaLift to use name of lifted lambda
#16310 - [Unity][Analysis] Show objects instead of names in WellFormedChecker
#16362 - [Unity][Fix] Memory planning check value type of 'tir_var_upper_bound'
#16367 - [Unity][Transform] Handle replacement at both var binding and usage
#16309 - [Unity][Transform] Use parameter name in BundleModelParams
#16307 - [Unity] Improved error message in ExprMutator::ReEmitBinding
#16308 - [Unity] Improved error message for matmul shape mismatch
#16360 - [Unity] Enhance Torch-consistency in rehsape
#16350 - [Unity][Contrib] Add vLLM paged attention kernel
#16303 - [Unity][NN] Use Linear name for nn.op.permute_dims
#16325 - [Unity][MSC][Legalize] legalize codes and mute logging
#16312 - [Unity][Analysis] Add utility for collecting compile-time bindings
#16330 - [Unity][WEBGPU] Enable wasm exception propagation
#16304 - [Unity][Analysis] Handle PrimStructInfo in EraseToWellDefined
#16305 - [Unity][Transform] Implement UpdateParamStructInfo
#16331 - [Unity] Alter op impl handling empty transform for output
#16254 - [Unity] Dispatch cumsum and sort
#16120 - [Unity][Transform] Extract partial-tuple-usage from FuseTIR
#16311 - [Unity] Validate struct info in relax::Call constructor
#16333 - [Unity] Fix nn.op.tensor_ir_op signature
#16302 - [Unity] Cutlass kernel compatibility with cmake 3.18+

Relay

#16622 - [ONNX] Fix the attribute mode parse of operator Upsample
#16626 - [ONNX] Fix the Resize operator in ONNX frontend
#16624 - [ONNX] fix the wrong default value about dtype in Multinomial converter
#16417 - [Frontend][Torch] fix pytorch frontend linspace op
#16400 - [Frontend][Torch] fix pytorch frontend not support logical or
#16390 - [Frontend][Torch] fix a typo mistake in nonzero_numpy
#16324 - make "ToScalar" support directly obtaining "int64_t"

Runtime

#16804 - Introduce MSCCLPP with NCCL equivalent interface
#16809 - Add "TVM_DLL" to NVTX header
#16750 - CUDA IPC Memory support and custom allreduce kernels
#16738 - [Refactor]Always specify device in allocator interface
#16716 - Ensure NDArray.CopyTo(Device) always sync
#16705 - Add TVM_DLL to memory manager functions
#16692 - PagedKVCache execute data copy on a separate stream
#16647 - [RPC] Fix FreeObject in minrpc server
#16667 - [Builtin] Using float32 accumulation in attention kernel
#16635 - [RPC] Enable RPCObjectRef over multi-hop RPC
#16630 - Add TVM_DLL to threading backend funcs
#16541 - Add "TVM_DLL" to NDArray cache load func
#16550 - [ROCM] Properly align rocm parameter buffer
#16545 - Fix dtype conversion for bf16 and fp8
#16508 - ParallelFor skipping thread backend for unit extent
#16486 - KV cache providing workspace for attn kernel
#16456 - [KVCache] AttentionWithFusedQKV and RoPE mode
#16415 - [Memory] Implement support for non-zero offset within a storage object in AllocNDArr…
#16387 - [RPC] Enable RPCObjectRef return in RPC
#16377 - Use cudaGetDeviceCount to check if device exists

TIR

#16832 - Use constructor for new PrimFunc in TransformLayout
#16543 - Fix segfaults from ordering of Let/Assert in MakePackedAPI
#16795 - Ramp and Broadcast lanes fixed to int32 dtype
#16767 - [Driver] Use BindTarget to specify target for FP8 legalization
#16742 - [Bugfix]Fix cache_read update buffer region
#16726 - [Bugfix]Avoid overwrite of unmanaged buffer allocations
#16548 - [CUDA] Add native FP8 support to codegen
#16723 - Implement max/min_value for fp8 data types
#16655 - Improve well-formed check's handling of match buffer
#16673 - Support Vector Reinterpret Calls
#16682 - [Bugfix]Handle AttrStmt of upcoming tir.Var in ConvertSSA
#16560 - Enhance and fix tensorize schedule for some case
#16660 - [Bugfix]Fix duplicate AllocateConst in CacheReadWrite schedule primitive
#16544 - Expand debug symbol output for CodeGenLLVM
#16553 - Fix get_block_access_region for let bindings
#16515 - Require exactly same-dtype matching for Vulkan smem reuse
#16406 - Fix of inter thread reduction with shared memory prefetch
#16293 - Extend DP4A tensor intrin
#16345 - Allow sync threads inside condition
#16250 - In SplitHostDevice, check for variables in thread extents
#16184 - [Transform] Implement InlinePrivateFunctions

TOPI

#16652 - improve inclusive_scan for thrust
#16383 - [Target] Add fp16 SIMD support for conv2d on arm_cpu targets

TVMC

#16261 - Add tvmc flag to print ir before and print ir after named pass

TVMScript

#16864 - Add parser and printer support for e4m3/e5m2 fp8
#16844 - Produce empty DictAttrs when R.func_attrs is absent
#16811 - Do not throw error for duplicate definitions
#16641 - Allow use of relax.Expr with void type as a statement
#16663 - Infer T.reads() for DeclBuffer nodes
#16640 - Represent tir::builtin::ret() using python "return"
#16562 - [Bugfix]Handle R.match_cast as last binding in if/else
#16593 - [Unity]Parse R.Object return type from call_pure_packed
#16356 - [Unity]Optionally hide StructInfo that can be inferred
#16379 - [Unity]Update call_packed semantics to support empty sinfo_args

Vulkan

#16858 - Fix CLZ support for Vulkan

cuda & cutlass & tensorrt

#16865 - [Codegen, CUDA] Add handling of fp8 broadcast / const
#16818 - [Cutlass] Fix usage of cuda stream for group gemm
#16788 - [Cutlass] Add check for group gemm param shapes
#16789 - [Bugfix][Cutlass] Remove a typo in cutlass build
#16787 - [Codegen, Cuda] Add overload for fp8x4 e5m2 <-> half4 conversion
#16751 - [Cutlass] Add group gemm kernels
#16736 - [Target][CUDA] Allow non-numeric arch as needed for latest gpu
#16619 - [Bugfix][Cutlass] Check if function attributes is None
#16342 - [CUDA] Simple extend to optimize reuse for static shared memory.
#16342 - [CUDA] Simple extend to optimize reuse for static shared memory.
#16342 - [CUDA] Simple extend to optimize reuse for static shared memory.
#16342 - [CUDA] Simple extend to optimize reuse for static shared memory.
#16342 - [CUDA] Simple extend to optimize reuse for static shared memory.

micoNPU

#16266 - [microNPU][ETHOSU] Add fixed point for tanh
#16680 - [microNPU][ETHOSU] Fix LUT size for int16 activations
#16401 - [microNPU][ETHOSU] Add fixed point for matmul

web

#16733 - Support web indexDB cache for larger model storage
#16810 - Support building tvm/web on Windows
#16825 - Allow custom bc files in emcc making
#16791 - Add kv_state and rnn_state to wasm_runtime
#16722 - Implement linear congruential generator, make runtime seedable
#16650 - Seperate parallel shard download and iterative shard loading
#16694 - Initial support for asyncify
#16631 - Fix NDArrayCache loading report callback
#16525 - Move ArtifactCache to Interface, Support Cache delete and Batch Delete, Remove typo
#16554 - Compatibility with PagedKVCache in WebGPU
#16527 - Revert "[Unity]Temp disable wasm exception (#16444)"
#16504 - [Relax]Add ApplyPresenceAndRequencyPenalty
#16485 - [wasm] Enlarge initial memory for emcc
#16444 - [Unity]Temp disable wasm exception

Misc

#16873 - [Thrust] Fix thrust workspace allocation
#16868 - [3rdparty] Bump flashinfer
#16871 - [PageKV] allow PopN to pop all the tokens in last block
#16866 - [3rdparty] Bump FlashInfer
#16863 - [Picojson] Let the key of objects in json be ordered by default
#16856 - [Thrust] Use pointer to tls pool to prevent creating new pool
#16850 - Fixing probability comment
#16849 - [KVCache] Initialize one extra page than specified
#16843 - [IR] Provide well-formed intermediate in ApplyPassToFunction
#16772 - [MSC][M5.3] Support torch.dynamo for dynamic models
#16839 - Bump pillow from 10.2.0 to 10.3.0 in /apps/microtvm/cmsisnn
#16838 - Bump pillow from 10.2.0 to 10.3.0 in /apps/microtvm/ethosu
#16831 - [KVCache] Reducing CacheAuxDataManager copy size
#16794 - [SME] Target parser support for SME
#16824 - [KVCache] Introducing auxiliary data manager
#16800 - [BugTIR]fix error merging shared memory for ptx_cp_async
#16822 - [VM] Recycle VMFrame
#16813 - [KVCache] Support forking sequence at specific posotion
#16786 - [Codegen] Add check to disable invalid reinterpret
#16816 - [Cmake] Allow using custom CCCL path for thrust
#16784 - [SLM] Add unit tests for SLM to Relax exporter
#16814 - Fix includes of custom allreduce kernel
#16806 - [Debug] Improve error message in VMShapeLower
#16802 - [Debug] Improve error messages in LiftTransformParams
#16425 - [Target] Use LLVM target parser for determining Arm(R) A-Profile Architecture features
#16797 - [3rdparty] AUTO mode for custom all-reduce strategy
#16761 - [SME] Add support for inserting processor state annotations
#16778 - [Analysis] Allow calls to GlobalVar in @R.function
#16745 - [IR] Default to empty attributes, instead of NULL
#16777 - Revert "[SLM] Allow modules to define pre-processing of weights"
#16776 - [Contrib] Remove thrust "built but not used" warning
#16757 - [SLM] Allow modules to define pre-processing of weights
#16763 - [CONTRIB] Add nm symbol dump
#16717 - Enable Shared Function in LiftTransformParam Pass
#16729 - [Builtin] Sliding window and sink support for PagedKVCache
#16724 - Fix cpp_rtvm cmake build on Windows
#16513 - [Target] Automatically detect system triple when not specified by the user
#16710 - [CMake] Add "USE_FLASHINFER" to libinfo
#16702 - [MSC][M5.2] Enable quantize && prune with gym by wrapper
#16699 - [Transform] Remove R.Object parameters after LazyTransformParams
#16668 - [MSC][M5.1] Build wrapper to support compression
#16693 - [Contrib] Support NDArray cache taking generator
#16412 - [Lint] Add check to prevent usage of #include
#16689 - [DeviceAPI] Support "GetCurrentStream"
#16690 - Use target name instead of node name as function name
#16683 - [skip ci] Fix wasm exception flag
#16609 - Minor update docs instructions
#16656 - Simplify Windows CMake Command
#16666 - [KVCache] Fix the reference counter in sequence fork
#16662 - Fixing workload comment
#16595 - [Transform] Check for zero-param operators in LiftTransformParams
#16599 - [Transform] De-duplicate MatchCast nodes in EliminateCommonSubexpr
#16596 - [Transform] Implement relax.transform.ReorderPermuteDimsAfterConcat
#16597 - [Transform] Allow explicit name of bundled model parameters
#16602 - [Transform] Improvements to LazyTransformParams
#16606 - [KVCache] Support passing in attn_score_scaling_factor into KV cache
#16608 - Extend gpu memory bandwidth test to work through RPC
#16587 - [Debug] Improve error message for codegen pattern mismatches
#16570 - [Marvell BYOC]: Marvell AI Accelerator Integration - Phase 1
#16576 - Update the 3rdparty/libflash_attn submodule
#16580 - [KVCache] Support mode "None" for Rotary Embebdding
#16578 - [KVCache] Support returning query positions
#16571 - Fix compile warnings
#16540 - [Upd] Enable lld search to include /opt/rocm/llvm/bin for rocm
#16539 - Improve error message in NDArray::CopyFromTo
#16524 - [Build] Improving debug and build-dir options
#16551 - [KVCache] Fix attention kernel for ROCm
#16512 - Cut pytest-lazy-fixture
#16506 - Bump 3rdparty/cutlass_fpA_intB_gemm version
#16511 - [Minor] Fix Clang compilation warning in fuse_tir.cc and codegen_c_host.cc
#16516 - Add Relax, Unity Tags in make_notes.py
#16497 - [Instrument] Add default instrument to print all passes
#16494 - [DPL] Support tir_vars field in is_call_tir pattern
#16453 - Bump pillow from 10.0.1 to 10.2.0 in /apps/microtvm
#16454 - [BugTIR] fix thread_sync occurs in letstmt
#16468 - [LINT] Fix pylint issues in test_dma_builtin.py
#16413 - [Contrib] Workspace for cuBLAS backend
#16460 - [Cherry-pick][MSC][M4.1] Add plugin && plugin_builder, enable build and test in different frameworks (#16397)
#16461 - [Minor] Fix Docstring for sphinx-build
#16431 - [Schedule] Loop-Partition Scheduling Primitive
#16451 - Bump pillow from 10.0.1 to 10.2.0 in /apps/microtvm/ethosu
#16452 - Bump pillow from 10.0.1 to 10.2.0 in /apps/microtvm/cmsisnn
#16445 - [skip ci] update branch rule to prepare for unity transition
#16426 - [CMake] Enable cuda lang if USE_CUDA is on
#16407 - Add NVIDIA Hopper H100 target tag
#16398 - [DeviceAPI] Support querying total global memory
#16357 - [RPC] Fix tuning on macOS and Windows (#15771)
#16386 - [Thrust] Use no sync exec policy and caching allocator
#16343 - [CMake][MSVC] Disable permissive mode for MSVC builds
#16242 - [Codegen] Fix if_then_else codegen
#16341 - [CMake] Use ccache as CMAKE_CUDA_COMPILER_LAUNCHER
#16332 - Change metal dtype of ceil_log2 to fp32

v0.15.0.rc0

3 months ago

Introduction

The TVM community has worked since the v0.14.0 release to deliver the following new exciting improvements! The main tags are below (bold text is with lots of progress):

Community, RFCs
Adreno, ArmComputeLibrary, Metal, cuda & cutlass & tensorrt, micoNPU, Runtime
Frontend & Relay
Arith, TOPI, TIR, TVMScript
Docs, CI, Misc, BugFix

Please visit the full listing of commits for a complete view: v0.14.0...v0.15.0.

Community

#16172 - Yixin Dong -> Reviewer
#16162 - Shuai Yuan -> Committer
#16164 - Qiang Zhang -> Committer
#16166 - Bohan Hou -> PMC
#16165 - Ruihang Lai -> PMC

RFCs

#105 - Add a new backend language——SYCL

Adreno

#15991 - [CI] Enhancements to Adreno specific CI utils
#15786 - [TOPI] Add conv2d transpose nchw texture schedule

Arith

#16227 - Simplify nested if_then_else when constant is appearing in then_expr

ArmComputeLibrary

#15990 - [ACL] Update Compute Library to v23.08

Metal

#16192 - [Device] Fix metal warp size
#16033 - [Codegen] Disable cross-function call in Metal codegen

cuda & cutlass & tensorrt

#16061 - [CUDA] Add an option for profiling cuda kernels

micoNPU

#16003 - [microNPU][ETHOSU] Fix ConcatRewriter args processing
#15929 - [microNPU][ETHOSU] Fix rounding mode in requantize operation

Runtime

#15896 - [CLML] Fix for CLML ops and enable more test case
#16133 - Parallel-for with threading backend
#16066 - Support clear global memory allocators
#16030 - Introduce TVM_MODULE_VTABLE Macros

BugFix

#16269 - Update pillow usage
#16272 - Fixed Inappropriate Logical Expression
#16216 - [TIR] Fix dynamic smem merge leaf alloc
#16190 - Fix the error of reloading the model library on the ROCm platform: "MIOpen Error: No invoker was registered for convolution forward.”
#16167 - [Relay][Pytorch] Fix missing .dtype
#16091 - [Fix] Fix topi.rms_norm with float32 upscale
#16081 - [Fix] Broken Windows Build with LLVM
#16051 - [Fix][TIR] Fix dtype issues for match_buffer and ramp node
#14655 - [VTA] Fix FSIM compile error on macOS
#16021 - [FFI] Typo fix of IncRef to DecRef
#16010 - [Fix][TIR] fix mul dtype mismatch
#16000 - [Fix][TIR] fix symbolic strides lower
#15970 - [Hotfix] Mark python-FFI handling with TVM_DLL
#15965 - [CI] Better to pass the build folder

CI

#16110 - Refactor unittest folder
#16055 - Fix broken links about Jenkins
#16062 - Use LLVM 17 for tests on ci_arm
#16018 - [Tests] Fix work_dir location used by test_micro_tuning_with_meta_schedule
#16019 - [Tests] Check int8+int32 testcases in test_estimate_peak_flops_cpu
#16017 - [Tests] Fix str vs. int comparison in test_num_threads

Docs

#16282 - [Doc] Fix minor error in doc (Add an operator to Relay)
#16152 - [DOC] Add v0.14.0 docs to site
#16127 - Revert "[#15157][Rust][Doc] Re-enable the Rust documentation build (#15213)"
#16097 - Add missing backtick to contribute/code_guide.rst
#16089 - Fix error on linting by adding --rev argument
#16024 - Update release_process.rst about version number modification

Frontend & Relay

#16243 - [TFLite] Add support for quantized mirror pad
#15914 - [TFLite]Support quantized SQUARE
#16159 - [KERAS] Fix bug concat convert for NCHW
#16319 - [Torch] add aten:broadcast_to
#16131 - [Pytorch] Add support for aten::unflatten
#16105 - [Pytorch] Add support for aten::bitwise_and
#16079 - [Pytorch] Add support for aten::swapaxes operator
#15502 - [Pytorch] aten::copy_ support for pytorch
#16180 - [Pytorch] Fix bug when converting models with torch.nn.ParameterList
#16143 - [Pytorch] Add support for aten::scaled_dot_product_attention
#16123 - [Pytorch] Add support for aten::linalg_vector_norm
#16171 - [Frontend] Preserve Pytorch Span Names
#16217 - [Frontend][QNN] fix access param_debug_name_map to node output name in fx-quantized graph node replacement
#16199 - [Frontend] Add support for aten::concat
#16151 - conv3d depthwise bug fix
#15928 - Expose qnn ops directly from relay.qnn module

TOPI

#16259 - Add support for group_conv3d_transpose_ncdhw for generic
#16052 - Enhance topi.nn.matmul
#16080 - Reduce code redundancy in conv2d weights transformation
#16248 - [TOPI] Add support for group_conv1d_transpose_ncw for generic
#16106 - [TOPI] Add conv2d NHWC hybrid schedule for arm_cpu

TIR

#16239 - [Schedule] TileWithTensorIntrin skip incorrect ComputeInline for input-padding
#16236 - ConvertSSA process entry func first
#16070 - [Transform] Introduce new InjectPermutedLayout pass
#16083 - Enhance Python Type Annotations for TIR Expr
#16073 - Support more mma intrinsics and get_mma_intrin_group utility
#16076 - Enhance Python Type Annotations for TIR stmt
#16074 - Fix the thread binding iter_var dtype in Bind primitive
#16063 - Fix pass RenewDefs error in gather/take case
#16027 - Fix software pipeline with dynamic loop extent

TVMScript

#16271 - Disable concise scoping when the scope stmt is explicitly annotated
#16041 - Fix mismatched dtype of IterVar in T.thread_binding
#15953 - [TIR] Pretty print TIR LLVM function name
#15972 - delete print extra info at parsing

Misc

#16279 - replace deprecated np.int with int to avoid crash
#16262 - Update conv2d.py
#16255 - [Support] Add Interrupt Handling in Pipe
#16104 - [LoopPartition] Fix a bug of LoopPartition in single point scenarioes
#16231 - [Target] Add Jetson AGX Orin tags
#16221 - remove deprecated np.int in slice converter (pytorch)
#16214 - [Python] Fix setup.py for inplace build
#16174 - Bump cryptography from 37.0.2 to 41.0.6 in /docker/python
#16202 - Fix IRModule initialization with attrs
#16176 - Enable ccache to accelerate contrib compilation
#15968 - Add missing backtick
#16034 - [Packaging] Include BYOC dynamic libraries into wheel
#16087 - Add _ffi_api.py under script folder
#16039 - [Target] Support obtain l2 cache size from target
#16065 - [Pylint] fix pylint issues from test_random to test_tedd
#16031 - [TRT] fix outdated module building method in tensorrt
#16032 - [CMake] Use llvm-config to locate Findzstd.cmake
#16023 - [Pylint] fix pylint issues for thrust&tflite_runtime&util
#15998 - [Codegen] Add shuffle for cuda and metal
#16015 - [Pylint] fix pylint issues for cblas
#15955 - [FFI][Python] Handle error propagation when line number is missing
#15982 - Bump werkzeug from 2.2.3 to 3.0.1 in /apps/microtvm
#15966 - [CMake] Fix order of GNUInstallDirs module
#15952 - Update ci_arm Docker tag
#15940 - [Minor] Fix compilation warnings for clang
#15947 - Bump urllib3 from 1.26.9 to 1.26.18 in /docker/python
#15835 - [CodeGenC][Redo] Handle GlobalVar callee as internal function call
#15945 - Bump urllib3 from 1.26.15 to 1.26.18 in /apps/microtvm

v0.15.0

3 months ago

Introduction

NOTE: This is last release version before unity branch switch as main branch. No unity features.

The TVM community has worked since the v0.14.0 release to deliver the following new exciting improvements! The main tags are below (bold text is with lots of progress):

Community, RFCs
Adreno, ArmComputeLibrary, Metal, cuda & cutlass & tensorrt, micoNPU, Runtime
Frontend & Relay
Arith, TOPI, TIR, TVMScript
Docs, CI, Misc, BugFix

Please visit the full listing of commits for a complete view: v0.14.0...v0.15.0.

Community

#16172 - Yixin Dong -> Reviewer
#16162 - Shuai Yuan -> Committer
#16164 - Qiang Zhang -> Committer
#16166 - Bohan Hou -> PMC
#16165 - Ruihang Lai -> PMC

RFCs

#105 - Add a new backend language——SYCL

Adreno

#15991 - [CI] Enhancements to Adreno specific CI utils
#15786 - [TOPI] Add conv2d transpose nchw texture schedule

Arith

#16227 - Simplify nested if_then_else when constant is appearing in then_expr

ArmComputeLibrary

#15990 - [ACL] Update Compute Library to v23.08

Metal

#16192 - [Device] Fix metal warp size
#16033 - [Codegen] Disable cross-function call in Metal codegen

cuda & cutlass & tensorrt

#16061 - [CUDA] Add an option for profiling cuda kernels

micoNPU

#16003 - [microNPU][ETHOSU] Fix ConcatRewriter args processing
#15929 - [microNPU][ETHOSU] Fix rounding mode in requantize operation

Runtime

#15896 - [CLML] Fix for CLML ops and enable more test case
#16133 - Parallel-for with threading backend
#16066 - Support clear global memory allocators
#16030 - Introduce TVM_MODULE_VTABLE Macros

BugFix

#16269 - Update pillow usage
#16272 - Fixed Inappropriate Logical Expression
#16216 - [TIR] Fix dynamic smem merge leaf alloc
#16190 - Fix the error of reloading the model library on the ROCm platform: "MIOpen Error: No invoker was registered for convolution forward.”
#16167 - [Relay][Pytorch] Fix missing .dtype
#16091 - [Fix] Fix topi.rms_norm with float32 upscale
#16081 - [Fix] Broken Windows Build with LLVM
#16051 - [Fix][TIR] Fix dtype issues for match_buffer and ramp node
#14655 - [VTA] Fix FSIM compile error on macOS
#16021 - [FFI] Typo fix of IncRef to DecRef
#16010 - [Fix][TIR] fix mul dtype mismatch
#16000 - [Fix][TIR] fix symbolic strides lower
#15970 - [Hotfix] Mark python-FFI handling with TVM_DLL
#15965 - [CI] Better to pass the build folder

CI

#16110 - Refactor unittest folder
#16055 - Fix broken links about Jenkins
#16062 - Use LLVM 17 for tests on ci_arm
#16018 - [Tests] Fix work_dir location used by test_micro_tuning_with_meta_schedule
#16019 - [Tests] Check int8+int32 testcases in test_estimate_peak_flops_cpu
#16017 - [Tests] Fix str vs. int comparison in test_num_threads

Docs

#16282 - [Doc] Fix minor error in doc (Add an operator to Relay)
#16152 - [DOC] Add v0.14.0 docs to site
#16127 - Revert "[#15157][Rust][Doc] Re-enable the Rust documentation build (#15213)"
#16097 - Add missing backtick to contribute/code_guide.rst
#16089 - Fix error on linting by adding --rev argument
#16024 - Update release_process.rst about version number modification

Frontend & Relay

#16243 - [TFLite] Add support for quantized mirror pad
#15914 - [TFLite]Support quantized SQUARE
#16159 - [KERAS] Fix bug concat convert for NCHW
#16319 - [Torch] add aten:broadcast_to
#16131 - [Pytorch] Add support for aten::unflatten
#16105 - [Pytorch] Add support for aten::bitwise_and
#16079 - [Pytorch] Add support for aten::swapaxes operator
#15502 - [Pytorch] aten::copy_ support for pytorch
#16180 - [Pytorch] Fix bug when converting models with torch.nn.ParameterList
#16143 - [Pytorch] Add support for aten::scaled_dot_product_attention
#16123 - [Pytorch] Add support for aten::linalg_vector_norm
#16171 - [Frontend] Preserve Pytorch Span Names
#16217 - [Frontend][QNN] fix access param_debug_name_map to node output name in fx-quantized graph node replacement
#16199 - [Frontend] Add support for aten::concat
#16151 - conv3d depthwise bug fix
#15928 - Expose qnn ops directly from relay.qnn module

TOPI

#16259 - Add support for group_conv3d_transpose_ncdhw for generic
#16052 - Enhance topi.nn.matmul
#16080 - Reduce code redundancy in conv2d weights transformation
#16248 - [TOPI] Add support for group_conv1d_transpose_ncw for generic
#16106 - [TOPI] Add conv2d NHWC hybrid schedule for arm_cpu

TIR

#16239 - [Schedule] TileWithTensorIntrin skip incorrect ComputeInline for input-padding
#16236 - ConvertSSA process entry func first
#16070 - [Transform] Introduce new InjectPermutedLayout pass
#16083 - Enhance Python Type Annotations for TIR Expr
#16073 - Support more mma intrinsics and get_mma_intrin_group utility
#16076 - Enhance Python Type Annotations for TIR stmt
#16074 - Fix the thread binding iter_var dtype in Bind primitive
#16063 - Fix pass RenewDefs error in gather/take case
#16027 - Fix software pipeline with dynamic loop extent

TVMScript

#16271 - Disable concise scoping when the scope stmt is explicitly annotated
#16041 - Fix mismatched dtype of IterVar in T.thread_binding
#15953 - [TIR] Pretty print TIR LLVM function name
#15972 - delete print extra info at parsing

Misc

#16279 - replace deprecated np.int with int to avoid crash
#16262 - Update conv2d.py
#16255 - [Support] Add Interrupt Handling in Pipe
#16104 - [LoopPartition] Fix a bug of LoopPartition in single point scenarioes
#16231 - [Target] Add Jetson AGX Orin tags
#16221 - remove deprecated np.int in slice converter (pytorch)
#16214 - [Python] Fix setup.py for inplace build
#16174 - Bump cryptography from 37.0.2 to 41.0.6 in /docker/python
#16202 - Fix IRModule initialization with attrs
#16176 - Enable ccache to accelerate contrib compilation
#15968 - Add missing backtick
#16034 - [Packaging] Include BYOC dynamic libraries into wheel
#16087 - Add _ffi_api.py under script folder
#16039 - [Target] Support obtain l2 cache size from target
#16065 - [Pylint] fix pylint issues from test_random to test_tedd
#16031 - [TRT] fix outdated module building method in tensorrt
#16032 - [CMake] Use llvm-config to locate Findzstd.cmake
#16023 - [Pylint] fix pylint issues for thrust&tflite_runtime&util
#15998 - [Codegen] Add shuffle for cuda and metal
#16015 - [Pylint] fix pylint issues for cblas
#15955 - [FFI][Python] Handle error propagation when line number is missing
#15982 - Bump werkzeug from 2.2.3 to 3.0.1 in /apps/microtvm
#15966 - [CMake] Fix order of GNUInstallDirs module
#15952 - Update ci_arm Docker tag
#15940 - [Minor] Fix compilation warnings for clang
#15947 - Bump urllib3 from 1.26.9 to 1.26.18 in /docker/python
#15835 - [CodeGenC][Redo] Handle GlobalVar callee as internal function call
#15945 - Bump urllib3 from 1.26.15 to 1.26.18 in /apps/microtvm

v0.14.0

6 months ago

Introduction

The TVM community has worked since the v0.13.0 release to deliver the following new exciting improvements! The main tags are below (bold text is with lots of progress):

Community, RFC
Arith, MetaSchedule
Adreno, ArmComputeLibrary, Hexagon, Metal, OpenCL & CLML, ROCm, Vulkan, cuda & cutlass & tensorrt, micoNPU, web
Runtime, TVMC, AOT, LLVM, microTVM, CMSIS-NN
Frontend, Relay, BYOC
TOPI, TIR, TVMScript
Docs, CI, Docker
Misc, , BugFix

Please visit the full listing of commits for a complete view: v0.13.0...v0.14.0.

Community

#15307 - Qingchao Shen -> Reviewer
#15619 - community strategy decision process

RFC

#102 - [Process RFC] Clarify Community Strategy Decision Process

AOT

#15301 - Avoid call_extern() with incorrect argument count
#15181 - Remove workaround to help resolve test flakiness

Adreno

#15830 - Minor changes for Adreno docs and help scripts
#15671 - [VM]Fix using buffers for weights in VM
#15391 - Small fixes in Adreno schedules

Arith

#15881 - Simplify the result of non-divisible floordiv
#15665 - Fix detect non-divisible iteration form like (x % 255) // 16
#15638 - MLIR PresburgerSet compile fix mlir >= 160
#15628 - Added simplification rule for multiple equality compares
#15558 - Fix detect linear equation with uint var
#14690 - Add tvm::arith::PresburgerSetNode to work with Presburger Set in MLIR
#15555 - Fix handling of overlapping predicates
#15471 - Enhance Canonical Simplify for LE
#15228 - Enhance buffer shape bound deduction to include offset

ArmComputeLibrary

#15600 - [ACL] Update Compute Library to v23.05.1
#15344 - [ACL] Update Compute Library to v23.05

BugFix

#15891 - [Relay]fix axis parsing of repeat converter in the MXNet frontend
#15873 - [Fix] Remove duplicated words from comments, NFC
#15868 - [Relay]Fix conv transpose with default strides in ONNX frontend
#15773 - [CPP] Fix cpp deploy bug
#15778 - [Hotfix] Fix Windows Pipe
#15748 - Move symbols that are relevant to the runtime from libtvm to…
#15752 - [Relay]fix the wrong calculate logic of operator flip in PyTorch frontend
#15715 - [Relay]Fix the wrong implementation about operator Threshold in oneflow
#15711 - [Strategy] Fix arm_cpu int8 conv2d strategy for dotprod and i8mm targets
#15717 - [Relay]fix the wrong implementation of Softplus in OneFlow
#15677 - [Arith] IterMapRewriter abort rewriting once failure
#15629 - [VTA] tvm.tir.Call has no name attribute
#15584 - [Relay][Strategy] Enable compile time transformation of weights matrix for arm_cpu NHWC quantized conv2d
#15542 - [Fix] Fix the typo in compile flag
#15484 - [TOPI] Fix a bug in arm_cpu int8 conv2d i8mm schedule
#15473 - [Relay] Fix some bugs of dominator pattern
#15478 - [TIR] ThreadSync with shared.dyn awareness
#15406 - [TIR]Ensure the Var's scope is correct
#15399 - [TIR] Fix multi-grouped multi-warp allreduce
#15350 - [Relay] fix a bug of printing dataflow pattern
#15385 - Work around "Internal Compiler Error" in MSVC
#15294 - [Bug][Relay] fix relay frontend pytorch op addmm bug
#15323 - [Fix][TIR] LowerThreadAllreduce with correct thread mask
#15291 - [Relay][GraphExecutor] Fix set_input_zero_copy() precision bug
#15225 - Fix function to read all file

CI

#15903 - [Target]Add LLVM functions for current system info
#15897 - [ADRENO] Few updates to Adreno docker setup
#15836 - Update ci-gpu image
#15668 - Allow Limit CPUs in Docker
#15568 - [Testing] Allow Capitalized name in CompareBeforeAfter
#15519 - [TEST] Run tests/python/relay/aot tests in ci-cortexm
#15485 - Remove cython version pin
#15421 - Bump Flax and Jaxlib versions to fix Jaxlib install error
#15226 - Add ml_dypes dependency for all docker images
#15353 - Pin cython version to fix cython compilation
#15352 - Make Graviton3 default AArch64 job runner node
#15339 - Update test to include unique attribute
#15277 - [Testing] Return BenchmarkResult in local_run and rpc_run
#15268 - [Testing] Add tvm.testing.local_run
#15136 - [UnitTest][NVPTX] Avoid cascading failures from CUDA postproc

CMSIS-NN

#15747 - Move CMSIS_5 from SHA to release based upgrade
#15407 - Support for Softmax Int16 operator

Docker

#15799 - Add LLVM 17 to the LLVM install script
#15862 - Upgrade oneflow to v0.8.0
#15819 - Install oneflow from PyPi
#15310 - Update ci-cortexm docker image
#15293 - tensorflow_aarch64 package upgrade

Docs

#15619 - community strategy decision process
#15508 - Add v0.13.0 docs to site
#15213 - [#15157][Rust][Doc] Re-enable the Rust documentation build

Frontend

#15821 - [TFLite]Support quantized ELU
#15844 - [TFLite]Fix test failures caused by div-by-zero
#15798 - [TFLite]Support quantized Pow
#15829 - [Relay][Keras][Bugfix] fix the converters of GRU and SimpleRNN about the go_backwards attribute
#15838 - Fix unnecessary pylint errors
#15802 - [SkipCI][Hotfix][TFLite] Disable test of quantized floor mod
#15790 - [TFLite]Support quantized LESS_EQUAL
#15775 - [TFLite]Support quantized GREATER_EQUAL
#15769 - [TFLite]Support quantized NOT_EQUAL
#15768 - [TFLite]Support quantized div
#15746 - [TFLite]Support quantized LESS
#15733 - [TFLite]Support quantized floor_mod
#15724 - [TFLite]Support quantized floor_div
#15602 - [ONNX][BugFix] Support If body with free variable from graph input
#15472 - [Relay][TFLite] Fix in qnn.conv2d when parameter groups not equal to 1
#15117 - [TFLITE] Add support for TFLite's regular NMS operator
#15415 - [ONNX] add onnx Mish operator
#15422 - [Keras] Add support for swish actiivation
#15370 - [Relay][Pytorch] Add aten::view_as
#15335 - [Bugfix][Keras] Add a check to reject the invalid input shape
#15334 - [Bugfix][Relay][Keras] Add a assertion to reject a invalid value for attribute units in RNN layers
#15337 - [Bugfix][Keras]Fix a corner case bug in softmax converter of keras frontend
#15259 - [TFLITE][BugFix] Fix variable typo in batchmatmul converting func
#15261 - [bugfix][keras] Fix go_backwards attribute of LSTM in keras frontend

Hexagon

#15788 - Properly handle RPC server shutdown
#15599 - F2qi avgpool bug fix
#15414 - Add default vtcm capacity for targets
#15367 - Simplify Mul->Sub->Conv to Conv->Add when possible
#15258 - Propagate QNN Concat Quantization Params to Inputs

LLVM

#15921 - Fix for llvm CodeGenOpt API change

MetaSchedule

#15792 - Allow generating uint random data
#15574 - Fix metaschedule flop estimation for non-integer loop dimensions
#15532 - Enable subprocess to stdout for DEBUG level
#15437 - Fix mma default rule and disable tuning abort
#15133 - [XGBoost,MetaSchedule] Support xgb set tree method

Metal

#15756 - [Unittest]Add minimal metal functionality test to CI
#15749 - [UnitTest]Parametrize allreduce GPU tests
#15401 - [Codegen]Support metal warp-level primitive

OpenCL & CLML

#15745 - [OpenCL] Don't initialize OpenCL runtime on host
#15400 - [VM][OpenCL] Introduce textures allocation to VM memory manager

ROCm

#15777 - [Codegen]Mismatched Dtype of Workgroup/Workitem
#15464 - fma intrin
#15454 - Fix some ROCm codegen bugs

Relay

#15889 - fix the conflicted documentation description
#15648 - [TOPI] Remove input padding for arm_cpu conv2d int8 native schedule in Legalize pass
#15386 - Fix an adaptive_max_pool1d operator conversion bug
#15533 - Disable exception for ADT in mixed precision pass
#15506 - [Strategy] Use x86 pool schedules for arm_cpu
#15470 - [Strategy] Use x86 dense schedules for arm_cpu
#15392 - add redirecting operation to dataflow pattern graph
#15468 - [Strategy] Fix arm_cpu int8 conv2d schedule selection for 32-bit targets
#15461 - Stop ToMixedPrecision when constant is out of dtype range
#15362 - improve SimplifyClipAndConsecutiveCast pass
#15137 - Introduce arguments limit to FuseOps pass
#15211 - Fix bug in MergeCompilerRegions pass
#15237 - ExprMutator Return Origin Expr When All Fields Isn't Changed
#15235 - [QNN] Support Dequantize to "float16" and Quantize to "uint16"

Runtime

#15693 - Make CSourceModule and StaticLibraryModule Binary Serializable
#15658 - Make export_library parameters after file_name keyword-only
#15637 - [Backport]Fix ICE from Clang
#15244 - Serialization/Deserialization of runtime module
#15630 - Utils to Stringify Device
#15623 - Expose ModuleGetFunction as PackedFunc
#15595 - Enhance PackedFunc Metaprogramming with PackArgs
#15543 - [Minor] Suppress verbose logging in Metal device API
#15305 - Flush L2 cache in time eval
#15332 - Device API to query L2 cache size

TIR

#15913 - Fix offset_factor in cuda tensor core intrins
#15906 - Fix the error example in the documentation for pad_einsum
#15816 - Revert "[TensorIR][Visitor] Visit buffer members in match_buffer's in block visitor functions (#15153)
#15763 - Do not drop 4th argument to tir.max
#15646 - Output DeclBuffer in LowerThreadAllreduce
#15493 - Output DeclBuffer in SplitHostDevice
#15517 - Shuffle in PointerValueTypeRewrite for scalar reads
#15263 - Output DeclBuffer in MakePackedAPI
#15465 - [TIR, Schedule] Fix decompose reduction with thread binding loops
#15432 - Generalize implementation of T.macro to work with other dialects
#15413 - Fix Primitive Rfactor DType
#15404 - Allow starred expressions in TIR script
#15374 - Finer predicate handling in cross-thread reduction
#15373 - Allreduce broadcast result to each thread in multi-warp case
#15214 - [UX] Implement privacy annotations in TIR
#15241 - Return error code from kernels in SplitHostDevice
#15327 - ThreadAllreduce warp-level primitive support with multi-warp
#15260 - Implement TIR macros
#15253 - Call TVMBackendFreeWorkspace inside LetStmt
#15264 - Allow symbolic bounds in IndexMap analysis
#15243 - Output DeclBuffer in LowerTVMBuiltin
#15236 - [Schedule] Scoped CacheRead/Write producing compact region
#15242 - Preserve AllocateNode::annotations
#15247 - Allow VerifyWellFormed to accept IRModule
#15192 - Support cross-threaad reduction lowering with thread-broadcasting rewrite
#15210 - [Schedule] Derive Nonnegative Bounds from Shape Var
#15207 - [Transform] Add LiftThreadBinding Pass

TOPI

#15685 - [Target]Use LLVM for x86 CPU feature lookup
#15710 - Ensure vectorization of input padding in arm_cpu int8 conv2d interleaved schedule
#15513 - check empty array of x86 injective's iters
#15371 - Revert "Add arm_cpu specific pooling schedules"
#15311 - Add arm_cpu specific pooling schedules
#15286 - Revert "Add arm_cpu specific pooling schedules"
#14855 - Add arm_cpu specific pooling schedules

TVMC

#15779 - enable dumping imported modules too
#15349 - Add tvmc flag to print compilation time per pass

TVMScript

#15824 - Preserve traceback across TVMScript parsing
#15762 - Use environment variable TVM_BLACK_FORMAT for .show()
#15706 - Disable black_format by default
#15705 - [FIX] Disable show_object_address in printing by default
#15579 - Optionally output the address as part of variable names
#15564 - Use triple-quoted python strings for metadata
#15547 - Create loop var with min_val dtype in for frame
#15492 - Allow use of Python builtins in script
#15442 - Support starred indices in for-loop
#15249 - Ensure completed root block has no read/write
#15239 - Handle parsing of PrimFunc calls with non-void return

cuda & cutlass & tensorrt

#15573 - [CUTLASS][Cherry-pick] Introduce several features of cutlass profiler
#15480 - [Bugfix][CUTLASS] CUTLASS path finding

micoNPU

#15780 - [microNPU][ETHOSU] MatMul legalization support
#15428 - [microNPU][ETHOSU] Fix concatenation with reused buffers
#14909 - [ETHOSU][MicroNPU][Pass] Add a pass to replicate pads
#15186 - [microNPU][ETHOSU] Add Vela's logic to select configuration block

microTVM

#15667 - Check the output of microNPU demos in CI

web

#15218 - Increase default EMCC compilation total memory size

Misc

#15934 - [Release] [Dont Squash] Modify version number to 0.14.0 and 0.15.0.dev on main branch
#15934 - [Release] [Dont Squash] Modify version number to 0.14.0 and 0.15.0.dev on main branch
#15847 - [release] Update version to 0.14.0 and 0.15.0.dev on main branch
#15867 - Bump pillow from 9.3.0 to 10.0.1 in /apps/microtvm/ethosu
#15866 - Bump pillow from 9.3.0 to 10.0.1 in /apps/microtvm/cmsisnn
#15865 - Bump pillow from 9.2.0 to 10.0.1 in /apps/microtvm
#15833 - [VM] Memory Manager moved up to runtime
#15859 - [Script] Fix miscs of make_notes.py
#15818 - [CLI TOOLS][RTVM] Improve rtvm tool with new options to measure native performance
#15761 - [Target] LLVM helper functions for any target info
#15672 - [IR] Implemented Variant<...> container
#15714 - [Target][Device] Auto detect target and create device from str in torch style
#15723 - fix _convert_simple_rnn
#15725 - Revert "[CodeGenC] Handle GlobalVar callee as internal function call"
#15684 - [Hopper TMA] Add intrinsic to create barriers for synchronization
#15683 - Fix a bug caused by PyTorch instance_norm when the input shape is [1,1,1,2]
#15596 - [FFI] Propagate Python errors across FFI boundaries
#15666 - [Module] Implement custom imported modules serialization
#15656 - [Hopper TMA] Add CUDA codegen support for bulk asynchronous copy
#15664 - [IR] Use structural equal for Range equality
#15649 - Add output_data_sec section in corstone300.ld
#15639 - Do not link LLVM libraries into cpptest binary
#15631 - [RPC] Enhance RPC Protocol to support TVM Object
#15624 - [CMake] Add RCCL to TVM and TVM Runtime
#15616 - [Hopper TMA] CUDA codegen for async copy with barrier synchronization
#15537 - [CPP_RPC] export listdir for RPC
#15605 - [CMake] Add NCCL to TVM and TVM Runtime
#15580 - Fix "to" duplicate word in python and C header file
#15581 - Remove duplicate load word inside .cc file
#15582 - Remove duplicate 'from' word inside python script
#15554 - Bump tornado from 6.1 to 6.3.3 in /apps/microtvm
#15552 - Bump tornado from 6.1 to 6.3.3 in /apps/microtvm/ethosu
#15553 - Bump tornado from 6.1 to 6.3.3 in /apps/microtvm/cmsisnn
#15536 - fixed typo [TypoFix]
#15529 - [quantize] fix bug of annotate for output of add op
#15535 - Fixed search task comment
#15530 - Remove duplicate msg word and condition inside the function doc
#15511 - Remove IRModule Dependency from Target
#15525 - Fix typo mistake and change whethe to whether
#15524 - Remove duplicate the word
#15103 - [CodeGenC] Handle GlobalVar callee as internal function call
#15419 - [VM][Textures] Enable OpenCL textures for VM
#15483 - [Script] Be more careful when generating ast.ExtSlice for Subscript
#15469 - [CYTHON] Make cython compatible with 3.0
#15423 - [Submodule] Add Flash attention v2
#15380 - [Target] Add Jetson Orin Nano tag
#15359 - [CMAKE] Conditionally link "clog" in NNPack install
#15326 - [OP] Add rms_norm into TOPI
#15312 - [skipci] Fix typo in docs/arch/index.rst
#15298 - [Release] Extend PR tags and Format PR hyper-links in release report
#15328 - [Package] Remove cutlass media/docs inside cutlass_fpA_intB_gemm
#15321 - [JVM] Fix the Maven pom.xml for OS X arm64 tvm4j build
#15265 - Fix keras version problem
#15292 - [RPC] Fix socket bind errno on corner case
#15287 - [Exec] Add a script to test GPU memory bandwidth
#15234 - [Miscs] Enhance script about make release notes
#15229 - [CMAKE] Add Vulkan header for Android
#15215 - [Android] ndk static build
#15208 - Update version to 0.14.dev0 on main branch

v0.13.0

9 months ago

Introduction

The TVM community has worked since the v0.12.0 release to deliver the following new exciting improvements! The main tags are below (bold text is with lots of progress):

Community, RFC;
Frontend: TensorFlow/TFLite, Pytorch/Torch, Paddle, keras;
Runtime: Adreno, OpenCL & CLML, ROCm, CUDA & CUTLASS & TensorRT, Ethosn, Vulkan, Hexagon, Metal, others about runtime;
Relay, BYOC, TOPI, Arith, TIR, TVMScript, MetaSchedule;
microTVM, AOT, TVMC, LLVM;
CI, BugFix, Docs, Docker, Miscs;

Please visit the full listing of commits for a complete view: v0.12.0...v0.13.0.

Community

#15086 - Aleksei-grovety -> Reviewer
#14676 - Jiajun Jiang -> Reviewer
#14677 - Qiang Zhang -> Reviewer
#14622 - Sunghyun Park -> Reviewer
#14578 - Zihao Ye -> Committer
#14853 - Anirudh Sundar Subramaniam -> Committer
#14772 - Add new key for release signing

RFC

https://github.com/apache/tvm-rfcs/pull/100

Frontend

#14830 - Use f-strings for string formatting, NFC
Keras
- #15122 - [Relay][Keras] Fix SeparableConv2D conversion in dilation_rate attribute
- #15107 - [Relay][Keras] Fix a wrong variable name in keras frontend
- #15053 - [Relay][Keras] Fix the wrong implementation logic about cropping2D
- #15082 - [Relay][Keras] Fix UpSampling2D about the wrong assertion about size
- #15060 - [Relay][keras] Fix the bug about the attribute 'output_padding' in Deconv
- #14707 - [Keras]fix a bug about alpha attribute in LeakyReLU which lead to passes conflict
- #15175 - [Relay][Keras] Fix concatenate convert function in axis parsing
Paddle
- #14801 - [Paddle] [PaddlePaddle Hackathon 4]add attribute support for gaussian_random/softplus/Conv3d/Conv2d
- #14973 - [Paddle] [PaddlePaddle Hackathon 4] add convert support for tanhshrink/pool3d/set_value ops for paddle frontend
- #14826 - [Paddle] [PaddlePaddle Hackathon 4] add convert support for p_norm/roi_align/softmax_with_cross_entropy
- #14575 - [Paddle] [PaddlePaddle Hackathon 4]add attribute support for dropout/hard_sigmoid/pixel_shuffle
TFLite
- #14667 - [TFLite]Support for quantized squared difference
- #14819 - [TFLite]Generate name when tensor name is missing
- #15173 - [FRONTEND][TFLITE]Fix int16 transpose conv loading
TensorFlow
- #14546 - [Tensorflow] Fix conv2d_transpose for NHWC layout
PyTorch
- #14747 - [PyTorch] Add aten::new_zeros
- #14699 - [Torch] fix typo in new_full
- #14963 - [PyTorch] Support use_input_stats in instance_norm
- #14930 - Fix pytorch axis
ONNX
- #15017 - [ONNX] Fix bug in scatter_elements

Runtime

#15182 - Add weak symbol to builtin fp16
#15161 - Clean TVM stacktrace in error messages
#15162 - Support void as dtype in FFI
#14902 - Update Module and Registry to use String Container
#14967 - [Runtime,RPC] Use f-strings for string formatting, NFC
#14887 - Make systemlib unique per prefix
#14775 - Added str for tvm._ffi.runtime_ctypes.TVMArray
#14656 - Fix Can't "query_imports" Bug of VM Executable

Adreno

#15061 - [TOPI]Fix problem with ceil_log2
#14996 - [OpenCL]Fix conv2d when output channels < 4

CMSIS-NN

#15059 - Update CMSIS-NN release to v4.1.0

OpenCL & CLML

#14972 - [OPENCL] Always use convert_T for type conversion
#14995 - [OpenCL] Improve diagnostic message
#14833 - [Codegen][OpenCL] fix amibiguous selection operator call
#14792 - [OpenCL] Refactor OpenCL runtime to support SPIRV binary ingestion
#14922 - [OpenCLML] Reactor and introduce on chip memory and memory planner
#14949 - [CodegenC] Updated unit test for sorted CodegenC output
#14767 - [OpenCLML] Transposed convolution support and other fixes

cuda & cutlass & tensorrt

#14751 - [CUDA] Fixed the call of the min function in the schedule for cuda
#14798 - [CUTLASS] Add NDEBUG option to CUTLASS compile to speed up attention kernel
#14782 - [Bugfix][Codegen][CUDA] Wrong casting in ASM

metal

#14962 - Fix int8 vectorized cast
#14846 - Fix vectorized select
#14727 - Update metal runtime to directly store kernel map
#14671 - Fix flaky memory issue due to racing

Vulkan

#15035 - [Vulkan] Allow DeclBuffer in CodeGenSPIRV
#14817 - [Vulkan] Add cooperative matrix support

Hexagon

#14997 - Remove "c" as aot_host_target tvm/contrib/hexagon/pytest_pl…
#14948 - Update instructions to compile hexagon runtime
#14965 - Add support for v73, make v68 default
#14720 - [TIR] Add get_vtcm_allocation_sizes with lowering
#14567 - [TIR] Use the "target" value in T.func_attr for VTCM limit

ROCm

#15106 - [TensorIR]AMD Matrix Core Support
#15088 - [Target]Replace rocm arch parsing from int to string

microTVM

#14872 - Use self.close_transport() on error

AOT

#15033 - Avoid Var-to-Var Let binding in AOTExecutorCodegen
#15032 - Remove duplication in tvm.testing.aot.compile_models
#14529 - Fix warning on dropping const in TVMAotExecutor_GetInputName

micoNPU

#15159 - [microNPU][ETHOSU] Fix compiler attributes types
#15147 - [microNPU][ETHOSU] Add option to disable copying constants for case without cascader
#15069 - [microNPU][ETHOSU] Fix SoftMax legalization parameters
#15115 - [microNPU][ETHOSU] Upgrade to 23.05 version of Arm(R) Ethos(TM)-U NPU drivers
#15114 - [microNPU] Upgrade Vela to v3.8.0
#15104 - [microNPU][ETHOSU] Fix minimum buffer size
#15063 - [microNPU][ETHOSU] Fix CopyComputeReordering pass arguments
#14861 - [microNPU][ETHOSU] Add offloading to the NPU the nn.avg_pool2d operator with a stride > 3
#14765 - [microNPU][ETHOSU] Channel pad offloaded to NPU
#14774 - [microNPU][ETHOSU] Fix Softmax quantization parameters
#14629 - [microNPU][ETHOSU] Softmax int8 legalization support
#14353 - [microNPU] Add support for MEAN with uint8 ifm
#14587 - [microNPU] Fix skip tests when Vela is not present
#14464 - [microNPU][ETHOSU] Add restrictions to convert to NHCWB16 layout in LayoutOptimization pass

BYOC

#15046 - Add GEMM kernel from FasterTransformer as submodule
#15029 - Hide internal cutlass symbols

Relay

#15068 - Improve the "clip" op optimization in simplify expr pass
#14925 - add a dimension check to reject invalid input
#14858 - [simplify_expr]: Add pass to remove trivial transpose ops
#14838 - Use f-strings for string formatting, NFC
#14831 - [Relay/Op] Use f-strings for string formatting, NFC
#14580 - Simplify the square of a binomial
#14735 - Handle pad value coming from Tensor instead of scalar
#14601 - Enhance type infer for dynamic shape
#14885 - [Relay] fix broadcast in PyTorch frontend
#15090 - [Relay] Insertion of "device_copy" CallNode to Resolve Device Conflict on Unconstrained Nodes
#14845 - [Relay] Fix softplus in paddlepaddle frontend
#14837 - [Relay] Fix AdaptiveAvgPool2d about wrong dtype prasing
#14821 - [Relay] Fix softplus about the wrong calculation formula in Relay PyTorch frontend
#14820 - [Relay] Fix threshold calculation logic in PyTorch frontend
#14824 - [Relay] fix a bug about ReLu in the threshold attribute which causes a different results with keras
#14796 - [relay] fix wrong calculate logic about celu
#14773 - [Relay] fix scatter_nd type relation
#14742 - [relay] Fix alpha attribute with None in ELU
#14740 - [Relay] Fix stride in LpPool for default
#14556 - [Relay] fix a bug caused by IncompleteTypeNode in EinsumRel while doing MergeComposite
#15057 - [QNN] Implement quantized avg_pool2d
#14536 - [QNN] Implement 'qnn.softmax'
#14875 - [Quantization]: Update simulated_quantize to infer correct layout

TOPI

#15018 - Fix dynamic dimensions support for Dense on TOPI side
#14856 - Fix in interpretation of empty axis parameter in reduction fun…
#14483 - [Target] Add SVE specific convolution
#14839 - Use f-strings for string formatting, NFC
#14822 - Use f-strings for string formatting, NFC
#14519 - Vectorize depthwise conv2d output operator
#14549 - remove the i32 cast for output shape of pool
#14566 - [Topi] Output strides in pack_buffer() utility

Arith

#15131 - Hotfix flaky test in padded matmul
#15120 - NormalizeToIterSum
#15081 - Improve arith simplify to handle symbolic reshape pattern
#14532 - Implement statistics counters for RewriteSimplifier
#14704 - [cherry-pick][BUGFIX] Fix a bug of iter map floormod(x,2) simplify
#14849 - [TVMScript] Capture fails if var appears only in annotation
#14596 - [TensorIR] Improve CompactBufferRegion for symbolic shape
#15129 - [TIR] Recognize empty extents
#14982 - [TIR][VTA] Update host-side target, even without device func
#14547 - Enhance IterMapSimplify for symbolic
#14571 - [BUGFIX] Fix a bug of iter map floormod(x,2) simplify
#14582 - Fix solve inequality of unbound var ranges
#14538 - Enhance CanonicalSimplify to Simplify ProdDiv

MetaSchedule

#14781 - [MetaSchedule] RPC port needs to be an integer
#14673 - Introduce MMA Tensor Core Multilevel Tiling
#14784 - Enhance tune_tir to tune IRModule of TIR Collections
#14783 - Add an API to dump a pruned database
#14785 - Clear screen only when specified
#14654 - Handle output cases for InlineConstantScalars
#14642 - PostProc not rewriting unroll for purely spatial block
#14591 - Handle cases when no features found by FeatureExtractor
#14584 - [ARM] Beautification of the function names

TIR

#15153 - [TensorIR][Visitor] Visit buffer members in match_buffer's in block visitor functions
#15168 - [Schedule] Support padding-by-factor in PadEinsum
#15165 - Expose UndefinedVars to Python
#15163 - Fix RenewDef for symbolic input shapes
#15142 - [Schedule] Enhance compute-inline for fusion
#15150 - Fix typo in code example
#15144 - [TensorIR][Schedule] New schedule primitive unsafe_hide_buffer_access
#15146 - Block dependence analysis without schedules
#15119 - Avoid duplicate GlobalVar names in SplitHostDevice
#15037 - Handle DeclBuffer in CacheReadWrite schedule primitive
#15098 - [Ethos-U]Handle DeclBuffer in Ethos-U inputs
#15044 - [USMP] Preserve DeclBuffer in PoolAllocationToOffsetConverter
#15078 - Handle DeclBuffer in LowerThreadAllreduce
#15094 - Handle DeclBuffer in MergeDynamicSharedMemoryAllocations
#15093 - Handle DeclBuffer in StorageAccessInfoLower
#15045 - Handle DeclBuffer in InjectDoubleBuffer
#15096 - Handle DeclBuffer in RemoveNoOp
#15076 - [CodeGen] Define PackedFunc error code in MakePackedAPI
#15102 - Update primfunc host attachment to include host
#14854 - [Compute-at] Enable complex floordiv/floormod expressions in compute_at
#15041 - Handle DeclBuffer in LowerCustomDatatypes
#15038 - Handle DeclBuffer in Inline/ComputeAt/ReverseComputeAt
#15052 - [Analysis] Handle DeclBuffer in FlopEstimator
#15051 - Handle DeclBuffer in StorageRewrite
#15050 - [Schedule] Fix decompose_padding bug with dtypes
#15034 - Refactor BlockScope outside schedule
#15054 - Handle DeclBuffer in IRSubstitute
#14986 - Move SplitHostDevice to before MakePackedAPI
#15042 - Handle DeclBuffer in StorageFlatten's input
#15040 - Preserve object equality in Buffer::GetFlattenedBuffer
#14693 - Enhance TVMScript Buffer Slice Access
#14988 - Handle callees on same target, different codegen
#14951 - Keep trivial LetStmt in tir.Simplify when used in buffer decl
#14944 - Restrict tir.transform.LowerTVMBuiltin to host functions
#14990 - [IR,TE,TIR] Use f-strings for string formatting, NFC
#14993 - Fix incorrect construction of block frames
#14952 - Avoid re-defining var = arg_var in ArgBinder
#14918 - SplitHostDevice, handle subroutines
#14943 - Restrict tir.transform.InstallDebugSpans to host functions
#14942 - Preserve existing kTarget function attribute in BindTarget
#14945 - Restrict tir.transform.CombineContextCall to host functions
#14914 - Handle subroutine calls in MakeUnpackedAPI
#14913 - Handle subroutine calls in MakePackedAPI
#14892 - Expand unit tests for ConvertSSA
#14866 - Avoid too complex predicate in compaction
#14766 - [Schedule] Improve blockize to support blockizing multiple blocks
#14776 - Improved parameter name in DLTensor unpacking error messages
#14562 - [Driver] Move ShouldAnnotateEntryFunc logic into transform
#14741 - Keep block annotations from tensorization
#14021 - More flexible buffer compaction
#14711 - [Analysis] Calculate allocated memory at module level
#14492 - Flatten SeqStmt on construction
#14598 - Add CUDA int4 tensor core intrinsics
#14593 - [Schedule] Method returning the function being worked on
#14592 - [TensorIR] Fix ComputeAt with perfect symbolic bound
#14491 - Use String instead of StringImm for AttrStmtNode::node
#14626 - [TensorIR]reindex_cache_write do not mutate init statement
#14588 - [Fix][TIR] UnifyThreadBinding creating unit loop with annotation
#14589 - [Fix][TIR][Analysis] Reduction block checking alloc_buffers

TVMScript

#15083 - Avoid visiting repetition tensor in SetCommonPrefix Visitor
#15091 - [TIR]Convert tir.op operands to PrimExpr
#14919 - [TIR] Parse subroutine calls with no arguments
#14941 - Prevent bool to int conversion in T.Assert condition
#14915 - Allow T.target("device", host="host") to specify host
#14900 - Round-trip DeclBuffer with undefined data pointer
#14889 - [TIR]Added format/parsing of subroutine calls
#14874 - Use default fallback for un-registered type
#14840 - Print Executor, Runtime, and FunctionInfo as metadata
#14812 - Handle AllocatedPoolInfo, ConstantPoolInfo, ConstantInfo
#14786 - Add __name__ attr for parsed PrimFunc and IRModule
#14531 - Preserve LetStmt of constants
#14488 - Distinguish between void* and handle

TVMC

#14994 - [Bugfix]Fix tvmc option for printing which operators are offloaded to the Ethos-U

LLVM

#15127 - Remove the "ret_void" argument of AddFunction
#15139 - Minor refactor to LLVMModuleNode::SaveToFile
#14958 - [Codegen]Allow void return type from PackedFunc
#14946 - Expose Host CPU Feature Detection
#14901 - Codegen subroutine call when CallNode::op is GlobalVar
#14570 - Use Var annotation in LetStmt for pointer type
#14843 - [RUNTIME] Enable multi systemlib with device code
#14564 - Validate generated LLVM module before optimization
#14568 - Expand tvm::Type to DWARF conversion
#14563 - [Codegen]Remove cast to i8* in builtin::address_of

BugFix

#14960 - [Bug] Add typing_extensions requirement again
#15015 - [Hotfix] Remove LOG(INFO) from unsupported dtype legalization pass
#14991 - Make ThreadAllReduce pass compatible with int64
#14950 - Avoid symbol conflicts in MakePackedAPI/MakeUnpackedAPI
#14903 - [Test Cases]Add some version check to make test cases run in all PyTorch versions
#14890 - [Fix] Fix typo in error message
#14879 - fix the undeclared identifier 'f'
#14857 - Fix batch_norm
#14787 - [FIX] fix typo in comment

CI

#15179 - [Testing] Utility method to run TVM on remote device
#15138 - [Test] Improve check for TVMError exception in test_cast
#15062 - Clone submodule recursively
#15065 - Revert "Make Graviton3 default AArch64 job runner node (#14983)"
#14983 - Make Graviton3 default AArch64 job runner node
#15056 - [Bugfix]Fix CacheControl version constraint violation
#14908 - Update the expected CI jobs list in the update_branch script
#14847 - Update CPU image to install PyTorch
#14808 - [Testing] Use TVMScript's "name" argument for error messages
#14780 - fix doc deploy issue
#14651 - Modify test cases to accommodate the CI upgrades
#14666 - sccache support while using ci.py under multi user environments
#14635 - Upgrade CI
#14713 - Add PLATFORM env var to builds
#14680 - Downgrade ci_cpu llvm version back to 11
#14653 - [tests][scripts][release] Optimize release note script about categories etc
#14646 - [test][script] Fix release gather_pr.py of script about ghost users or blank PR nodes
#14550 - Add JAX deps in Dockerfiles
#14466 - Update ci_cpu image and build with llvm-15

Docker

#15149 - Fix build.sh environment variables
#15105 - Update docker images for llvm-16
#15092 - Update ci-cortexm docker image to contain CMSIS-NN release v…
#15095 - Add build.sh environment variables
#15067 - Migrate arm docker image to use llvm packages
#15031 - Update ci_cpu docker image to one containing polly package f…
#15003 - [ADRENO] Docker setup changes for multi user environments
#14912 - Add polly package
#14842 - Install PyTorch on cpu image
#14590 - Support rootless docker when using docker/bash.sh

Docs

#15126 - [DOC] Add RPC System Setup Document
#15071 - Updated the copyright year from 2020 to 2023
#15055 - [DOC][TUTORIAL] Fix typo for the 'Making your Hardware Accelerator TVM-ready with UMA'
#14504 - [TensorIR][Doc] Docstring of reorder_block_iter_var
#14611 - [TIR] Fix unsafe_set_dtype docstring
#14585 - Fix typo in the Vitis AI Integration docs

Misc

#15267 - [release] Disable git merge to avoid conflict
#15187 - [RPC] Report RPC Session Timeout to Client Instead of "kShutdown"
#15185 - Update tvm_runtime.h
#15164 - [CMake] Support LLVM-16 static linking
#15167 - [Python] Enhance Wheel Packaging
#15166 - [Target] Add MetaSchedule-compatible attributes to OpenCL
#15154 - [Minor] Fix Compilation Warnings
#15132 - [NDArray] Allow creating a view from a strided array
#15116 - [RPC] Add Missing Option "port_end" to RPC Proxy
#15073 - [CodeGenC] Use PrimFuncNode::ret_type in function signature
#15036 - [StackVM] Updated CodeGenStackVM to handle DeclBuffer
#15022 - [Build] Fix missing virtual destructor in SIBuilder
#15016 - Fix type parse error about AdaptiveMaxPool
#15007 - [Minor] Fix compilation warnings
#15000 - [CMAKE] Introduce dummy build as an option
#14863 - [DataType] Initial support of fp8 (e4m3/e5m2)
#14975 - [CMAKE] Add a dummy target to defer libtvm dep
#14574 - [IR][SIBuilder]
#14939 - [Target] Add target to all TVM callbacks
#14937 - [BUILD] Enable log before throw message in windows
#14934 - [TestCases] fix unreachable test cases due to outside the for-loop
#14916 - [TypoFix] fix some typo problem in keras frontend
#14893 - [Contrib] Use f-strings for string formatting, NFC
#14884 - [AutoTVM] Use f-strings for string formatting, NFC
#14876 - [CONTRIB] Enable create_staticlib to take in tar files
#14867 - Fix f-string typo
#14851 - Add v0.12.0 docs
#14813 - [BUILD] Removed the duplicated MACROs in config.cmake
#14743 - [SUPPORT] Fix RingBuffer ReadWithCallback
#14799 - [LINT] Fix clang-format script for newest clang-format
#14797 - [NDArray] Allow arbitrary stride when the corresponding shape is 1
#14790 - More clear ref of thirdparty license
#14779 - fix: use arm on demand instead of spot
#14762 - [Target][Minor] Add A6000 Target Tag
#14683 - [AutoTVM] Added Droplet algorithm in TVM
#14694 - unify search path approach to various libs
#14686 - [CMAKE] Update search pattern of config
#14636 - Fix bug about wrong attribute name
#14628 - [CODEGEN] Fix metal codegen when with only single working dim
#14607 - fix: deploy ci
#14569 - [Node] Allow alternative root names in ObjectPath::Root()
#14522 - [Object] Implemented .as<T> for ObjectRef param, returns Optional<T>
#14477 - feat: use spot instances for ci with on demand as a backup
#14468 - [AutoTVM] New rank-binary loss_type for the new xgboost >= 2.0.0 behaviour
#14544 - Update to v0.13.dev0
#14539 - [Target] Add Apple M1 GPU tag with 256-thread restriction

v0.13.0.rc0

9 months ago

Introduction

The TVM community has worked since the v0.12.0 release to deliver the following new exciting improvements! The main tags are below (bold text is with lots of progress):

Community, RFC;
Frontend: TensorFlow/TFLite, Pytorch/Torch, Paddle, keras;
Runtime: Adreno, OpenCL & CLML, ROCm, CUDA & CUTLASS & TensorRT, Ethosn, Vulkan, Hexagon, Metal, others about runtime;
Relay, BYOC, TOPI, Arith, TIR, TVMScript, MetaSchedule;
microTVM, AOT, TVMC, LLVM;
CI, BugFix, Docs, Docker, Miscs;

Please visit the full listing of commits for a complete view: v0.12.0...v0.13.0.

Community

#15086 - Aleksei-grovety -> Reviewer
#14676 - Jiajun Jiang -> Reviewer
#14677 - Qiang Zhang -> Reviewer
#14622 - Sunghyun Park -> Reviewer
#14578 - Zihao Ye -> Committer
#14853 - Anirudh Sundar Subramaniam -> Committer
#14772 - Add new key for release signing

RFC

https://github.com/apache/tvm-rfcs/pull/100

Frontend

#14830 - Use f-strings for string formatting, NFC
Keras
- #15122 - [Relay][Keras] Fix SeparableConv2D conversion in dilation_rate attribute
- #15107 - [Relay][Keras] Fix a wrong variable name in keras frontend
- #15053 - [Relay][Keras] Fix the wrong implementation logic about cropping2D
- #15082 - [Relay][Keras] Fix UpSampling2D about the wrong assertion about size
- #15060 - [Relay][keras] Fix the bug about the attribute 'output_padding' in Deconv
- #14707 - [Keras]fix a bug about alpha attribute in LeakyReLU which lead to passes conflict
- #15175 - [Relay][Keras] Fix concatenate convert function in axis parsing
Paddle
- #14801 - [Paddle] [PaddlePaddle Hackathon 4]add attribute support for gaussian_random/softplus/Conv3d/Conv2d
- #14973 - [Paddle] [PaddlePaddle Hackathon 4] add convert support for tanhshrink/pool3d/set_value ops for paddle frontend
- #14826 - [Paddle] [PaddlePaddle Hackathon 4] add convert support for p_norm/roi_align/softmax_with_cross_entropy
- #14575 - [Paddle] [PaddlePaddle Hackathon 4]add attribute support for dropout/hard_sigmoid/pixel_shuffle
TFLite
- #14667 - [TFLite]Support for quantized squared difference
- #14819 - [TFLite]Generate name when tensor name is missing
- #15173 - [FRONTEND][TFLITE]Fix int16 transpose conv loading
TensorFlow
- #14546 - [Tensorflow] Fix conv2d_transpose for NHWC layout
PyTorch
- #14747 - [PyTorch] Add aten::new_zeros
- #14699 - [Torch] fix typo in new_full
- #14963 - [PyTorch] Support use_input_stats in instance_norm
- #14930 - Fix pytorch axis
ONNX
- #15017 - [ONNX] Fix bug in scatter_elements

Runtime

#15182 - Add weak symbol to builtin fp16
#15161 - Clean TVM stacktrace in error messages
#15162 - Support void as dtype in FFI
#14902 - Update Module and Registry to use String Container
#14967 - [Runtime,RPC] Use f-strings for string formatting, NFC
#14887 - Make systemlib unique per prefix
#14775 - Added str for tvm._ffi.runtime_ctypes.TVMArray
#14656 - Fix Can't "query_imports" Bug of VM Executable

Adreno

#15061 - [TOPI]Fix problem with ceil_log2
#14996 - [OpenCL]Fix conv2d when output channels < 4

CMSIS-NN

#15059 - Update CMSIS-NN release to v4.1.0

OpenCL & CLML

#14972 - [OPENCL] Always use convert_T for type conversion
#14995 - [OpenCL] Improve diagnostic message
#14833 - [Codegen][OpenCL] fix amibiguous selection operator call
#14792 - [OpenCL] Refactor OpenCL runtime to support SPIRV binary ingestion
#14922 - [OpenCLML] Reactor and introduce on chip memory and memory planner
#14949 - [CodegenC] Updated unit test for sorted CodegenC output
#14767 - [OpenCLML] Transposed convolution support and other fixes

cuda & cutlass & tensorrt

#14751 - [CUDA] Fixed the call of the min function in the schedule for cuda
#14798 - [CUTLASS] Add NDEBUG option to CUTLASS compile to speed up attention kernel
#14782 - [Bugfix][Codegen][CUDA] Wrong casting in ASM

metal

#14962 - Fix int8 vectorized cast
#14846 - Fix vectorized select
#14727 - Update metal runtime to directly store kernel map
#14671 - Fix flaky memory issue due to racing

Vulkan

#15035 - [Vulkan] Allow DeclBuffer in CodeGenSPIRV
#14817 - [Vulkan] Add cooperative matrix support

Hexagon

#14997 - Remove "c" as aot_host_target tvm/contrib/hexagon/pytest_pl…
#14948 - Update instructions to compile hexagon runtime
#14965 - Add support for v73, make v68 default
#14720 - [TIR] Add get_vtcm_allocation_sizes with lowering
#14567 - [TIR] Use the "target" value in T.func_attr for VTCM limit

ROCm

#15106 - [TensorIR]AMD Matrix Core Support
#15088 - [Target]Replace rocm arch parsing from int to string

microTVM

#14872 - Use self.close_transport() on error

AOT

#15033 - Avoid Var-to-Var Let binding in AOTExecutorCodegen
#15032 - Remove duplication in tvm.testing.aot.compile_models
#14529 - Fix warning on dropping const in TVMAotExecutor_GetInputName

micoNPU

#15159 - [microNPU][ETHOSU] Fix compiler attributes types
#15147 - [microNPU][ETHOSU] Add option to disable copying constants for case without cascader
#15069 - [microNPU][ETHOSU] Fix SoftMax legalization parameters
#15115 - [microNPU][ETHOSU] Upgrade to 23.05 version of Arm(R) Ethos(TM)-U NPU drivers
#15114 - [microNPU] Upgrade Vela to v3.8.0
#15104 - [microNPU][ETHOSU] Fix minimum buffer size
#15063 - [microNPU][ETHOSU] Fix CopyComputeReordering pass arguments
#14861 - [microNPU][ETHOSU] Add offloading to the NPU the nn.avg_pool2d operator with a stride > 3
#14765 - [microNPU][ETHOSU] Channel pad offloaded to NPU
#14774 - [microNPU][ETHOSU] Fix Softmax quantization parameters
#14629 - [microNPU][ETHOSU] Softmax int8 legalization support
#14353 - [microNPU] Add support for MEAN with uint8 ifm
#14587 - [microNPU] Fix skip tests when Vela is not present
#14464 - [microNPU][ETHOSU] Add restrictions to convert to NHCWB16 layout in LayoutOptimization pass

BYOC

#15046 - Add GEMM kernel from FasterTransformer as submodule
#15029 - Hide internal cutlass symbols

Relay

#15068 - Improve the "clip" op optimization in simplify expr pass
#14925 - add a dimension check to reject invalid input
#14858 - [simplify_expr]: Add pass to remove trivial transpose ops
#14838 - Use f-strings for string formatting, NFC
#14831 - [Relay/Op] Use f-strings for string formatting, NFC
#14580 - Simplify the square of a binomial
#14735 - Handle pad value coming from Tensor instead of scalar
#14601 - Enhance type infer for dynamic shape
#14885 - [Relay] fix broadcast in PyTorch frontend
#15090 - [Relay] Insertion of "device_copy" CallNode to Resolve Device Conflict on Unconstrained Nodes
#14845 - [Relay] Fix softplus in paddlepaddle frontend
#14837 - [Relay] Fix AdaptiveAvgPool2d about wrong dtype prasing
#14821 - [Relay] Fix softplus about the wrong calculation formula in Relay PyTorch frontend
#14820 - [Relay] Fix threshold calculation logic in PyTorch frontend
#14824 - [Relay] fix a bug about ReLu in the threshold attribute which causes a different results with keras
#14796 - [relay] fix wrong calculate logic about celu
#14773 - [Relay] fix scatter_nd type relation
#14742 - [relay] Fix alpha attribute with None in ELU
#14740 - [Relay] Fix stride in LpPool for default
#14556 - [Relay] fix a bug caused by IncompleteTypeNode in EinsumRel while doing MergeComposite
#15057 - [QNN] Implement quantized avg_pool2d
#14536 - [QNN] Implement 'qnn.softmax'
#14875 - [Quantization]: Update simulated_quantize to infer correct layout

TOPI

#15018 - Fix dynamic dimensions support for Dense on TOPI side
#14856 - Fix in interpretation of empty axis parameter in reduction fun…
#14483 - [Target] Add SVE specific convolution
#14839 - Use f-strings for string formatting, NFC
#14822 - Use f-strings for string formatting, NFC
#14519 - Vectorize depthwise conv2d output operator
#14549 - remove the i32 cast for output shape of pool
#14566 - [Topi] Output strides in pack_buffer() utility

Arith

#15131 - Hotfix flaky test in padded matmul
#15120 - NormalizeToIterSum
#15081 - Improve arith simplify to handle symbolic reshape pattern
#14532 - Implement statistics counters for RewriteSimplifier
#14704 - [cherry-pick][BUGFIX] Fix a bug of iter map floormod(x,2) simplify
#14849 - [TVMScript] Capture fails if var appears only in annotation
#14596 - [TensorIR] Improve CompactBufferRegion for symbolic shape
#15129 - [TIR] Recognize empty extents
#14982 - [TIR][VTA] Update host-side target, even without device func
#14547 - Enhance IterMapSimplify for symbolic
#14571 - [BUGFIX] Fix a bug of iter map floormod(x,2) simplify
#14582 - Fix solve inequality of unbound var ranges
#14538 - Enhance CanonicalSimplify to Simplify ProdDiv

MetaSchedule

#14781 - [MetaSchedule] RPC port needs to be an integer
#14673 - Introduce MMA Tensor Core Multilevel Tiling
#14784 - Enhance tune_tir to tune IRModule of TIR Collections
#14783 - Add an API to dump a pruned database
#14785 - Clear screen only when specified
#14654 - Handle output cases for InlineConstantScalars
#14642 - PostProc not rewriting unroll for purely spatial block
#14591 - Handle cases when no features found by FeatureExtractor
#14584 - [ARM] Beautification of the function names

TIR

#15153 - [TensorIR][Visitor] Visit buffer members in match_buffer's in block visitor functions
#15168 - [Schedule] Support padding-by-factor in PadEinsum
#15165 - Expose UndefinedVars to Python
#15163 - Fix RenewDef for symbolic input shapes
#15142 - [Schedule] Enhance compute-inline for fusion
#15150 - Fix typo in code example
#15144 - [TensorIR][Schedule] New schedule primitive unsafe_hide_buffer_access
#15146 - Block dependence analysis without schedules
#15119 - Avoid duplicate GlobalVar names in SplitHostDevice
#15037 - Handle DeclBuffer in CacheReadWrite schedule primitive
#15098 - [Ethos-U]Handle DeclBuffer in Ethos-U inputs
#15044 - [USMP] Preserve DeclBuffer in PoolAllocationToOffsetConverter
#15078 - Handle DeclBuffer in LowerThreadAllreduce
#15094 - Handle DeclBuffer in MergeDynamicSharedMemoryAllocations
#15093 - Handle DeclBuffer in StorageAccessInfoLower
#15045 - Handle DeclBuffer in InjectDoubleBuffer
#15096 - Handle DeclBuffer in RemoveNoOp
#15076 - [CodeGen] Define PackedFunc error code in MakePackedAPI
#15102 - Update primfunc host attachment to include host
#14854 - [Compute-at] Enable complex floordiv/floormod expressions in compute_at
#15041 - Handle DeclBuffer in LowerCustomDatatypes
#15038 - Handle DeclBuffer in Inline/ComputeAt/ReverseComputeAt
#15052 - [Analysis] Handle DeclBuffer in FlopEstimator
#15051 - Handle DeclBuffer in StorageRewrite
#15050 - [Schedule] Fix decompose_padding bug with dtypes
#15034 - Refactor BlockScope outside schedule
#15054 - Handle DeclBuffer in IRSubstitute
#14986 - Move SplitHostDevice to before MakePackedAPI
#15042 - Handle DeclBuffer in StorageFlatten's input
#15040 - Preserve object equality in Buffer::GetFlattenedBuffer
#14693 - Enhance TVMScript Buffer Slice Access
#14988 - Handle callees on same target, different codegen
#14951 - Keep trivial LetStmt in tir.Simplify when used in buffer decl
#14944 - Restrict tir.transform.LowerTVMBuiltin to host functions
#14990 - [IR,TE,TIR] Use f-strings for string formatting, NFC
#14993 - Fix incorrect construction of block frames
#14952 - Avoid re-defining var = arg_var in ArgBinder
#14918 - SplitHostDevice, handle subroutines
#14943 - Restrict tir.transform.InstallDebugSpans to host functions
#14942 - Preserve existing kTarget function attribute in BindTarget
#14945 - Restrict tir.transform.CombineContextCall to host functions
#14914 - Handle subroutine calls in MakeUnpackedAPI
#14913 - Handle subroutine calls in MakePackedAPI
#14892 - Expand unit tests for ConvertSSA
#14866 - Avoid too complex predicate in compaction
#14766 - [Schedule] Improve blockize to support blockizing multiple blocks
#14776 - Improved parameter name in DLTensor unpacking error messages
#14562 - [Driver] Move ShouldAnnotateEntryFunc logic into transform
#14741 - Keep block annotations from tensorization
#14021 - More flexible buffer compaction
#14711 - [Analysis] Calculate allocated memory at module level
#14492 - Flatten SeqStmt on construction
#14598 - Add CUDA int4 tensor core intrinsics
#14593 - [Schedule] Method returning the function being worked on
#14592 - [TensorIR] Fix ComputeAt with perfect symbolic bound
#14491 - Use String instead of StringImm for AttrStmtNode::node
#14626 - [TensorIR]reindex_cache_write do not mutate init statement
#14588 - [Fix][TIR] UnifyThreadBinding creating unit loop with annotation
#14589 - [Fix][TIR][Analysis] Reduction block checking alloc_buffers

TVMScript

#15083 - Avoid visiting repetition tensor in SetCommonPrefix Visitor
#15091 - [TIR]Convert tir.op operands to PrimExpr
#14919 - [TIR] Parse subroutine calls with no arguments
#14941 - Prevent bool to int conversion in T.Assert condition
#14915 - Allow T.target("device", host="host") to specify host
#14900 - Round-trip DeclBuffer with undefined data pointer
#14889 - [TIR]Added format/parsing of subroutine calls
#14874 - Use default fallback for un-registered type
#14840 - Print Executor, Runtime, and FunctionInfo as metadata
#14812 - Handle AllocatedPoolInfo, ConstantPoolInfo, ConstantInfo
#14786 - Add __name__ attr for parsed PrimFunc and IRModule
#14531 - Preserve LetStmt of constants
#14488 - Distinguish between void* and handle

TVMC

#14994 - [Bugfix]Fix tvmc option for printing which operators are offloaded to the Ethos-U

LLVM

#15127 - Remove the "ret_void" argument of AddFunction
#15139 - Minor refactor to LLVMModuleNode::SaveToFile
#14958 - [Codegen]Allow void return type from PackedFunc
#14946 - Expose Host CPU Feature Detection
#14901 - Codegen subroutine call when CallNode::op is GlobalVar
#14570 - Use Var annotation in LetStmt for pointer type
#14843 - [RUNTIME] Enable multi systemlib with device code
#14564 - Validate generated LLVM module before optimization
#14568 - Expand tvm::Type to DWARF conversion
#14563 - [Codegen]Remove cast to i8* in builtin::address_of

BugFix

#14960 - [Bug] Add typing_extensions requirement again
#15015 - [Hotfix] Remove LOG(INFO) from unsupported dtype legalization pass
#14991 - Make ThreadAllReduce pass compatible with int64
#14950 - Avoid symbol conflicts in MakePackedAPI/MakeUnpackedAPI
#14903 - [Test Cases]Add some version check to make test cases run in all PyTorch versions
#14890 - [Fix] Fix typo in error message
#14879 - fix the undeclared identifier 'f'
#14857 - Fix batch_norm
#14787 - [FIX] fix typo in comment

CI

#15179 - [Testing] Utility method to run TVM on remote device
#15138 - [Test] Improve check for TVMError exception in test_cast
#15062 - Clone submodule recursively
#15065 - Revert "Make Graviton3 default AArch64 job runner node (#14983)"
#14983 - Make Graviton3 default AArch64 job runner node
#15056 - [Bugfix]Fix CacheControl version constraint violation
#14908 - Update the expected CI jobs list in the update_branch script
#14847 - Update CPU image to install PyTorch
#14808 - [Testing] Use TVMScript's "name" argument for error messages
#14780 - fix doc deploy issue
#14651 - Modify test cases to accommodate the CI upgrades
#14666 - sccache support while using ci.py under multi user environments
#14635 - Upgrade CI
#14713 - Add PLATFORM env var to builds
#14680 - Downgrade ci_cpu llvm version back to 11
#14653 - [tests][scripts][release] Optimize release note script about categories etc
#14646 - [test][script] Fix release gather_pr.py of script about ghost users or blank PR nodes
#14550 - Add JAX deps in Dockerfiles
#14466 - Update ci_cpu image and build with llvm-15

Docker

#15149 - Fix build.sh environment variables
#15105 - Update docker images for llvm-16
#15092 - Update ci-cortexm docker image to contain CMSIS-NN release v…
#15095 - Add build.sh environment variables
#15067 - Migrate arm docker image to use llvm packages
#15031 - Update ci_cpu docker image to one containing polly package f…
#15003 - [ADRENO] Docker setup changes for multi user environments
#14912 - Add polly package
#14842 - Install PyTorch on cpu image
#14590 - Support rootless docker when using docker/bash.sh

Docs

#15126 - [DOC] Add RPC System Setup Document
#15071 - Updated the copyright year from 2020 to 2023
#15055 - [DOC][TUTORIAL] Fix typo for the 'Making your Hardware Accelerator TVM-ready with UMA'
#14504 - [TensorIR][Doc] Docstring of reorder_block_iter_var
#14611 - [TIR] Fix unsafe_set_dtype docstring
#14585 - Fix typo in the Vitis AI Integration docs

Misc

#15267 - [release] Disable git merge to avoid conflict
#15187 - [RPC] Report RPC Session Timeout to Client Instead of "kShutdown"
#15185 - Update tvm_runtime.h
#15164 - [CMake] Support LLVM-16 static linking
#15167 - [Python] Enhance Wheel Packaging
#15166 - [Target] Add MetaSchedule-compatible attributes to OpenCL
#15154 - [Minor] Fix Compilation Warnings
#15132 - [NDArray] Allow creating a view from a strided array
#15116 - [RPC] Add Missing Option "port_end" to RPC Proxy
#15073 - [CodeGenC] Use PrimFuncNode::ret_type in function signature
#15036 - [StackVM] Updated CodeGenStackVM to handle DeclBuffer
#15022 - [Build] Fix missing virtual destructor in SIBuilder
#15016 - Fix type parse error about AdaptiveMaxPool
#15007 - [Minor] Fix compilation warnings
#15000 - [CMAKE] Introduce dummy build as an option
#14863 - [DataType] Initial support of fp8 (e4m3/e5m2)
#14975 - [CMAKE] Add a dummy target to defer libtvm dep
#14574 - [IR][SIBuilder]
#14939 - [Target] Add target to all TVM callbacks
#14937 - [BUILD] Enable log before throw message in windows
#14934 - [TestCases] fix unreachable test cases due to outside the for-loop
#14916 - [TypoFix] fix some typo problem in keras frontend
#14893 - [Contrib] Use f-strings for string formatting, NFC
#14884 - [AutoTVM] Use f-strings for string formatting, NFC
#14876 - [CONTRIB] Enable create_staticlib to take in tar files
#14867 - Fix f-string typo
#14851 - Add v0.12.0 docs
#14813 - [BUILD] Removed the duplicated MACROs in config.cmake
#14743 - [SUPPORT] Fix RingBuffer ReadWithCallback
#14799 - [LINT] Fix clang-format script for newest clang-format
#14797 - [NDArray] Allow arbitrary stride when the corresponding shape is 1
#14790 - More clear ref of thirdparty license
#14779 - fix: use arm on demand instead of spot
#14762 - [Target][Minor] Add A6000 Target Tag
#14683 - [AutoTVM] Added Droplet algorithm in TVM
#14694 - unify search path approach to various libs
#14686 - [CMAKE] Update search pattern of config
#14636 - Fix bug about wrong attribute name
#14628 - [CODEGEN] Fix metal codegen when with only single working dim
#14607 - fix: deploy ci
#14569 - [Node] Allow alternative root names in ObjectPath::Root()
#14522 - [Object] Implemented .as<T> for ObjectRef param, returns Optional<T>
#14477 - feat: use spot instances for ci with on demand as a backup
#14468 - [AutoTVM] New rank-binary loss_type for the new xgboost >= 2.0.0 behaviour
#14544 - Update to v0.13.dev0
#14539 - [Target] Add Apple M1 GPU tag with 256-thread restriction

v0.12.0

11 months ago

v0.11.1

1 year ago

Introduction

This is a v0.11.1 bug fix release on top of v0.11.0 (see https://github.com/apache/tvm/issues/13899), incorporating a fix to the Python dependencies description.

What's Changed

Python dependencies

Add typing_extensions requirement (https://github.com/apache/tvm/pull/14244)
Adjust version to 0.11.1 (https://github.com/apache/tvm/pull/14300)

v0.11.0

1 year ago

Introduction

The TVM community has worked since the v0.10.0 release to deliver the following new exciting improvements!

Metaschedule
- Tuning API improvements and anchor-block tuning
TVMSCript metaprogramming
- Lots of progress wiht TVMScript, with the introduction of a core parser, AST, Evaluator, Source and diagnostics

And many other general improvements to microTVM, code quality, CI, frontends, and more! Please visit the full listing of commits for a complete view: https://github.com/apache/tvm/compare/v0.10.0...v0.11.0.

RFCs

These RFCs have been merged in apache/tvm-rfcs since the last release.

CodeGenAArch64 backend with Scalable Vector Extension (SVE) #94 https://github.com/apache/tvm-rfcs/commit/04b9909d6f8b63524091f12ff5eb964ad490c7b8

What's Changed

Note that this list is not comprehensive of all PRs and discussions since v0.10. Please visit the full listing of commits for a complete view: https://github.com/apache/tvm/compare/v0.10.0...v0.11.0.

Adreno

[Adreno] Add global pooling schedule (#13573)
[Adreno] Add documentation for Adreno deployment (#13393)
[Adreno] Fix mem_scope annotations for prim funcs having several heads (#13153)
[Adreno] Adapt reduction schedule for adreno (#13100)
[Adreno] Fix winograd accuracy (#13117)
[Adreno][Textures] Fix static memory planner (#13253)
[DOCKER][Adreno]Docker infra for Adreno target with CLML support (#12833)

AoT

[AOT] Add CreateExecutorMetadata analysis pass (#13250)
[AOT] Add CreateFunctionMetadata analysis pass (#13095)
[AOT] Sanitize input/output name in runtime (#13046)

Arith

[Arith] Add internal NarrowPredicateExpression utility (#13041)
[Arith] Optional rewriting and simplification into AND of ORs (#12972)

arm

[bfloat16] Fixed dtype conversion in the arm_cpu injective schedule (#13417)

AutoTVM

[AutoTVM] Introducing multi_filter into ConfigSpace autotvm (#12545)

Build

[BUILD] Re-enable ccache by default (#12839)

CI

[ci] Fix docs deploy (#13570)
[ci] Split Jenkinsfile into platform-specific jobs (#13300)
[ci] Dis-allow any non-S3 URLs in CI (#13283)
[ci] Split out C++ unittests (#13335)
[CI] Separate the ci scripts into Github and Jenkins scripts (#13368)
[ci] Assert some tests are not skipped in the CI (#12915)
[ci] Ignore JUnit upload failures (#13142)
[ci] Lint for trailing newlines and spaces (#13058)
[ci] Template build steps (#12983)
[ci][docker] Allow usage of ECR images in PRs (#13590)
[ci][docker] Read docker image tags during CI runs (#13572)
[ci][wasm] Add package-lock.json to git (#13505)

CL

[ACL] Enable int8 data type in pooling operators (#13488)

CMSIS-NN

[CMSIS-NN] Support for int16 conv2d (#12950)
[CMSIS-NN] Support for int16 in fully connected layer (#13484)

DNNL

[AMP] refine AMP and the corresponding tests for bfloat16 (#12787)

Docker

[Docker]Refactor timezone script and NRF installation (#13342)

Docs

[docs] Fix empty code blocks in tutorials (#13188)

Ethos-N

[ETHOSN] Consolidate target string usage (#13159)
[ETHOSN] Throw error message when inference fails (#13022)
[ETHOSN] Inline non-compute-intensive partitions (#13092)
[ETHOSN] Transpose fully connected weights (#12970)
[ETHOSN] Support conversion of add/mul to requantize where possible (#12887)

Frontend

[TFLite] Enable int64 biases for int16 quantized operators (#12042)

Hexagon

[Hexagon] Add HVX quant conv2d implementation (#13256)
[Hexagon] Add test to show scheduling of resnet50 with async dma pipe… (#13352)
[Hexagon] Enable Hexagon User DMA bypass mode (#13381)
[Hexagon] Lint tests part 2 (#13271)
[Hexagon] Add pylint on tests (#13233)
[Hexagon] Add E2E test demonstrating how to apply blocked layout schedule to conv2d via metaschedule (#13180)
[Hexagon] Add a test to show how to use multi input async dma pipelin… (#13110)
[Hexagon]: Add upload function to hexagon session (#13161)
[Hexagon] Add support for instrumentation based profiling for Hexagon (#12971)
[Hexagon] Add power manager (#13162)
[Hexagon] Add scripts for e2e MetaSchedule tuning demonstration (#13135)
[Hexagon] Add feature to copy logcat to --hexagon-debug and add new --sysmon-profile option to run sysmon profiler during the test (#13107)
[Hexagon] Async DMA pipelining test suite (#13005)
[Hexagon] Enable multi input Async DMA; same queue / stage (#13037)
[Hexagon] Do not use target test fixture in Hexagon tests (#12981)
[Hexagon] 3-stage pipeline; multi queue async DMA for cache read / write (#12954)
[Hexagon] vrmpy tensorization for e2e compilation of int8 models (#12911)
[Hexagon] Support template-free meta schedule tuning (#12854)
[Hexagon] depth_to_space slice op (#12669)
[Hexagon] Make allocate_hexagon_array a hexagon contrib API (#13336)
[Hexagon] Add fix for vtcm allocation searches (#13197)
[MetaSchedule][Hexagon] Add postproc for verifying VTCM usage (#13538)
[Hexagon][QNN] Add TOPI strategies for qnn ops mul/tanh/subtract (#13416)
[Logging][Hexagon] Improve logging on Hexagon (#13072)
[Hexagon] [runtime] Per-thread hardware resource management (#13181)
[Hexagon] [runtime] Create objects to manage thread hardware resources (#13111)
[QNN][Hexagon] Disable QNN canonicalization pass (#12398)
[Hexagon] [runtime] Manage RPC and runtime buffers separately (#13028)
[Hexagon] [runtime] VTCM Allocator (#12947)
[TOPI][Hexagon] Add schedule and test for maxpool uint8 layout (#12826)
[TOPI][Hexagon] Implement quantize op for hexagon (#12820)
[Meta Schedule][XGBoost] Update the custom callback function of xgboost in meta schedule (#12141)
[TIR] [Hexagon] Add vdmpy intrinsic and transform_layout for tests (#13557)
[Hexagon] [runtime] Support VTCM alignments of 128 or 2k (#12999)
[HEXAGON][QHL] Clippling the inputs of HVX version of QHL Sigmoid operation (#12919)
[Hexagon] [runtime] Add user DMA to device API resource management (#12918)

LLVM

[LLVM] Emit fp16/fp32 builtins directly into target module (#12877)
[LLVM] Switch to using New Pass Manager (NPM) with LLVM 16+ (#13515)

MetaSchedule

[MetaSchedule] Make MultiLevelTiling apply condition customizable (#13535)
[MetaSchedule] Enhance Database Validation Script (#13459)
[MetaSchedule] Fix Dynamic Loop from AutoBinding (#13421)
[MetaSchedule] Support schedules with cache read in RewriteLayout (#13384)
[MetaSchedule] Improve inlining and VerifyGPUCode for quantized model workload (#13334)
[MetaSchedule] Add JSON Database Validation Scripts (#12948)
[MetaSchedule] Fix the order of applying AutoInline in ScheduleUsingAnchorTrace (#13329)
[MetaSchedule] Refactor ScheduleRule Attributes (#13195)
[MetaSchedule] Improve the script for TorchBench model tuning & benchmarking (#13255)
[MetaSchedule] Enable anchor-block tuning (#13206)
[MetaSchedule] Introduce a variant of ModuleEquality to enable ignoring NDArray raw data (#13091)
[MetaSchedule] Consolidate module hashing and equality testing (#13050)
[MetaSchedule] Support RewriteLayout postproc on AllocateConst (#12991)
[MetaSchedule] Tuning API cleanup & ergonomics (#12895)
[MetaSchedule] Fix XGBoost Import Issue (#12936)
[MetaSchedule] Add Script for TorchBench Model Tuning & Benchmarking (#12914)
[MetaSchedule] Restore num_threads parameter in tuning API (#13561)
[MetaSchedule] TorchBench tuning script: add option to disallow operators in sub graph (#13453)
[MetaSchedule] Fix segfault in gradient based scheduler (#13399)
[MetaSchedule] Add from-target Defaults for x86 VNNI Targets (#13383)
[MetaSchedule] Fix Task Hanging in EvolutionarySearch (#13246)
[MetaSchedule] Allow skipping exact NDArray rewrite in RemoveWeightLayoutRewriteBlock (#13052)
[MetaSchedule][UX] Support Interactive Performance Table Printing in Notebook (#13006)
[MetaSchedule][UX] User Interface for Jupyter Notebook (#12866)

microNPU

[microNPU] Upgrade Vela to v3.5.0 (#13394)
[microNPU] Fixed MergeConstants pass on striped networks (#13281)

microTVM

[microNPU] Upgrade Vela to v3.5.0 (#13394)
[microNPU] Fixed MergeConstants pass on striped networks (#13281)
[microTVM] Modernize Arm Cortex-M convolution schedules (#13242)
[microTVM] Improve code reuse in Corstone300 conv2d tests (#13051)
[microTVM] Add Cortex-M DSP schedules for optimal conv2d layouts (#12969)
[microTVM] Use default Project Options in template projects and add Makefile for Arduino template project (#12818)
[microTVM] Generalize depthwise_conv2d schedule (#12856)
[microTVM] add the option to open a saved micro project for debugging (#12495)
Added macro generation in MLF export (#12789)
[microTVM][Arduino]Add serial_number to project options and tests (#13518)
[microTVM][Zephyr] Add 'serial_number' option (#13377)
[microTVM][PyTorch][Tutorial]Adding a PyTorch tutorial for microTVM with CRT (#13324)

Misc

[CodegenC] Explicit forward function declarations (#13522)
[FQ2I] Support converting dense -> add to qnn.dense -> add -> requantize (#13578)
[Minor][Testing] Consolidate IRs into corresponding functions (#13339)
Add recursive on loop with marked kUnrolled (#13536)
Skip stride check if shape is 1 in IsContiguous (#13121)
[TEST] CPU feature detection for x86 and ARM dot product instructions (#12980)
[Node] Expose StructuralEqual/Hash handler implemenation to header (#13001)
[Tensorize] Add logs to comparator to make debugging tensorize failures easier (#13285)
[usmp] Also remap VarNode to USMP-allocated buffer (#12880)
[Virtual Machine] Implementation of 'set_output_zero_copy' (#11358)

ONNX

[ONNX] Add converter for FastGelu from Microsoft onnxruntime contrib opset (#13119)
[QNN, ONNX] Extension of QLinearMatMul in ONNX front-end for all ranks of input tensors (#13322)

OpenCL

[OpenCL] Introduce OpenCL wrapper to TVM (#13362)
[OpenCL] Introduction of weights on buffers (#13563)
[OPENCL][TEXTURE] Test case enhancements and fixes for RPC (#13408)

Relay

[Relay] Fix CombineParallelDense slicing axis (#13597)
[Relay] Refactor constant folding over expr into a utility function (#13343)
[Relay] Enhancement for fold_scale_axis and simplify_expr (#13275)
[Relay] Add ClipAndConsecutiveCast and CastClip to SimplifyExpr (#13236)
[Relay] Rewrite division by constant to multiply (#13182)
[Relay] Extend split for blocked ConvertLayout pass (#12886)
[Relay][transform][SimplifyExpr] simplify adjacent muls and adds with constants (#13213)
[Relay][Hexagon] Add per-channel FixedPointMultiply operation (#13080)
[IRBuilder][Minor] Add intrinsics like T.int32x4 (#13361)

roofline

[ROOFLINE] Add support for different dtypes (#13003)
[Roofline] Add fma (non-tensorcore) peak flops for CUDA (#13419)

RPC

[RPC] Fix tracker connection termination (#13420)

Runtime

[RUNTIME][CLML] Add fixes to clml runtime api (#13426)
[DLPack][runtime] Update DLPack to v0.7 (#13177)

Target

[Target] Replace utility functions with target.features (#12455)
[Target] Add Target Parser for Arm(R) Cortex(R) A-Profile CPUs (#12454)
[Target] Add target_device_type attribute to override default device_type (#12509)

TIR

[TIR] Add preserve_unit_iters option to blockize/tensorize (#13579)
[TIR] Introduce ReduceBranchingThroughOvercompute (#13299)
[TIR] Unify index data type when creating prim func (#13327)
[TIR] Remove PrimFuncNode::preflattened_buffer_map (#10940)
[TIR] Make syntax of AST nodes different than ops (#13358)
[TIR] Update ReductionIterNotIndexOutputBuffer to check BlockRealizeN… (#13301)
[TIR] Check producer predicate in ReverseComputeInline (#13338)
[TIR] Add utility for anchor block extraction (#13194)
[TIR] Allow IndexMap applied to arguments with different dtypes (#13085)
[TIR] Fix handling of int64 extent in blockize and tensorize (#13069)
[TIR] Refactor NarrowDataType into DataTypeLegalizer (#13049)
[TIR] add unit-tests for upcoming primfunc-slicing (#12794)
[TIR] Fix plan buffer allocation location for loop carried dependencies (#12757)
[TIR] Fix predefined inverse map in layout transform dtype legalization (#13565)
[TIR] Preserve loop annotation after loop partitioning (#13292)
[TIR] Use IndexMap to transform NDArray (#12949)
[TIR] Preserve loop annotations in inject_software_pipeline pass (#12937)
[TIR][Schedule] Support for specific consumer block targeting in cache_write (#13510)
[TIR][Hexagon] Add vtcm memory capacity verification for Hexagon target (#13349)
[TIR][Transform] Optional data-flow analysis in RemoveNoOp (#13217)
[TIR][Analysis][Arith] Implement basic data-flow analysis (#13130)
[TIR][Bugfix] Fix AXIS_SEPARATORS in tir.Schedule.transform_layout (#13326)
[TIR][Arith] Use TryCompare to narrow inequalities if possible (#13024)
[TIR][Primitive] Support rolling_buffer schedule primitive in TensorIR (#13033)
[Arith][TIR] Check for constant offsets of known literal constraints (#13023)
[TIR][Arith] Implement kApplyConstraintsToBooleanBranches extension (#13129)
[TIR][Schedule] Add cache_index to precompute index of buffer load (#13192)
[TIR][Schedule] Add cache_inplace primitive to cache opaque buffer (#12939)
[UnitTest][TIR] Support IRModule comparisons in CompareBeforeAfter (#12920)
[TIR][Arith] Prove conditionals by transitively applying knowns (#12863)
[TIR, MetaSchedule] Preserve unit block iters for auto-tensorization (#12974)
[TIR][MetaSchedule] Add regression test for layout_rewrite extent=1 (#12916)
[TIR][Transform] Keep the allocate buffers order after update buffer allocation location (#13560)
[TIR][Schedule] Fix cache_read loc detecting and region_cover checking (#13345)
[TIR][Transform] Clear buffer_map during MakeUnpackedAPI (#12891)
[TIR][Schedule] Relax cache read/write's restriction and fix unexpected behavior (#12766)

TOPI

[TOPI] Implement Einsum with reduction axes (#12913)
[TOPI] Add layer norm operator (#12864)
[TOPI] Add handwritten matvec for dynamic cases (#13423)
[TOPI] Fix dtype legalize logic for CPU dot product instruction (#12865)
[TOPI][Hexagon] Implement quantized adaptive_avg_pool1d for hexagon (#13282)
[TOPI][Hexagon] Implement quantized depthwise conv2d (#12499)

Torch

[TVM PyTorch Integration] optimized_torch & as_torch how-to guide (#12318)
[frontend][pytorch]Support aten::Tensor_split operator (#12871)

TVMC

[TVMC] Global pass context for compile and tune (#13309)

TVMScript

[TVMScript] Improvements tvm.script.highlight (#13438)
[TVMScript] Reorganize the folder structure (#12496)
[TVMScript] TIR parser (#13190)
[TVMScript] IRModule parser (#13176)
[TVMScript] Evaluator, core parser, var table (#13088)
[TVMScript] AST, Source and diagnostics for Parser (#12978)
[TVMScript] Import TIR methods into the IRBuilder (#12900)
[TVMScript] Infer T.match_buffer parameters for region (#12890)

v0.11.0.rc0

1 year ago

Introduction

The TVM community has worked since the v0.10.0 release to deliver the following new exciting improvements!

Metaschedule
- Tuning API improvements and anchor-block tuning
TVMSCript metaprogramming
- Lots of progress wiht TVMScript, with the introduction of a core parser, AST, Evaluator, Source and diagnostics

RFCs

These RFCs have been merged in apache/tvm-rfcs since the last release.

CodeGenAArch64 backend with Scalable Vector Extension (SVE) #94 https://github.com/apache/tvm-rfcs/commit/04b9909d6f8b63524091f12ff5eb964ad490c7b8

What's Changed

Note that this list is not comprehensive of all PRs and discussions since v0.10. Please visit the full listing of commits for a complete view: https://github.com/apache/tvm/compare/v0.10.0...v0.11.0.

Adreno

[Adreno] Add global pooling schedule (#13573)
[Adreno] Add documentation for Adreno deployment (#13393)
[Adreno] Fix mem_scope annotations for prim funcs having several heads (#13153)
[Adreno] Adapt reduction schedule for adreno (#13100)
[Adreno] Fix winograd accuracy (#13117)
[Adreno][Textures] Fix static memory planner (#13253)
[DOCKER][Adreno]Docker infra for Adreno target with CLML support (#12833)

AoT

[AOT] Add CreateExecutorMetadata analysis pass (#13250)
[AOT] Add CreateFunctionMetadata analysis pass (#13095)
[AOT] Sanitize input/output name in runtime (#13046)

Arith

[Arith] Add internal NarrowPredicateExpression utility (#13041)
[Arith] Optional rewriting and simplification into AND of ORs (#12972)

arm

[bfloat16] Fixed dtype conversion in the arm_cpu injective schedule (#13417)

AutoTVM

[AutoTVM] Introducing multi_filter into ConfigSpace autotvm (#12545)

Build

[BUILD] Re-enable ccache by default (#12839)

CI

[ci] Fix docs deploy (#13570)
[ci] Split Jenkinsfile into platform-specific jobs (#13300)
[ci] Dis-allow any non-S3 URLs in CI (#13283)
[ci] Split out C++ unittests (#13335)
[CI] Separate the ci scripts into Github and Jenkins scripts (#13368)
[ci] Assert some tests are not skipped in the CI (#12915)
[ci] Ignore JUnit upload failures (#13142)
[ci] Lint for trailing newlines and spaces (#13058)
[ci] Template build steps (#12983)
[ci][docker] Allow usage of ECR images in PRs (#13590)
[ci][docker] Read docker image tags during CI runs (#13572)
[ci][wasm] Add package-lock.json to git (#13505)

CL

[ACL] Enable int8 data type in pooling operators (#13488)

CMSIS-NN

[CMSIS-NN] Support for int16 conv2d (#12950)
[CMSIS-NN] Support for int16 in fully connected layer (#13484)

DNNL

[AMP] refine AMP and the corresponding tests for bfloat16 (#12787)

Docker

[Docker]Refactor timezone script and NRF installation (#13342)

Docs

[docs] Fix empty code blocks in tutorials (#13188)

Ethos-N

[ETHOSN] Consolidate target string usage (#13159)
[ETHOSN] Throw error message when inference fails (#13022)
[ETHOSN] Inline non-compute-intensive partitions (#13092)
[ETHOSN] Transpose fully connected weights (#12970)
[ETHOSN] Support conversion of add/mul to requantize where possible (#12887)

Frontend

[TFLite] Enable int64 biases for int16 quantized operators (#12042)

Hexagon

[Hexagon] Add HVX quant conv2d implementation (#13256)
[Hexagon] Add test to show scheduling of resnet50 with async dma pipe… (#13352)
[Hexagon] Enable Hexagon User DMA bypass mode (#13381)
[Hexagon] Lint tests part 2 (#13271)
[Hexagon] Add pylint on tests (#13233)
[Hexagon] Add E2E test demonstrating how to apply blocked layout schedule to conv2d via metaschedule (#13180)
[Hexagon] Add a test to show how to use multi input async dma pipelin… (#13110)
[Hexagon]: Add upload function to hexagon session (#13161)
[Hexagon] Add support for instrumentation based profiling for Hexagon (#12971)
[Hexagon] Add power manager (#13162)
[Hexagon] Add scripts for e2e MetaSchedule tuning demonstration (#13135)
[Hexagon] Add feature to copy logcat to --hexagon-debug and add new --sysmon-profile option to run sysmon profiler during the test (#13107)
[Hexagon] Async DMA pipelining test suite (#13005)
[Hexagon] Enable multi input Async DMA; same queue / stage (#13037)
[Hexagon] Do not use target test fixture in Hexagon tests (#12981)
[Hexagon] 3-stage pipeline; multi queue async DMA for cache read / write (#12954)
[Hexagon] vrmpy tensorization for e2e compilation of int8 models (#12911)
[Hexagon] Support template-free meta schedule tuning (#12854)
[Hexagon] depth_to_space slice op (#12669)
[Hexagon] Make allocate_hexagon_array a hexagon contrib API (#13336)
[Hexagon] Add fix for vtcm allocation searches (#13197)
[MetaSchedule][Hexagon] Add postproc for verifying VTCM usage (#13538)
[Hexagon][QNN] Add TOPI strategies for qnn ops mul/tanh/subtract (#13416)
[Logging][Hexagon] Improve logging on Hexagon (#13072)
[Hexagon] [runtime] Per-thread hardware resource management (#13181)
[Hexagon] [runtime] Create objects to manage thread hardware resources (#13111)
[QNN][Hexagon] Disable QNN canonicalization pass (#12398)
[Hexagon] [runtime] Manage RPC and runtime buffers separately (#13028)
[Hexagon] [runtime] VTCM Allocator (#12947)
[TOPI][Hexagon] Add schedule and test for maxpool uint8 layout (#12826)
[TOPI][Hexagon] Implement quantize op for hexagon (#12820)
[Meta Schedule][XGBoost] Update the custom callback function of xgboost in meta schedule (#12141)
[TIR] [Hexagon] Add vdmpy intrinsic and transform_layout for tests (#13557)
[Hexagon] [runtime] Support VTCM alignments of 128 or 2k (#12999)
[HEXAGON][QHL] Clippling the inputs of HVX version of QHL Sigmoid operation (#12919)
[Hexagon] [runtime] Add user DMA to device API resource management (#12918)

LLVM

[LLVM] Emit fp16/fp32 builtins directly into target module (#12877)
[LLVM] Switch to using New Pass Manager (NPM) with LLVM 16+ (#13515)

MetaSchedule

[MetaSchedule] Make MultiLevelTiling apply condition customizable (#13535)
[MetaSchedule] Enhance Database Validation Script (#13459)
[MetaSchedule] Fix Dynamic Loop from AutoBinding (#13421)
[MetaSchedule] Support schedules with cache read in RewriteLayout (#13384)
[MetaSchedule] Improve inlining and VerifyGPUCode for quantized model workload (#13334)
[MetaSchedule] Add JSON Database Validation Scripts (#12948)
[MetaSchedule] Fix the order of applying AutoInline in ScheduleUsingAnchorTrace (#13329)
[MetaSchedule] Refactor ScheduleRule Attributes (#13195)
[MetaSchedule] Improve the script for TorchBench model tuning & benchmarking (#13255)
[MetaSchedule] Enable anchor-block tuning (#13206)
[MetaSchedule] Introduce a variant of ModuleEquality to enable ignoring NDArray raw data (#13091)
[MetaSchedule] Consolidate module hashing and equality testing (#13050)
[MetaSchedule] Support RewriteLayout postproc on AllocateConst (#12991)
[MetaSchedule] Tuning API cleanup & ergonomics (#12895)
[MetaSchedule] Fix XGBoost Import Issue (#12936)
[MetaSchedule] Add Script for TorchBench Model Tuning & Benchmarking (#12914)
[MetaSchedule] Restore num_threads parameter in tuning API (#13561)
[MetaSchedule] TorchBench tuning script: add option to disallow operators in sub graph (#13453)
[MetaSchedule] Fix segfault in gradient based scheduler (#13399)
[MetaSchedule] Add from-target Defaults for x86 VNNI Targets (#13383)
[MetaSchedule] Fix Task Hanging in EvolutionarySearch (#13246)
[MetaSchedule] Allow skipping exact NDArray rewrite in RemoveWeightLayoutRewriteBlock (#13052)
[MetaSchedule][UX] Support Interactive Performance Table Printing in Notebook (#13006)
[MetaSchedule][UX] User Interface for Jupyter Notebook (#12866)

microNPU

[microNPU] Upgrade Vela to v3.5.0 (#13394)
[microNPU] Fixed MergeConstants pass on striped networks (#13281)

microTVM

[microNPU] Upgrade Vela to v3.5.0 (#13394)
[microNPU] Fixed MergeConstants pass on striped networks (#13281)
[microTVM] Modernize Arm Cortex-M convolution schedules (#13242)
[microTVM] Improve code reuse in Corstone300 conv2d tests (#13051)
[microTVM] Add Cortex-M DSP schedules for optimal conv2d layouts (#12969)
[microTVM] Use default Project Options in template projects and add Makefile for Arduino template project (#12818)
[microTVM] Generalize depthwise_conv2d schedule (#12856)
[microTVM] add the option to open a saved micro project for debugging (#12495)
Added macro generation in MLF export (#12789)
[microTVM][Arduino]Add serial_number to project options and tests (#13518)
[microTVM][Zephyr] Add 'serial_number' option (#13377)
[microTVM][PyTorch][Tutorial]Adding a PyTorch tutorial for microTVM with CRT (#13324)

Misc

[CodegenC] Explicit forward function declarations (#13522)
[FQ2I] Support converting dense -> add to qnn.dense -> add -> requantize (#13578)
[Minor][Testing] Consolidate IRs into corresponding functions (#13339)
Add recursive on loop with marked kUnrolled (#13536)
Skip stride check if shape is 1 in IsContiguous (#13121)
[TEST] CPU feature detection for x86 and ARM dot product instructions (#12980)
[Node] Expose StructuralEqual/Hash handler implemenation to header (#13001)
[Tensorize] Add logs to comparator to make debugging tensorize failures easier (#13285)
[usmp] Also remap VarNode to USMP-allocated buffer (#12880)
[Virtual Machine] Implementation of 'set_output_zero_copy' (#11358)

ONNX

[ONNX] Add converter for FastGelu from Microsoft onnxruntime contrib opset (#13119)
[QNN, ONNX] Extension of QLinearMatMul in ONNX front-end for all ranks of input tensors (#13322)

OpenCL

[OpenCL] Introduce OpenCL wrapper to TVM (#13362)
[OpenCL] Introduction of weights on buffers (#13563)
[OPENCL][TEXTURE] Test case enhancements and fixes for RPC (#13408)

Relay

[Relay] Fix CombineParallelDense slicing axis (#13597)
[Relay] Refactor constant folding over expr into a utility function (#13343)
[Relay] Enhancement for fold_scale_axis and simplify_expr (#13275)
[Relay] Add ClipAndConsecutiveCast and CastClip to SimplifyExpr (#13236)
[Relay] Rewrite division by constant to multiply (#13182)
[Relay] Extend split for blocked ConvertLayout pass (#12886)
[Relay][transform][SimplifyExpr] simplify adjacent muls and adds with constants (#13213)
[Relay][Hexagon] Add per-channel FixedPointMultiply operation (#13080)
[IRBuilder][Minor] Add intrinsics like T.int32x4 (#13361)

roofline

[ROOFLINE] Add support for different dtypes (#13003)
[Roofline] Add fma (non-tensorcore) peak flops for CUDA (#13419)

RPC

[RPC] Fix tracker connection termination (#13420)

Runtime

[RUNTIME][CLML] Add fixes to clml runtime api (#13426)
[DLPack][runtime] Update DLPack to v0.7 (#13177)

Target

[Target] Replace utility functions with target.features (#12455)
[Target] Add Target Parser for Arm(R) Cortex(R) A-Profile CPUs (#12454)
[Target] Add target_device_type attribute to override default device_type (#12509)

TIR

[TIR] Add preserve_unit_iters option to blockize/tensorize (#13579)
[TIR] Introduce ReduceBranchingThroughOvercompute (#13299)
[TIR] Unify index data type when creating prim func (#13327)
[TIR] Remove PrimFuncNode::preflattened_buffer_map (#10940)
[TIR] Make syntax of AST nodes different than ops (#13358)
[TIR] Update ReductionIterNotIndexOutputBuffer to check BlockRealizeN… (#13301)
[TIR] Check producer predicate in ReverseComputeInline (#13338)
[TIR] Add utility for anchor block extraction (#13194)
[TIR] Allow IndexMap applied to arguments with different dtypes (#13085)
[TIR] Fix handling of int64 extent in blockize and tensorize (#13069)
[TIR] Refactor NarrowDataType into DataTypeLegalizer (#13049)
[TIR] add unit-tests for upcoming primfunc-slicing (#12794)
[TIR] Fix plan buffer allocation location for loop carried dependencies (#12757)
[TIR] Fix predefined inverse map in layout transform dtype legalization (#13565)
[TIR] Preserve loop annotation after loop partitioning (#13292)
[TIR] Use IndexMap to transform NDArray (#12949)
[TIR] Preserve loop annotations in inject_software_pipeline pass (#12937)
[TIR][Schedule] Support for specific consumer block targeting in cache_write (#13510)
[TIR][Hexagon] Add vtcm memory capacity verification for Hexagon target (#13349)
[TIR][Transform] Optional data-flow analysis in RemoveNoOp (#13217)
[TIR][Analysis][Arith] Implement basic data-flow analysis (#13130)
[TIR][Bugfix] Fix AXIS_SEPARATORS in tir.Schedule.transform_layout (#13326)
[TIR][Arith] Use TryCompare to narrow inequalities if possible (#13024)
[TIR][Primitive] Support rolling_buffer schedule primitive in TensorIR (#13033)
[Arith][TIR] Check for constant offsets of known literal constraints (#13023)
[TIR][Arith] Implement kApplyConstraintsToBooleanBranches extension (#13129)
[TIR][Schedule] Add cache_index to precompute index of buffer load (#13192)
[TIR][Schedule] Add cache_inplace primitive to cache opaque buffer (#12939)
[UnitTest][TIR] Support IRModule comparisons in CompareBeforeAfter (#12920)
[TIR][Arith] Prove conditionals by transitively applying knowns (#12863)
[TIR, MetaSchedule] Preserve unit block iters for auto-tensorization (#12974)
[TIR][MetaSchedule] Add regression test for layout_rewrite extent=1 (#12916)
[TIR][Transform] Keep the allocate buffers order after update buffer allocation location (#13560)
[TIR][Schedule] Fix cache_read loc detecting and region_cover checking (#13345)
[TIR][Transform] Clear buffer_map during MakeUnpackedAPI (#12891)
[TIR][Schedule] Relax cache read/write's restriction and fix unexpected behavior (#12766)

TOPI

[TOPI] Implement Einsum with reduction axes (#12913)
[TOPI] Add layer norm operator (#12864)
[TOPI] Add handwritten matvec for dynamic cases (#13423)
[TOPI] Fix dtype legalize logic for CPU dot product instruction (#12865)
[TOPI][Hexagon] Implement quantized adaptive_avg_pool1d for hexagon (#13282)
[TOPI][Hexagon] Implement quantized depthwise conv2d (#12499)

Torch

[TVM PyTorch Integration] optimized_torch & as_torch how-to guide (#12318)
[frontend][pytorch]Support aten::Tensor_split operator (#12871)

TVMC

[TVMC] Global pass context for compile and tune (#13309)

TVMScript

[TVMScript] Improvements tvm.script.highlight (#13438)
[TVMScript] Reorganize the folder structure (#12496)
[TVMScript] TIR parser (#13190)
[TVMScript] IRModule parser (#13176)
[TVMScript] Evaluator, core parser, var table (#13088)
[TVMScript] AST, Source and diagnostics for Parser (#12978)
[TVMScript] Import TIR methods into the IRBuilder (#12900)
[TVMScript] Infer T.match_buffer parameters for region (#12890)