MegEngine Versions Save

MegEngine 是一个快速、可拓展、易于使用且支持自动求导的深度学习框架

v1.11.0

1 year ago

MegEngine

HighLight

新增 CUDA INT4 支持。在 cuda11.4 + cudnn8.2.1 + trt7.2.2.3 + A2 卡上验证，和 Float32 相比，ResNet-50 Acc top1 精度损失 0.993%，速度提升5.8倍（557.969ms ->96.726ms ）; 和 INT8 相比，ResNet-50 Acc top1 精度损失 0.131%，速度提升 1.3 倍(125.76ms -> 96.726ms)。详情参考MegEngine example 。尝鲜通道： python3 -m pip install megengine==1.11.0+cu114 -f https://megengine.org.cn/whl/mge.html
Netron 可以可视化 Traced Module 了！欢迎大家体验： https://netron.app/

Bugfix

发版流程

修复 traced module 中重命名张量导致的错误。

通用组件

修复 fastrun 过程中跳过算法的判定条件。
修复 fastrun 过程中显存占用过多触发的 OOM 错误。
修复 Windows7 + 32bit + 多线程组合情况下，进程无法退出问题。
修复了参数初始化时 tensor 格式信息丢失的问题。
修改 nchw44 broadcast_vec 的场景下的算法选择, 修复 nchw44 的 elemwise 性能缺陷。
修复源码污染问题，使得 git status 恢复只显示用户本人的改动信息。
优化卷积通道不匹配，Matmul shape 不匹配时的输出信息，使其更好理解。
修复读取 persist cache 过程中由于网络原因导致的偶发性数据读取异常问题。
修复参数 tensor 初始化中未考虑 DTR 导致的卡死问题。
修复 softmax 运行时动态创建 elemwise 等 opr 导致不能开 record2 优化的问题
修复 elewise multitype 所引发的前向兼容的问题，使得之前的 load and run 可以正常运行该版本 dump 下来的模型。
修复 Repeat 算子无法开启 trace 模式的问题。
修复 load_and_run fitting 模式下仅指定输入 shape 或给定输入 batch-size 时设置无效等问题。
修复 ReduceMean 不同版本之间以及相同版本的 CPU 与 GPU 之间误差较大的问题。
修复 1.10 版本的模型内存占用增大的问题。

CUDA

修复 cutlass 编译 SM86 时间过长或者编译失败问题。
更改多卡环境的检测逻辑。取消初始化时对当前所有显卡是否支持 import megengine 的检测与提示，只有当运行时所使用的显卡不支持 import megengine 时才报错。
修复 cudnn8 的编译不通过的问题。
修复了 TensorRT8 在编译由于不指定 LIBRARY_PATH 导致失败的问题。

周边工具

修复 load_and_run 中 record_comp_seq 没有生效的问题。
修复参数和技术量统计工具中由于 long 类型的表示范围限制导致模型计算量的计算不准确的问题。
修复 load_and_run 中模型包含测试用例在全局图优化 dump 模型时报错的问题。
修复参数量和计算量统计工具 module_stats 重复统计共享权重的问题。
修复 megengine.tools.network_visualize 不支持CondTake 导致报错的问题。
修复 load and run 设置 multithread 后，没有加速效果的bug。

ROCM

修复 ROCM 平台由于缺少 conv bias 的实现导致的卷积算子无法执行的问题。

分布式训练

修复多卡训练时设置 async_level 为0会导致训练卡死的问题。

New Features

Python API

新增暴露如下API： is_cambricon_available、is_atlas_available、is_rocm_available、what_is_xpu。

通用组件

resize 反向传播支持 fp16 及 nhwc 的数据格式
CPU 和 CUDA 的 algo policy 的 cache 写入方式改为追加模式
elemwise multitype 中添加输出类型为 bool 的 opr，以提升megengine.functional.isnan、megengine.functional.not_equal、megengine.functional.less_equal、megengine.functional.greater_equal、megengine.functional.greater、megengine.functional.less、megengine.functional.isinf 、megengine.functional.equal 这些 opr 的性能，优化后整体和 pytorch 一致，其中megengine.functional.isinf 、megengine.functional.equal 优于pytorch表现。
增加可以查询whl包中的 trt、cudnn 版本、cuda 版本的接口：megengine.get_cuda_version、megengine.get_cudnn_version、megengine.get_tensorrt_version
使用 VF 指令优化 X86 和 RVV 的 GI 直接卷积, winograd 卷积, nchw_nchw44 卷积, 矩阵乘性能。经过验证 ResNet18 在 amax04 有 50ms 性能提升。矩阵乘：12 Gflops -> 20 Gflops E5-2620 v4 @ 3.0GHz amax, 0.3 Gflops -> 1.2 Gflops @ nezha D1
GI algo RVV 去掉 FIXLEN 的依赖, 避免 FIXLEN 产生多余的 load/store 操作，加速推理过程，RVV 上 resnet18 模型有 5%～10% 的提升。
优化 softmax 的实现。在 arm 的设备上，优化后的 softmax 实现相较于之前代理版 softmax 性能提升 10 倍左右。
新增支持 TensorRT8 的编译的工具链。
load_and_run 增加 mdl 模型可用的 optimize_for_inference 优化选项，可以用来实现 optimize-for-inference 的图优化, 如bn融合。

ARM

针对 pooling 算子，支持 nchw44 format 下的 reduce 和 elemwise 算子融合。

第三方硬件

优化 X86+RISC-V 的性能，在resnet18 上验证加速 1.1 倍。

周边工具

load and run 添加运行时给定 loader init 接口的功能，使业务侧业务的 loader 在修改 init api 名字后指定参数可以继续加载。此功能使用参数：--c-opr-init-interface 。使用示例：./load_and_run --c-opr-init-interface="your_loader_init_API"。 c-opr-init-interface 的默认值为 mgb_c_opr_init 。举例在业务中业务可能使用的值为： anc_c_opr_init。
load_nerwork_and_run 支持权重预处理以及设置warm up iter数。

发版流程

添加 cu114 whl包的生成方式。

Improvements

ARM

优化 CPU 上 reduce Opr 在 shape (xxx，xxx, 2/3/4) 的最后维度进行 reduce 时候的前向计算性能，提升约10倍。

CUDA

优化 conv2d padding mode 为 reflect 时的性能，大 shape 场景下提升明显，经过验证提升约50%。

文档

优化 functional.vision 模块中 roi_pooling，roi_align，nms，remap，warp_affine，warp_perspective，interpolate 的文档描述。
优化 pad 的文档中关于 mode 参数的描述，使之更准确。
优化 dataloader、Dataset、MNIST dataset 的文档描述，使之更完整明确。

MegEngine Lite

Bugfix

修复 MegengineLite 的 python 接口中 get_io_tensor、slice 及 concat 接口反复调用导致的内存泄漏问题。
修复 lite 中同时开 fast_run 和 nchw44 会挂的问题。

New Features

MegEngine Lite的 LiteConfig 增加 auto_optimize_inference 选项进行设备检测，可以根据推理时的CPU信息自动设置对应的 layout 优化选项。
添加 Lite 中 set_data_by_share 和 set_data_by_copy 接口，当输入是 numpy ndarry 时必须是连续的断言。

v1.10.0

1 year ago

MegEngine

HighLight

MegEngine 模型支持前向兼容性。即新版本的 MegEngine 序列化的模型可以在老版本的 MegEngine 加载。
- 从该版本及以上的版本，具备向前兼容的能力。
- 部分场景不具备向前兼容的能力。例如使用了新版本中新增的 opr，此时则不可向前兼容。
增加 python3.9 的支持。

Know Issue

v1.10 trace 模式下 sublinear 和静态图 dtr 是失效的。
2080ti cuda 上 ResNet50 推理耗时略慢于 v1.9。
树莓派上 VGG 推理耗时略慢于v1.9。

Bugfix

Python API

限制把输入自动转换成 tensor 的场景：仅 elemwise 会自动转换输入为 tensor。
修复 megengine.functional.matmul 在动态图模式下反传时挂掉的问题。
修复 megengine.functional.transpose 的 shape 推断错误。
修复 conv 反传和 megengine.random.RNG 算子中空 tensor 的问题。
限制 trace 模式下的 megengine.functional.concat 的 apply 时输入是非 tensor 的类型转换。
修复 megengine.functional 里比较函数结果的 dtype 不为 bool 的问题。

混合精度训练

修复 v1.9 版本在 BaseCls 上部分网络显存占用增大的问题。

通用组件

修复 fp16 参数使 AMP 不能工作的问题。
修复cpuinfo版本，以避免ARM上dlopen时可能造成内存泄露的问题。
修复 adaptive_pooling 在推不出 shape 时 ndim 不正确设置的问题。
修复 riscv64 gcc 使用大于 O0 的编译优化选项报错的问题。
修复异步读写 tensor shape 的错误。
修复 advanced indexing 在一个元素被多次取出时的求导错误。
修复commit改变会导致大量文件重新编译的问题。
修复 fastrun 与 heuristic 混用时缓存混乱的问题。
修复某些情况下在 fork 之后，使用 megengine.get_cuda_compute_capability 接口获取 cuda 环境报错的问题。
修复不能 attach 已经在求导路径上的 Tensor 的问题。
修复类似 softmax 等通过其他 Opr 组合完成计算的 Opr 在 midout 之后运行奔溃问题。
修复 pooling，matmul 中执行 policy 缺失的问题。
修复使用 MegEngineLite 推理，并 reset memory 之后报错的问题，具体为修复 reduce opr 中，当 input 的内存地址发生改变时报错的问题，在实际执行前增加了 update 的功能。
修复 path 里不带 nvcc 时使用 jit 相关的函数会挂的问题。
修复 reduce 算子在 v1.9 其参数 keepdims 的默认值从 True 修改为 False 后，reduce 前后 dim 维度不一样的问题。
修复 layernorm 训练不稳定、normalize 的维度较小时速慢的问题。
修复在极小的概率下 tensor 产生时 shape 信息不全导致获取 shape 时出现卡死的情况。
修复在 adaptivate_pooling 中输入 tensor 作为 tshape 时抛出异常的问题。
修复 reduce 在 backward 构建反向图时，不参与反向计算，没有梯度时抛出异常的问题。
使输入带 axis 选项的 op 都支持负数 axis。
修复使用 GraphInference 跑 mge 计算图时出现的内存泄漏的问题
修复 fastrun 过程中跳过算法的判定条件。
修复 fastrun 过程中显存占用过多触发的 OOM 错误。
修复 maximum(x,x) 求导错误的问题。
在 cmake中添加 MGE_WITH_BENCHMARK 选项，允许开启 DNN 中 BENCHMARK 的编译。
修复 Function 中的 inplace 操作。
修复 broadcast_to 不能被 trace 的问题。
使用 tensor 去构造新 tensor 时检查 dtype, device 等其他参数。

发版流程

修复 traced module 中重命名张量导致的错误。
修复 traced module 中可能错误抛出异常的问题。
修复 traced module 中的兼容性问题

ARM

修复 ARM 上执行 NHWCD4 模型的报错信息。

周边工具

修复 load_and_run fitting 模式下用户开启 const_shape 时 shape 变化的模型抛出异常的问题。
修复 load_and_run 中 record_comp_seq 没有生效的问题。
修复 profile 时 altas 的 event sync 的问题。

New Features

Python API

移除 Imperative python 接口里的 Symbolvar，并将其功能由 Tensor 实现（兼容之前的 mgo 图手术代码）。
新增了支持大 batch size 训练的 lamb 优化器。
megengine.functional.nn.roi_align 算子支持空 tensor 的输入。
添加 swapaxes 接口支持维度交换功能。

通用组件

优化 third_party 的准备工作，增添可选项，改善只训练或者只推理用户的体验。在 cmake 前添加 EXTRA_CMAKE_ARGS="-DMGE_SYNC_THIRD_PARTY=ON" ，会自动调整编译所需的 THIRD_PARTY 库。
增加检查本机 CUDA 版本和当前 MegEngine 依赖的 CUDA 版本是否匹配，如果不匹配打印 warning 信息，如下图所示。
支持对 uint16 tensor 进行 astype 。
在 fastrun 的 profile 模式中添加 warmup，以提高评判的准确。
MegEngine 模型支持前向兼容性。即新版本的 MegEngine 序列化的模型可以在老版本的 MegEngine 加载。
补全 gi 对 risc-v 的支持。
增加 python3.9 的支持。

ARM

在 arm_common 中添加了 chanwise 的 9x9点和 11x11 点积运算；9x9 的情况下有 25% 的无用计算, 11x11 的情况下无用计算只有 8.3%, 在满足对齐的情况下测试 9x9 与 11x11 耗时差距不大，因此推荐使用 11x11 的版本。
在 dnn/src/fallback/matrix_mul 下实现一个 gi 版本的 gemm 非 mk4 的版本。

CUDA

支持 int1 conv 的基本实现。

三方硬件

支持 Atlas710 的硬件。

周边工具

优化了 cmake 编译说明 , 如有问题欢迎提交 PR 修改或在论坛提出反馈。
在 load_and_run 中添加了 fitting 模式接口。
load_and_run --input 选项新增指定输入 shape 的用法。使用格式：--input="data_name:{d0,d1,d2, ...,dn}" 。
load_and_run 新增 layout_transform_batch_size 选项，支持指定全局图优化输入的 batch size。

Improvements

Python API

提高 megengine.functional.nn.pixel_shuffle 在小 shape 下的性能，可达 500%。
提高 megengine.functional.matmul 在小 shape 下的性能约 15%。

通用组件

优化跨 stream 张量复制。
优化 adaptive_pooling 实现。imperative 情况下的 megengine.functional.nn.adaptive_avg_pool2d megengine.functional.nn.adaptive_max_pool2d 速度提升约 6.5 倍。
优化 megengine.functional.nn.conv_transpose3d 实现。imperative 情况下的速度提升约 2 倍。
优化 pooling 实现。imperative 情况下 megengine.functional.nn.avg_pool2d megengine.functional.nn.max_pool2d的速度提升约 5 倍。
优化 megengine.functional.nn.conv_transpose2d 实现。imperative 情况下的速度提升约 3 倍。
在 heuristic cache 中使用简单构造 key 的方式，获得性能提升。
重写 matmul 和 batchmatmul 的自定义求导规则，提升matmul batchmatmul 反向计算速度，与 1.9 版本相比， vit 模型训练单个迭代训练时间从 354ms 降低到 350ms。
缩小单个 sm cuda 编译时间到原来的 2/3。

CUDA

优化大尺寸卷积的 CUDA direct 算法性能，正向的速度达到峰值的 80% 以上。

MegEngine Lite

Bugfix

修复 lite_shared.dll 没有在install 目录的问题。
修复从 numpy 拷贝数据到 device tensor 的错误。
修复 cpu:default 下多线程执行，MegEngine Lite 仍使用同一个线程的问题。
修复 pylite 中的接口名: set_tensorrt_cache → set_redis_cache
修复旧版本load_and_run无法解析历史的打包模型的兼容性问题。

New Features

MegEngine Lite 中添加上传和下载 redis cache 的功能。
MegEngine Lite 中增加 LITE_extra_configure 接口，用户可以设置是否使用模型信息进行网络配置。

MegEngine

Bugfix

Python API

restrict using convert_inputs in py_apply.
Fix megengine.functional.matmul grad error.
Fix megengine.functional.transpose shape infer.
Fix empty tensor bug of conv_bwd and megengine.random.RNG.
Restrict value converts to tensor for megengine.functional.concat.
Fix return dtype of comparison.
Fix the problem that cuda environment cannot be used after fork.
Fix the problem that tensors already in gradient path cannot be attached.
Fix the crash of some Operators running after midout, these Operators will call other Operator to finish compute task, such as softmax.
Fix the problem that policy is missed for pooling and matmul.
Fix the problem of reporting an error when the input memory address changes in reduce opr, and add the update function to fix it before the actual execution.

AMP

Fixed v1.9 the memory usage incresing problem of some network on basecls .

Common components

Fix an amp error occuring when some parameters has float16 dtype.
Fix cpuinfo version to avoid memory leakage when dlopen on arm.
fix incorrect ndim when could not infer shape for adaptive_pooling.
Fix riscv64 gcc error when using compilation optimization options greater than O0.
fix bug when asynchronously read/write tensor's shape.
print warning information when CUDA on user's PC mismatched with CUDA which in MegEngine.
Fix advanced indexing grad error.
Fix many object need recompile when commit id changed.
Fix lookup heuristic cache even in fastrun.
Fixed the problem that jit related functions will fail when NVCC not in path.
Fixed the problem that the default behavior of reduce operation is inconsistent with older version whick keepdims.
Fixed the problem that layernorm training is unstable and the speed is slow with small normalization dimensions.
Fixed the situation where the tensor would get stuck when getting shape if the probability of creating a tensor was not complete.
Fixed the problem when entering tensor as tshape in adaptivate_pooling.
Fix the problem that reduce does not participate in reverse calculation when constructing backward graphs and throws exceptions when there is no gradient.
Make input op with axis option support negative axis.
Fixed memory leak when using GraphInference to run mge calculation graphs.
Fix skip condition in fastrun.
Fix OOM error in fastrun.
Fix grad of maximum(x, x).
Add the MGE_WITH_BENCHMARK option to cmake to allow the compilation of BENCHMARK in DNN.
Fix inplace operation on autodiff.Function.
broadcast_to supports mutable target shape. check args when construct tensor with existing tensor.

Release process

Fix the bug occurred when renaming tensor in traced module.
Fix trace_module function may raise error in finally scope
Fix traced module compatible issues.

ARM

Fix error message when executing NHWCD4 model on ARM.

Peripheral tools

Fix the problem that the model whose shape changes when the user turns on const_shape in load_and_run fitting mode throws an exception.
Fix the bug that record_comp_seq in load_and_run does not take effect.
Fix the bug of event sync of altas when profiling.

New Features

Python API

Remove Symbolvar and implement its function in Tensor.
Add lamb optimizer that supports large batch size training.
megengine.functional.nn.roi_align operator supports empty tensor input.
Add swapaxes interface to support dimension swapping.

Common components

Optimize third_party's prepare, add options, and improve the experience of training-only or inference-only users.Adding EXTRA_CMAKE_ARGS="-DMGE_SYNC_THIRD_PARTY=ON" before cmake will automatically adjust the THIRD_PARTY library required for compilation.
Add warmup before profile in fastrun.
MegEngine models support forward compatibility. That is, the model serialized by the new version of MegEngine can be loaded in the old version of MegEngine.
Complete gi support for risc-v.
support python3.9.

Third-party hardware

supports Atlas710.

ARM

Added chanwise's 11x11 & 9x9 dot product operation in arm_common.
Implement a gi version of gemm's non-mk4 algorithm under dnn/src/fallback/matrix_mul.

CUDA

Support simple implementation of int1 conv.

Peripheral tools

Improve cmake build note,if you have any questions, welcome to contribute or give feedback in here.
Added fitting mode interface for load_and_run.
Add the usage of specifying input shape to the --input option of load_and_run. format: --input="data_name:{d0,d1,d2, ...,dn}".
Add layout_transform_batch_size option for load_and_run to specify global layout transform input batch size.

Improvements

Python API

Speed up megengine.functional.nn.pixel_shuffle on small shapes by up to 500%
Speed up megengine.functional.matmul on small shapes by 15%

CUDA

Speedup CUDA direct large conv.

Common components

improve cross stream memory borrowing.
Speed up megengine.functional.nn.adaptive_avg_pool2d megengine.functional.nn.adaptive_max_pool2d on imperative by 6.5 times。
Speed up megengine.functional.nn.conv_transpose3d on imperative by 2 times.
Speed up megengine.functional.nn.avg_pool2d on imperative by 5 times.
Speed up megengine.functional.nn.conv_transpose2d on imperative by 3 times.
using the simple hash key in heuristic cache.
Rewrite the custom grad rules of matmul and batchmatmul to improve the backward calculation speed. Compared with version 1.9, the training time of one iteration of vit model is reduced from 354ms to 350ms.
Reduced single sm cuda compile time to 2/3.

MegEngine Lite

Bugfix

Fix the bug that lite_shared.dll is not in the install directory.
Fix set data by copy on device tensor.
Fix cpu:default create new thread.
correct set_redis_cache API name in pylite.
Fixed the compatibility issue that the packaged model could not be resolved with the old version of load_and_run.

New Features

Add redis cache support for uploading and downloading in MegEngine Lite.
Add LITE_extra_configure interface for Lite. Users can set whether to use model info for network configuration.

v1.9.1

1 year ago

MegEngine

Bugfix

Python API

修复 conv 反传和 megengine.random.RNG 算子中空 tensor 的问题。
限制 trace 模式下的 megengine.functional.concat 的 apply 时输入是非 tensor 的类型转换。

MegEngine Lite

Bugfix

修复 cpu:default 下多线程执行，MegEngine Lite 仍使用同一个线程的问题。

v1.9.0

2 years ago

MegEngine

Known Issue

使用 megengine.random.RNG 的输入包含 0 维 tensor 场景，训练会报错。

HighLight

本次版本性能有较大提升，大部分网络训练提速约 10% ， host bound 严重的场景如检测模型，QAT 训练等有 20%~40% 的加速。尤其是在小 batch、amp 等情况下有显著提速。在 BaseCls 的多卡训练上验证，平均提速15.4%。
- 支持在一些算子中，输出张量可以与输入张量共享数据（Memory Forwarding）。此时不会发生数据拷贝，只有当数据是共享的张量发生修改时，才会触发数据拷贝，保证共享这一部分数据的其他张量不会受到影响。涉及到的算子包括：megengine.functional.transpose、megengine.functional.broadcast_to、megengine.functional.reshape 、megengine.functional.expand_dims 、megengine.functional.split 、张量索引等。这样可以尽可能地减少数据拷贝的过程，性能得到提升。为了防止极端情况下显存异常，提供 megengine.config.disable_memory_forwarding 用于禁用这项功能。

Notice

本次版本对 python3.5 的支持继续维持，从下个版本 MegEngine v1.10（MegBrain v8.17）开始将停止，请大家注意提前做好准备。

Bug fixes

Python API

修复使 @ 运算符与 megengine.functional.matmul 的行为一致。
修复使用 megengine.functional.nn.pad ，输出 Tensor 值可能为全 0 的问题。
为 megengine.functional.nn.remap megengine.data.transform.Resize 添加 nearnest mode 模式。

通用组件

修复在混合精度训练时无法使用 megengine.functional.nn.sync_batch_norm 的问题。
修复全局优化 conv 与两个 nolinear 算子融合时出错的问题。
修复不开 fastrun 的情况下大 kernel 卷积速度慢的问题。
修复对输入为非 float32 的类型求导时不报错，并且没有梯度的问题。
修复分布式训练 RPC 通信 IO 中断问题。
修复 BatchNorm 对二阶导的支持问题。

New Features

Python API

megengine.functional.nn.conv1d megengine.functional.nn.conv2d 增加 padding_mode 参数，支持 zeros、reflect、replicate 模式。

CUDA

添加大核的 direct conv 实现。
添加 implicit bmm 大核 depthwise conv 的实现。
CUDA 上 resize 的 nearest mode 支持不止 1 和 3 的多通道输入。

通用组件

基于业务降噪模型进行关于 cd4 优化，主要是添加 NHWC 和 NHWCD4 两种 format 之间的转换。在业务的降噪模型上验证性能提升 15% 左右。
添加 int1 数据类型的支持。
tensor indexing 中支持 np.newaxis(None) 。

Improvements

通用组件

优化性能，大部分网络训练提速约 10% ， host bound严重的 vit、检测模型，在 QAT 场景有 20%~40% 的加速。
提升 op dispatch 系统的性能。修复了 v1.8 使用的新 dispatch 系统存在的性能问题，修复后性能与 v1.7 持平。
提升 dispatch 系统 jit trace 性能。性能与 v1.7 相比略有提升。开启 trace 下部分模型训练性能提升如下， ResNet50 提升 0.7% ， ShuffleNet 提升 9%， ATSS 提升 10%。
subgraph op 支持 shape 推导和 jit fusion 优化，并用 subgraph op 重写了部分由 elemwise 组合成的性能较差的op。优化后 megengine.functional.nn.hsigmoid、megengine.functional.nn.relu6、megengine.functional.nn.prelu、megengine.module.LeakyReLU、megengine.functional.nn.softplus 、megengine.functional.nn.logsigmoid、megengine.functional.where 性能在大输入 shape 时与 pytorch 持平。
提升batch_norm的性能，小尺寸下提升 4.3 倍。
优化 reduce op 性能，速度提升 75%。

CUDA

融合 conv 和 h_swish，部分模型性能提升。

MegEngine Lite

Bug fixes

lite 修复全局图优化接口 symbolvar 替换不完整导致 cuda 设备上无法使用的问题。
修复 load_and_run lite 模型全局图优化接口与 fast-run 接口使用冲突的问题。
修复 load_and_run 使用 “–cuda” 参数时报错的问题

New Features

lite-c 接口中添加错误码和全局获取错误码的接口 LITE_get_last_error_code。
lite 增加通过虚拟地址查询物理地址的接口。
load_and_run 支持 lite 模型全局图优化。

Improvements

优化 Lite 中 get_data_by_share python 接口的性能。在算法仓的模型中略有性能提升。

MegEngine

Bug fixes

Python API

make operator "@" behaves in a way consistent with the behavior of megengine.functional.matmul .
Fix the output tensor of megengine.functional.nn.pad may be all 0 .
Add the nearNest mode for megengine.functional.nn.remap and megengine.data.transform.Resize .

Common components

Fix megengine.functional.nn.sync_batch_norm not being available when training with mixed precision.
Fix bug of fuse conv bias and two nolinear opr.
Fix the problem of poor performance of the large kernel convolution without fastrun.
Fixed bug gm attach non-float type does not report error without gradient.
Fix the IO interruption for RPC communication when distributed training.
Fix BatchNorm support for higher-order differentiation.

New Features

Python API

Add padding_mode parameter，support zeros、reflect、replicate mode for megengine.functional.nn.conv1d megengine.functional.nn.conv2d.

CUDA

Add implementation of large kernel's direct conv algo.
Add implementation of large kernel's depthwise conv by implicit bmm.
The nearest mode of resize on cuda supports more than 1 and 3 multi-channel inputs.

Common components

Add conversion between NHWC and NHWCD4 formats.
Add support for int1 dtype.
Add np.newaxis(None) for tensor indexing.

Improvements

Common components

Optimized performance, Most networks speed up to 10%, host bound heavy VIT or detection models, QAT scenarios speed up 20% to 40%.
Improve the performance of the op dispatch system. Fix the performance problems of the new dispatch system in version 1.8. After the repair, the performance is the same as that of version 1.7.
Improve the jit trace performance of the dispatch system. The performance is slightly improved compared to the 1.7 version. When trace is enabled, the training performance of some models is improved as follows, resnet50 0.7%, shufflenet 9%, and atss 10%.
Subgraph op supports shape infer and jit fusion optimization, and rewrites some ops with it. Performance of megengine.functional.nn.hsigmoid、megengine.functional.nn.relu6、megengine.functional.nn.prelu、megengine.module.LeakyReLU、megengine.functional.nn.softplus 、megengine.functional.nn.logsigmoid、megengine.functional.where, and where is on par with pytorch for large input shapes.
Improve the performance of the op batch_norm by 4.3 times for small object.
Improve the performance of the op reduce,speed up 75%.

CUDA

Fusion of conv and h_swish, the performance of some models is improved.

MegEngine Lite

Bug fixes

Fix lite global layout transform symbolvar replace error.
Fix the conflict between load_and_run lite model global layout transform optimization interface and fast-run interface.
Fix load_and_run error when using "--cuda" parameter.

New Features

Add 'LITE_get_last_error_code' interface in lite-c.
Add get physic address interface in lite.
Load_and_run supports lite model global layout transform optimization.

Improvements

Optimize the get_data_by_share interface of LiteTensor.

v1.8.2

2 years ago

MegEngine

Known Issue

训练和推理的GPU显存占用（MiB）各模型有不同程度的增加。

New Features

CUDA

添加大卷积核的 direct conv 实现。
添加 implicit bmm 大卷积核 depthwise conv 的实现。

MegEngine

New Features

CUDA

Add implementation of large kernel's direct conv algo.
Add implementation of large kernel's depthwise conv by implicit bmm.

v1.8.1

2 years ago

MegEngine

Notice

从下个版本 MegEngine v1.9 开始将停止对 python3.5 支持，请大家提前做好准备。

HighLight

megengine.functional.topk 新增「descending」以定义排序行为，本次版本默认为「False」保持从小到大排列，如果未指定则提示warning 信息。在 v1.12 版本将修改「descending」默认值为 true 以符合惯常情况下大家对 topK 的定义，即从选出二维矩阵中 Top-K 个最大元素。
MegEngine 支持端上训练，使用参考这里。

Bug fixes

Python API

修复 megengine.functional.floor_div 对于异号整数输入的计算错误。
使 megengine.functional.broadcast_to 接受 None，表示这一维无需进行广播以支持 -1 shape 自动推导。

发版流程

修复 MegEngine v1.7 版本序列化的 TM 模型，由 MegEngine v1.8 版本加载做图手术会失败的问题。
TracedModule Bug 修复如下。
- 修复无法序列化第三方后端中 op 的问题。
- 修复 Input 类型 expr 未绑定 top_graph 的问题。
- 修复图手术中将 ModuleNode 作为输入时，expr 的插入位置计算错误的问题。
- 修复 TracedModule 加载 v1.7 及之前含有 ones 或 zeros 的模型无法运行的问题。
- 修复 TracedModule 在部分情况下递归过深的问题。
- 修复 TracedModule 无法重复 trace 的问题。
- 修复 TracedModule 无法正确识别 pad 的问题。
- 改善 TracedModule 对不合法输入的报错信息。
修复同时开全局图优化和 fastrun 时，选中的算法只有 naive 时会报错的问题。

CUDA

前置输入 Tensor 太大的判断，优化错误提示信息，避免直接输出 cuDNN 报错。
修复 tensorrt 改变 shape 时，output推导错误问题

通用组件

修复 MegDNN fallback 的 ConvBias 算子不可用的问题。
修复图优化之后无法正常 fastrun 模型中的 matmul 和 pooling 的问题。
修复在低内存环境（8G）无法编译 MegEngine 的问题。
修复将较大的 numpy array 转换为 tensor，或将较大的 tensor 转换为 numpy array 时，占用额外内存的问题。
增加计算设备上的异步错误的检查与报错。
修复了 tensor 的 ndim 未知时 indexing 操作无法被 trace 的问题。

周边工具

修复 load and run 命令行输入的数据无法解析的问题
修复 io dump 中 qint4 和 bool 数据类型 dump 错误
修复megengine.utils.module_stats没有import相关库而无法使用的问题
修复 load and run 编译 cuda 时错误。
删除 dump_with_testcase 工具。
修复 load and run 无法识别用 flatbuffer 序列化模型的问题。
修复参数和计算量统计工具 module_stats 接口的 inputs 为 dict 时，无法统计的问题。
修复 load and run工具使用 --get-static-mem-info选项，统计得到的权重信息数据有误的问题。
修复 load_and_run 工具中，使用形如 –input "ratio:1.0" 选项时的参数解析错误。

New Features

Python API

添加 megengine.functional.diag 算子。

发版流程

TracedModule 支持在图手术过程中修改 Node 的名字。
为 TracedModule 提供一个 enable_expr_checker 开关，以在 trace 时进行更多检查。

ARM

优化 Arm 中部分数学计算的实现，性能有微弱的提升
ARM 后端支持 rnn_cell/lstm_cell/lstm 算子
添加 elemwise 部分 case 对多线程的支持，以支持 TS 项目部分模型性能优化。

第三方硬件

增加对寒武纪 MLU270 支持。
TensorRT Runtime Opr 支持动态 shape 的模型,且可根据输入 shape 主动选择相近「IOptimizationProfile」。

通用组件

CPU 支持运行 int4 模型。
megengine.functional.nn.remap 支持 dtype 为 float16 下的求导
优化非连续情况下的 typecvt 的性能
新增端上训练支持，更多详情查看这里
在 windows 系统上，load_and_run 增加动态链接 MegEngine 支持。

周边工具

新增了 cmake 格式化工具，执行可将 cmake 相关文件进行格式化。
Custom Op 增加 JIT 构建工具，文档待补充。
支持构建 Android whl 包。

Improvements

Python API

优化 megengine.random.RNG.uniform API中 low=0 & high=1 的情况下的 elemwise 开销，单算子性能提升约75% 。

CUDA

改进 megengine.functional.nn.softmax 在 axis 为标量时，CUDA 平台上的性能提升约200%～450%。
提高 megengine.functional.nn.dropout 在 CUDA 平台上的性能，可提升约 650%。
提高 megengine.functional.nn.layer_norm 在 CUDA 平台上的性能，可提升约 540%。

ARM

当一个 tensor 需要进行 int16/uint16 → float 的转换，并且转换后的数据进行 Mul/ADD 运算时，将多个运算合并为 ElemwiseMultiType，在010项目的 369 号模型验证性能提升约20倍(23512.8us →1208 us)。

通用组件

动态 AMP 性能提升，多个模型验证可提升约1% 。
优化 cpu 环境下 jit.trace 的时间。bs 256 、VGG16 模型验证，jit.trace 从约 4 分钟提升至 2 分钟。
修复在 cpu 上模型执行速度过慢的问题，在 VGG16 bs 10 验证从 10 分钟提升至约 6s。

MegEngine Lite

Bug fixes

修复 lite 中 TensorBatchCollector 的 device id 设置错误
Lite 中空 tensor 的 to_numpy 方法增加输出 Tensor 的数据类型信息
修复用户在自定义模型输出空间时部分模型推理失败的问题
修复 MegEngine Lite 的 device 配置接口为只设置 xpu 的 device type 为用户指定的 device type 。
修复 MegEngine Lite python 接口在 TensorBatchCollector 的 batch id 出错时没有报错日志输出的问题。
修复 MegEngine Lite 开启「record level 2」时报错的问题。

New Features

lite 中增加对寒武纪的支持。
MegEngineLite 新增一个名为 get_data_by_share 的接口。通过调用该接口，用户可以零拷贝地获得一个 lite tensor 的 numpy 对象。
增加 cv 的分类与检测的 example 。
新增全局图优化支持。

MegEngine

Notice

Drop support for python3.5 from MegEngine v1.9.

HighLight

megengine.functional.topk will default to descending order in v1.12. Please specify the "descending" argument during the transition period.
MegEngine support Device Training，you can refer to here.

Bug fixes

Python API

Correct behavior of megengine.functional.floor_div for integers with opposite sign.
Allow passing None to megengine.functional.broadcast_to , meaning the corresponding axis should not broadcast.

Release process

Fix a compatibility issue with TracedModule.
Fix TracedModule Bug ：
- Fix the problem that ops in third-party backend such as tensorrt can not be serialized.
- Fix the problem that input expr bound top_ graph failed.
- Fix the problem of incorrect calculation of expr insertion position when ModuleNode is used as input of graph operation.
- Fix a bug of v1.7: the model with ones or zeros can't work.
- Fix a recursion too deep issue when copying traced module.
- Fix an error that prevents traced module from tracing a module more than once.
- Fix traced module not recognizing pad.
- Improve error message for illegal inputs feed into traced module.
Fixed the problem that when global graph optimization and fastrun are enabled at the same time, an error will be reported when the selected algorithm is only naive.

CUDA

To judge that the front input Tensor is too large, optimize the error message, and avoid directly outputting cuDNN to report errors.
Fixed output derivation error when tensorrt changed shape.

Common components

Fix the problem that the ConvBias operator of MegDNN fallback is not available.
matmul, pooling operators support fastrun, which will lead to better inference performance for C++ models.
MegEngine（8G） fix build issue at low memory env(8G).
Reduce memory consumption when a large numpy array is converted to tensor or a large tensor is converted to numpy array
Add out-of-bound access check for some operators.
Fix the problem that the indexing operation cannot be traced when the ndim of the tensor is unknown.

Peripheral tools

Fixed the problem that the data entered in the load and run command line could not be parsed.
Fix qint4 and bool data type dump errors in io dump.
Fix the problem that megengine.utils.module_stats cannot be used without import related libraries.
Fix load and run build error when build with CUDA.
Remove dump_with_testcase tool.
Fix the problem that load and run cannot recognize the serialized model with flatbuffer.
fix a bug in megengine.tools.network_visualize when inputs is instance of dict.
Fix a bug that user will get wrong statistic when using --get-static-mem-info.
Fix a bug that load_and_run will get parsing error when meet command like –input "ratio:1.0".

New Features

Python API

Add megengine.functional.diag operator.

Release process

Support that the name of node can to be modified during the graph operation in TraceModule.
Add a enable_expr_checker switch for traced module, which adds more checks during tracing.

ARM

Optimize the implementation of some mathematical calculations in arm, the performance is slightly improved.
Add arm rnn_cell/lstm_cell/lstm operator.
Support part of arm ternary elemwise multithread.

Third-party hardware

Added support for cambricon MLU270.
Supporting dynamic shape model in TensorRT Runtime Opr and set closest IOptimizationProfile according to input shape automatically .

Common components

CPU supports running int4 model.
Support backward computation for float16 dtype in remap.
Optimize the performance of typecvt in non-continuous situations.
Add training based on cpp Interface, more.
For windows system, load_and_run supports dynamicly linking megengine now.

Peripheral tools

Added a cmake formatting tool: cmakeformat.py.
Add the JIT builder for Custom Op.
Support build python wheel for Android(termux env).

Improvements

Python API

Add fastpath when low=0 and high=1 for megengine.random.RNG.uniform to improve performance.

CUDA

Improve performance of softmax when axis is scalar on CUDA platforms, by 200% - 450%.
Enhance performance of dropout on CUDA platforms by up to 650%.
Enhance performance of layer_norm on CUDA platforms, by up to 540%.

ARM

ADD an operator fusion case of TypeCvt and Elemwise. A pass will fuse a Typecvt(uint16 to float) operator and one Elemwise operator(Mul/ADD) to an ElemwiseMultiType operator and developing relative kernel on aarch64.

Common components

Add fastpath when low=0 and high=1 for megengine.random.RNG.uniform to improve performance.
Optimize the placement order of algorithms in matrixmul under the x86 platform in dnn to improve the dump time of jit.trace(bs256 VGG16, 4min -> 2min).
Fix the problem that the model speed on CPU is too slow (bs10 VGG16,10min -> 6s).

MegEngine Lite

Bug fixes

Fix the device ID setting error of tensorbatchcollector in lite.
Add data type information when call empty tensor to_numpy method.
Fix the problem that some model inferences fail when users customize the output space of the model.
Fix device type configuration for megengine lite. Now only the devices of which the device type is unspecified will be modified.
Add warning for megengine lite python interface, when error of batch indexes occurs in the TensorBatchCollector.
Fix runtime error when record level of megengine lite is 2.

New Features

Add interface for cambricon models in lite.
Add a new interface in megenginelite tensor module named get_data_by_share. A zero-copy numpy object will be returned containing data of a lite tensor object.
Add classification and detection examples in lite.
Add megenginelite Python & c/c++ global graph optimization interface.

v1.8.1.m1

2 years ago

v1.8.0

2 years ago

v1.7.2.m1

2 years ago

v1.7.0

2 years ago

MegEngine

HighLight

dump_with_testcase_mge.py 脚本删除，功能都挪入 megengine.jit.trace.dump 中，使用请参考导出序列化模型文件。
MgeConvert 可将 mge/TracedModule 模型转换为第三方模型文件，支持 Caffe、TFLite 和 ONNX 框架。使用参考 ReadMe。

Bug Fixes

通用组件

修复python3.8下的语法问题。
修复学习率不能为0的问题。
增加指数操作特判逻辑以保证结果的一致性，如x**2和x*x，x**3和x*x*x等的一致性。
修复了业务线 det int4 模型在全局图优化 dump 时的问题。修复后全局图优化会检查 opr 的 format，若 format 不匹配 nchw 就不做全局图优化。
解决 tensor.mean() 在以 fp32 计算，fp16 输出时溢出的问题。
修复求导规则失败后异常过多的问题，通过条件判断仅输出必要异常，便于 gdb 调试。
修复无法将 tensor 转换为量化 int4 类型的问题。
修复禁止 DTR 功能时未释放相关资源的问题。
解决 DTR 平方根采样不随机的问题，解决后 resnet1202 训练速度可提升5%。
删除 DTR 中所有 swap 接口。
显存分配默认开启去碎片功能，去除 enable_defrag 接口。
修复scripts/cmake-build/*.sh脚本-n参数。
使用官方构建脚本构建nccl。
修复让 fbs 序列化使用正确的版本号。
imperative中支持设置fast-run的workspace限制，以解决开 fastrun 训练 OOM 问题。
修复 megengine.functional.nn.layer_norm 在 AMP 上的运行错误。

CUDA

当 CUBLAS_VERSION < 11200（ CUBLAS 版本低于 11.2 ）且 batch 为 1 时，如果 shape 过大，则关闭 cublas batched matmul 算法，避免运行崩溃。
修复使 kernel size 较大时（例如 160x160 ），可找到正确的算法。
修复 CUDA 下卷积 illegal memory access 的问题。
修复 cuda11 下，部分模型 cudnnConvBiasActivation 崩溃的问题。
修复带 BatchConvBias 的模型开启 nchw32 图优化时模型运行报错的问题。

python API

添加 layer_norm API 文档。
megengine.utils.module_stats 支持字典输入。
megengine.functional.full ：修改返回值的 dtype 为传入的数字的 dtype ，而不是默认 float32 。
修复使用 PReLU 的模型无法被 trace 的问题。

基础组件

优化 low bit 类型的信息描述。
修复使 module 的 tensors 和 named_tensors 接口只返回 Tensor 类型的数据。
修复 gcc 和 clang 编译选项没对齐带来的精度问题。

周边工具

使用load and run 工具推理增加支持 u16 和 s16 输入数据。
修复 load_and_run 获取模型内存「--get-static-mem-info」选项不生效的问题。

x86

修复 conv_bias Operator 计算错误问题。

发版流程

修复在 TracedModule 中插入自定义 QATModule 后 trace 状态有误的问题。
修复 TracedModule 拍平之后的图中部分 Node 未重命名。
修复 TracedModule 无法插入 _name 属性相同的 Module 。
修复让不在megengine.functional.nn 里定义的函数可被 TracedModule 正确地 trace 。
TracedModule 中修复 module dict 遍历操作无法被 trace 的问题。
修复使用相同 tensor node 的 functional 调用时 flatten 时 node.users 不正确的问题。

OpenCL

修改OpenCL算法搜索的cache更新策略，当cache key一样时用最新的而不是之前的，以解决更换cache触发了从source编译kernel，导致运行速度变慢的问题。

Compatibility violation

ARM

优化Arm中Sigmoid的计算性能。

New Features

Python API

增加 pixel_shuffle opr 。
megengine.random.permutation 增加支持 Tensor 输入。
增加megengine.random.shuffle opr 。
添加 layer_norm 的支持。
增加 megengine.functional.nn.hinge_loss 。
增加 megengine.functional.nn.pad
megengine.functional.nn.local_response_norm 在 Functional 和 Module 中添加 LRN（局部相应归一化）支持。
megengine.functional.nn.conv_transpose2d megengine.functional.nn.conv_transpose3d 增加 group 参数支持。

通用组件

添加 megengine.coalesce_free_memory 接口，用于回收空闲内存。
修复 module.PixelShuffle 报错问题。
支持构造 uint16 tensor 。
BatchNorm 添加支持 NHWC format 。
支持 bazel 编译 flatbuffer ，使用编译参数 --copt "-DMGB_ENABLE_FBS_SERIALIZATION=1" 编译的 load and run 可以运行开源版本 MegEngine dump 的模型。
支持在 windows 上直接使用 MegEngine 的动态库以减少 windows 中 whl 包的大小。
修复同一个进程中同时使用 MegEngine 和 Pytorch 崩溃问题。
训练时自动打开 defrag 功能，显存不够且显存碎片严重时可合并显存碎片。
支持 record 模式中输出 tensor 的内存地址可自定义修改。
为 megengine.functional.nn.cvt_color OPR 增加 bgr2gray mode 。

周边工具

删除 dump_with_testcase_mge.py，将其功能放在jit.dump接口中。
为 SmallVector 等 MegEngine 自定义容器增加调试脚本，在 gdb 调试时可以查看容器内容，如下图所示。注意：gdb 只有在 MegBrain 根目录运行的时候才自动有效。

基础组件

Reduce 算子支持 nchw44 format 。
Elemwise 算子支持 nchw88 的 format 。
针对nhwc int8 conv，添加融合 conv 和 typecvt 图优化。
megengine.optimizer.SGD 增加 nesterov momentum 支持。
支持 C++ 模型内存/显存可视化。

ARM

优化非连续情况下的transpose性能，在aarch64 下加速比 1.3。
Arm 平台支持 NHWC Int16 输入，fp32 输出的 elemwise。
增加支持了 nchw、nchw44+fp32、nchw88+fp16 的 linear 的 upsample2 ，大幅优化了 ARM 上 linear resize 性能。
优化 channel wise conv 实现，在 feature map 较小时有 1.5x 加速。
conv 加入 nchw88 format 的 fp16 支持。
优化 arm 中非连续的 relayout kernel，在开启 record 的情况下推理平均用时在 sdm660 上可以减少约 50% ； subtensor opr 的用时占比减少约70% 。
优化 Arm Elemwise 中 N1HW 的 broadcast 情况，单个计算性能有 1 倍以上提升。

X86

X86下添加 reproduceable 属性的 6x16 的 matmul 支持。

CUDA

BatchNorm 算子支持 nhwc format 。

发版流程

TracedModule node 命名规则变更，对 expr 的 GetAttr 根据目的进行融合。 node 修改为 name 、qualname 两个名字，命名规则参见如下规则。node._name 用法不变，node._orig_name 需要修改为 node.qualname 使用。 qualname：
- 通过 qualname 可以找到该 Node 是哪个 Module 或 Module 中第几个操作产生了该 Tensor 或 Module。
- 当 Node 所对应的 Tensor 或 Module 是模型的 attribute 时，qualname 与 attribute 的路径一致。
- 当 Node 所对应的 Tensor 是通过调用一个 Module 而生成时，qualname = 该 Module 的在模型中的路径 + ".[out]"。
- 当 Node 所对应的 Tensor 是通过调用一个 function 而生成时，qualname = 该 Node 所在 Graph 的 qualname + ".[func_函数名_函数在当前 Module 中的调用次数]"。
- 当 Node 所对应的 Tensor 是通过调用某个 Tensor 的 method 而生成时，qualname = 该 Node 所在 Graph 的 qualname + ".[method_方法名_该方法在当前 Module 中的调用次数]"。
- 当 Node 所对应的 Tensor 是通过调用一个 opdef 而生成时，qualname = 该 Node 所在 Graph 的 qualname + ".[def_opdef名_该 opdef 在当前 Module 中的调用次数]"。
- 其它：qualname = 该 Node 所在 Graph 的 qualname +"[.Node的名字]"。
name：
- 如果该 Node 所对应的 Tensor 或 Module 是模型的 attribute，name 为 Node 的 qualname 与 Graph 的 qualname 的 “差值” 去重后得到，则 name = Node 的qualname - Node 所在 Graph 的qualname。例如：ResNet18.layer1.block0.conv- ResNet18.layer1 = block0.conv → block0_conv → block0_conv + 去重后缀。
- 当 Node 所对应的 Tensor 是通过调用一个 Module 而生成时，name qualname = 该 Module 所对应 ModuleNode 的 name + "_out" + 去重后缀。
- 当 Node 所对应的 Tensor 是通过调用一个 function 而生成时，qualname = 函数名 + "_out" + 去重后缀。
- 当 Node 所对应的 Tensor 是通过调用某个 Tensor 的 method 而生成时，qualname = 方法名 + "_out" + 去重后缀。
- 当 Node 所对应的 Tensor 是通过调用一个 opdef 而生成时，qualname = opdef名 + "_out" + 去重后缀。
- 其它：name = 形参的参数名或用户定义的名字。

MegEngine Lite

Bug Fixes

修复调用 enable_profile_performance 接口，在 OpenCL 平台的运行性能分析数据为0的问题。
修复 tensor 等在线程结束后无法被其他线程使用的问题，修改为全局变量并加锁。
修复 lite 中 rknn 设置 tensor 属性类型错误。
修复 MegEngineLite 编译 midout 失败的问题。
修复lite中纯C接口编译错误问题。
修复 MegEngine Lite 模型load之后推导输出 tensor shape 错误的问题。
修复 lite Python 接口中异步执行回调函数错误问题。

New Features

添加 LITE_get_static_memory_alloc_info 接口，用于静态内存分析以及可视化。
支持在执行之前获得模型输出的 Tensor Shape。
Lite 支持用户指定输入输出内存地址。
Load and run 重构到 Lite 下，并删除原目录「sdk/load-and-run」。文档正在撰写中，待正式版本中可以提供。
Lite 中 C 接口的回调函数支持传递 void* 参数。

MegEngine

HighLight

The dump_with_testcase_mge.py script is deleted, and the functions are moved into megengine.jit.trace.dump, please refer to Export Serialized Model File .
MgeConver can convert mge/TracedModule models into third-party model files, and supports Caffe, TFLite and ONNX frameworks. Use reference ReadMe .

Bug Fixes

Common components

Fix the syntax problem under python3.8.
Fix the problem that the learning rate cannot be 0.
Increase the special judgment logic of exponential operation to ensure the consistency of the results, such as the consistency of x**2 and x*x, x**3 and x*x*x, etc.
Fixed the problem when the business line det int4 model optimized dump in the global graph. After the repair, the global graph optimization will check the format of opr. If the format does not match nchw, the global graph optimization will not be performed.
Solve the problem of overflow when tensor.mean() is calculated with fp32 and fp16 is output.
Fix the problem of too many exceptions after the derivation rule fails, and only necessary exceptions are output through conditional judgment, which is convenient for gdb debugging.
Fix the problem that tensor cannot be converted to quantized int4 type.
Fixed the issue that related resources were not released when the DTR function was disabled.
Solve the problem that DTR square root sampling is not random. After solving the problem, resnet1202 training speed can be increased by 5%.
Delete all swap interfaces in DTR.
The video memory allocation defaults to enable the de-fragmentation function and remove the enable_defrag interface.
Fix scripts/cmake-build/*.sh script -n parameter.
Use the official build script to build nccl.
Fix to use the correct version number for fbs serialization.
Imperative supports setting fast-run workspace restrictions to solve the problem of opening fastrun training OOM.
Fix the running error of megengine.functional.nn.layer_norm on AMP.

CUDA

When CUBLAS_VERSION <11200 (CUBLAS version is lower than 11.2) and batch is 1, if the shape is too large, turn off the cublas batched matmul algorithm to avoid running crashes.
Fix that when the kernel size is larger (for example, 160x160), the correct algorithm can be found.
Fix the problem of illegal memory access of convolution under CUDA.
Fixed the problem that some models cudnnConvBiasActivation crash under cuda11.
Fix the problem that the model with BatchConvBias will report an error when the nchw32 graph is optimized.

python API

Added layer_norm API documentation.
megengine.utils.module_stats supports dictionary input.
megengine.functional.full: Modify the dtype of the return value to the dtype of the number passed in, instead of the default float32.
Fix the problem that the model using PReLU cannot be traced.

Basic components

Optimize the information description of the low bit type.
Fix the module's tensors and named_tensors interfaces to only return data of type Tensor.
Fix the accuracy problem caused by the misalignment of gcc and clang compilation options.

Peripheral tools

Use load and run tool inference to increase support for u16 and s16 input data.
Fix the problem that the option of load_and_run to obtain model memory “--get-static-mem-info” does not take effect.

x86

Fix the calculation error of conv_bias Operator.

Release process

Fix the problem that the trace status is incorrect after inserting a custom QATModule in the traced module.
Fix that some nodes in the figure after TracedModule are flattened are not renamed.
Fix TracedModule cannot insert Module with the same _name attribute.
Fix to make it not in megengine.functional.nn The defined function can be traced correctly by TracedModule.
Fixed the problem that module dict traversal operation cannot be traced in TracedModule.
Fix the problem that node.users is incorrect when flattening the functional call of the same tensor node.

OpenCL

Modify the cache update strategy of the OpenCL algorithm search. When the cache key is the same, use the latest instead of the previous one to solve the problem that the replacement of the cache triggers the compilation of the kernel from the source, which causes the running speed to slow down.

Compatibility violation

ARM

Optimize the calculation performance of Sigmoid in Arm.

New Features

Python API

Add pixel_shuffle opr.
megengine.random.permutation adds support for Tensor input.
Add megengine.random.shuffle opr.
Add support for layer_norm.
Add megengine.functional.nn.hinge_loss.
megengine.functional.nn.local_response_norm adds LRN (local response normalization) support in Functional and Module.
megengine.functional.nn.conv_transpose2d megengine.functional.nn.conv_transpose3d adds group parameter support.

Common components

Add megengine.coalesce_free_memory interface to reclaim free memory.
Fix the error report of module.PixelShuffle.
Support the construction of uint16 tensor.
BatchNorm adds support for NHWC format.
Support bazel to compile flatbuffer, use compiling parameter --copt "-DMGB_ENABLE_FBS_SERIALIZATION=1" to compile load and run to run the open source version of MegEngine dump model.
Support the direct use of MegEngine's dynamic library on windows to reduce the size of the whl package in windows.
Fixed a crash when using MegEngine and Pytorch in the same process at the same time.
The defrag function is automatically turned on during training, and the video memory fragments can be merged when the video memory is insufficient and the video memory fragments are severe.
Support the memory address of output tensor in record mode can be customized and modified.
Added bgr2gray mode for megengine.functional.nn.cvt_color OPR.

Peripheral tools

Delete dump_with_testcase_mge.py and put its function in the jit.dump interface.
Add debugging scripts to MegEngine custom containers such as SmallVector, and you can view the contents of the container during gdb debugging, as shown in the figure below. Note: gdb is only effective when the root directory of MegBrain is running.

Basic components

Reduce operator supports nchw44 format.
The Elemwise operator supports the format of nchw88.
For nhwc int8 conv, add fusion conv and typecvt graph optimization.
megengine.optimizer.SGD adds nesterov momentum support.
Support C++ model memory/video memory visualization, please refer to [Static Graph Memory Visualization Tool User Guide] () for usage.

ARM

Optimize the transpose performance in non-continuous situations, with a speedup of 1.3 under aarch64.
Arm platform supports elemwise of NHWC Int16 input and fp32 output.
Added support for linear upsample2 of nchw, nchw44+fp32, and nchw88+fp16, greatly optimizing the linear resize performance on ARM.
Optimize the implementation of channel wise conv, with a 1.5x speedup when the feature map is small.
conv added fp16 support of nchw88 format.
Optimize the non-continuous relayout kernel in the arm. When the record is turned on, the average inference time can be reduced by about 50% on the sdm660; the time-consuming ratio of the subtensor opr can be reduced by about 70%.
Optimize the broadcast situation of N1HW in Arm Elemwise, and the single calculation performance has been improved by more than 1 times.

X86

Added support for 6x16 matmul with reproduceable attribute under X86.

CUDA

BatchNorm operator supports nhwc format.

Release process

The naming rule of TracedModule node is changed, and GetAttr of expr is merged according to the purpose. Modify node to name and qualname. The usage of node._name remains unchanged, and node._orig_name needs to be modified to use node.qualname.

MegEngine Lite

Bug Fixes

Fix the problem that the running performance analysis data on the OpenCL platform is 0 when the enable_profile_performance interface is called.
Fix the problem that tensor, etc. cannot be used by other threads after the thread ends. Modify it to a global variable and lock it.
Fix the wrong type of tensor set by rknn in lite.
Fix the problem that MegEngineLite fails to compile midout.
Fix the compilation error of pure C interface in lite.
Fix the problem that the tensor shape of the deduced output is wrong after the MegEngine Lite model is loaded.
Fixed an error in the asynchronous execution of the callback function in the lite Python interface.

New Features

Add LITE_get_static_memory_alloc_info interface for static memory analysis and visualization.
Support Tensor Shape to get model output before execution.
Lite supports users to specify input and output memory addresses.
Load and run is refactored to Lite, and the original directory "sdk/load-and-run" is deleted. The document is being written and will be available in the official version.
The callback function of the C interface in Lite supports passing void* parameters.