MegEngine Versions Save

MegEngine 是一个快速、可拓展、易于使用且支持自动求导的深度学习框架

v1.7.0

2 years ago

MegEngine

HighLight

dump_with_testcase_mge.py 脚本删除，功能都挪入 megengine.jit.trace.dump 中，使用请参考导出序列化模型文件。
MgeConvert 可将 mge/TracedModule 模型转换为第三方模型文件，支持 Caffe、TFLite 和 ONNX 框架。使用参考 ReadMe。

Bug Fixes

通用组件

修复python3.8下的语法问题。
修复学习率不能为0的问题。
增加指数操作特判逻辑以保证结果的一致性，如x**2和x*x，x**3和x*x*x等的一致性。
修复了业务线 det int4 模型在全局图优化 dump 时的问题。修复后全局图优化会检查 opr 的 format，若 format 不匹配 nchw 就不做全局图优化。
解决 tensor.mean() 在以 fp32 计算，fp16 输出时溢出的问题。
修复求导规则失败后异常过多的问题，通过条件判断仅输出必要异常，便于 gdb 调试。
修复无法将 tensor 转换为量化 int4 类型的问题。
修复禁止 DTR 功能时未释放相关资源的问题。
解决 DTR 平方根采样不随机的问题，解决后 resnet1202 训练速度可提升5%。
删除 DTR 中所有 swap 接口。
显存分配默认开启去碎片功能，去除 enable_defrag 接口。
修复scripts/cmake-build/*.sh脚本-n参数。
使用官方构建脚本构建nccl。
修复让 fbs 序列化使用正确的版本号。
imperative中支持设置fast-run的workspace限制，以解决开 fastrun 训练 OOM 问题。
修复 megengine.functional.nn.layer_norm 在 AMP 上的运行错误。

CUDA

当 CUBLAS_VERSION < 11200（ CUBLAS 版本低于 11.2 ）且 batch 为 1 时，如果 shape 过大，则关闭 cublas batched matmul 算法，避免运行崩溃。
修复使 kernel size 较大时（例如 160x160 ），可找到正确的算法。
修复 CUDA 下卷积 illegal memory access 的问题。
修复 cuda11 下，部分模型 cudnnConvBiasActivation 崩溃的问题。
修复带 BatchConvBias 的模型开启 nchw32 图优化时模型运行报错的问题。

python API

添加 layer_norm API 文档。
megengine.utils.module_stats 支持字典输入。
megengine.functional.full ：修改返回值的 dtype 为传入的数字的 dtype ，而不是默认 float32 。
修复使用 PReLU 的模型无法被 trace 的问题。

基础组件

优化 low bit 类型的信息描述。
修复使 module 的 tensors 和 named_tensors 接口只返回 Tensor 类型的数据。
修复 gcc 和 clang 编译选项没对齐带来的精度问题。

周边工具

使用load and run 工具推理增加支持 u16 和 s16 输入数据。
修复 load_and_run 获取模型内存「--get-static-mem-info」选项不生效的问题。

x86

修复 conv_bias Operator 计算错误问题。

发版流程

修复在 TracedModule 中插入自定义 QATModule 后 trace 状态有误的问题。
修复 TracedModule 拍平之后的图中部分 Node 未重命名。
修复 TracedModule 无法插入 _name 属性相同的 Module 。
修复让不在megengine.functional.nn 里定义的函数可被 TracedModule 正确地 trace 。
TracedModule 中修复 module dict 遍历操作无法被 trace 的问题。
修复使用相同 tensor node 的 functional 调用时 flatten 时 node.users 不正确的问题。

OpenCL

修改OpenCL算法搜索的cache更新策略，当cache key一样时用最新的而不是之前的，以解决更换cache触发了从source编译kernel，导致运行速度变慢的问题。

Compatibility violation

ARM

优化Arm中Sigmoid的计算性能。

New Features

Python API

增加 pixel_shuffle opr 。
megengine.random.permutation 增加支持 Tensor 输入。
增加megengine.random.shuffle opr 。
添加 layer_norm 的支持。
增加 megengine.functional.nn.hinge_loss 。
增加 megengine.functional.nn.pad
megengine.functional.nn.local_response_norm 在 Functional 和 Module 中添加 LRN（局部相应归一化）支持。
megengine.functional.nn.conv_transpose2d megengine.functional.nn.conv_transpose3d 增加 group 参数支持。

通用组件

添加 megengine.coalesce_free_memory 接口，用于回收空闲内存。
修复 module.PixelShuffle 报错问题。
支持构造 uint16 tensor 。
BatchNorm 添加支持 NHWC format 。
支持 bazel 编译 flatbuffer ，使用编译参数 --copt "-DMGB_ENABLE_FBS_SERIALIZATION=1" 编译的 load and run 可以运行开源版本 MegEngine dump 的模型。
支持在 windows 上直接使用 MegEngine 的动态库以减少 windows 中 whl 包的大小。
修复同一个进程中同时使用 MegEngine 和 Pytorch 崩溃问题。
训练时自动打开 defrag 功能，显存不够且显存碎片严重时可合并显存碎片。
支持 record 模式中输出 tensor 的内存地址可自定义修改。
为 megengine.functional.nn.cvt_color OPR 增加 bgr2gray mode 。

周边工具

删除 dump_with_testcase_mge.py，将其功能放在jit.dump接口中。
为 SmallVector 等 MegEngine 自定义容器增加调试脚本，在 gdb 调试时可以查看容器内容，如下图所示。注意：gdb 只有在 MegBrain 根目录运行的时候才自动有效。

基础组件

Reduce 算子支持 nchw44 format 。
Elemwise 算子支持 nchw88 的 format 。
针对nhwc int8 conv，添加融合 conv 和 typecvt 图优化。
megengine.optimizer.SGD 增加 nesterov momentum 支持。
支持 C++ 模型内存/显存可视化。

ARM

优化非连续情况下的transpose性能，在aarch64 下加速比 1.3。
Arm 平台支持 NHWC Int16 输入，fp32 输出的 elemwise。
增加支持了 nchw、nchw44+fp32、nchw88+fp16 的 linear 的 upsample2 ，大幅优化了 ARM 上 linear resize 性能。
优化 channel wise conv 实现，在 feature map 较小时有 1.5x 加速。
conv 加入 nchw88 format 的 fp16 支持。
优化 arm 中非连续的 relayout kernel，在开启 record 的情况下推理平均用时在 sdm660 上可以减少约 50% ； subtensor opr 的用时占比减少约70% 。
优化 Arm Elemwise 中 N1HW 的 broadcast 情况，单个计算性能有 1 倍以上提升。

X86

X86下添加 reproduceable 属性的 6x16 的 matmul 支持。

CUDA

BatchNorm 算子支持 nhwc format 。

发版流程

TracedModule node 命名规则变更，对 expr 的 GetAttr 根据目的进行融合。 node 修改为 name 、qualname 两个名字，命名规则参见如下规则。node._name 用法不变，node._orig_name 需要修改为 node.qualname 使用。 qualname：
- 通过 qualname 可以找到该 Node 是哪个 Module 或 Module 中第几个操作产生了该 Tensor 或 Module。
- 当 Node 所对应的 Tensor 或 Module 是模型的 attribute 时，qualname 与 attribute 的路径一致。
- 当 Node 所对应的 Tensor 是通过调用一个 Module 而生成时，qualname = 该 Module 的在模型中的路径 + ".[out]"。
- 当 Node 所对应的 Tensor 是通过调用一个 function 而生成时，qualname = 该 Node 所在 Graph 的 qualname + ".[func_函数名_函数在当前 Module 中的调用次数]"。
- 当 Node 所对应的 Tensor 是通过调用某个 Tensor 的 method 而生成时，qualname = 该 Node 所在 Graph 的 qualname + ".[method_方法名_该方法在当前 Module 中的调用次数]"。
- 当 Node 所对应的 Tensor 是通过调用一个 opdef 而生成时，qualname = 该 Node 所在 Graph 的 qualname + ".[def_opdef名_该 opdef 在当前 Module 中的调用次数]"。
- 其它：qualname = 该 Node 所在 Graph 的 qualname +"[.Node的名字]"。
name：
- 如果该 Node 所对应的 Tensor 或 Module 是模型的 attribute，name 为 Node 的 qualname 与 Graph 的 qualname 的 “差值” 去重后得到，则 name = Node 的qualname - Node 所在 Graph 的qualname。例如：ResNet18.layer1.block0.conv- ResNet18.layer1 = block0.conv → block0_conv → block0_conv + 去重后缀。
- 当 Node 所对应的 Tensor 是通过调用一个 Module 而生成时，name qualname = 该 Module 所对应 ModuleNode 的 name + "_out" + 去重后缀。
- 当 Node 所对应的 Tensor 是通过调用一个 function 而生成时，qualname = 函数名 + "_out" + 去重后缀。
- 当 Node 所对应的 Tensor 是通过调用某个 Tensor 的 method 而生成时，qualname = 方法名 + "_out" + 去重后缀。
- 当 Node 所对应的 Tensor 是通过调用一个 opdef 而生成时，qualname = opdef名 + "_out" + 去重后缀。
- 其它：name = 形参的参数名或用户定义的名字。

MegEngine Lite

Bug Fixes

修复调用 enable_profile_performance 接口，在 OpenCL 平台的运行性能分析数据为0的问题。
修复 tensor 等在线程结束后无法被其他线程使用的问题，修改为全局变量并加锁。
修复 lite 中 rknn 设置 tensor 属性类型错误。
修复 MegEngineLite 编译 midout 失败的问题。
修复lite中纯C接口编译错误问题。
修复 MegEngine Lite 模型load之后推导输出 tensor shape 错误的问题。
修复 lite Python 接口中异步执行回调函数错误问题。

New Features

添加 LITE_get_static_memory_alloc_info 接口，用于静态内存分析以及可视化。
支持在执行之前获得模型输出的 Tensor Shape。
Lite 支持用户指定输入输出内存地址。
Load and run 重构到 Lite 下，并删除原目录「sdk/load-and-run」。文档正在撰写中，待正式版本中可以提供。
Lite 中 C 接口的回调函数支持传递 void* 参数。

MegEngine

HighLight

The dump_with_testcase_mge.py script is deleted, and the functions are moved into megengine.jit.trace.dump, please refer to Export Serialized Model File .
MgeConver can convert mge/TracedModule models into third-party model files, and supports Caffe, TFLite and ONNX frameworks. Use reference ReadMe .

Bug Fixes

Common components

Fix the syntax problem under python3.8.
Fix the problem that the learning rate cannot be 0.
Increase the special judgment logic of exponential operation to ensure the consistency of the results, such as the consistency of x**2 and x*x, x**3 and x*x*x, etc.
Fixed the problem when the business line det int4 model optimized dump in the global graph. After the repair, the global graph optimization will check the format of opr. If the format does not match nchw, the global graph optimization will not be performed.
Solve the problem of overflow when tensor.mean() is calculated with fp32 and fp16 is output.
Fix the problem of too many exceptions after the derivation rule fails, and only necessary exceptions are output through conditional judgment, which is convenient for gdb debugging.
Fix the problem that tensor cannot be converted to quantized int4 type.
Fixed the issue that related resources were not released when the DTR function was disabled.
Solve the problem that DTR square root sampling is not random. After solving the problem, resnet1202 training speed can be increased by 5%.
Delete all swap interfaces in DTR.
The video memory allocation defaults to enable the de-fragmentation function and remove the enable_defrag interface.
Fix scripts/cmake-build/*.sh script -n parameter.
Use the official build script to build nccl.
Fix to use the correct version number for fbs serialization.
Imperative supports setting fast-run workspace restrictions to solve the problem of opening fastrun training OOM.
Fix the running error of megengine.functional.nn.layer_norm on AMP.

CUDA

When CUBLAS_VERSION <11200 (CUBLAS version is lower than 11.2) and batch is 1, if the shape is too large, turn off the cublas batched matmul algorithm to avoid running crashes.
Fix that when the kernel size is larger (for example, 160x160), the correct algorithm can be found.
Fix the problem of illegal memory access of convolution under CUDA.
Fixed the problem that some models cudnnConvBiasActivation crash under cuda11.
Fix the problem that the model with BatchConvBias will report an error when the nchw32 graph is optimized.

python API

Added layer_norm API documentation.
megengine.utils.module_stats supports dictionary input.
megengine.functional.full: Modify the dtype of the return value to the dtype of the number passed in, instead of the default float32.
Fix the problem that the model using PReLU cannot be traced.

Basic components

Optimize the information description of the low bit type.
Fix the module's tensors and named_tensors interfaces to only return data of type Tensor.
Fix the accuracy problem caused by the misalignment of gcc and clang compilation options.

Peripheral tools

Use load and run tool inference to increase support for u16 and s16 input data.
Fix the problem that the option of load_and_run to obtain model memory “--get-static-mem-info” does not take effect.

x86

Fix the calculation error of conv_bias Operator.

Release process

Fix the problem that the trace status is incorrect after inserting a custom QATModule in the traced module.
Fix that some nodes in the figure after TracedModule are flattened are not renamed.
Fix TracedModule cannot insert Module with the same _name attribute.
Fix to make it not in megengine.functional.nn The defined function can be traced correctly by TracedModule.
Fixed the problem that module dict traversal operation cannot be traced in TracedModule.
Fix the problem that node.users is incorrect when flattening the functional call of the same tensor node.

OpenCL

Modify the cache update strategy of the OpenCL algorithm search. When the cache key is the same, use the latest instead of the previous one to solve the problem that the replacement of the cache triggers the compilation of the kernel from the source, which causes the running speed to slow down.

Compatibility violation

ARM

Optimize the calculation performance of Sigmoid in Arm.

New Features

Python API

Add pixel_shuffle opr.
megengine.random.permutation adds support for Tensor input.
Add megengine.random.shuffle opr.
Add support for layer_norm.
Add megengine.functional.nn.hinge_loss.
megengine.functional.nn.local_response_norm adds LRN (local response normalization) support in Functional and Module.
megengine.functional.nn.conv_transpose2d megengine.functional.nn.conv_transpose3d adds group parameter support.

Common components

Add megengine.coalesce_free_memory interface to reclaim free memory.
Fix the error report of module.PixelShuffle.
Support the construction of uint16 tensor.
BatchNorm adds support for NHWC format.
Support bazel to compile flatbuffer, use compiling parameter --copt "-DMGB_ENABLE_FBS_SERIALIZATION=1" to compile load and run to run the open source version of MegEngine dump model.
Support the direct use of MegEngine's dynamic library on windows to reduce the size of the whl package in windows.
Fixed a crash when using MegEngine and Pytorch in the same process at the same time.
The defrag function is automatically turned on during training, and the video memory fragments can be merged when the video memory is insufficient and the video memory fragments are severe.
Support the memory address of output tensor in record mode can be customized and modified.
Added bgr2gray mode for megengine.functional.nn.cvt_color OPR.

Peripheral tools

Delete dump_with_testcase_mge.py and put its function in the jit.dump interface.
Add debugging scripts to MegEngine custom containers such as SmallVector, and you can view the contents of the container during gdb debugging, as shown in the figure below. Note: gdb is only effective when the root directory of MegBrain is running.

Basic components

Reduce operator supports nchw44 format.
The Elemwise operator supports the format of nchw88.
For nhwc int8 conv, add fusion conv and typecvt graph optimization.
megengine.optimizer.SGD adds nesterov momentum support.
Support C++ model memory/video memory visualization, please refer to [Static Graph Memory Visualization Tool User Guide] () for usage.

ARM

Optimize the transpose performance in non-continuous situations, with a speedup of 1.3 under aarch64.
Arm platform supports elemwise of NHWC Int16 input and fp32 output.
Added support for linear upsample2 of nchw, nchw44+fp32, and nchw88+fp16, greatly optimizing the linear resize performance on ARM.
Optimize the implementation of channel wise conv, with a 1.5x speedup when the feature map is small.
conv added fp16 support of nchw88 format.
Optimize the non-continuous relayout kernel in the arm. When the record is turned on, the average inference time can be reduced by about 50% on the sdm660; the time-consuming ratio of the subtensor opr can be reduced by about 70%.
Optimize the broadcast situation of N1HW in Arm Elemwise, and the single calculation performance has been improved by more than 1 times.

X86

Added support for 6x16 matmul with reproduceable attribute under X86.

CUDA

BatchNorm operator supports nhwc format.

Release process

The naming rule of TracedModule node is changed, and GetAttr of expr is merged according to the purpose. Modify node to name and qualname. The usage of node._name remains unchanged, and node._orig_name needs to be modified to use node.qualname.

MegEngine Lite

Bug Fixes

Fix the problem that the running performance analysis data on the OpenCL platform is 0 when the enable_profile_performance interface is called.
Fix the problem that tensor, etc. cannot be used by other threads after the thread ends. Modify it to a global variable and lock it.
Fix the wrong type of tensor set by rknn in lite.
Fix the problem that MegEngineLite fails to compile midout.
Fix the compilation error of pure C interface in lite.
Fix the problem that the tensor shape of the deduced output is wrong after the MegEngine Lite model is loaded.
Fixed an error in the asynchronous execution of the callback function in the lite Python interface.

New Features

Add LITE_get_static_memory_alloc_info interface for static memory analysis and visualization.
Support Tensor Shape to get model output before execution.
Lite supports users to specify input and output memory addresses.
Load and run is refactored to Lite, and the original directory "sdk/load-and-run" is deleted. The document is being written and will be available in the official version.
The callback function of the C interface in Lite supports passing void* parameters.

v1.7.1.m1

2 years ago

v1.6.0

2 years ago

HighLight

trace module

基于 MegEngine 的应用落地方案 Trace Module 正式开源。 Trace Module 由一个普通的 Module 通过 trace_module 方法转换得到，仅由 MegEngine 的数据结构而构成，可脱离用户源代码被训练、序列化以及反序列化、以及图手术。其优点如下。

graph 基于 mge.Module 和 mge.function 构建，便于与源代码相对应，且能与第三方框架的 OP 对应上。
图手术直观，可直接查看 graph ，了解修改后的 graph 是否与预期一致。
可基于 TracedModule 进行训练。
基于动态图，调试方便，所见即所得。

Custom Op

提供可以将用户自定义的 C++/CUDA 算子快速集成入 MegEngine 的工具，以达成更好性能效果。

全局图优化

新增全局图优化支持，可以自动做 format 转换和 padding，增强易用性。您不用再关心是否需要开哪个图优化选项，例如「-- enable-nchw32」，在检测模型验证获得 5% 左右加速。本次版本增加对 cuda 平台支持，ARM/X86 等更多平台进行中，使用如下命令尝鲜使用： load_and_run /path/to/model --layout-transform cuda

MegEngine Lite

MegEngine Lite 正式开源，请参考使用指南 MegEngine Lite 是模型直达 SDK 的解决方案，充分发挥 MegEngine 的高效，多平台的推理能力，为用户提供简洁、易用的 C++、C、python 多种模型推理接口，同时 MegEngine Lite 也可以接入其他的推理框架，便于业务集成。

Bug Fixes

Python API

修复 megengine.functional.nn.dropout 在 inference 模式下输出不正确的问题。
修复 megengine.functional.vision.warp_perspective 文档公式错误的问题。
修复 megengine.functional.sub 的 API 接口说明。
删除已经弃用的 megengine.jit.trace.trace 接口。
linspace/arange 修改 dtype 为 fp32 支持更多数据类型。
修复 megengine.functional.nn.dropout 在 inference 模式下的返回值错误的问题。

通用组件

修复训练模型时打开 fastrun 之后显存 OOM 问题。
修复开启 DTR 时由于重算链过长导致递归栈溢出的问题。
修复 fp16 midout 编译失败的问题。
修复 python fastrun 错误，确保读取 cache 的正确性。
计算图 dump的边界情况处理不完整：修复导出 json 文件时把 inf 或者 nan 直接写成 number，导致MegHair profile_analyze.py 分析无法再解析的错误。
修复 tensor detach 后不能被 trace 的问题。
修复异步执行错误 (async error) 报错到无关地方的问题，同时添加引导说明以帮助用户理解报错原因。
修复了在系统已安装 LLVM-12-dev 的情况下引发的 Cmake 构建失败的问题。
修复用户通过设置 LD_LIBRARY_PATH 环境变量指定链接库时 Cmake 编译失败的问题。
修复 Tensor Interpreter 一个错误多次抛异常的问题。之前已经出错的 tensor 被当前 OP 使用时不再重复报错，直接跳过。
修复 Cmake 编译 llvm 报错。

周边工具

修复静态内存分析的工具中统计 GPU 静态内存分配信息的错误。
缩短 cutlass 中gemm kernel 的命名长度，修复由于目录层级太多造成的 windows 编译失败的问题。
修复项目上使用 SNPE Loader 时，如果初始化失败时，Megbrain 没有及时抛出异常的问题。

CUDA

禁用 CUDA 平台部分 matmul 算法的 TensorCore 优化。原因是 TensorCore 优化引入的隐>式类型转换有潜在精度问题。
cuda10 关闭 cublas Lt 算法，因为它会对 fp32 的输入做隐式的类型转换，然后调用 fp16 的 tensorcore，可能引入精度问题。

ARM

修复 ROCm 中 Pooling 计算错误的问题。

New Features

Python API

interpolate 支持 nearest 和 bilinear 模式。
增加混合并行下的数据排布变换支持。对多机算子，可以指定 axis 表示对高维 tensor 对应维度进行切分或者合并。
增加 python 侧获取 nvidia gpu 的 compute capability 的 API: megengine.get_cuda_compute_capability。

混合精度训练

修改添加 weight_scaler/bias_scaler 的方式，从需要手动在模型内添加改为通过接口 get_scaled_model 实现自动添加。

通用组件

Conv backward fp16 精度增加支持 NHWC 格式。
优化异步错误的表达方式，给予更多提示信息。

CUDA

添加 tensorcore fp16 matmul 算法。
sass conv int8 优化后，单个 conv 性能有 5%～10% 的提升。
添加 cutlass nhwc int8 imma conv kernel ，相对于 nchw4 dp4a 提升大约 10%-200% 。全局图优化功能 ready 后可开启对应 algo 获得收益。

ARM

新增 NCHW88 格式的 channel wise 卷积实现，fp32 下部分场景可加速约 1.5 倍。
新增 arm nchw88 fp16 的 channel wise 卷积 kernel。

Improvements

Python API

DTR 优化，各模型测试速度平均提升约10%，ResNet50 8 卡最大 batchsize 达500， GL 8 卡最大 batchsize 达 110， ViT 8 卡最大 batchsize 达 300 。
新增 megengine.functional.nn.split ，替换之前用其他 op 拼接的实现，速度提升约 5 倍。
megengine.functional.split、megengine.functional.cond_take 、advance indexing、Indexing(Set)MultiAxisVec、setsubtensor、subtensor 支持空 tensor 输入。
为 megengine.functional.sub 添加 docstring。
python API 文档转换成 google style , 详情参考这里。

v1.6.0-rc1

2 years ago

HighLight

MegEngine Lite 是模型直达SDK的解决方案，充分发挥 MegEngine 的高效，多平台的推理能力，为用户提供简洁、易用的 C++、C、python 多种模型推理接口，同时 MegEngine Lite 也可以接入其他的推理框架，便于业务集成。
尝鲜体验通道如下，请参考使用指南 python3 -m pip install megengine==1.6.0rc1 -f https://megengine.org.cn/whl/mge.html

Known Issue

由于对fastrun fp16 的特殊 case 加了一些判断，AMP 整体性能有10%左右的下降。

Bug Fixes

Python API

解决 LSQ 量化算子初始化格式错误问题。
当自定义求导 Function 的 forward 输入为非 Tensor 参数时，修改报错信息更可读。
- 原报错信息：RuntimeError: can not find op_trait by GenericPyOp .
- 现报错信息：TypeError: op GenericPyOp expect type Tensor as inputs, got int actually
修复 module 通过 setattr 自动命名时，对同一个实例多次设置到相同属性会误报错误信息的问题。
修复 is_single_machine 可能导致自定义 launcher 报错的检查逻辑。
修复报错信息，输出 device 逻辑名称和物理名称的完整信息，以便于 debug 因为 device 不一致的计算问题。
- 原报错信息：RuntimeError: ambiguous device: cpu0:0 vs cpu0:0 .
- 现报错信息：RuntimeError: ambiguous device: CompNode("cpu0:0" from "xpux:0") vs CompNode("cpu0:0" from "cpux:0") .

通用组件

修复从 vs code terminal 无法构建 macos 的 wheel 包的问题。
统一lb_memory在环境变量、python 、C++ 里的单位为MB，解决 int 类型参数较大时溢出的问题。
将 span dist too large 从 warning 降级成 deug log，避免展示一堆无用的 warning。
修复开启 DTR 时由于重算链过长导致递归栈溢出的问题。
修复 fp16 midout 编译失败的问题。
修复 python fastrun 错误，确保读取cache的正确性

CUDA

修复 cutlass 链接时候 crash。
修复设置了环境变量 CUDA_CACHE_PATH 但仍然报相关警告信息的问题。
缩短 cutlass 文件名以修复在 windows 上的编译错误。
修复 cutlass cmake 编译依赖。

第三方硬件

增加第三方的 loader 读模型失败后报错功能，避免返回 null ptr 导致后续程序崩溃。

发版流程

修复图手术中 replace opr 错误导致输出名字被改动的问题。

周边工具

修复 windows 编译准备脚本。
在 hub 缓存的模型中加入 MegEngine 版本信息，以解决当 MegEngine 版本更新之后，若有 breaking change，而用户的 ~/.cache/megengine/hub 中还是老代码，则可能在使用 megengine.hub.load 时因 API 版本不同出现兼容性问题。
修复静态内存分析的工具中统计 GPU 静态内存分配信息的错误。

New Features

Python API

interpolate 支持 nearest 和 bilinear 模式
增加 conv_transpose2d 的量化 module 。
增加 cumsum 算子，行为和 pytorch 一致。和 MegDL 差别如下。
- axis 必须指定，而不是不指定就偷偷帮你 flatten
- 没有暴露 exclusive 和 reverse 这两个参数
增加 Reduce OPR 对空 Tensor 支持。具体行为如下。
- sum: => 0
- mean: => nan
- prod: => 1
- min: 报异常
- max: 报异常
- sum_sqr: 报异常

混合精度训练

修改添加 weight_scaler/bias_scaler 的方式，从需要手动在模型内添加改为通过接口 get_scaled_model 实现自动添加。

CUDA

添加 tensorcore fp16 matmul 算法。

通用组件

新增加环境变量 MGB_REGISTER_SEGV_HANDLER (默认关闭)。该环境变量打开时，MegEngine 只打印自己的堆栈，而默认不注册 SEGV signal。

v1.5.0

2 years ago

Compatibility violation

去掉不必要的隐式类型转换，以解决性能和显存的问题。
- 使用影响：不再接受 numpy array 作为 functional 的 input ，需要转换为MegEngine 的 Tensor 类型。

HighLight

DTR 升级
- 在 trace 的静态构造模式下支持用 DTR 算法优化计算图的显存峰值，与 Sublinear 相比，ResNet 50 单卡最大 batch size 350->450，八卡 300→450。
- 动态图模式下支持无阈值开启，用户无需指定 eviction_threshold。
支持混合精度训练。
增加高阶导支持（试验版）。

Bug Fixes

Python API

修复对 nvof 输出形状的计算。
launcher 中 fork thread 之前检查 CUDA 是否初始化。
修复 expand_dims 中对 scalar 未处理的问题。
修复 F.topk 中 kth_only=True 不可用的问题。
从 state dict 创建模型参数对应的 Tensor 时不使用 cache ，以防止 inplace 修改参数导致的错误结果。
megengine.random.RNG：修复了当 RNG 被定义成一个全局变量，程序退出时，系统报错的问题。
megengine.random.seed：修复对 random seed 的重置，使用相同的 seed( ) 值，每次生成的随即数相同，与 numpy 保持一致。

CUDA

修复tensorRT runtime，支持 int8 nchw4 输入，可以减少显存用量。
修复 cuDNN ConvolutionBackwardData获取算法时候错误。

周边工具

修复 Windows 中 cmake 开启 asan 不工作问题。
修复 toposort 能按定义序获取 opr 顺序。
修复对量化模型统计计算参数 std 时报错的问题，修复 pooling 的 kernel size 为 2d 时，参数量统计会报错 type 问题，支持统计量返回 dims

通用组件

关闭 TEE 模式下的 static 内存统计功能，以保证 TEE 环境的安全性。
修复 x86 matmul 算子在输出 tensor 不连续时候计算错误。
修复 oss 模型序列化中的兼容性问题。
修复 dump 模型时的 device 类型。

New Features

Python API

增加 lsq 算子。
DTR 中去除需要用户指定的 threshold。
增加 opr _has_inf。
分布式训练增加user_pop函数在用户获取自定义的 key-value pair后释放资源。
废弃 get_device_count_by_fork。
增加单机利用 cpu shared memory 做 allreduce 的功能。在 launcher 中设置 backend="auto" 即可在不支持p2p通信的 GPU 中开启 cpu shared memory。
增加 unfold。
增加 silu 和 gelu。
interpolate 对 channel=1 或 3 的 input 增加 nearest 和 bicubic mode。
增加 op 实现 gamma、beta、poisson 和 permutation 等随机算子。

ARM

新增 nchw44 layout 下第一层卷积为 K1x1S1 的优化。

CUDA

CUDA topK 支持 FP16 数据类型。

通用组件

修复多 batch 精度抖动问题，fast-run 增加忽略 batch size 功能
修改 CUDA JIT 配置接口。
新增统计计算图中内存使用信息的功能。
集合通信增加对 uint8 的支持。
增加 trace、PowC、elemwise 算子支持空的输入输出。
增加 bn 推理模式下的梯度反传。
支持对 metadata 的序列化。
新增 RelayoutEmitter，便于 Tensor 处理复杂的 Layout 变换

Improvements

ARM

优化 ARM pooling 和多线程性能。

CUDA

重构 cutlass 相关 kernel 的生成逻辑。
重构 CUDA relayout format 相关 kernels。
CUDA topK 支持 fp16 数据类型

通用组件

Pooling 算子支持 fast-run 搜参功能。

重构 profiler 功能，并添加对 trace 的支持。
Group 卷积支持新版 fast-run。
优化 x88 pooling 性能。

Compatibility violation

Remove undesired implicit type conversion.
- The effects of use : It no longer accepts numpy array as functional input and needs to be converted to Tensor type.

HighLight

DTR
Support DTR memory optimization for static graph under trace mode. Compared with Sublinear, the maximum batch size of training a ResNet 50 increases from 350 to 450 with 1 gpu, and from 300 to 450 with 8 gpu.
In dynamic graph , DTR can be used without the need to specify memory eviction threshold.
Add mix precision.
Support higher-order differentiation (experimental).

Bug Fixes

Python API

Fix nvof output shape computation.
Add CUDA env check before fork thread in launcher.
Fix expand_dims for scalar.
Fix F.topk with kth_only.
The cache is not used when creating the tensors corresponding to the model parameters from the state dict to prevent incorrect results caused by inplace modification of the parameters.
megengine.random.RNG : Fix the system error during the program exit.
megengine.random.seed : Fix the reset of random seed when using the same seed value.

CUDA

Repair the tensorRT runtime and support input in nchw4 format, int8 dtype, which may reduce memory usage.
Fix cuDNN convolutionbackwarddata error when getting algorithm.

Tools

Fix asan don’t work in windows when build with cmake.
Fix toposort to get definition order.
Fix module status error.

General Components

Turn off the static memory statistics function in TEE to ensure the safety of the TEE environment.
Fix the compute error of X86 matmul operator when output tensor is not continuous
Fix compatibility error of oss model.
Fix dump device error with const.

New Features

Python API

Add lsq opr.
Remove eviction threshold in DTR.
Add _has_inf opr.
Add user_pop function to get user defined key-value pair and delete the resources when the get is done.
Deprecate get_device_count_by_fork.
Enable shared memory allreduce on a single machine.
Add unfold.
Add silu and gelu.
Interpolate supports nearest and bicubic modes for tensors with the channel as 1 or 3.
Add random op's including gamma, beta, poisson and permutation.

ARM

Add optimization of first layer Convolution with param K1x1S1 in nchw44.

CUDA

CUDA topK operator supports fp16 data types.

General Components

Fix the problem of multi batch precision jitter, and ignoring batch size option in fast-run.
Modify CUDA JIT configuration interface.
Add recording memory usage information function in compute graph.
Enable uint8 for collective communication.
Add more support to empty IO.
Add bn inference backward.
Add support of serializing metadata.
A new relayemitter is added to facilitate complex layout transformations of tensor

Improvements

ARM

Optimize ARM pooling and multithread performance.

CUDA

Refactor the generation logic of cutlass related kernels.
Refactor CUDA relayout format related kernels
CUDA topK operator supports fp16 data types

General Components

The pooling operator supports fast-run.
Refactor the profiler function and add support for MegEngine trace.
The group Convolution operator supports the new version of fast-run.
Add algo for x86 max pooling for W13S1 under NCHW88.

v1.4.0

2 years ago

Highlights

Python API

重构 DTR 相关的 API 并修复其中随机数算子相关的 bug。
新增 gradient clip 相关 API。
新增 correlation 算子。
新增 sliding_window 算子。
重构随机数生成的代码，解决随机数生成速度慢的问题。
添加了参数量/计算量统计与可视化工具。

问题修复

CUDA

修复 NCHW4 layout 使用 cuDNN8.0.4 时候 convbias 中融合Z错误。
修复多个库用了不同版本的 cub 可能带来的隐藏链接问题。

Python API

argmax 和 argmin 的 axis 参数接受负数。
如果 attach 到 grad manager 的参数不是 iterable of tensors 时报错。
退出前删除已有的 tensor 以避免非零的退出码。
修复动态图下使用 fastrun 查询算法结果过慢的问题。
如果 batch normalization 的 mode 在 dump 时是 training 就报错。
修复 param pack 中的 UAF。
修复 gather 求导卡住。
修复 trace 中 tensor 共享存储时导致的 bug。

分布式训练

修复程序退出释放资源时多线程竞态问题。

周边工具

修复 dump_with_mge.py 脚本中的 bug。
修复 CUDA pytest 概率下卡死。
修复 VOC dataset 中缺失的 class_colors。
修复 VarNode 的 inplace 操作时的 bug。

文档

API 文档修复了 BN momentum 描述。

通用组件

修复 fast-run workspace 限制 bug。
修复 elementise mode 中 enum 和 string 类型比较时的 bug。
修复 optimizer 在 step 时修改了 param.grad 的问题。
修复了 Lighting 中的 scale 范围。
修复 fast run 从 cache 中找算法时过滤规则缺失的问题。
模型 dump 支持 backward opr。
修改 TensorRT Runtime Opr 的是否有batch的判断依据，只根据输入来判断。
修复由 module setattr 引起的打印 module 出错的问题。

新功能

通用组件

dnn opt assert 时增加输出关键 log 的功能。
为了更好的支持后续功能，对 fastrun cache 结构做了重构。

周边工具

添加命令行工具 mge 以便捷调用 tools 目录中的脚本，并支持 bash 补全。

参数量/计算量统计与可视化工具

统计类型增加 norm 和 pooling。
增加统计 activations 功能（比 flops 更能反应GPU性能）。
增加递归更改 module status。
module_stats 的 input_size 参数修改为 input_shapes，表示可支持复数。
对于 flops/parameters/activations 支持返回值。
支持关掉打印信息便于自动化脚本集成。

发版流程

构建系统从 make 迁移到 ninjia ，优化发版时间。

Python API

重构随机数生成的代码，解决随机数生成速度慢的问题。
添加 MegEngine.ConvTranspose3D。

改进

CUDA

增加 CUDA conv_bwd_data 和 conv_bwd_filter 算法精度抖动测试。

Python API

增加计算 loss 时的 reduction 选项。
trace 可以返回任意类型的输出。

周边工具

对 qat and quantized module 增加__repr__ 方法。
避免 load_graph 时不必要的 "span dist too large" 警告。
提升统计工具的用户体验。

通用组件

增加 CUDA API 调用的 cache。

Known Issue

在开启DTR训练时，可能会出现申请显存失败的报错日志，这说明当前正在进行碎片整理，整理后程序可能可以继续运行。

MegEngine Release Notes

Highlights

Python API

The API of Dynamic Tensor Rematerialization is refactored and there is a minor bug fix to work around random operators.
Add gradient clip.
Add correlation operator.
Add sliding_window operator.
Refactor random operator and resolve the performance issue.
Add parameter/calculation statistics and visualization.

Bug Fixes

CUDA

Fix fusing Z error in convbias when use nchw4 layout in cuDNN8.0.4.
Fix potential linking issues caused by multiple libraries using different versions of cub.

Python API

Argmax and argmin accept negtive axis.
Error out if parameters attached to grad manager are not iterable of tensors.
Delete existing tensors before compnode finalizing to avoid non-zero exit code.
Fix overhead of query algorithm when execution strategy is set to PROFILE.
Error out if batch normalization is dumped in training mode.
Fix UAF in param pack.
Fix hanging when taking gradient to gather operator.
Fix bug in trace when memory is shared between tenors.

Distributed Training

Fix multi-thread race condition when the program exits.

Tools

Minor fixes to the dump_with_mge.py script
Fix CUDA pytest random hang.
Fix missed class_colors in VOC dataset.
Fix bug of VarNode inplace operations.

Documents

Fix BN momentum description of API documentation.

General components

Fix fast run workspace limitation bug.
Fix comparison between enum and string types for elemwise mode.
Fix the problem that the optimizer modifies param.grad during step.
Fix the scale range of Lighting.
Fix the problem of incomplete filtering rules when fast run searches for algorithms from the cache.
Support dumping backward opr.
Modify TensorRT Runtime Opr to determine whether there is a batch dim, only set based on input dim.
Fix the problem of printing module error caused by module setattr.

New Features

General components

Add more critical log when assert occurs.
In order to better support future work, the fastrun cache structure has been refactored.

Tools

Add the command line tool mge to conveniently call the scripts in the tools directory, and support bash completion.

Parameter/calculation statistics and visualization

Statistics type adds support norm and pooling .
Add support activation statistics.
Change module status recursively.
The input_size parameter of module_stats is modified to input_shapes.
Support return value of flops/parameters/activations.
Support preventing information print to facilitate the integration of automated scripts.

Release workflow

Build system migration from make to ninjia.

Python API

Refactor random operator and resolve the performance issue.
Add MegEngine.ConvTranspose3D.

Improvements

CUDA

Add CUDA conv_bwd_Data and conv_bwd_Filter algorithm accuracy shake check.

Python API

Add reduction choices to loss functions.
Make trace return any kind of output.

Tools

Add repr method for qat and quantized module.
Avoid unnecessary "span dist too large" warning in load_graph.
Improve statistical tools' user experience .

General components

Add CUDA API cache.

Known Issue

When DTR training is enabled, there may be some error logs showing that allocating memory has failed. This indicates that the program is currently in defragmentation phase and the program may continue to run after defragmentation.

v1.4.0-rc1

3 years ago

Highlights

增加动态图下通过重计算优化显存使用的功能。增加 2 行代码，即可在相同显存情况下，训练 3 倍大的模型。
尝鲜体验通道： pip3 install megengine==1.4.0rc1 -f https://megengine.org.cn/whl/mge.html

问题修复

通用组件

修复设置 no-profiling-on-shape-change 之后，MatMul 依然搜参问题。
修复 const tensor 缓存导致的越训越慢问题。

CUDA

修复销毁 MegEngine cuda 和 cuDNN 的顺序问题。
修复 CUTLASS GEMM 奔溃问题，增加了 block size 限制。
修复 TensorRT runtime opr profiling 功能。

Python API

修复 optimizer 的 state_dict 带来的副作用。
修复 trace 中的 gopt level。

量化

修复 Quantized.Concat 的 forward。
修复 easy quant 中的 zero scale。

周边工具

修复 TensorBoard 中的节点显示。
修复 Module 扩展结构时的自动命名。
修复 module stats 中对 group conv 的 FLOPs 计算。

新功能

通用组件

dnn 默认开启 log ，并打印 error 信息，并提供用户设置 log level 的接口。

ARM

arm上默认打开 dot 支持，并兼容在不支持 8.2 指令集的机器上运行。

CUDA

增加 CUDA compnode 直接获取内存相关信息。

Python API

增加动态图下通过重计算优化显存使用的功能。
增加 AdamW 优化器。
增加 varnode 的 array 方法。
warp_perspective 支持 mat_idx。

周边工具

对 NetworkNode 增加 repr 方法。
增加 opgraph 的 optimize-for-inference 接口。
增加 module_stats 和 net_visualize 的总结输出。
增加 NetWorkNode 对 receptive_field 的统计量。
设置 network_visualize 的 log_path 为可选参数。

改进

周边工具

优化算子的自动命名规则。

其他说明

通用组件

重构 CPU CompNode，使 default_cpu 不支持 record。
使用 algo attitude 替换 algo reproducible 属性。

Python API

移动 nvof 到 vision，同时兼容原有用法。

Highlights

Dynamic Tensor Rematerialization [Kirisame et al., 2021] is implemented in MegEngine. With two more lines of code, you can train a model twice larger given the same memory budget.
Welcome to try it out through : pip3 install megengine==1.4.0rc1 -f https://megengine.org.cn/whl/mge.html

Bug Fixes

General components

Fix MatMul opr tuning bug when setting no-profiling-on-shape-change.
Fix const tensor cache.

CUDA

Fix the destroying order of cudnn and cuda in MegEngine.
Fix the cutlass gemm crash by limiting the block size.
Fix TensorRT runtime opr profiling.

Python API

Fix bugs in optimizer's state_dict.
Fix gopt level in trace.

Quantization

Fix quantized concat forward.
Fix zero scale bug of easy quant.

Tools

Fix node display bug in tensorboard.
Fix auto naming bug when expanding structure.
Fix module stats calculate flops bug for group conv and remove model status change.

New Features

General components

DNN turns on log by default, prints error information, and provides an interface for users to set the log level.

ARM

dot is turned on by default on ARM, and it is compatible to run on machines that do not support the 8.2 instruction set.

CUDA

Enable CUDA compnode directly obtain memory related information.

Python API

Add dynamic tensor rematerialization.
Add AdamW optimzer.
Add array method for varnode.
Support F.warp_perspective with mat_idx.

Tools

Add repr for NetworkNode.
Add optimize-for-inference interface for opgraph.
Add summary print for module_stats and network_visualize.
Add support of receptive_field stats for NetworkNode.
Set network_visualize's log_path as an optional flag.

Improvements

Tools

Optimize the op's auto naming rules.

Others

Tools

Refactored CPU compnode so that default_cpu does not support record.
Replace algo reproducible attribute with algo attributes.

Python API

Move nvof to vision, compatible with old usage.

v1.3.1

3 years ago

兼容性破坏

问题修复

通用组件

修复使用trace时，broadcast op在输入输出shape一致情况下报错的问题
修复异步队列导致多卡crash的问题
修复当module中包含list等container时自动命名出错的问题
修复module state_dict在keep_var时被修改内部state值的问题
修复Strategy枚举中OPTIMIZED的拼写错误

周边工具

修复可视化工具节点显示的问题
修复统计工具计算flops问题
修复统计工具内部修改module状态的问题

量化

修复量化concat module forward出错的问题

新功能

周边工具

可视化工具增加log_path作为可选项
可视化和统计工具增加summary输出
统计工具增加感受野支持

Compatibility violation

Bug Fixes

General components

Fix the problem of broadcast op reporting an error when the input and output shapes are the same when using trace
Fix the crash of distributing training caused by asynchronous queue
Fix the problem of automatic naming errors when the module contains containers such as a list
Fix the problem that the internal state value of module state_dict is modified with the keep_var setting
Fix the spelling error of OPTIMIZED in Strategy enumeration

Tools

Fix the problem of network visualization tool node display
Fix the problem of statistics tools calculating flops
Fix the problem of modifying the module status in the module_stats tools

Quantization

Fix the problem of the quantized concat module forward

New Features

Tools

Network visualization tool add log_path as an option
Network visualization tool and the module_stats add a summary print
Statistics tool add support receptive field

v1.3.0

3 years ago

兼容性破坏

由于 C++ 序列化增加了 opname 字段，导致老版本不能加载新版本序列化文件。
废弃 set/get_conv_execution_strategy ，请使用新接口 set/get_execution_strategy 。

其他说明

funtional.nn 模块中 interpolate/roi_pooling/roi_align/nms/remap/warp_affine/warp_perspective/cvt_color 移动到 funtional.vision 模块。
functional.elemwsie 模块中 sigmoid/hsigmoid/relu/relu6/hswish 移动到 funtional.nn 模块。
functional.utils 模块中 topk_accuracy 被移动到 funtional.metric 模块。
functional.utils 模块中 copy 被移动到 funtional.tensor 模块。

问题修复

通用组件

修复 reshape 推导 shape 错误导致 trace 报错的问题。
修复 trace 内存泄漏的问题。
修复 linspace 造成 trace 报错的问题。
修复 scalar 参数经过求导后变成 1 维 tensor 的问题。
修复图优化中 NCHW 转 NCHW4 出错的问题。
修复异步执行下发任务过快导致内存泄漏问题。
修复 pyobject 引用计数问题引起的段错误。
修复 roialign 越界访存的问题。
修复 CompNode reuse 某些情况下 load 错误。
修复 NormalizeArithChainPass 和 WarpFusion 的图优化错误。
修复 linspace 中 device 参数。

Python API

修复 F.full/F.ones/F.zeros 输入 shape 是 scalar 类型的 tensor 会报错的问题。

量化

修复量化类型在某些 case 下判等会报错的问题。
修复量化训练 checkpoint 加载出错的问题。
修复 TQT 量化训练参数不更新的问题。
修复 TQT 量化训练反向求导计算的问题。
修复量化训练未转换自定义量化 Module 的问题。

其他

修复 set_mgb_log_level 不生效的问题。
修复 batch normalization 中的 freeze 参数的问题。

新功能

通用组件

支持小 tensor 在 host 上的计算以减少 host-device 同步。
fastrun 添加 fast profile 模式。
fast-run 支持递归搜索。
Matmul Opr 支持 fast-run 搜参。
load_and_run 增加 disable-optimize-for-inference 参数。
增加 trace 时根据 module 结构自动命名 op name 的功能。
Reshape 增加静态 shape 推导。

Python API

增加 TensorRT/Atalas/Cambricon (三方硬件）、cvt_color、matinv、resize、warp_affine、deformable_conv2d、deformable_psroi_pooling、repeat、tile 等新算子。
增加给 tensor 命名的功能。

分布式训练

增加分布式通信算子对 scalar 的支持。

周边工具

在 cgtools 中增加 GraphInference 并支持指定输出节点。
增加基于 .mge 文件的可视化、统计参数量计算量的工具。
增加 python 版 load_and_run 工具。

Dataloader

stream dataloader 支持设置 timeout 以及设置 timeout 后的回调函数。

ARM

自动检测 ARM 平台特征并开启相应优化。
添加 ARM64 CUDA 推理支持。

改进

通用组件

被 trace 的函数增加支持返回dict的功能。
Python API
module 支持用复杂 key 来做 getattr。
module repr 支持 list/dict。

分布式训练

分布式训练增加返回值功能。

量化

调整了假量化 bias 的策略，只有在 weight 和 activation 都被量化时才对 bias 做假量化。
优化量化数据类型结构使量化框架支持第三方量化数据类型。

ARM

增加了 Matmul 的分块实现，优化某些 shape 下的性能。

Thanks to our Contributors

本次 release 非常感谢 @jia-kai 提交 PR ，期待更多的开发者一起共建 MegEngine！

Compatibility violation

Since C++ serialization adds new opname filed, C++ serialization file dumped by this version can not be loaded by earlier releases.
set/get_conv_execution_strategy is deprecated and set/get_execution_strategy is suggested to use.

Additional Note

Some functionals are moved to new modules for better orgnization. Backward compatibility is also gurrettened so the change is not expected to affact original usage. The moved functionals include:interpolate/roi_pooling/roi_align/nms/remap/warp_affine/warp_perspective/cvt_color are moved from funtional.nn to funtional.vision.
sigmoid/hsigmoid/relu/relu6/hswish are moved from functional.elemwsie to funtional.nn.
topk_accuracy is moved from functional.utils to funtional.metric copy is moved from functional.utils to funtional.tensor.
copy is moved from functional.utils to funtional.tensor.

Bug Fixes

General components

Fix shape inference in reshape which may lead to error in trace.
Fix the problem of trace memory leak.
Fix trace error caused by linspace.
Fix the bug in automatic differentiation which turns a scalar into an 1-dim tensor.
Fix NCHW-to-NCHW4 layout transform in gopt.
Fix memory leak when python frontend runs much faster without synchronization to the device.
Fix segfault caused by pyobject reference counting error.
Fix the illegal memory access in ROIAlign operator.
Fix CompNode reuse load error in some cases.
Fix the graph optimization error of NormalizeArithChainPass and WarpFusion.
Fix the device parameter in linspace.

Python API

Fix scalar as the input shape of F.full/F.ones/F.zeros.

Quantization

Fix comparision error of quantized data type.
Fix checkpoint loading error in quantized training.
Fix parameters which cannot be updated in TQT.
Fix gradient calculation in TQT.
FIx bug in user-defined TQT module.

Others

Fix set_mgb_log_level malfunction.
Fix freeze parameter in batch normalization.

New Features

General components

Support host computation for small tensors to reduce synchronization between host and device.
Add fast profile mode for fastrun.
Support recursive search in fastrun.
Add matmul support in fastrun.
Add disable-optimize-for-inference parameter to load_and_run.
Add automatic naming of op's based on module structure.
Add static shape inference for reshape operator.

Python API

Add new operators: TensorRT/Atalas/Cambricon(third party hardwares）, cvt_color, matinv, resize, warp_affine, deformable_conv2d, deformable_psroi_pooling、repeat、tile.
Enable tensor naming.

Distributed training

Support scalar tensors for distributed operators.

Tools

Add GraphInference in cgtools and support specifying output nodes.
Support model visualization and parameter statistics from .mge files.
Add python load_and_run.

Dataloader

Support setting timeout and callback function after timeout in stream dataloader.

ARM

Automatically detect ARM platform calculation characteristics and enable corresponding optimization.
Support inference on ARM64 with CUDA.

Improvements

General components

Support dict as returned value for traced function.

Python API

Add get/set_expand_structure to deal with complex key.
Support list and dict in module repr methods.

Distributed training

Add return values for distributed training.

Quantization

Adjust fake quantization method such that bias is quantized only both weight and activation are quantized.
Support user-defined quantized data type in quantized training.

ARM

Add more tiled kernels of Matmul to improve performance.

Thanks to our Contributors

A kind acknowledgement to PR lodged by @jia-kai , and we are genuinely welcoming more developers to co-build MegEngine!

v1.2.0

3 years ago

问题修复

修复asan报错的问题
修复寒武纪跨计算节点拷贝的问题
修复profile导致的显存爆炸
修复寒武纪环境下显存未能正确回收
修复由于CUDA环境变量没有正确设置而导致分布式训练卡0显存爆炸的问题
修复tensor split
修复 ARM testcase 内存占用过多的问题
修复 Fastrun 占用显存过多的问题
修复 Atlas dump 模型指定的 batch size 大于模型最大 batch size 的问题
修复 MLIR 不能正确处理不同的 shape 的问题
修复 MLIR 执行 CUDA 时出现 Dangling Pointer 的问题
修复 Weight 前处理时没有考虑无 bias 的 ConvBias 的问题
修复打印错误堆栈过程中再次crash导致 log 混乱的问题

新功能

python退出时做full sync
MegEngine中添加subpackages
pooling window size 小于 padding size 时输出警告信息
添加 Atlas Stub, 支持在 X86 平台上 dump Atlas 模型
为 JITExecutor opr 添加 memory forwarding 功能
为 load_and_run 添加将结果输出到 stdout/stderr 的功能
增加EasyQuant量化方法
支持Tensor换入/换出重计算功能
Optimizer支持inplace add_update

性能优化

添加常见 Video Detection 网络前处理融合优化
添加 DimShuffle, Reformat 与 ConvBias 的融合优化
添加 WarpPerspective 和 DimShuffle 的融合优化
将tensor，求导以及trace从python实现改到cpp实现，提高了性能
修改部分opr的求导规则以节省显存
优化QAT和TQT量化训练性能和显存
调整 CUDA chanwise Convolution 算法选择策略
优化 NCHW32 的 pooling 算子性能
优化 CallbackCaller 算子的性能
优化 CUDA IO 通信

兼容性破坏

Bug Fixes

Fix errors reported by ASAN
Fix the problem of cross compute node copy in Cambricon
Fix out of memory error caused by profiling
Fix memory leak in the Cambrian
Fix out of memory error during distributed training due to the incorrect setting of CUDA environment variables
Fix tensor split
Reduce the memory usage of ARM testcase
Reduce the memory usage of Fastrun
Fix the issue that the batch size specified when dumping the Atlas model exceeds the maximum batch size of the model
Fix the problem that MLIR cannot handle different shapes correctly
Fix the problem of Dangling Pointer when MLIR executes CUDA
Fix the weight pre-processing to handle ConvBias without bias correctly
Fix the broken log caused by crash again in the process of printing error stack

New Features

Full sync when exits in Python
Add sub-packages to MegEngine
Print warning message when pooling window size is smaller than padding size
Add Atlas Stub, enabling R dump Atlas model on X86 platform
Add memory forwarding to JITExecutor operator
Make load_and_run print the result to stdout/stderr not just files
Add EasyQuant quantification method
Support tensor swap-in/swap-out recalculation
Optimizer supports inplace add_update

Optimization

Optimize common Video Detection network by pre-processing fusion
Optimize performance by fusing DimShuffle and Reformat with Convolution
Fuse WarpPerspective with DimShuffle
Improve performance by rewriting tensor, derivation and trace in cpp
Refactor some opr derivation rules to save memory usage
Optimize QAT and TQT quantitative training in terms of both performance and memory usage
Adjust the CUDA chanwise Convolution algorithm selection strategy
Optimize the performance of NCHW32 pooling operator
Optimize the performance of CallbackCaller operator
Optimize CUDA IO communication