MegEngine Versions Save

MegEngine 是一个快速、可拓展、易于使用且支持自动求导的深度学习框架

v1.13.4

3 weeks ago

MegBrain

Bug fixes

通用组件

修复 dump 开启 CD4 + FP16 时 clip 阶段图优化异常， MIN op 相关 bug 导致 dump 出错的问题
修复 megengine tensor 类型为 bool 时 index 操作未能正确定位地址的问题

XLA

修复多机训练时，device 设置错误的问题

CUDA

修复由于缺少一个 void ** 的强制转换而引发无法通过编译的问题。

New Features

Python API

添加 FillPoly 算子
增加 erf 接口

CUDA

增加对 Hopper 系列 GPU 的支持

通用组件

修复在 io16xc32 模式下 reduce 算子无法执行的问题

XLA

XLA 增加对 FP16 数据类型的支持
新增支持 xla 打包的脚本，自 v8.20.3（包含）及以后可以用以下方式安装 xla： megbrain[xla]==8.20.3+cu111

Dataloader

Dataloader 支持 cuda 数据转换

Bug fixes

Common components

Fixed the issue where the clip stage diagram optimization was abnormal when CD4 + FP16 was turned on for dump, and MIN op related bugs caused dump errors.
Fix the problem that the index operation fails to correctly locate the address when the megengine tensor type is bool.

XLA

Fixed the problem of incorrect device settings during multi-machine training.

CUDA

Fixed the problem of failing to compile due to the lack of a void ** cast.

New Features

Python API

Add FillPoly operator.
Add erf interface.

CUDA

Add support for Hopper series GPUs.

Common components

Fix the problem that reduce operator cannot be executed in io16xc32 mode.

XLA

XLA adds support for FP16 data type.
Added scripts that support xla packaging. From v8.20.3 (included) and later, xla can be installed in the following way: megbrain[xla]==8.20.3+cu111.

Dataloader

Dataloader supports cuda data conversion.

v1.13.3

5 months ago

MegEngine

HighLight

新增支持寒武纪思元系列 AI 芯片训练和推理。

know issue

dump 开启 CD4 + FP16 时 clip 阶段图优化异常， MIN op 相关 bug 导致 dump 出错，预计在 v1.13.4 修复。

Bug fixes

第三方硬件

修复 rocm 编译失败的问题。
修复在寒武纪 590 上找不到 checksum_kernel_union4 kernel 的问题。

通用组件

修复 trace 模式时 reshape 算子不支持 int64 的 shape 输入的问题。
修复 tile 算子 workspace 计算错误的问题。
修复由于 NHWCD4 优化 pass 处理错误导致 seg transformer 模型无法 dump 的问题。
修复 megfile 版本依赖固定的问题。
修复 module_stats 函数计算 traced_module 模型参数量和计算量报错的问题。
优化了在异步执行出错时的报错信息，提供给用户进一步定位问题的方法。
在 graph 执行出错抛出异常前提供了更多的错误信息。
修复因缺少头文件 limits 而引发的编译错误。

发版流程

修复在不带 MGE_WITH_CUSTOM_OP 编译参数时编译 megbrain cuda 后端不通过的问题。

XLA

修复 xla 显存占用不稳定的问题。
修复 XLA 出现的 indexing 错误。
修复 XLA 无法 Trace GradManager Callback 的问题；修复 XLA 无法 Trace 带有 property 装饰的 module 的问题。

CUDA

暂时关闭了两个调用 cudnn-v8 的算法（AlgoCUDNNConvV8，AlgoCUDNNConvBiasActivationV8）以修复计算结果的对分问题。
修复已知问题，正式支持 cuda11.8。

文档

修复 megengine 中 _mgb.so 丢失的问题。

New Features

Python API

新增 einsum 算子。
增加对 exponential opr 的支持。
增加对多项式分布采样的支持。
增加对 Remap 算子的支持。
增加对 GaussianBlur 算子的支持。

第三方硬件

寒武纪平台支持 neuware 1.13.0 版本。
支持寒武纪平台训练和推理。

通用组件

增加对 dilate 算子的支持。
修复 ohos thread local存在的内存泄漏问题

XLA

xla 后端添加 fake_quant、tqt 算子。
在 xla 中支持 linspace，stack，resize，resize backward 算子。
支持 XLA 后端添加 lsq 算子。

Improvements

Dataloader

将 datamonitor 中统计的 dataset 和 transform 时间修改为一个 batch 的总时间，使其与 collator time 和 ipc time 统计口径保持一致。

MegEngine Lite

Bug Fixes

文档

修复 lite 中 get_elem_size 方法文档描述与实现不一致的问题。

MegEngine

HighLight

Added support for Cambrian MLU series AI chip training and inference.

know issue

When dump turns on CD4 + FP16, the clip phase diagram optimization is abnormal. MIN op related bugs cause dump errors. It is expected to be fixed in the next new version (MegBrian v8.20.4)

Bug fixes

Third-party hardware

Fix the problem of rocm compilation failure.
Fixed an issue where the checksum_kernel_union4 kernel could not be found on Cambrian 590.

Common components

Fixed the bug that the reshape operator does not support int64 shape input in trace mode.
Fixed the problem of incorrect calculation of tile operator workspace.
Fixed the issue where the seg transformer model cannot be dumped due to NHWCD4 optimization pass processing errors.
Fix megfile version dependency fixing problem.
Fix the problem of module_stats function calculating the traced_module model parameters and calculation amount reporting an error.
Optimize the error messages during asynchronous execution errors, providing users with methods to further locate issues.。
Provide more error information before throwing an exception when an error occurs during graph execution.
Fix the compilation error caused by the missing header file "limits".

Release process

Fix the problem that the megbrain cuda backend fails to pass when compiled without the MGE_WITH_CUSTOM_OP compilation parameter.

XLA

Fix the unstable occupation of cuda memory of xla.
Fix indexing problems with XLA.
Fix the problem that XLA cannot trace GradManager Callback.
Fix the problem that XLA cannot trace modules with property decorations.

CUDA

Temporarily closed two algorithms that call cudnn-v8 (AlgoCUDNNConvV8, AlgoCUDNNConvBiasActivationV8) to fix the bisection problem of calculation results.
Formal support for cuda11.8。

Documentation

Fixed loss of mgb.so in megengine.

New Features

Python API

Implements einsum operator.
Add exponential opr.
Added support for polynomial distribution sampling.
Add Remap module.
Add GaussianBlur module.

Third-party hardware

Cambrian platform supports neuware version 1.13.0.
Support Cambricon training and inference.

Common components

Add the dilate operator.
Fix memory leak issues in OHOS thread local storage.

XLA

Add fake quant and tqt operators to the xla backend.
XLA supports linspace, stack, resize, resize backward operators。
The lsq operator is added to the XLA back-end.

Improvements

Dataloader

Modify the dataset and transform time statistics in datamonitor to the total time of a batch to make it consistent with the statistical calibers of collator time and ipc time.

MegEngine Lite

Bug Fixes

Documentation

Fix the inconsistency between the documentation and implementation of the get_elem_size method in lite.

v1.13.2

6 months ago

MegEngine

Highlight

支持 cuda118 正式版本，已知问题见 know issue。
MegEngine-XLA 发布正式版，经 XLA 优化后在 cuda11.8/cudnn8.6.0 上 basecls/basedet 上典型网络可获得 10%~90% 的速度提升。

know issue

cuda118 在使用 TensorRT 进行推理时可能出现资源析构异常的问题。

Bugfix

Python API

修复 arange function 不能设置 device 为 cpu 的问题。

第三方硬件

修复多模型多线程的环境中，atlas 报 event 资源不够的问题。
修复 atlas 同步时需要激活 atlas_env 的问题；修复由于 tensordesc 没释放导致的内存泄漏问题；修复 aclInit 重复的问题。

通用组件

修复 custom op 实现 builtin op 时静态变量初始化顺序错误的问题。
修复 megengine 包含 setenv 依赖导致在 android 环境下存在的内存踩踏风险问题。

XLA

修复 xla 使用时显存增大、找不到 ptxas 以及 rng seed 设置不正确的问题。

ARM

升级 ndk 版本到 r25c，以解决旧版 ndk 下 armv7 开启 -D_FORTIFY_SOURCE=2 不生效的问题；修复 conv_backdata 算子访存越界问题；优化编译速度，android 设备编译可提速 30%。

New Features

Python API

增加 python 侧的高维 sort 支持。
添加 flip、rotate、resize、rot90 算子。

周边工具

支持 dump 模型在 MegBrain v8.14 的前向兼容。

通用组件

添加 where 的 kernel 实现。

XLA

XLA 支持 partial_trace 的函数在输入 shape 变化的情况下 fallback 到原始的 python 函数；partial_trace 支持将 all_reduce 等集合通信算子编译到 xla executable，以提升 xla trace 的模型性能；partial_trace 支持 trace, optimizer._update，支持加速 optimizer step 方法。

CUDA

添加三种 mixup 的三种 gpu 实现（cutmix, fmix, mixup）。
新增对 cropandpad 算子的支持。
增加 elemwise uint16 dtype 计算的支持。

Dataloader

新增 dataloader 对数据各阶段处理的监控，通过环境变量 os.environ[‘MGE_DATA_MONITOR’] =‘1’ 打开此功能。 num_workers = 0 时, 获取拉取数据时间 dataset_time、数据转换时间 transform_time、拼 batch 时间 collate_time； num_workers > 0时，在以上指标基础上，可再获取到进程通信时间 IPC_time。

Improvements

文档

优化现有 api 的 docstring。

MegEngine Lite

Bug Fixes

通用组件

修复调用 get_io_tensor 获取设备类型时概率性出错的问题。

MegEngine

know issue

Cuda118 may encounter a resource destruction exception when using TensorRT for inference;
The training benchmark avg_cpu_usage indicator has an average increase of 32.4% compared to the previous two versions;

Bugfix

Python API

Fix the bug that arange function cannot set device to cpu.

Third-party hardware

Fixed memory leak problem caused by tensordesc not being free;Fixed the problem that atlas_env activation is required during synchronization;Fixed aclInit repeated problem.

Common components

Fix the problem of wrong initialization order of static variables when custom op implements builtin op.
Fixed the problem that Megengine uses setenv may cause the memory stampede risk in android.

XLA

Fixed the problem of increased video memory, unable to find ptxas and incorrect rng seed settings when using xla.

ARM

Upgrade the ndk version to r25c to solve the problem that -D_FORTIFY_SOURCE=2 does not take effect when armv7 is enabled under the old version of ndk; fix the conv_backdata operator memory access out-of-bounds problem; optimize the compilation speed, android device compilation can be accelerated by 30%.

New Features

Python API

Add support for high-dimensional sort on the python side.
Add flip, rotate, resize and rot90 operators.

Peripheral tools

Support forward compatibility of dumped models in MegBrain v8.14.

Common components

Add the kernel implementation of where operator。

XLA

XLA supports the function of partial_trace to fallback to the original python function when the input shape changes; partial_trace supports compiling set communication operators such as all_reduce into xla executable to improve the model performance of xla trace; partial_trace supports trace, optimizer._update, Supports accelerated optimizer step method.

CUDA

Add three mixup gpu implementations (cutmix, fmix, mixup).
Add cropandpad operation.
Add support for elemwise uint16 dtype calculations.

Dataloader

Add the dataloader monitoring function for each stage of data processing: when num_workers = 0, obtain the data pulling time dataset_time, data conversion time transform_time, and batch batch time collate_time. When num_workers > 0, use os.environ['MGE_DATA_MONITOR'] ='1' to obtain the process communication time IPC_time.

Improvements

Documentation

Optimize the docstring of existing interfaces.

MegEngine Lite

Bug Fixes

Common components

Fixed the problem of probabilistic errors when calling get_io_tensor to obtain the device type.

v1.13.1

8 months ago

MegEngine

HighLight

此版本及后续版本将不再支持 cuda10.1。

Bugfix

第三方工具

支持 atlas 输入 batch 为静态而输出 batch 为动态的情况；修复在多模型多线程的情况下 atlas 报 event 资源不够的问题。

通用组件

优化 device 检查错误时的报错信息，使其更明确。
修复 ConvLike 和 AxisAddRemove 的组合对 shape 错误处理导致报错的问题。

混合精度

修复部分算子显示指定输入为 nchw format，例如 inp=tensor(a, format="nchw")，时出错的问题。

New Features

Python API

增加对 cross 算子的支持。
增加对高斯、泊松、拉普拉斯噪声算子(AdditiveGaussianNoise、AdditivePoissonNoise、AdditiveLaplaceNoise) 的支持。

通用组件

添加 emboss, sharpen, linearcontrast 算子。

XLA

添加 normalization 相关算子的 XLA lowering rule（目前支持batch norm， layer norm，group norm，instance norm）。
添加 sort/argsort/topK 等排序相关算子的 XLA lowering rule。

Improvements

文档

添加复数、添加 DLPack 相关接口的使用说明。

MegEngine Lite

Bug Fixes

周边工具

修复 MegEngine Lite 调用 get_io_tensor 时概率性获取到错误 device 类型的问题。

MegEngine

HighLight

This version and subsequent versions no longer support cuda10.1

Bugfix

Third party tools

Support the scenario where the input batch is static while the output batch is dynamic for Atlas, and fix the issue of insufficient event resources reported by Atlas in the case of multiple models and multiple threads.

Common components

Optimize error messages for device check errors.
Fix the combination of Convlike and Axisaddremove.

Mixed precision

Fixed an error in partial operator display when specifying input as nchw format, such as inp=tensor(a, format="nchw").

New Features

Python API

Add cross operator.
Support gaussian, laplace and poisson noise operator.

Common components

Add Operators include emboss,sharpen,linearcontrast.

XLA

Add XLA lowering rules for normalization ops.
Add xla lowering rules for sort/argsort/topK ops.

Improvements

Documentation

Add docstring for complex APIs and dlpack.

MegEngine Lite

Bug Fixes

Peripheral tools

Fix the issue in MegEngine Lite where there is a probability of obtaining an incorrect device type when calling get_io_tensor.

v1.13.0

10 months ago

MegEngine

HighLight

MegEngine 支持 Trace 后的图使用 XLA 进行编译优化并执行，在 cuda11.8/cudnn8.6.0 上典型分类网络可获得 10%~80% 的速度提升。此特性为试验性特性。关于此功能更多信息请参考文档链接
后续版本将不再支持 cuda10.1。

Bugfix

Dataloader

优化 dataloader 的报错机制，避免 Dataloader worker 闪退及卡死的情况。
消除 pyarrow.SerializationContext() 的 future warning，提升使用体验。
修复 pyarrow 版本高于 1.12 时反复 warning 的问题。

第三方硬件

支持 atlas 启用 aipp 后输入 format 可以为多种类型（nhwc、nchw、nc1hwc0）。

通用组件

修复 slice 的 start 为负数时，index 结果错误的问题。
修复由于 ArgSpec 中的参数类型信息被序列化导致的 TracedModule 兼容性问题。

New Features

Python API

支持 megengine tensor 与 dlpack 的互相转换。
interpolate op 新增 trilinear 模式。

CUDA

添加 cuda/naive mha proxy 实现。

通用组件

jit.trace 支持 without host 模式, 目前主要用途是接入其他深度学习编译器（例如 xla），without host 为 True 时，被 trace 包装的函数经过编译后不会再执行函数原始的 python 代码，也不会检查算子序列是否与 trace 记录的序列一致，使用时需要您保证被 trace 部分完全静态。
支持外部框架 tensor 与 mge tensor 做计算，例如 mge.tensor(troch.tensor)+mge.tensor 即获取两者相加的结果。

XLA

实现 mge op 到 XLA HLO IR 的 lowering rule，支持在 MegEngine 中编译并调用 XLA。

MegEngine

HighLight

MegEngine supports XLA to compile, optimize and execute graphs after Trace. Typical classification networks on cuda11.8/cudnn8.6.0 can achieve a speed increase of 10%~80%. This feature is experimental. For more information about this function, please refer to Here
Subsequent versions will no longer support cuda10.1.

Bugfix

Dataloader

fix dataloader worker crash quietly in some cases.
Remove the warning of pyarrow on some interfaces.
Fix the problem of repeated warnings when pyarrow version is higher than 1.12.

第三方硬件

Enabled multi-type input format when using atlas with aipp (nhwc、nchw、nc1hwc0).

通用组件

Fixed the problem that the index result was wrong when the start of the slice was negative.
Fixed TracedModule compatibility issue due to parameter type information in ArgSpec being serialized.

New Features

Python API

Support the conversion between megengine tensor and dlpack tensor.
Add trilinear mode for interpolate operator.

CUDA

Add cuda/naive MHA proxy implementation.

通用组件

jit.trace supports without host mode. When without host is True, the function wrapped by trace will not execute the original python code of the function after compilation, nor will it check whether the operator sequence is consistent with the sequence recorded by trace. When using it, you need to ensure that the traced part is completely static.
Support external framework tensor and mge tensor to do calculations, for example, mge.tensor(troch.tensor)+mge.tensor is to get the result of the addition of the two.

XLA

Implement the lowering rules from mge Op to XLA HLO IR, and support compiling and calling XLA in MegEngine.

v1.12.4

11 months ago

MegEngine

HighLight

训练侧默认开启 CUDA_MODULE_LOADING，节省了 fatbin 加载带来的 CUDA 显存开销（对于 cuda 版本为118及以上的包有效），您将有更多的显存可以使用。（使用的 kernel 种类越少，节省效果越明显，最多可为您节省 900MB 显存）
包括此版本在内的近两个版本（v1.12.4，v1.13）会保持对 cuda10.1、cuda11.4 的支持，后续将不再支持 cuda10.1，请您知晓～

Bugfix

Python API

修复了 F.flatten 和 Tensor.flatten 签名未对齐的问题，目前两者均统一为 flatten(start_axis, end_axis)。
python 层 multiheadattention functional/module 接口格式修改，用于后续进一步解决原始接口中存在的不能给出中间的 attn matrix、qkvo projection bias 不可组合等问题。
c++ 层 multiheadattention functional/module 接口格式修改，用于后续进一步解决原始接口中存在的不能给出中间的 attn matrix、qkvo projection bias 不可组合等问题。

Dataloader

修复 dataloader 中读取系统内存大小后未关闭相关文件导致的 warning。

通用组件

修复 trace 时如果 tarced_function 的 return 是复杂嵌套类型，报错信息不直观的问题。
修复 gitlab 登录 windows 环境打印的错误信息乱码问题。
修复了开启 DTR 情况下多卡训练概率性崩溃的问题。

周边工具

完善 windows 平台下 whl 包的环境依赖。

ARM

修复了 macos aarch64 下开启 fp16 编译失败的问题。

文档

修复 readme 的拼写错误。

New Features

Python API

profiler 为 functional 添加 scope，用于记录其调用的层次结构（目前支持 functional/module scope）。

CUDA

新增对 aarch64 下 cuda11.8 的编译支持。
支持并完善 windows cuda118 工具链。
训练侧默认开启 CUDA_MODULE_LOADING，节省了 fatbin 加载带来的 CUDA 显存开销（对于 cuda 版本为118及以上的包有效），您将有更多的显存可以使用。

通用组件

profiler 新增了两个指标，以帮助您更直观地获取当前模型训练的性能指标（具体可见MR内容）。gpu 忙碌比：gpu_usage_ratio，gpu 训练时间占整体训练时间的比例；model.step 时间占比：train_time_ratio，实际用于训练的时间（各 epoch 的第一个 step 开始到最后一个 step 结束的时间之和）占整体训练时间的比例。
完善 unsupported opr 的报错 log，便于您直接获取到所输入的模型中具体没有实现的 opr 信息。
加入对复数的支持，包括四则运算、求导、拆包、打包等基本运算（新增 op： F.polar，F.imag，F.real，F.complex；添加了复数支持的旧 op：add，sub，mul，negate，reshape）

Improvements

通用组件

完善 symbolic trace 中部分不能通过静态推导值的 tensor 调用 numpy 方法时的报错信息，使之更完整合理。

量化

量化添加对 linear_bn，linear_bn_relu 的支持。

MegEngine

HighLight

The training side will open the CUDA_MODULE_LOADING default to save the CUDA video memory overhead brought by Fatbin loading (effective for the CUDA version of 118 and above), and you will have more memory to use. (The fewer types of Kernel you use, the more obvious saving the effect, you can save you at most 900MB of memory)
Nearly two versions (V1.12.4, V1.13), including this version, will maintain support for CUDA10.1 and CUDA11.4. In the future, CUDA10.1 will no longer be supported. Please know ~

Bugfix

Python API

Fixed the problem that the signatures of F.flatten and Tensor.flatten were not aligned. Currently both are unified as flatten(start_axis, end_axis).
The python layer multiheadattention functional/module interface format modification is used to further solve the problems in the original interface that the intermediate attn matrix cannot be given, and the qkvo projection bias cannot be combined.
The C++ layer multiheadattention functional/module interface format modification is used to further solve the problems in the original interface that the intermediate attn matrix cannot be given, and the qkvo projection bias cannot be combined.

Dataloader

Fix the warning caused by not closing related files after reading the system memory size in dataloader.

通用组件

Fixed the problem that the error message is not intuitive when the return of tarced_function is a complex nested type.
Fix the issue of garbled error messages printed by Gitlab logging into the Windows environment.
Fixed the probabilistic crash of multi-card training after enabling DTR.

周边工具

Improving the environmental dependency of whl package on windows platform.

ARM

Fix compile error on macos aarch64 with fp16 enabled.

文档

Fix the typo in README.md.

New Features

Python API

The profiler adds a scope to functional to record the hierarchy of its calls (currently supports functional/module scope).

CUDA

Support compiling with cuda11.8 on aarch64.
Support and improve the windows cuda118 toolchain.
CUDA_MODULE_LOADING is enabled by default on the training side, which saves the CUDA video memory overhead caused by fatbin loading (valid for packages with cuda version 118 and above), and you will have more video memory available.

通用组件

The profiler has added two new indicators, the gpu busy ratio (gpu_usage_ratio) and the model.step time ratio (train_time_ratio), to help users more intuitively obtain the overall performance indicators of the current model training.
Added support for complex numbers, including basic operations such as four arithmetic operations, derivation, unpacking, and packaging (new ops: F.polar, F.imag, F.real, F.complex; old ops with complex number support: add , sub, mul, negate, reshape)

Improvements

通用组件

Improve the error message in the symbolic trace when some tensors that cannot statically derive the value call the numpy method to make it more complete and reasonable.

量化

Quantization added linear bn, linear bn relu support.

v1.12.3

1 year ago

MegEngine

HighLight

添加 general_norm 算子，支持对指定轴进行 norm 操作。例如 shape=[1,3,256,256]，给定 list [0,3]，表明对第 0 维和第 3 维进行 norm。
新增 multiattention 的 cuda 后端实现。

Bugfix

CUDA

修复 MegEngine CUDA 在 ubuntu 22.04 上构建失败的问题。
修正在 mali 2.0 驱动上开启 ION 后部分模型会 crash 的问题。
修复部分用户环境无法识别 CUDA 卡的问题。

量化

修复 lsq fakequant 无法从普通 observe 获取量化参数的问题。

通用组件

修复 logsigmoid 在某些情况下反向会溢出的问题。
修复 float16 winograd f43 分块的计算错误问题。
修复在开启 sublinear 后并未节省内存的问题。
修复多线程加载模型时概率性 crash 的 bug。
修复因 ParameterizedDType 初始化存在 race condition 导致的模型推理崩溃的 bug。
修复 imperative runtime 退出时析构顺序问题导致程序Segmentation fault 的 bug。
修复 DeformableConv 在 cpu backend 下不支持 algorithms 接口的问题。
完善 trace 在没有输出情况下的报错信息，使其更加友好。
修复 GeneralNorm 的weight、bias参数对初始化时机敏感的问题，以确保其在调用forward前被正确attach。
完善 trace 输入非法时的报错信息，使其更友好；修复 jit.dump 导出模型可能出现（例如调用apply_on_var_node 构造包含多个 operator node 子图的算子时） OpNode 和 VarNode 名字重复的问题。
修复分布式训练由于多 stream 内存管理导致的内存泄露问题。

New Features

CUDA

添加 general_norm 算子，支持对指定轴进行 norm 操作。例如 shape=[1,3,256,256]，给定 list[0,3]，表明对第 0 维和第 3 维进行 norm。
新增 multiattention 的 cuda 后端实现。

通用组件

增加 elemwise 操作数为 None 时的合理报错信息。
profiler 增加记录 python 和 dispatcher 调用栈的功能。
在 Lite::TensorBatchCollector 中增加通过 id 获取对应 tensor 的接口。
优化 PyMegEngineLite 开发体验: 编译完成后直接执行 PYTHONPATH=lite/pylite:$PYTHONPATH python3 就可以开始使用 MegEngineLite python 接口。

Improvements

文档

readme 中添加编译工具链选择的相关内容。
修正 api 文档的一些错误内容。

MegEngine Lite

Bugfix

通用组件

修复 MegEngineLite load and run 中 lar fitting 模式不支持 ioc16 的问题。
MegEngineLite python3 支持从网络接口文件加载模型（目前支持 oss 直接读取与 fileobject 的方式）。

MegEngine

HighLight

Bugfix

Python API

Fixed an issue where general-norm has an assertion is always true.

CUDA

Fix host build at ubuntu 22.04.
Fixed a problem where some user environments did not recognize the CUDA card.

Quantify

Fixed an issue where lsq fakequant could not obtain quantization parameters from common observe.

Common components

Fixed the problem where logsigmoid would overflow when backpropagated in some cases.
Fix the calculation error of winograd (f16) f43 partition.
Fixed the problem that memory was not saved after opening sublinear.
Fix the bug of probabilistic crash when loading models in multiple threads.
Fix the bug of model inference crash caused by race condition in ParameterizedDType initialization.
Fix the program Segmentation fault bug caused by the order of destruction when the imperative runtime exits.
Fix the problem of DeformableConv kernel not support algorithms interface in cpu backend.
Improve the trace error message, when the user's trace function has no output, the error message is more friendly.
Fix the parameter acquisition issue with GeneralNorm to ensure that it is correctly attached before calling Forward.
Improve the error message when entering illegal input to make it more friendly; repair the jit.dump export model may have the problem of repeating OpNode and VarNode names.
Fixed a memory leak in distributed training due to multi-stream memory management.

New Features

CUDA

Add the general_norm operator to support the norm operation on the specified axis. For example, shape=[1,3,256,256], given list[0,3], indicates that the norm is performed on the 0th dimension and the 3rd dimension.
Add a cuda implement for multiattention operator.

Common components

Add error message when the operand of elemwise is None.
Profiler Added the ability to log python and dispatcher call stacks.
Add the interface to get the corresponding tensor by id in Lite::TensorBatchCollector.
Optimize the development experience of PyMegEngineLite: After the compilation is completed, directly execute PYTHONPATH=lite/pylite:$PYTHONPATH python3 to start using the MegEngineLite python interface.

Improvements

Documentation

Add the description of compilation toolchain selection in readme.
Fix some errors in documentation.

MegEngine Lite

Bugfix

Common components

Fix that the lar fitting mode does not support ioc16.
Megenginelite Python3 supports loading models from network interface files (the current model reading method supports OSS and FileObject).

v1.12.2

1 year ago

MegEngine

HighLight

ARM CPU FP16 推理性能大幅提升，以 vgg16 模型为例，在 mi9 设备上耗时由 481.252ms 减少为 168.300ms。在 dump 的时候加上 --enable-ioc16 即可。
新增对 CUDA118_CUDNN860_TRT8531 构建的支持。
添加 ConcatDataset 数据类型用于合并多个现有数据集，相关文档见 ConcatDataset 。

Bugfix

发版流程

修复 split 在未指定输出 shape 时 dump 失败的问题。

Python API

修复 indexing 操作（getitem） start 为空，step 为负数时崩溃的 bug（例如a[::-1]）。
修复 indexing 操作（setitem）dtype promotion 行为与 numpy 不一致的 bug。

CUDA

修复 TRT 加载出错或运行出错时，日志未显示 error 信息的 bug。

ARM

修复 thread_local 在 android 平台存在内存泄漏问题。

通用组件

将 tensor 的 dtype 属性和 np.dtype 对齐。
修复 channel wise conv channel padding 时，pass 出错的问题。
修复 cpp 为 opt 版本时，MGB_USE_MEGDNN_DBG不生效的问题，修复前此 env 仅在 python whl 版本和 c++ debug 版本可用，修复后在任何版本都生效。
修复开启 no_profile_when_shape_change 选项时 cudnn 概率性选不中算法的问题。
修复 nchw44 布局的 channel padding pass 中 reduce axis 为负数时发生的crash。
修复开启 DTR 时，使用 stack/concat 算子程序崩溃的问题
修复在 c++ 模型上做图手术后部分 op（ConvTranspose，MatrixMul）参数信息丢失的问题。
修复 traced module 部分 api（topk，arange, full, linspace, conv_transpose2d/3d, quantized.convtranspose2d）的兼容性问题，以解决新版本（v8.19.1）无法 load 历史版本 .tm 模型的问题。
修复 TracedModule 的 BackwardFoldScale pass 上可能（受模型复杂程度影响）会出现的死循环问题。

New Features

Dataloader

添加 ConcatDataset 数据类型用于合并多个现有数据集。

Python API

添加 F.nn.instance_norm 接口。

CUDA

新增对 CUDA118_CUDNN860_TRT8531 构建以及 Nvidia 4X sm_89 卡的支持。

ARM

新增 fp16 hybird direct 卷积。
新增 conv1x1 对fp16 nchw88 的支持。
添加 ARM CPU 平台的 Float16 NCHW88 Winograd 算法，提升 Float16 计算性能，以 Vgg16 模型为例，耗时由481.252ms减少为168.300ms。
新增 Float16 MK8 8x8 matmul 算法。

通用组件

megengine 算子支持 shape 中包含0的 tensor 作为输入。

分布式训练

移除分布式训练的 shared memory 后端。

Improvements

通用组件

对部署在 dlopen/dlclose 的用户场景，建议开启编译链接 c++_shared。内部 megvii3 用户，BUILD 目标配置 is_linking_system_dynamic_library = True；CMake 用户：编译参数追加 EXTRA_CMAKE_ARGS 包含 -DANDROID_STL=c++_shared 配置，比如编译 android 版本可执行EXTRA_CMAKE_ARGS=" -DANDROID_STL=c++_shared" ./scripts/cmake-build/cross_build_android_arm_inference.sh。

ARM

优化 ARM FP16 gevm 性能（在 aarch64 的不同 shape 上对 gflops 指标进行测试，95% 的 shape 有11%～156% 不等的性能提升）。

MegEngine Lite

Bugfix

通用组件

修复 lite io 接口设置多输出模型的 output 属性不生效问题。
Load and run 支持对 mgv2 格式模型的自动识别，修复了使用 megenginelite 接口进行推理导致的一些优化选项无法使用的问题，目前接口推理采用 megengine。

New Features

通用组件

load_and_run 支持在线 float32 转 float16（通过 --enable-ioc16 开启）。

MegEngine

Bug fixes

Release Process

Fix the problem that dump fails when split does not specify an output shape.

Python API

Fix the bug that the indexing operation (getitem) crashes when the start is empty and the step is negative (e.g. a[::-1]).
Fix the bug that the behavior of indexing operation (setitem) dtype promotion is inconsistent with numpy.

CUDA

Fix the bug that the log does not display the error message when the TRT loads or runs incorrectly.

ARM

Fix thread_local has a memory leak problem on the android platform.

Common components

Align the dtype property for tensor with np.dtype.
Fix channel padding pass of channel wise conv.
Fix the problem that MGB_USE_MEGDNN_DBG does not take effect when cpp is the opt version. Before the fix, this env is only available in the python whl version and c++ debug version. After the fix, it will take effect in any version.
Fix the bug that cudnn probabilistically fails to select the algorithm when the no_profile_when_shape_change option is turned on.
Fix the bug that crashes when the reduce axis of channel padding pass of nchw44 layout is negative.
Fix the crash issue caused by stack/concat operators when DTR is enabled.
Fix the problem that some op (ConvTranspose, MatrixMul) parameter information is lost after the operation on the C++ model.
Fix the compatibility problem of some APIs (topk, arange, full, linspace, conv_transpose2d/3d, quantized.convtranspose2d) of the traced module to solve the problem that the new version cannot load the .tm model of the historical version.
Fix the infinite loop problem that may occur on the BackwardFoldScale pass of TracedModule (affected by the complexity of the model).

New Features

Dataloader

Add ConcatDataset that supports merging multiple datasets.

Python API

Add F.nn.instance_norm interface.

CUDA

Support for CUDA118_CUDNN860_TRT8531 builds，Start supporting Nvidia 4X sm_89 cards.

ARM

Add fp16 hybrid direct conv algo.
Adjust the conv1x1 algorithm to support fp16 nchw88.
Add Float16 NCHW88 Winograd algorithm for ARM CPU backend to improve Float16 computation performance.Taking the Vgg16 model as an example, the elapsed time is reduced from 481.252ms to 168.300ms.
Add Float16 MK8 8x8 matmul algorithm.

Common components

MegEngine operators support tensors whose shape contains 0 as inputs.

Distributed Training

Remove shared memory backend of distributed training.

Improvements

ARM

Optimize ARM FP16 gevm performance (gflops test on aarch64, 95% of shapes have performance improvement ranging from 11% to 156%).

Common components

For user scenarios deployed in dlopen/dlclose, it is recommended to open the compilation link c++_ shared。 CMake users: EXTRA_CMAKE_ARGS=" -DANDROID_STL=c++_shared" ./scripts/cmake-build/cross_build_android_arm_inference.sh

MegEngine Lite

Bug Fixes

Common components

Fix the problem that the output attribute of multiple output models set by lite io interface does not take effect.
Load and run supports automatic recognition of mgv2 model, and fixes the problem that some optimization options cannot be used due to inference using the megenginelite interface. Currently, the megengine interface is used for inference.

New Features

Common components

Load and run supports convert fp32 to fp16 online (enabled by --enable-ioc16).

v1.12.1

1 year ago

MegEngine

Bugfix

Dataloader

修复 1.12.0 Dataloader 不能将 Infinite 作为输入问题。

CUDA

修复当 cuda/cudnn 头文件在 CPATH 中， MGE_WITH_CUDA=OFF 时的编译错误。

通用组件

将android 构建时的cpp 标准显式调整到 c++17，以解决第三方通过 add_custom_command 调用 MegEngine 构建时，无法编译 libion 的问题；修复 load_and_run --iter 0 时，log 乱码的问题。
修复在开启 no_profiling_on_shape_change 时，错误地重置了低比特量化 Tensor 的 layout 而导致的报错。
修复动态 shape subtensor channel padding 的断言错误。

MegEngine

Bugfix

Dataloader

Fix the problem that Dataloader cannot take Infinite as input in v8.19.0.

CUDA

Fix compilation errors when cuda/cudnn header files are in CPATH and MGE_WITH_CUDA=OFF.

Common components

Config android c++ standard to c++17 to fix build failed when called by add_custom_command;Do not print lar summary log when load_and_run with iter 0.
Fix the bug caused by incorrectly resetting the layout of the low-bit quantized Tensor when no_profiling_on_shape_change is True.
Fix subtensor padding channel assert issue.

Full Changelog: https://github.com/MegEngine/MegEngine/compare/v1.12.0...v1.12.1

v1.12.0

1 year ago

MegEngine

HighLight

针对 BaseDet 中一些 host bound 严重的算子进行了优化，整体模型较上个版本相比 fp32 下平均提速 12%，fp16 下平均提速 19%，其中包含 group_norm 算子的网络显存降低 20%，在与 cvpack2 中有对应 pytorch 模型的网络相比，速度差距在 2% 以内，基本与 pytorch 对应的模型持平。
修改「descending」默认值为 true 以符合惯常情况下大家对 topK 的定义，topk 默认行为由升序改为降序。
增加了对 python 3.10 的支持。

Bugfix

Dataloader

修复 Infinite sampler 无法获取 batchsize 的问题，并增加了使用示例与参数说明。
修复 ReplacementSampler 设置采样权重后采样结果不符合预期的 bug。
修复 ReplacementSampler 有 weight 时输出的 indices 不符合预期的问题。

Python API

修复 deconv 与 bn 融合错误的问题。
修复 softmax 在 cpu 上计算结果不正确的问题。
修复 ImageNet 解压路径错误的问题。

量化

修复 matmul 对量化 dtype 推理错误的问题。
禁止模型以非对称 qint8 的量化模式推理，去除 fake_quant_bias 里的 assert 以支持更多 QAT 量化模式。

CUDA

修复 Region Restricted Conv 不支持输入的 group 维度等于 1 的情形。
修复使用 --copt “-DMEGDNN_DISABLE_FLOAT16=0” 编译选项时，undefined 的报错。

ARM

修复 fallback im2col 算子所需 workspace 比实际需求大的问题。

X86

修复 x86 INT8 matmul 算子在代码重构时性能变差的问题。

通用组件

无 cuda 环境中开启 subgraph jit 特性可能导致部分 functional API 调用报错，subgraph jit 特性临时改为默认不开启。
修复模型多次初始化时偶发内存用量不一致的问题。
修复 tensor astype 成量化类型时概率性 segmentfault 问题和内存泄露问题。
修复 v1.11.0 及之后的版本 Elemwise multitype 的 loader 和 dumper 函数无法向前兼容的问题。
对 mge(fbs) 格式化，补充 tensor_value_dumper 和 tensor_value_loader 用户接口，方便用户在模型 dump 和 load 阶段自定义一些行为，比如模型的压缩和解压。
修复模型仅能通过 forward 函数进行参数统计导致的参数缺失的问题。
修复 megengine 训练时默认 async_level 情况下的数据竞争导致的运行中随机报错。
修复 load and run jit 设置对非 CUDA 后端无效的问题，增加了 jit 对 CPU 后端的支持。
修复 dump 量化模型时，开启 enable_nchw4/32/64 等选项报 shape 或 channel 不匹配的问题。
调整编译配置，使之对开发者模式更加友好：只需要设置 PYTHONPATH 到 imperative/python 即可，详细参见 scripts/whl/BUILD_PYTHON_WHL_README.md。
移除 python3.8 及之后的 SyntaxWarning。
修复 MegEngine 和 python中 mod 计算结果不一致的问题。
修复 symbolic trace时，3维输入的 matmul 输出 shape 计算错误问题。
修复 ConvolutionBackwardData 算子推断 layout 错误导致的概率性训练崩溃 bug。
加速 reshape、setsubtensor、subtensor、concat、stack 算子。
修复 NormElemwisePass 中 named_args 接口未更新的问题。

文档

修复 warp_affine 的文档错误。

New Features

Python API

deconv 支持 fuse bn 操作。

CUDA

CUDA 上customop 支持新的 RuntimeArgs 参数。
取消 RegionRestrictedConv mask 类型为 uint8 时输入和输出 tensor size 必须为4的倍数的限制。

ARM

ARM 平台支持 fp16 nchw88 im2col 算法，此算法性能较 fp32 nchw44 快2倍左右，主要用于提升 ARM fp16 模型推理速度。
添加 ARM NCHW88 fp16 pooling 算法。

通用组件

Region Restricted Conv 支持 bias。
nchw44/nchw88/nchw44-dot 三种 layout 在 channel 上不满足要求时会 padding channel。
添加 grouonorm 算子。
增加了对 python 3.10 的支持。
为 custom op 新增 cuda 相关的辅助函数，以允许 custom op 异步执行。

Improvements

Python API

修改 descending 默认值为 true ，topk 默认行为由升序改为降序。

CUDA

完善了 dump 和 load 使用的 tensorrt 版本不一致时的错误信息。

MegEngine Lite

Bugfix

通用组件

修复 lite 运行跨 compnode 的模型时 zero copy 不生效的问题。
修复 lite zero copy pass 触发的 UAF 问题。

周边工具

修复 load_and_run fitting 模式下 fast-run 不工作的问题。

MegEngine

Bugfix

Dataloader

Fixed the problem that Infinite sampler cannot getbatchsize, and added usage examples and parameter descriptions.
Fix the bug that ReplacementSampler gets wrong sampling results after setting the sampling weights.
Fix the problem that the indices output by ReplacementSampler does not meet expectations when it has weight.

Python API

Fixed deconv and bn fusion error.
Fixed softmax calculation result incorrectly on cpu.
Fix bad path when untarring imagenet data.

Quantify

Fixed matmul inference error for quantized dtypes.
Forbid the model to reason in the quantization mode of asymmetric qint8, and remove the assert in fake_quant_bias to support more QAT quantization modes.

CUDA

Fixed the issue when Region Region Restrict Conv's group is 1.
Fix the undefined error when using the -- copt "- DMEGDNN_DISABLE_FLOAT16=0" compilation option.

ARM

Fix the problem that the workspace required by the fallback im2col operator is larger than the actual requirement.

X86

Fix x86 INT8 matmul operator's poor performance during code refactoring.

Common components

Enabling the subgraph jit feature in a non-cuda environment may cause some functional API calls to throw errors. The subgraph jit feature is temporarily changed to be disabled by default.
Fix occasional inconsistent memory usage when the model was initialized multiple times.
Fixed probabilistic segmentfault and memory leak when when set tensor dtype to Quantized.
Fixed the problem that the loader and dumper functions of Elemwise multitype cannot be forward compatible in v1.11.0 and later versions.
Implement user interface 'tensor_value_dumper' and 'tensor_value_loader' for fbs model, used for user register some behavior at model dump and load stage, for example model compress and decompression.
Fixed an issue where module_stats does not support information statistics for axion models.
Fixed random errors during operation caused by data races in the case of default async_level during megengine training.
Support JIT CPU backend and fix load and run jit options invalid for backend exclude CUDA.
Fix the error when dump model to nchw4/32/64 tensor format.
Fix build CMakeLists and script to get better experience.
Remove SyntaxWarning after python3.8.
Fix the problem of mod op get the different result between MegEngine and python.
Fix the probabilistic training crash bug caused by the deduce layout error of ConvolutionBackwardData operator.
Speed up reshape, setsubtensor, subtensor, concat, stack operators.
Fix the problem that the named_args interface in NormElemwisePass is not updated.

Documentation

Fix documentation error of warp_affine.

New Features

Python API

Deconv supports fuse bn operations.

CUDA

Add param RuntimeArgs to customop kernel on CUDA.
Cancel the restriction that the input and output tensor size must be a multiple of 4 when the RegionRestrictedConv mask type is uint8.

ARM

The ARM platform supports the fp16 nchw88 im2col algorithm, which is about twice faster than the fp32 nchw44 algorithm, and is mainly used to improve the reasoning speed of the ARM fp16 model.
Add ARM NCHW88 fp16 pooling algorithm.

Common components

Region Restricted Conv support bias.
Three layouts (nchw44/nchw88/nchw44-dot) will padding the channel when the channel does not meet the requirements.
Add grouonorm operator.
Add support for python 3.10.

Improvments

Python API

Change the default value of descending to true , and the default behavior of topk is changed from ascending to descending.

CUDA

Improve the error message when the tensorrt versions used by dump and load are inconsistent.