Paddle Versions Save

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)

v2.2.2

2 years ago

2.2.2 Release Note

1. 重要更新

我们很高兴的发布飞桨框架2.2.2版本,主要是对2.2.1中一些功能和性能问题的修复,并对部分功能点做了增强。

2. 训练框架(含分布式)

(1)新功能

API

  • 新增paddle.nn.Mishpaddle.nn.functional.mish,支持逐元素计算mish激活函数。 (#38803)

其他

  • paddle.nn.PReLUpaddle.nn.functional.prelupaddle.nn.static.prelu 新增支持 data_format 参数,可以设置输入的数据类型。 (#38495)
  • paddle.index_select 新增支持 float16 数据类型。(#38751)
  • 优化 paddle.multiplexinputs中张量 size 为 0 时的报错信息。(#38757)
  • paddle.fluid.contrib.slim.quantization.PostTrainingQuantization 新增初始化参数data_loader,支持传入 paddle.io.DataLoader 对象或者Python Generator 。(#38729)

(2)问题修复

API

  • 修复paddle.max在输入x.ndim > 6 and axis < 0时运行出错的问题。(#38070)
  • 修复paddle.maxpaddle.min的bug:在CPU设备上,当参数axis是list类型且len(axis) == x.ndim and axis[i] < 0时,结果出错。(#38478)
  • 修复paddle.nn.functional.unfold在InferShape计算时不区分compile time和runtime的问题。(#38925)
  • 修复paddle.nn.functional.cross_entropy在对labels进行检查时,存在不必要的GPU与CPU同步的问题。(#38849
  • 修复paddle.distributed.split在沿列切分FC时,反向计算时得到的输入梯度结果异常的问题。(#38724)
  • 修复 paddle.nn.Layer.to 不支持 paddle.dtype 类型的问题。(#38108)
  • 修复静态图下 paddle.linalg.svdfull_matrics=True 时,输出tensor的shape在动态图和静态图下不同的问题。(#37744)
  • 修复Tensor切片索引使用多个None类型索引时结果维度异常的问题。(#37400)
  • 修复Tensor索引赋值在部分场景下显存泄露的问题。(#38098)
  • 修复模型使用 save_inference_model 导出后,添加反向 pass 做训练,conv2d 缺失属性报错的问题。 (#38832)

IR(Intermediate Representation)

  • 动态图转静态图

    • 修复了部分初始化相关 API 动静行为不统一的问题。(#37827)
    • 修复动转静代码转写时会将 paddle 作为变量的问题。(#37999)
    • 修复动转静代码转写时,突出的代码注释导致转写报错的问题。(#38003)
    • 修复 for ... zip... 语句在动转静中死循环的问题。(#37846)
  • 模型量化

    • 修复动态图量化训练导出模型多余节点问题。(#38122) (#38025)
    • 针对量化模型在Paddle Lite上无法预测的问题,去除量化导出模型的 clip_extra 设置。 (#38343)
    • 针对 flatten_contiguous_range 算子在量化中输出配置错误的问题,修复 flatten_contiguous_range 量化设置。 (#37741)

其他

  • 自定义OP

    • 修复了自定义算子在多进程下加载Python API 时,可能因文件不完整导致报错的问题。(#38128)
    • 修复了在CentOS平台上编译时,D_GLIBCXX_USE_CXX11_ABI未按预期生效导致的编译失败问题。(#37878)
  • 动态图Inplace策略

    • 修复了多个inplace op连续执行时,accumulator 报错的问题。(#38406)
    • 修复了 Tensorsetitem 方法,对叶子节点进行inplace操作时,导致反向图构建错误的bug。(#38014)
  • NHWC 策略

    • 修复 batchnorm_op 中,当数据类型为 FP32 ,且数据维度 dims = 2,data_layout = NHWC 时,反向 Op 内中间变量未定义问题。 (#37020)

3. 部署方向(Paddle Inference)

(1)功能优化

框架及API更新

  • C API支持对c++ std::string的处理。(#38667)

后端能力增强

  • GPU 及 TensorRT 子图引擎相关更新
    • 支持 relu、relu6、tanh、sigmoid、pool2d、concat、batch_norm、split、gelu、scale、swish、prelu、clip、reduce_sum、reduce_mean 算子在静态 shape 且2维输入情况下调用 TensorRT 推理。(#37773)
    • 支持mish激活函数调用 TensorRT 推理。 (#38866)

(2)问题修复

框架及API修复

  • 算子修复

    • 修复roi_align算子在使用 TRT 时不兼容的问题。(#38788)
    • 增加elementwise在维度相同时广播的功能。(#37908)
  • 框架功能修复

    • 修复动态图转静态图时的模型剪裁逻辑,使得包含 subblock 的算子在动态图转静态图时可以正确剪裁。(#37579)
    • 修复多线程下 CreatePredictor 接口的报错问题,当前的 CreatePredictor 接口允许在多线程中调用而不会导致推理异常。(#37894)
    • 配置config时,对于没有权重的模型,支持 params file 传空字符串。(#38579)
    • 修复Paddle-TRT engine直接输入cpu tensor没有进行gpu数据拷贝的问题。(#37427)

后端能力修复

  • TensorRT 子图引擎修复

    • 修复pool2d在某些参数组合的情况下运行TensorRT出错的问题。(#37929)
  • MKLDNN引擎修复

    • 修复 matmul_v2 的 mkldnn kernel 不支持两个输入的shape长度不同的问题。 (#38733)

其他修复

  • 修复ERNIE模型在TRT8下可能出现的hang死问题。(#37839)

2.2.2 Release Note

1. Important Updates

This version fixed some function and performance issues of PaddlePaddle 2.2.1 and optimized some functions.

2. Training Framework (distributed included)

(1)New functions

API

  • Add the paddle.nn.Mish and paddle.nn.functional.mish which support the element-by-element calculation of the mish activation function. (#38803)

Others

  • The paddle.nn.PReLU, paddle.nn.functional.prelu, and paddle.nn.static.prelu newly support the data_format parameter. You can set input data type. (#38495)
  • The paddle.index_select supports float16 data type. (#38751)
  • Optimize error message of paddle.multiplex when tensor size in inputs is 0. (#38757)
  • Add initialization parameter data_loader for paddle.fluid.contrib.slim.quantization.PostTrainingQuantization, and support input of the paddle.io.DataLoader object or Python Generator. (#38729)

(2)Bug Fixes

API

  • Fix operation error of paddle.max in input of x.ndim > 6 and axis < 0. (#38070)
  • Fix bug of paddle.max and paddle.min: Result is incorrect on the CPU device when the parameter axis is the list type and len(axis) == x.ndim and axis[i] < 0. (#38478)
  • Fix bug that paddle.nn.functional.unfold does not distinguish between compile time and runtime in InferShape calculation. (#38925) (#38834)
  • Fix bug where GPU unnecessarily synchronizes with the CPU when paddle.nn.functional.cross_entropy checks labels. (#38849
  • Fix bug of input gradient result error in backward computing when paddle.distributed.split slices the FC along columns. (#38724)
  • Fix bug where paddle.nn.Layer.to does not support paddle.dtype type. (#38108)
  • Fix bug that output tensor's shape is different between dynamic and static graphs when full_matrics=True in paddle.linalg.svd under static graphs. (#37744)
  • Fix bug of the result dimension exception when the Tensor slice index uses multiple None type indexes. (#37400)
  • Fix memory leak bug of Tensor index assignment in some scenarios. (#38098)
  • Fix bug of conv2d reporting an error with missing attributes after model is exported using save_inference_model and backward pass is added for training. (#38832)

IR(Intermediate Representation)

  • Dynamic Graph to Static Graph

    • Fix bug of inconsistency between dynamic and static behaviors of some initialization-related APIs. (#37827)
    • Fix bug where paddle will be used as a variable when dynamic to static code is transcribed. (#37999)
    • Fix bug that highlighted code comments lead to an error report when dynamic to static code is transcribed. (#38003)
    • Fix endless loop of for … zip … statement in dynamic to static graph. (#37846)
  • Model quantization

    • Fix problem of redundant nodes in model derived from quantitative training of dynamic graph. (#38122) (#38025)
    • To solve the problem that the quantitative model cannot be predicted on Paddle Lite, remove clip_extra settings of quantitative export models. (#38343)
    • Fix flatten_contiguous_range quantization settings for flatten_contiguous_range operator output configuration error in quantization. (#37741)

Others

  • Custom OP

    • Fix bug that user-defined operator may report an error due to incomplete files when loading Python APIs under multiple processes. (#38128)
    • Fix compilation failure caused by D_GLIBCXX_USE_CXX11_ABI not taking effect as expected when compiling on CentOS platforms. (#37878)
  • Dynamic graph inplace strategy

    • Fix problem that accumulator reports an error when multiple inplace OPs execute continuously. (#38406)
    • Fix problem that the setitem method of Tensor causes the backward graph construction error when performing the inplace operation on leaf nodes. (#38014)
  • NHWC strategy

    • Fix bug of undefined intermediate variables in backward Op in batchnorm_op when data type is FP32, with dims = 2 and data_layout = NHWC. (#37020)

3. Paddle Inference

(1)Function Optimization

Framework and API updates

  • C API supports processing of c++ std::string. (#38667)

Back-end capability enhancement

  • GPU and TensorRT subgraph engine related updates
    • Support invoke of TensorRT inference for relu, relu6, tanh, sigmoid, pool2d, concat, batch_norm, split, gelu, scale, swish, prelu, clip, reduce_sum, and reduce_mean operators in the static shape and 2-dimensional input. (#37773)
    • Support invoke of TensorRT inference by mish activation function. (#38866)

(2)Bug Fixes

Framework and API fixing

  • Operator fixing

    • Fix incompatibility bug of the roi_align operator in use of TRT. (#38788)
    • Add the function of elementwise broadcasting in the same dimension. (#37908)
  • Framework function fixing

    • Fix bug of model clipping logic in dynamic-to-static graphs, so operators containing subblock are clipped correctly in dynamic-to-static graphs. (#37579)
    • Fix error reporting issue of CreatePredictor interface under multiple threads. Current CreatePredictor interface allows calling in multiple threads without causing inference exceptions. (#37894)
    • Support “params file” to pass empty strings for models without weights in config. (#38579)
    • Fix problem of not copying GPU data when Paddle-TRT engine directly inputs CPU tensor. (#37427)

Back-end capability fixing

  • TensorRT subgraph engine fixing

    • Fix the bug of an error that occurred in the running of TensorRT by pool2d with some of the parameters. (#37929)
  • MKLDNN engine fixing

    • Fix the problem that mkldnn kernel of matmul_v2 does not support different lengths of two input shapes. (#38733)

Others

  • Fix the possible hang bug of ERNIE model under TRT8. (#37839)

v2.2.1

2 years ago

2.2.1 Release Note

1. 重要更新

我们很高兴的发布飞桨框架2.2.1版本,主要是对2.2.0中一些功能和性能问题的修复,并对部分功能点做了增强,重点如下:

  • 新增 paddle.linalg.triangular_solve,用于计算带有三角系数矩阵的线性方程组。
  • 新增 paddle.device.cuda.graphs.CUDAGraph API,支持NVIDIA的CUDA Graph功能,注意目前该API还处于实验阶段,尚未稳定。
  • 修复了基础API、Tensor 索引中的已知问题。

2. 训练框架(含分布式)

(1)新功能

API

  • 新增paddle.linalg.triangular_solve API,用于计算带有三角系数矩阵的线性方程组。(#36714)
  • 新增paddle.device.cuda.graphs.CUDAGraph API,支持NVIDIA的CUDA Graph功能,可以将GPU计算全部捕捉到一张CUDA Graph中,往后多次调用,可以去除框架的额外开销,提升运行性能。注意目前该API还处于实验阶段,尚未稳定。(#37109)
  • 新增paddle.incubate.graph_send_recv API,主要应用于图学习领域,目的是为了减少在消息传递过程中带来的中间变量显存或内存的损耗,包含 SUM、MEAN、MIN、MAX 共四种更新模式。(#37205)
  • 新增paddle.incubate.operators.ResNetUnit API,用于 ResNet 网络里的卷积、批归一化、shortcut/bottleneck操作融合。(#37109)

(2)功能优化

API

  • paddle.incubate.FusedTransformerEncoderLayer,添加 src_mask=None 的支持,添加pure fp16的支持。 (#37229)

IR(Intermediate Representation)

  • 动态图转静态图
    • 使用@paddle.jit.to_static装饰单独的 function 时,提供 train()、eval() 函数支持切换到 train、eval 模式。(#37383)

分布式训练

  • 异构参数服务器完善任意次切图能力,增加流水线训练功能,提升训练吞吐。(#37446)

其他

  • 针对 paddle.scatterindex 越界导致 core dump 的问题,加强了越界检查,并完善对应的报错信息。(#37431)

(3)性能优化

  • 优化 paddle.top_k,根据 k 的大小和 input_width 大小进行选择不同的实现方案,当 k>=75% input_width 时选择 cub 实现,否则选择手写 kernel 实现。(#37325)
  • 优化paddle.fluid.optimizer.LarsMomentumOptimizer,通过 optimizer 算子融合 + CUDA Cooperative Groups的方式提高OP性能。(#37109)

(4)问题修复

API

  • 修复paddle.nn.ELUpaddle.nn.functional.elu 的计算公式,解决 alpha<0 时结果错误的问题;paddle.nn.functional.elu_不支持 alpha<0 的场景,在 alpha<0 时会报错。(#37437)
  • 修复paddle.slice反向执行时出现 out_of_range 的问题。(#37584)
  • paddle.shape 没有反向,显式设置 stop_gradientTrue。(#37412)
  • paddle.arange 没有反向,显式设置 stop_gradientTrue。(#37486)
  • paddle.shard_index 在输入数据的最后一维不为1时进行报错提示。(#37421)
  • 修复 paddle.matmul 使用int8量化,反量化时维度错误的问题。(#36982)
  • 修复 paddle.nn.Dropouteval 模式下不计算梯度的问题。(#37305)
  • 修复 paddle.nn.functional.dropout 在静态图下输入 Tenor 形状中有 -1 并指定 drop 该维时报错的问题。(#37223)
  • 修复RNN类API paddle.nn.LSTM,paddle.nn.GRU, paddle.nn.SimpleRNN在CPU训练时多层RNN(dropout设置为0)反向计算出错的问题。(#37086)
  • 修复 paddle.incubate.FusedTransformerEncoderLayer 反向计算梯度错误、pre_layer_norm 处理不正确、参数处理不正确,漏传参数、 add_bias 计算错误等问题。 (#37229)
  • 修复 paddle.incubate.fused_multi_head_attention 不支持 biasNone 的问题。(#37411, #37566)
  • 修复paddle.vision.datasets.Cifar10, paddle.vision.datasets.Cifar100加载数据没有顺序的问题。 (#37528)
  • 修复一维Tensor在使用省略号(...)索引时维度检测异常报错的问题。(#37192)
  • 修复Tensor索引赋值(setitem)梯度属性无法传播的问题,详见issue。(#37028)

IR(Intermediate Representation)

  • 动态图转静态图
    • 动转静后的模型调用 paddle.flops 能够正确统计模型参数。(#36852)
    • 动转静模块能够正确转换for i in [1, 2, 3]循环语句。(#37259)

分布式训练

  • fleet.load_model: 修复参数服务器模式下模型加载API不可用问题。(#37461)
  • fleet.save_inference_model: 修复参数服务器模式下模型保存 dense 参数前,未从 server 端拉取参数的问题。(#37461)

其他

  • 修复动态图 inplace 操作的问题:对一个非叶子节点进行 inplace 操作后,立即执行 backward,该节点及更前的节点的梯度计算错误。(#37420)

3. 部署方向(Paddle Inference)

(1)问题修复

  • 在明确关闭日志的情况下,进一步去除冗余的调试日志。(#37212)
  • 修复内存/显存优化策略,避免因不当的内存/显存优化导致预测结果有误或崩溃。(#37324, #37123)
  • 修复 Transformer 模型的 MultiHead 结构中融合后 QkvToContextPluginDynamicscale 的 scale 计算错误问题,这是由于 cuda 函数的 block 和 thread 设置错误引起的。(#37096)
  • 将所有的推理OP在int8量化的功能中注册:解决因历史原因有些推理OP没有在int8量化中注册的问题。(#37266)

2.2.1 Release Note

1. Important Updates

This version fixed some function and performance issues of PaddlePaddle 2.2.0, and optimized some functions. The highlights are as follows:

  • Add paddle.linalg.triangular_solve to calculate linear equations with triangular coefficient matrices.
  • Add paddle.device.cuda.graphs.CUDAGraph API that supports the CUDA Graph function of NVIDIA. Note that this API is still experimental and not yet stable.
  • Fix known issues of basic API and Tensor index.

2. Training Framework(Distributed Included)

(1)New Functions

API

  • Add paddle.linalg.triangular_solve API to calculate linear equations with triangular coefficient matrices. (#36714)
  • Add paddle.device.cuda.graphs.CUDAGraph API that supports the CUDA Graph function of NVIDIA by capturing all GPU calculations into a single CUDA Graph and calling them for later use, which not only cuts the extra overhead but also improves the runtime performance. Note that the API is still experimental and not yet stable. (#37109)
  • Addpaddle.incubate.graph_send_recv API for graph learning to reduce the loss of intermediate variables in memory or video memory during message passing. It contains four update modes, namely, SUM, MEAN, MIN, and MAX. (#37205)
  • Add paddle.incubate.operators.ResNetUnit API to integrate the convolution, batch normalization, and shortcut/bottleneck operation in the ResNet network. (#37109)

(2)Function Optimization

API

  • paddle.incubate.FusedTransformerEncoderLayer adds src_mask=None and supports pure fp16.(#37229)

IR(Intermediate Representation)

  • Dynamic Graph to Static Graph
    • When adopting@paddle.jit.to_static to decorate single function, train()、eval() functions are provided to support the switch to train、eval mode. (#37383)

Distributed Training

  • Optimize the ability of arbitrary cutting and add pipeline training in the heterogeneous parameter server, which enhance training throughput.(#37446)

Others

  • Enhance the out-of-bounds check for the index of ``paddle.scatter` that causes core dump, and improve the corresponding error reporting message. (#37431)

(3)Performance Optimization

  • Optimize paddle.top_k by enabling it to choose different implementations according to the size of k and input_width: cub implementation when k>=75% input_width, otherwise the handwritten kernel implementation.(#37325)
  • Optimize paddle.fluid.optimizer.LarsMomentumOptimizer to improve OP performance by integrating optimizer operator and CUDA Cooperative Groups. (#37109)

(4)Bug Fixes

API

  • Fix the calculation error of paddle.nn.ELU and paddle.nn.functional.elu when alpha<0;please note the inplace version:paddle.nn.functional.elu_ will raise error when alpha<0. ([#37437]
  • (https://github.com/PaddlePaddle/Paddle/pull/37437))
  • Fix the problem of out_of_range when the paddle.slice is reversely executed. (#37584)
  • paddle.shape doesn't support backward, explicitly set stop_gradient to True. (#37412)
  • paddle.arange doesn't support backward, explicitly set stop_gradient to True.(#37486)
  • paddle.shard_index reports an error if the last dimension of the input data is not 1. (#37421)
  • Fix the wrong dimension of inverse quantization when paddle.matmul adopts int8 quantization. (#36982)
  • Fix the issue that paddle.nn.Dropout, under eval, does not calculate the gradient. (#37305)
  • Fix the issue that paddle.nn.functional.dropout, in static graph mode, reports an error when -1 is included in the input shape of Tensor and it is specified to drop this dimension. (#37223)
  • Fix the backward calculation errors of multi-layer RNN (dropout set 0) in CPU training by RNN API paddle.nn.LSTM,paddle.nn.GRU, paddle.nn.SimpleRNN. (#37086)
  • Fix issues such as the gradient error ofpaddle.incubate.FusedTransformerEncoderLayer backward calculation, incorrect processing of pre_layer_norm, incorrect parameter processing, missing parameters, calculation errors of add_bias, etc. (#37229)
  • Fix the issue that paddle.incubate.fused_multi_head_attention does not support bias as None.(#37411, #37566)
  • Fix the disordered data loaded by paddle.vision.datasets.Cifar10, paddle.vision.datasets.Cifar100. (#37528)
  • Fix the issue that one-dimensional Tensor reports an exception error of dimension detection when using ellipsis(...) indexing. (#37192)
  • Fix the issue that the gradient attribute ofTensor cannot be spread during indexing and assignment (setitem), see issue for details. (#37028)

IR(Intermediate Representation)

  • Dynamic Graph to Static Graph
    • The model can call paddle.flops to count the model parameters correctly. (#36852)
    • The model can correctly convert the loop statements for i in [1, 2, 3].(#37259)

Distributed Training

  • fleet.load_model: Fix the unavailable API loaded by the model in parameter server mode.(#37461)
  • fleet.save_inference_model: Fix the issue that the model does not pull parameters from the server side before saving dense parameters in parameter server mode. (#37461)

Others

  • Fix the problem of inplace operation of dynamic graph: after performing inplace operation on a non-leaf node, followed by immediate execution of backward, the gradient of this node and the nodes before is calculated incorrectly. (#37420)

3. Paddle Inference

(1)Bug Fixes

  • Further removal of redundant debug logs in the case of clear log disable.(#37212)
  • Fix memory/video memory optimization policies to avoid incorrect prediction results or crashes due to improper memory/video memory optimization. (#37324, #37123)
  • Fix the scale calculation error in the MultiHead structure of Transformer model after integrating QkvToContextPluginDynamicscale, which is caused by wrong block and thread settings of cuda function. (#37096)
  • Register all inference OPs in the function of int8 quantization: Solve the issues that some inference OPs are not registered in int8 quantization due to historical reasons. (#37266)

v2.2.0

2 years ago

v2.2.0-rc0

2 years ago

v2.1.3

2 years ago

本版本主要是对2.1.2中的部分问题的修复,重点如下:

  • 新增 paddle.disable_signal_handler 支持关闭 Paddle 中信号捕捉机制,从而支持同时使用Paddle与TVM(#35366)。
  • 修复 paddle.flatten 在静态图使用下编译期计算维度错误的问题(#35398)。

v2.1.2

2 years ago

2.1.2 Release Note

重要更新

本版本主要是对2.1.1中一些功能和性能问题的修复,重点如下:

  • 修复了基础API中的已知问题。
  • 修复了动转静语法转写已知的若干问题。
  • 自定义OP编译时C++版本检查由C++11升级为C++14。

训练框架

功能优化(含分布式)

基础API

  • 修复 paddle.vision 路径下部分API无法访问的问题。(#34489)
  • 修复paddle.concat在应用到多个大shape 的Tensor时溢出的问题。(#34396)
  • paddle.flip支持输入axis为整型,并提升了动态图模式下的性能。(#34477)
  • 修复paddle.slice 输入输出地址相同时越界访问问题。(#34265)
  • 修复 paddle.nn.Unfold 的输入参数顺序错误的问题。(#34251)
  • 新增了静态图下 Tensor 的若干接口,如 size()、detach()等。 (#33330)
  • Tensor.grad 的 Warning内容中增加了不兼容升级的说明。(#34262)
  • 下线 paddle.save 保存 Layer 的功能。(#34039)
  • 修复 paddle.jit.save在Mac系统上保存的模型,在Linux平台上无法对模型进行重训练的问题。(#34154)
  • 修复 layer_norm 在大 size 输入时 cuda kernel 参数错误的问题。(#33893)
  • 修复paddle.io.DataLoader误报不兼容升级warning问题。(#34001)
  • 修复paddle.io.DataLoader内存泄漏问题。(#34301)

动态图转静态图

  • 新增对 Sequential 容器类嵌套使用时的语法支持。(#34246)
  • 新增对 Python3 type hint 语法的兼容支持。(#33745)
  • @to_staticinput_spec 参数新增支持非 Tensor 类型,如 int、float、string、bool等。(#33464)
  • 修复了动转静语法转写已知的若干问题。(#33963)

自定义OP

  • 自定义OP编译时C++版本检查由C++11升级为C++14。 (#30415)

推理部署

Paddle Inference

问题修复

  • 修复batch_size > 1时ERNIE模型计算结果错误的问题。(#33784)
  • 修复windows下TensortRT推理路径用右斜杠分割导致的崩溃。(#33885)
  • 修复MKLDNN elementwise系列OP的X不支持广播的问题。(#33845

环境适配

编译安装

  • 限定了依赖的 Gast 库的版本范围。( gast>=0.3.3, <=0.4.0)(#33850)
  • 优化了Avx/No-Avx相关的安装报错信息,减少了冗余的Warning信息。(#33885)

新硬件适配

昆仑硬件训练支持

  • 修改昆仑的cmake文件,统一更新昆仑的算子库。(#34000

Important Updates

This release mainly fixes some features and performance issues in 2.1.1. See the following highlights:

  • Fix several known issues in frontend APIs.
  • Fix several known issues with dynamic to static syntax transcriptions.
  • C++ version check upgrade from C++11 to C++14 during Custom OP compile.

Training framework

Functional optimization (including distributed)

Basic API

  • Fix some APIs located in paddle.vision are not accessible issues. (#34489)
  • Fix paddle.concat overflow when applied to multiple Tensor with large shape. (#34396)
  • paddle.flip supports input axis as integer, and improves performance in dynamic graph mode. (#34477)
  • Fix paddle.slice out-of-bounds access problem when input and output addresses are the same. (#34265)
  • Fix the problem of wrong order of input parameters of paddle.nn.Unfold. (#34251)
  • Add several interfaces for Tensor under static graphs such as size(), detach(), etc. (#33330)
  • Add incompatible upgrade note to the Warning content of Tensor.grad.(#34262)
  • Downlink paddle.save to save the function of Layer. (#34039)
  • Fix paddle.jit.save for saving models on Mac systems that cannot be retrained on Linux platforms. (#34154)
  • Fix layer_norm with wrong cuda kernel parameters for large size input. (#33893)
  • Fix paddle.io.DataLoader error reporting incompatible upgrade warning issue. (#34001)
  • Fix paddle.io.DataLoader memory leak problem. (#34301)

Dynamic to static map

  • Add syntax support for nested use of Sequential container classes. (#34246)
  • Add compatibility support for Python3 type hint syntax. (#33745)
  • Add support for non-Tensor types including int, float, string, bool in the input_spec argument of @to_static. (#33464)
  • Fix a number of known problems with the transcription of dynamic to static syntax. (#33963)

Custom OP

  • C++ version check upgrade from C++11 to C++14 during Custom OP compile. (#30415)

Inference Deployment

Paddle Inference

bugfix

  • Fix wrong calculation result of ERNIE model when batch_size > 1. (#33784)
  • Fix the crash caused by splitting TensortRT inference path with right slash under windows.(#33885)
  • Fix MKLDNN elementwise series OP's X does not support broadcast .(#33845

Environment adaptation

Compile and install

  • Restrict the version range of dependent Gast libraries ( gast>=0.3.3, <=0.4.0). (#33850)
  • Optimize Avx/No-Avx related installation error messages, reduce redundant Warning messages. (#33885)

New Hardware Adaptation

Kunlun hardware training support

  • Modify the cmake file of Kunlun to unify and update its operator library.(#34000

Thanks to our Contributors

This release contains contributions from:

0x45f、Aurelius84、Chen Weihang、chentianyu03、HexToString、iducn、Jacek Czaja、Kaipeng Deng、Leo Chen、lzzyzlbb、Peihan、taixiurong、tianshuo78520a、WeiXin、wenbin、Wilber、wuhuachaocoding、xiongkun、Zhou Wei、 winter-wang .

v2.1.1

2 years ago

2.1.1 Release Note

重要更新

本版本主要是对2.1.0中一些功能和性能问题的修复,并对部分功能点做了增强,重点如下:

  • 完成了 paddle.distributed、paddle.device、paddle.vision 目录API的可见性优化。
  • 动态图转静态图新增对 paddle.nn.Sequential容器内 sublayer 的用户代码的动静转换。
  • 动态图增加 SyncBatchNorm 对AMP的支持,提升动态图 SyncBatchNorm 层在AMP模式的性能。

训练框架

功能优化(含分布式)

基础API

  • paddle.distributed、paddle.device、paddle.vision 等层级新增推荐使用方式,推荐使用方式的具体说明请见下文2.1.0 Release Note。(#33420)
  • 新增 paddle.is_compiled_with_rocm 。(#33228)
  • 新增 paddle.strided_slice bool type输入的支持。(#33373
  • 新增 paddle.equal_all、paddle.equal、paddle.greater_equal、paddle.greater_than、paddle.less_equal、paddle.less_than、paddle.not_equal bool type输入的支持。 (#33551
  • 修复 paddle.utils.download 在ConnectionError异常时不进行Retry逻辑。(#33454
  • 修复 paddle.gather 在axis不等于0下,infershape错误的问题。(#33553
  • 修复 paddle.io.DataLoadernum_workers=0Dataset 生成GPU Tensor 送入DataLoader 时导致的段错误。(#33487, #33249
  • 修复 slice 操作结果作为左值使用inplace操作时,反向运行报错提示与错误无关的问题。(#32981
  • 修复 paddle.concat 动态图支持 uint8 出错的问题。(#33667)
  • 修复 paddle.grid_sample 显存溢出和输出结果异常的问题。(#33100#33232
  • 修复 roi_align 中align=True模式下输入为0时的问题。(#33446
  • 修复了在特定情况下 log_softmax 会把输入改为nan的问题。(#32937

动态图转静态图

  • 新增支持对 paddle.nn.Sequential容器内 sublayer 的用户代码的动静转换。(#33065
  • 修复了在控制流 for 语句转换中,在变量静态类型分析阶段未正确处理 Subscript 语法的问题。(#32969
  • 重构了动转静 param_guard 逻辑代码,全面解决动静态图 Tensor 类型互转问题。(#32985

分布式训练

  • 修复 paddle.distributed.spawn 在使用默认 nprocs 参数时出错的问题。(#33249
  • 修复流水线并行通信组创建不一致导致训练启动hang住的问题。(#32890#33473
  • 修复混合并行中保存参数失败的问题。(#33595#33588
  • 修复Fleet API无法直接运行 Program 的问题。(#33511
  • 修复异构参数服务器纯GPU训练模式中样本分桶不均导致hang住的问题。(#32957
动态图混合并行
  • 修复 TensorParallel 的精度问题。改变 TensorParallel 的参数初始化方式,保证参数切分后的随机性。(#33087
  • 修复 PipeLineParallel 的精度问题。解决 PipeLineParallelmicrobatch 使用不正确的问题。(#33097
  • 修复 new_group API创建多个通信组,会hang的问题。(#33553

混合精度训练

  • 动态图增加 SyncBatchNorm 对AMP的支持,提升动态图 SyncBatchNorm 层在AMP模式的性能,在PaddleSegDeepLabV3P模型上8卡AMP模式加速比提升19%。(#33709)

自定义OP

  • 移除了自定义OP编译时对 PADDLE_WITH_MKLDNN 宏的依赖。(#32903
  • 默认设置 GLIBCXX_USE_CXX11_ABI=1 以解决GCC版本过低导致编译时可能报错的问题。(#33185
  • 新增支持c++14的语法特性,默认开启-std=c++14编译选项。 (#33227

其他

  • 修复了多线程下 LoDTensorArray 作为Op输入时,训练会随机出段错误的问题。(#32984
  • 修复 paddle.ParamAttr 的 regularizer 和 paddle.optimizer.Momentumweight_decay 同时被指定为 L2Decay 时,参数正则化被执行2次的问题。(#32881
  • 修复windows系统下warning信息可能显示乱码问题。(#33689

推理部署

模型量化

  • 修复动态图量化训练功能中跳过OP量化的问题。(#32879
  • 修复量化模型保存时 layer_norm不保存 out_threahold 属性的问题。(#33610

Paddle Inference

功能升级

  • Paddle-TRT新增 gather_ndreduce_sum 的converter/plugin。(#33365
  • Paddle-TRT新增 reshape 。(#33372

性能优化

  • 增加TensorRT的 layer_norm 动态shape plugin,提升模型动态shape推理性能。(#33448

易用性优化

  • 新增 Paddle Inference ROCm 版的预测示例文档以及增加C++预测库的version.txt中与ROCM相关版本信息 (#33290)
  • 更新了XPU的编译选项,具体编译选项请参考 #33581

问题修复

  • 修复 fused_fc_elementwise_layernorm 在海光DCU下的线程数过大导致的计算结果错误问题。 (#33299)
  • 修复yolov3模型在Jetson Nano和Jetson TX2上开启gpu后运行失败的问题。(#33442)
  • Paddle-TensorRT plugin multihead_matmul 修复当seq_len > 1024的计算错误。(#33365
  • 修复了ERNIE 模型变长情况下,输入的顺序不一致导致输出结果不对的问题。(#33622
  • 修复OCR模型在GPU上预测报错问题。(#33431)
  • 修复 paddle.static.io.normalize_program 没有导出 paddle.static.normalize_program 的问题。(#33408
  • 修复TensorRT6.0使用stride > 1的conv失败的问题。(#33198 )
  • 修复批量推理图片时的显存访问越界错误。(#33370 )(#33531 )
  • 修复X86 CPU上MKLDNN缓存大小设置失效的问题。 (#33571
  • 修复TensorRT conv2d_transpose op converter维度错误设置问题。(#33242
  • 修复Jetson 设备上分CUDA Arch编译出的预测库结果错误的问题,本版本将发布分Arch编译的Jetson预测库,供对预测库体积有需求的用户使用。(#33269
  • 修复使用PaddleSlim量化模型从内存加载预测时,仍会因未设置校准表路径而报错的问题。(#33629
  • 修复BERT/ERNIE在非0号卡上使用TensorRT预测时报错cuda error 400的问题。(#33706
  • 修复在Linux下设置自定义编译参数时引发的cmake语法错误。(#33621
  • 优化 layer_norm 计算精度,修复大数据输入时输出Nan的问题。(#33420)
  • 修复windows下,TensorRT推理传入左斜杠做分隔符的模型路径时,opt路径错误问题。(#33885)

环境适配

新硬件适配

昆仑硬件训练支持

  • 修复 gather op,新增支持 logsumexp 。 (#32931)

2.1.1 Release Note

Important Updates

This version fixed some function and performance issues of PaddlePaddle 2.1.0, and optimized some function. The important updates are as following:

  • Optimize the API visibility of paddle.distributed、paddle.device、paddle.vision .
  • Add support for dynamic conversion of user code for sublayer in the paddle.nn.Sequential.
  • Add SyncBatchNorm support for AMP in dynamic graph, to improve the performance of dynamic graph SyncBatchNorm layer in AMP mode,

Training Framework

Functional optimization (including distributed)

Basic API

  • Optimize the API visibility of paddle.distributed、paddle.device、paddle.vision , for more information, please see 2.1.0 Release Note. (#33420)
  • Add paddle.is_compiled_with_rocm. (#33228)
  • Add the paddle.strided_slice to support bool type.(#33373
  • Add paddle.equal_all、paddle.equal、paddle.greater_equal、paddle.greater_than、paddle.less_equal、paddle.less_than、paddle.not_equal to support bool type. (#33551
  • Fix paddle.utils.download does not perform Retry when ConnectionError is abnormal.(#33454
  • Fix the issue of infershape error when paddle.gather axis is not equal to 0.(#33553
  • Fix segment fault caused by paddle.io.DataLoader when num_workers=0 and Dataset returns GPU Tensor and sends it to DataLoader .(#33487, #33249
  • Fix the issue that when use slice result as an lvalue of inplace operation, the error message of backward is not related to the error. (#32981
  • Fix the issue of paddle.concat support uint8 in dynamic graph.(#33667)
  • Fix the issue of paddle.grid_sample GPU memory overflow and abnormal output. (#33100#33232
  • Fix bug of roi_align, when the input width or height of rois is 0, the output feature should be 0 .(#33446
  • Fixed in some corner cases, input was modified to 'nan' bug of log_softmax op. (#32937

Dynamic Graphs to Static Graphs

  • Add support for dynamic conversion of user code for sublayer in the paddle.nn.Sequential .(#33065
  • Fix the issue of subscript syntax errors in the phase of static type analysis of variables in control flow for statement conversions. (#32969
  • Refactor the dynamic to static param_guard logic code to comprehensively solve the dynamic to static graph Tensor type conversion problem.(#32985

Distributed Training

  • Fix the error in paddle.distributed.spawn when using the default nprocs argument.(#33249
  • Fix the hang issue of training start caused by the inconsistent creation of pipeline parallel communication group.(#32890#33473
  • Fix the issue of failed to save parameters in mixed parallelism.(#33595#33588
  • Fix the issue that Fleet API cannot run Program directly.(#33511
  • Fix the hang issue caused by the uneven sample bucketing in the pure GPU training mode of heterogeneous parameter server.(#32957
Hybrid Parallelism with Dynamic Graph
  • Fix the the accuracy error ofTensorParallel. Change the parameter initialization method of TensorParallel to ensure the randomness of the parameter after slicing.(#33087
  • Fix an accuracy error of PipeLineParallel. Fix the incorrect use of microbatch for PipeLineParallel.(#33097
  • Fix the issue that new_group API will hang when creating multiple communication groups.(#33553

Mixed Precision Training

  • Add SyncBatchNorm support for AMP in Dynamic graph, to improve the performance of dynamic graph SyncBatchNorm layer in AMP mode, and improve the 8-card AMP mode speedup ratio by 19% on DeepLabV3P model of [PaddleSeg].(#33709)

Custom OP

  • Remove the dependency on PADDLE_WITH_MKLDNN macro for custom OP compilation.(#32903
  • Default setting GLIBCXX_USE_CXX11_ABI=1 to resolve the issue of low GCC version that may cause compile-time errors.(#33185
  • Add support for c++14 syntax feature, and enable -std=c++14 compile option by default. (#33227

Others

  • Fix the random segment error of training when LoDTensorArray is input of Op under multi-threading.(#32984
  • Fix an issue where parameter regularization is executed twice when both the regularizer of paddle.ParamAttr and the weight_decay of paddle.optimize are specified as L2Decay.(#32881
  • Fix the issue of corrupted characters of warning information in windows system.(#33689

Inference Deployment

Model Quantification

  • Fix the issue of skipping OP quantization in dynamic graph quantization training function.(#32879
  • Fix the issue that layer_norm does not save out_threahold attribute when quantized model is saved.(#33610

Paddle Inference

Function Upgrades

  • Add converter/plugin of gather_ndreduce_sum in Paddle-TRT.(#33365
  • Add reshape in Paddle-TRT.(#33372

Performance Optimization

  • Add the dynamic shape plugin of TensorRT layer_norm to improve model dynamic shape inference performance.(#33448

易用性优化

  • Add Paddle Inference ROCm version of Prediction Example Document, so as to add C++ prediction library version.txt with ROCm related version information. (#33290)
  • Update XPU compilation options. Please refer to #33581 for specific compilation options.

Bug Fixes

  • Fix the calculation error of fused_fc_elementwise_layernorm caused by too large number of threads under DCU. (#33299)
  • Fix the issue that yolov3 model fails to run after gpu is turned on on nano and TX2.(#33442)
  • Fix the computation error when seq_len > 1024 in Paddle-TRT multihead_matmul plugin .(#33365
  • Fix the incorrect output error caused by inconsistent order of input when ERNIE model becomes longer.(#33622
  • Fix the reports error of OCR model in prediction on GPU.(#33431)
  • Fix the issue that paddle.static.io.normalize_program failed to export paddle.static.normalize_program.(#33408
  • Fix the issue that conv with stride > 1 fails in TRT6.0 and below.(#33198 )
  • Fix the out-of-bounds error of GPU memory access when batch predicting images. (#33370 )(#33531 )
  • Fix the issue of cache size setting failure on X86 CPU. (#33571
  • Fix TRT conv2d_transpose op converter dimension error setting. Now the model of conv2d_transpose op can work normally on TRT.(#33242
  • Fix the error of prediction library compiled by sub-CUDA Arch on Jetson devices. This version will release the Jetson prediction library compiled by sub-Arch for users who have demand for shrinked prediction library binary size.(#33269
  • Fix the issue that when using PaddleSlim quantitative model to load prediction from memory, it still reports an error because the calibration table path is not set.(#33629
  • Fix the issue that BERT/ERNIE gives wrong cuda error 400 when using TRT prediction on non-0 card.(#33706
  • Fix a cmake syntax error caused by setting custom compilation parameters under Linux.(#33621
  • Optimize the calculation accuracy of layer_norm and fix the problem of outputting Nan when input is large data. (#33420)

Environment Adaptation

Compile and install

Support of new hardware training

support of Kunlun chips

  • Fix the gather op, add support of logsumexp op. (#32931)

Thanks to our Contributors

This release contains contributions from: Aurelius84, cc, ceci3, Chen Weihang, danleifeng, feng_shuai, houj04, jiangcheng, JZ-LIANG, Kaipeng Deng, lidanqing, LielinJiang, Lijunhui, lilong12, liuyuhui, liym27, Pei Yang, Peihan, Qi Li, Ren Wei (任卫), Roc, Shang Zhizhou, ShenLiang, Shibo Tao, TeslaZhao, tianshuo78520a, TTerror, wangguanzhong, Wangzheee, wawltor, WeiXin, wenbin, Wenyu, whs, Wilber, wuhuanzhou, Zhang Ting, zhiboniu, Zhou Wei, zhoujun, 李季, 王明冬

v2.1.0

2 years ago

v2.0.2

3 years ago

2.0.2 Release Note

重要更新

本版本主要是对2.0.1中一些功能和性能问题的修复,并对部分功能点做了增强,重点如下:

  • paddle.nn.functional.cross_entropy 新增了 use_softmax 参数,控制是否在计算交叉熵前先进行softmax运算;并给paddle.nn.functional.softmax_with_cross_entropy 添加了 deprecated 标志,该API将在未来的版本中废弃。
  • 修复了分布式训练中参数服务器模式下的多个问题。
  • 升级Paddle的oneDNN版本至2.2版本,提升了多个模型的预测性能。

训练框架

功能优化

API

  • 新增 paddle.io.random_splitpaddle.io.Subset。(#32090)

问题修复

API

  • 修复 paddle.nn.MaxPool3Dpaddle.nn.AvgPool3Dstridepadding 没有默认值的问题。(#32014)
  • 修复支持cudnn的 RNN 创建参数时报告重复创建的问题。(#31916)
  • 修复 paddle.nn.functional.cross_entropysoft_label 为 True,并指定 weight 参数时报错的问题;新增参数 use_softmax,用于控制是否在计算交叉熵前先进行softmax运算;同时,给 paddle.nn.functional.softmax_with_cross_entropy 添加 deprecated 说明,该API将会在未来的版本中废弃。(#31953#32105#32035)
  • 修复paddle.nn.ClipByNorm在梯度全部为零时产生NaN数值的问题,该问题会导致使用混合精度训练时不收敛。(#32038)
  • 修复 paddle.stack 内存越界访问的问题。(#32005)

分布式

  • 修复参数服务器模式下计算图切分支持GradClip策略的问题。(#31945)
  • 修复参数服务器模式下截断高斯分布初始化的问题。(#31945)
  • 修复参数服务器模式下Profiler多线程信息打印不准确的问题。(#31945)
  • 修复在Python3环境下使用Dataset读取数据时,使用zip输出数据时的兼容性问题。(#31945)
  • 清理多余日志信息, 优化exe.train_from_dataset输出格式。(#32009)

推理部署

Paddle Inference

功能升级

  • Paddle-TRT适配由Paddle 2.0 训练保存的ERNIE/BERT模型。(#31959)

性能优化

  • 升级Paddle的oneDNN版本到oneDNN 2.2,多个模型预测性能有提升。(#31270)
    • Upgrade onednn to onednn 2.2 which improved many models inference performance. (#31270)
  • 添加hard_swish oneDNN算子支持,增加 conv + hard_swish 算子融合, 使得ocr_det模型性能在SkyLake上提升18%。(#31870)
    • Add hard_swish oneDNN support and conv + hard_swish fusion, which improved ocr_det model inference performance by 18%. (#31870)

问题修复

  • 修复rnn模型动态图转静态图导出保存后,运行时崩溃问题。(#31846)
  • 修复了开启oneDNN预测连续多个图像时报错的问题。(#31837)
    • Fix continuous images inference failure when oneDNN is ON. (#31837)
  • 修复了部署在CPU上的部分oneDNN int8 模型与原量化模型存在精度差的问题。(#31810)
    • Fix the accuracy difference between fake quantized models and deployed oneDNN int8 models. (#31810)
  • 去除了SkipLayerNorm融合的多余限制条件。 (#32082#32119)

Important Updates

This version fixed some function and performance issues of PaddlePaddle 2.0.1, and optimized some function. The important updates are as following:

  • Add the use_softmax parameter to paddle.nn.functional.cross_entropy, which controls whether to perform softmax operation before calculating the cross entropy; add the deprecated mark to paddle.nn.functional.softmax_with_cross_entropy, for this API will be deprecated in the future version.
  • Fix multiple issues of distributed training in parameter server mode。
  • Upgrade Paddle's oneDNN version to 2.2, which improves the inference performance of multiple models.

Training Framework

Function Optimization

API

  • Add paddle.io.random_split and paddle.io.Subset. (#32090)

Bug Fixes

API

  • Fix the issue that the stride and padding of paddle.nn.MaxPool3D and paddle.nn.AvgPool3D do not have default values. (#32014)
  • Fix the issue that when RNN supporting cudnn creates parameters, repeated creations are reported. (#31916)
  • Fix the issue that when the soft_label of paddle.nn.functional.cross_entropy is True, and the weight parameter is specified, an error will be reported; add the use_softmax parameter to paddle.nn.functional.cross_entropy, which controls whether to perform softmax operation before calculating the cross entropy; add the deprecated mark to paddle.nn.functional.softmax_with_cross_entropy, for this API will be deprecated in the future version. (#31953, #32105, #32035)
  • Fix the issue of paddle.nn.ClipByNorm generating NaN values as the gradients are all zero, which will lead to non-convergence when using mixed precision training. (#32038)
  • Fix the issue of accessing array out of bounds in paddle.stack. (#32005)

Distributed Training

  • Fix the issue that in parameter server mode the calculation graph segmentation supports GradClip strategy.(#31945)
  • Fix the initialization of truncated gaussian distribution in parameter server mode.(#31945)
  • Fix the issue of incorrectly printing the Profiler's multi-threaded information in parameter server mode.(#31945)
  • Fix the Python3 incompatibility issue when data are read by Dataset and output by zip.(#31945)
  • Clean up redundant log information and optimize the output format of exe.train_from_dataset.(#32009)

Inference Deployment

Paddle Inference

Function Upgrades

  • Paddle-TRT adapts to the ERNIE/BERT model trained and saved by PaddlePaddle 2.0.(#31959)

Performance Optimization

  • Upgrade onednn to version 2.2, which has improved many models inference performance. (#31270)
  • Add hard_swish oneDNN support and conv + hard_swish fusion, which has improved ocr_det model inference performance by 18% on SkyLake. (#31870)

Bug Fixes

  • Fix the issue that a run of the rnn model, which is saved after the dynamic graph to static graph, will crash.(#31846)
  • Fix the error of inferring continuous images when oneDNN is ON. (#31837)
  • Fix the accuracy difference between fake quantized models and deployed oneDNN int8 models. (#31810)
  • Remove the redundant constraints of SkipLayerNorm fusion. (#32082#32119)

v2.0.1

3 years ago

2.0.1 Release Note

重要更新

本版本主要对2.0.0中一些功能和性能问题的修复,并对部分功能点做了增强,重点如下:

  • 提供了在框架外部自定义算子的新方案,简化了自定义算子写法与训练推理部署流程。
  • paddle.save/paddle.static.save 支持用户选择pickle版本,在Python 3下提升模型保存效率。
  • 推理阶段支持在开启TensorRT的基础上使用NVIDIA的深度学习加速器DLA
  • Paddle Inference的C++ 和 Python 推理接口提供对昆仑 XPU的原生支持,与飞桨对XPU的训练支持能力相统一。

训练框架

功能优化

API

  • 为提高性能,roi_align 新增 aligned 参数,generate_proposals、distribute_fpn_proposals 中新增 pixel_offset 参数。
  • paddle.nn.functional.cross_entropy 支持昆仑设备下的float类型label。
  • paddle.nn.functional.softmax_with_cross_entropy 新增label错误检查和报错信息优化。
  • paddle.nn.LayerList 支持 paddle.nn.LayerList([None])

动态图转静态图

  • 增加了对for循环中含tuple作为循环变量的支持。
  • 增加对Tensor索引变量诸如x[:],x[2:], 这种不定起始或终点的支持。
  • 补齐静态图下Tensor slice左值功能,动态图使用slice后可正确动静转换。支持通过索引或切片修改 Tensor数据:支持索引类型是 Python.intTensorPython.slice;支持步长是1、大于1或者是负数;支持赋值数据类型是 Numpy.arrayTensor

混合精度训练

  • 动态图混合精度训练支持 paddle.nn.LayerNorm,减少cast的次数,提升训练效率。

分布式训练优化

  • paddle.distributed.fleet.DistributedStrategy amp 添加pure fp16策略。
  • 新增 paddle.distributed.ProbabilityEntrypaddle.distributed.CountFilterEntry 用于稀疏参数训练。
  • 优化流水线并行通信次数。
  • 新增参数服务器模式下模型保存count/unseen_day等字段。
  • 新增参数服务器模式下稀疏参数的淘汰策略。

模型保存与载入

  • paddle.savepaddle.static.save 支持用户选择pickle版本,默认pickle 版本为2。Python 3下,选择Pickle 4+版本,可以提升保存速度,突破单文件4G大小限制,但注意此时保存的模型也需要在Python3加载使用。
  • 为满足部分用户直接获取经裁剪后的推理计算图需求,正式化接口 paddle.static.normalize_program

复数计算

  • paddle.abs 算子新增支持Complex64和 Complex128类型功能。

自定义算子

  • 实现在框架外部自定义算子的新方案,简化了自定义算子写法与使用流程,支持两种编译安装与调用方式,同时支持Linux和Window;使用新方案自定义的算子能够在动态图、静态图、动转静和推理场景中使用;具体说明请参考自定义外部算子(新)

问题修复

API

  • 修复paddle.optimizer.AdamW的multi_precision功能,确保正则处理作用对象为FP32类型的master weights,防止收敛异常。
  • 修复激活函数ELU在输入为nan时输出也为nan的问题。
  • 修复Tensor.backward()进行梯度累加时,动态图多卡启动时的梯度计算错误。
  • 修复 paddle.nn.functional.softmax_with_cross_entropy 在处理元素个数超过2^31的 Tensor时,存在的整数溢出问题。
  • 修复 paddle.nn.Sequential 进行for遍历会发生溢出崩溃的Bug。
  • 修复动态图slice的报错信息有误的bug。
  • 修复 paddle.nn.functional.local_response_norm 在静态图及动转静中,无法使用batch_size=-1的问题。
  • 修复 paddle.nn.LayerNorm 在float64时计算错误。

分布式

  • 修复参数服务器模式下准入配置不生效的问题。
  • 修复参数服务器模式下保存的模型参数无法加载的问题。
  • 修复参数服务器模式下profiler异常的问题。
  • 修复参数服务器模式在使用超过INT32类型数据时训练异常的问题。
  • 修复参数服务器模式下无法绑定长stringIP的问题。
  • 修复分布式训练过程中LOG级别配置较低导致日志输出多的问题。
  • 修复动态图分布式中对if else控制流导致各卡参数不一致的问题。
  • 修复分布式训练FLAGS设置和单机不统一的问题。

其他

  • 修复PaddlePaddle/models仓下,metric_learning finetune 报错的问题。
  • 修复昆仑静态图多卡调度op时缺失导致的权重不同步问题。

推理部署

模型量化

  • 增加了对采用per-layer方式量化的模型trt量化预测的支持。

Paddle Inference

API

  • 新增API paddle_infer::Config::EnableTensorRtDLA(),支持在开启TensorRT的基础上使用Nvidia的硬件加速器DLA
  • paddle-trt增加对模型输入的检查,如果输入是变长,优化相关报错信息,提示用户开启dynamic_shape。

功能升级

  • 支持运行带有用户自定义算子的预测部署模型,并提供了用户文档
  • C++ 和 Python 推理接口新增对昆仑 XPU 的原生支持,用户可因此获得更完备的算子种类支持。

性能优化

  • Paddle-TRT新增group_norm op支持,为solov2_r50_fpn_1x模型提供如下加速:在T4,cuda11, cudnn8.1,trt7.1.3上,相比2.0.0版本,TRT FP32推理性能由87.019ms -> 75.13ms,提升13%;TRT FP16推理性能由72.9253ms -> 44.149ms,提升65%。

问题修复

  • 修复某些OP在TensorRT 7.1+版本下运行失败的问题(例如ERNIE模型的TensorRT推理)。
  • 修复Python pass_builder API 使用过程可能出错的问题。
  • jetson下由于内存资源有限,将显/内存分配策略默认设为auto_growth,解决由于资源问题导致部分模型跑不通的问题。
  • 对cudnn8.0的内存泄露问题进行了规避,保证可用性,该改动不影响到其它cudnn版本。
  • 修复预测库动态库中MakeCipher符号缺失的问题。
  • 修复mask_rcnn_r50_1x_coco动转静模型mask预测结果错误的问题。
  • 修复adaptive pooling不被oneDNN 完全支持导致的segmentation模型预测失败的问题。
  • 修复当batch_size> 1时,oneDNN下OCR模型预测得到不正确的结果的问题。
  • 修复由于relu的CPU实现错误导致 freeze_model 预测失败的问题。
  • 修复BF16中图片转二进制脚本对python3不兼容问题。

环境适配

训练框架

  • 将cuda9.0与cuda10.0相关镜像中的gcc4.8.2升级成gcc5.4。
  • 支持Windows用户从官网安装最新的develop版本Paddle,每天实时发包。

推理库Paddle Inference

  • 修复cuda10.2开发镜像无法编译带TensorRT的Paddle的问题,将原来powerpc架构的TensorRT7替换成x86-64架构的TensorRT6。
  • 飞桨推理库名称升级:Paddle Inference动态链接库由 libpaddle_fluid.so 更名为libpaddle_inference.so。

2.0.1 Release Note

Important Updates

This version fixed some function and performance issues of PaddlePaddle 2.0.0, and optimized some function. The important updates are as following:

  • The new scheme that operators can be customized outside the framework. The process of customized operators’ writing and inference deployment, is simplified.
  • paddle.save/paddle.static.save supports users to choose the pickle version, which can improve the efficiency of saving models under Python 3.
  • At the stage of inference, users can apply DLA of NVIDIA while using TensorRT.
  • PaddlePaddle inference APIs of C++ and Python support XPU, which is aligned with training supported by PaddlePaddle to XPU.

Training Framework

Function Optimization

API

  • Add aligned in roi_align, and pixel_offset in generate_proposals、distribute_fpn_proposals to improve performance.
  • paddle.nn.functional.cross_entropy supports float type label in XPU accelerator.
  • Add label error checks and optimized error message of paddle.nn.functional.softmax_with_cross_entropy.
  • paddle.nn.LayerList supports paddle.nn.LayerList([None]) .

Dynamic Graph to Static Graph

  • Add the support of tuple as loop variable in for-loop.
  • Add Tensor support to be indexed by unspecific start and stop variables, such as x[:], x[2:].
  • Now Tensor supports slicing with lvalue in static graph. In dynamic graph, Tensor uses slicing can correctly turn into static graph. Tensor can be modified by indexing or slicing. Python.IntTensorPython.slice can be used for indexing. The stride could be 1, greater than 1 or negative. NumPy.array, Tensor types could be used as rvalue.

Mixed Precision Training

  • Mixed precision training of dynamic graph supports paddle.nn.LayerNorm , improving efficiency by reducing the number of cast.

Distributed Training Optimization

  • paddle.distributed.fleet.DistributedStrategy amp adds pure fp16 strategy.
  • paddle.distributed.ProbabilityEntry and paddle.distributed.CountFilterEntry are added for sparse parameters training.
  • Optimized the number of communications in parallel pipeline.
  • In parameter server mode, fields like count/unseen_day could be saved into model.
  • Add the elimination strategy of sparse parameters in parameter server mode.

Model Saving and Loading

  • paddle.save and paddle.static.save allow users to select the pickle version, and the default version is 2. For Python 3, users can choose Pickle version 4+. In this way, saving speed could be increased and single file could over 4G. But, please notice that models saved this way must be loaded and used under Python 3.
  • Add paddle.static.normalize_program to obtain the pruned computation graph.

Complex Number Operation

  • paddle.abs supports Complex64 and Complex128 types.

Customized Operator

  • Offered the new scheme of custom operators outside the framework, simplify the writing and using process of custom operators, support two installation and calling methods, and support Linux and Window at the same time; custom operators by using the new scheme can be used in dynamic graphs, static graphs, dynamic-to-static and inference scenarios; for specific instructions, please refer to the file: Customizing External Operators.

Distributed Training

  • Fixed Entry Config has no effect issue in parameter server mode.
  • Fixed the saved parameters could not be loaded issue in parameter server mode.
  • Fixed Profiler abnormal issue in parameter server mode.
  • Fixed training abnormal issue when data type category is higher than INT32 in parameter server mode.
  • Fixed long stringIP cannot be bounded issue in parameter server mode.
  • Fixed the issue of too much log outputs issue, in distributed training caused by lower level LOG config.
  • Fixed the issue of inconsistent parameters of each devices, when if else control flow is used in dynamic graph distributed training.
  • Fixed the issue that FLAGS setting of multi-host distributed training is not consistent with single host distributed training.

Bug Fixes

API

  • Fixed the muti_precision function of paddle.optimizer.AdamW to ensure the master weights in FP32 type, which are regularized, in order to prevent possible diverge.
  • Fixed the issue when the input of paddle.nn.ELU is nan, the output is nan.
  • Fixed gradient calculation error of using Tensor.backward() for gradient accumulation, in dynamic graph mulit-card training.
  • Fixed the integer overflow issue when paddle.nn.functional.softmax_with_cross_entropy processes a Tensor with over 2^31 elements.
  • Fixed crash bug during the for-loop traversal of paddle.nn.Sequential.
  • Fixed Wrong error message of dynamic graph slicing.
  • Fixed the issue that batch_size=-1 cannot be used, when paddle.nn.functional.local_response_norm is used in static graph or dynamic graph to static graph converting.
  • Fixed paddle.nn.LayerNorm computation error when data type is float64.

Others

  • Fixed the error message of metric_learning finetune under PaddlePaddle/models.
  • Fixed weight asynchrony issue caused by lack of operators, when XPU's static graph multi-card is used.

Inference Deployment

Model Quantification

  • Support the quantification inference of TRT, which uses per-layer to quantize.

Paddle Inference

API

  • Add API— paddle_infer::Config::EnableTensorRtDLA(). At the stage of inference, users can apply DLA of NVIDIA while using TensorRT.
  • Paddle-TRT will check inputs of model, If input shape is variant, the error messages are optimized and Paddle-TRT will hint users to use dynamic_shape.

Function Upgrades

  • Support inference and deployment models that have the operators customized by users, and provide User Documentation.
  • PaddlePaddle inference APIs of C++ and Python support XPU, which is aligned with training supported by PaddlePaddle to XPU.

Performance Optimization

  • Paddle-TRT supports group_norm op, and speed up solov2_r50_fpn_1x as following: compared with TRT v2.0.0, on T4, CUDA11, cuDNN8.1 and TRT7.1.3, the performance of TRT FP32 improves by 13%, from 87.019ms to 75.13ms, and the performance of TRT FP16 improves by 65%, from 72.9253ms to 44.149ms.

Bug Fixes

  • Fix some operator problems in TensorRT v7.1+, like TensorRT’s inference of ERNIE.
  • Fix some issues of using Python pass_builder API.
  • Due to limited memory of Jetson, auto_growth is regarded as default distribution policy of memory, tackling problems that some models cannot run with limited memory.
  • Avoid the problem of cuDNN8.0’s memory leaks to ensure the availability, and this will not influence other versions of cuDNN.
  • Fixed MakeCipher’s symbol absence issue in inference dynamic library.
  • Fixed wrong predicting results issue of mask_rcnn_r50_1x_coco model when this static graph model is converted from dynamic graph.
  • Fixed the inference failure of the segmentation models, caused by adaptive pooling is not fully supported by oneDNN,
  • Fixed the issue that oneDNN’s OCR model inference will be incorrect when batch_size>1.
  • Fixed freeze_model inference failure due to ReLU CPU’s implementation error.
  • Fixed the incompatibility issue that BF16’s images cannot change into binary script for Python3.

Environment Adaptation

Training Framework

  • Upgrade the GCC from V4.8.2 to V5.4 in Paddle docker images of CUDA9.0 and CUDA10.0
  • Add the Windows Paddle develop version wheel package. Windows users now can run pip --pre for installation.

Paddle Inference

  • Fixed the problem that develop docker image cannot compile with TensorRT, and replace TensorRT7 of powerpc architecture with TensorRT6 of x86-64 architecture.
  • Upgrade the name of Paddle Inference library: the name of dynamic link library changes from libpaddle_fluid.so to libpaddle_inference.so.