Oneflow Versions Save

OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.

v1.0.0

1 month ago

v0.9.0

1 year ago

Version 0.9.0

OneFlow v0.9.0 release note

OneFlow v0.9.0 came out, welcome to install the new version for a better experience.

Highlights
Backwards Incompatible Change
New Features
Performance
Improvements
Bug fixes
Documentation
Edge Tools

Highlights

This update contains 640 commits and the following highlights:

With the addition of 86 new API interfaces and operators aligned with PyTorch and the fix of 104 bugs related to operator compatibility, OneFlow v0.9.0 provides better PyTorch API and model compatibility. In v0.9.0, users can migrate more PyTorch models to OneFlow with one click and gain faster performance.
- Allowing one-click migration of Stable Diffusion、GLM、YOLOv5 etc to OneFlow.
- More convenient model migration. Oneflow.load supports loading the torch.save models directly.
- With the newly added oneflow.mock_torch module and mock method, oneflow can migrate complex PyTorch models containing multiple scripts with one click without changing the original PyTorch script.
Global Tensor has added a series of interfaces and methods that are convenient for distributed programming, and fixed known related bugs.
The Graph released a new feature of automatic parallelism (version 1), which supports automatic search for the fastest SBP with a specified Placement. When writing distributed models with Global Tensor, users do not need to consider parallelism.
The Graph adds a series of optimizations related to memory, execution speed, pipeline masking, and compilation speed to improve performance and reduces memory overhead.
The Graph provides a series of functions to aid debugging, including analyzing memory logs, displaying the progress during the compilation stage, and the computation graph.
OneFlow IR provides more compilation optimization functions.
The error prompt of OneFlow is more user-friendly, which supports highlighting the error content and simplifies unnecessary information details inside the system. In this connection, you can visually learn about the location and type of the error.
A series of operator optimizations and system optimizations have been added, including Eager instruction scheduling, high-performance CUDA kernel, opening up of multiple memory pools, etc.

Backwards Incompatible Change

To solve the possible duplicate name conflict between Graph.Block.config and module user-defined attribute module.config, OneFlow redesigned the abstraction of Graph proxy Module/Tensor, thus introducing a breaking change: (https://github.com/Oneflow-Inc/oneflow/pull/9351 , https://github.com/Oneflow-Inc/oneflow/pull/9437，https://github.com/Oneflow-Inc/oneflow/pull/9607)

The attr and config attributes on Block are removed, and Block is renamed to Proxy;
Implementation plan: When added as members of nn.Graph, the original Eager Module and Tensor types will be packaged into the Proxy class, and the corresponding GraphModule and GraphTensor will be generated; nn.Graph will use Proxy in the subsequent composition For proxy execution, when the proxy is executed, the original eager type and graph type can be obtained from the Proxy. The naming refers to the naming of torch.fx.

	Eager primitive type	Graph type, base class Graph Block	Proxy execution type, the base class is called Proxy
Function	Supporting to get the original eager type	A Graph code block corresponding to GraphBlock stores the information required for graph execution, such as name/scope/lazy op or tensor and optimization switches of some sub-modules on the graph.	Proxy execution capability, using the same execution interface as Module and Tensor, but the behavior has changed, such as lazy, and the op that may be executed has also been rewritten.
Module type	Module	GraphModule	ProxyModule contains a Module member and a GraphModule member
Tensor type	Tensor	GraphTensor	ProxyTensor contains a Tensor member and a GraphTensor member

Here is an exmaple：

    import oneflow as flow
    import oneflow.nn as nn
    from oneflow.nn.graph import GraphModule
    linear = flow.nn.Linear(3, 8, False)
    class LinearGraph(nn.Graph):
        def __init__(self):
            super().__init__()
            # The type of linear is nn.Module. When added as an attribute of nn.Graph, it will be registered with nn.Graph.
            # self.linear has been wrapped as a ProxyModule.
            #self.linear.weight has been wrapped as a ProxyTensor.
            #nn.Graph will use ProxyModule to perform graph composition.
            self.linear = linear
            # There are two parts in ProxyModule, one is the original module and the other is GraphModule.
            self.linear.to(GraphModule)  # Get the corresponding GraphModule, on which you can do configuration related to graph optimization.
            # such as setting a pipeline stage for a module, and enabling pipeline parallelism. 
            self.linear.to(GraphModule).set_stage(id, placement)
            self.linear.to(nn.Module)  # get the corresponding original nn.Module.
            self.linear.weight.to(flow.Tensor)  # get the corresponding original Tensor.

Outdated interface in OneFlow v0.8.0:

import oneflow as flow
import oneflow.nn as nn
linear = flow.nn.Linear(3, 8, False)
class LinearGraph(nn.Graph):
    def __init__(self):
        super().__init__()
        self.linear = linear
        self.linear.config.set_stage(id, placement)  # set stage
        self.linear.config.activation_checkpointing = True  # set activation checkpointing
        self.linear.origin  # get the corresponding original nn.Module
        self.linear.weight.origin # get the corresponding original Tensor

New interface in OneFlow v0.9.0:

import oneflow as flow
import oneflow.nn as nn
from oneflow.nn.graph import GraphModule
linear = flow.nn.Linear(3, 8, False)
class LinearGraph(nn.Graph):
    def __init__(self):
        super().__init__()
        self.linear = linear
        self.linear.to(GraphModule).set_stage(id, placement)  # set stage
        self.linear.to(GraphModule).activation_checkpointing = True  # set activation checkpointing
        self.linear.to(nn.Module)  # get the corresponding original nn.Module
        self.linear.weight.to(flow.Tensor)  # get the corresponding original Tensor

New Features

Graph

Adds automatic parallelization feature for the first stage in Graph: (https://github.com/Oneflow-Inc/oneflow/pull/8891, https://github.com/Oneflow-Inc/oneflow/pull/9172 , https://github.com/Oneflow-Inc/oneflow/pull/9288)
- Automatic parallelism can be enabled by configuring self.config.enable_auto_parallel(True) in Graph. After it is enabled, you don't have to configure sbp, and the Graph will automatically find the optimal sbp combination.
- Here is an exmaple:
```
import oneflow as flow
class SubclassGraph(flow.nn.Graph):
    def __init__(self):
        super().__init__() # MUST be called
        # auto parallelism configuration
        self.config.enable_auto_parallel(True)
        # other configurations about auto parallelism
        # ......

    def build(self):
        pass
```
- For documentation see: https://oneflow.readthedocs.io/en/master/auto_parallel.html
Graph supports straightened algorithm optimization with memory priority, reducing the memory life cycle of each Tensor by adjusting the execution sequence to reduce the peak value of memory. (https://github.com/Oneflow-Inc/oneflow/pull/9094)
- With self.config.enable_straighten_algorithm("MemoryFirst"), the straightened algorithm with memory optimization can be enabled.
- The available modes are as follows: "MemoryFirst" / "SpeedFirst" / "Disable" / "OverlapCpuGpu"
- At the same time, Graph adds the algorithm "OverlapCpuGpu" that make CPU and GPU kernel overlap with each other as much as possible. (https://github.com/Oneflow-Inc/oneflow/pull/9278)
Graph provides generalized basic transmission, using nccl send/recv to realize fast communication for any NdSbp (2d, 3d,...), thus minimizing the transmission volume.(https://github.com/Oneflow-Inc/oneflow/pull/8437 , https://github.com/Oneflow-Inc/oneflow/pull/8783)
With autograd.Function, Graph is allowed to use custom op (https://github.com/Oneflow-Inc/oneflow/pull/8843).
You can use the Graph Optimizer through param_group["lr_scale"], supporting configuring the learning rate for the parameter of each module/layer. (https://github.com/Oneflow-Inc/oneflow/pull/9138)
Adds enable_multi_tensor_update optimization. Enabling by self.config.enable_multi_tensor_update(True), it will optimize the overhead of numerous broken parameters when updating the model. (https://github.com/Oneflow-Inc/oneflow/pull/9209, https://github.com/Oneflow-Inc/oneflow/pull/9252)
Adds enable_fused_model_update_cast optimization. Enabling by self.config.enable_fused_model_update_cast(True), it will speed up the training speed of the network by fusing Optimizer and fp16 cast when AMP is on. (https://github.com/Oneflow-Inc/oneflow/pull/9209)
Graph supports non-uniform segmentation under ND-SBP. (https://github.com/Oneflow-Inc/oneflow/pull/9310)
Graph supports LazyTensor's indexing feature. (https://github.com/Oneflow-Inc/oneflow/pull/9334)
Adds enable_compress_memory interface. Enabling by self.config.enable_compress_memory(True), it will try to optimize the memory and iterate the video memory of the computation graph within a half hour. Finally, the minimum value close to the lower limit will be found. (https://github.com/Oneflow-Inc/oneflow/pull/9509)
Adds oneflow.utils.global_view.global_mode. It supports smooth migration from single-GPU code to multi-GPU code. This global_mode will create a global context with on/off support. In addition, it will set the default placement and sbp under the context and support various grammar of LocalTensor such as Tensor.device and Tensor.to(device). The source op created in this context will automatically generate the GlobalTensor and populate the default placement and sbp. This context enables the logic of the local tensor in the module to convert to global logic in a non-invasive manner.
- Here is an example:
- ```
import oneflow as flow
from oneflow.utils.global_view import global_mode

P_C = flow.placement("cpu", ranks=[0, 1])
P = flow.placement("cuda", ranks=[0, 1])
B = flow.sbp.broadcast
S0 = flow.sbp.split(0)
x = flow.ones((6, 8), placement=P_C, sbp=S0)

with global_mode(True, placement=P, sbp=B):
    device = linear_dp.weight.device
    x = x.to(device) # global tensor to device
    out = linear_dp(x)

    # The local tensor will be converted to global
    sample = flow.randn(out.shape, device="cpu").to(device)
```

Debug

Provides comprehensive memory analysis logs V2.0 (https://github.com/Oneflow-Inc/oneflow/pull/8565)
- export GLOG_v = 3 enables the environment variable to see the full memory analysis log in oneflow.INFO.
- Adds shape, dtype, life cycle, and order of application for release of all tensors in each memory block (Chunk, MemBlock), which helps to quickly find out whether the tensor that greatly affect occupied memory in each memory block is normal or not.
- The Checkpointing pass provides a log, recording tensors with Checkpoint.
Adds time_util to record the execution time of each module, actual physical memory occupied, and virtual memory occupied. (https://github.com/Oneflow-Inc/oneflow/pull/9164，https://github.com/Oneflow-Inc/oneflow/pull/9245)
Graph will display the compilation progress bar when the rank 0 calculation Graph is compiled when enabling such environment variables as debug(0) and ONEFLOW_NNGRAPH_ENABLE_PROGRESS_BAR=1. (https://github.com/Oneflow-Inc/oneflow/pull/9537)
The default log directory is removed (The directory will not be created and be written to log files by default.) The log directory print logs will be generated when in ONEFLOW_DEBUG_MODE=1. (https://github.com/Oneflow-Inc/oneflow/pull/9552 ， https://github.com/Oneflow-Inc/oneflow/pull/9575)

Eager

Adds parameter map_location to oneflow.load to support the placement or device of the specified loading model Tensor. (https://github.com/Oneflow-Inc/oneflow/pull/8666)
Adds the oneflow.async.thread to allow users to create a new thread for asynchronous programming. (https://github.com/Oneflow-Inc/oneflow/pull/8866 , https://github.com/Oneflow-Inc/oneflow/pull/9039 , https://github.com/Oneflow-Inc/oneflow/pull/9270)
oneflow.save supports saving ddp Module objects directly. (https://github.com/Oneflow-Inc/oneflow/pull/8856)
Adds oneflow.utils.checkpoint to support Checkpointing optimization under eager. (https://github.com/Oneflow-Inc/oneflow/pull/9053)
With the newly added oneflow.mock_torch module and mock method, the effect of one-click migration to oneflow can be realized without changing the original script of import torch. The benefit of this method is that all you need to do is add a new line instead of modifying the imports of files one by one (https://github.com/Oneflow-Inc/oneflow/pull/9160 , https://github.com/Oneflow-Inc/oneflow/pull/9256 , https://github.com/Oneflow-Inc/oneflow/pull/9442 , https://github.com/Oneflow-Inc/oneflow/pull/9473). You can use it with the following code:
- ```
import torch
from oneflow.mock_torch import mock
mock()
# torch code
# ...
```
- Supports mocks with scope, such as:
- ```
import torch
from oneflow.mock_torch import mock
with mock.enable():
    # torch code
    # ...
```
Supports autograd's backward graph visualization debug: When enabling ONEFLOW_DEBUG_MODE=1 environment variable, each backward computation will generate the AutogradEngine execution graph to the dot file in the log directory. As is shown in the figure, you can see the operators of backward execution and topologies, which provides an easy way for algorithm and R&D personnel to debug backward problems. (https://github.com/Oneflow-Inc/oneflow/pull/9412)

v0.8.0

1 year ago

v0.7.0

2 years ago

OneFlow v0.7.0 Release Notes

OneFlow v0.7.0 came out. Welcome to use it. We would love to hear your feedback!

本文的中文版本

https://mp.weixin.qq.com/s/dSR-2Xw92eoFhF0c6MtutQ

Highlights

This release has the following highlights:

Provides a Tensor that can be executed in multi-nodes multi-GPUs scenarios: Global Tensor. It is an easy-to-use solution for distributed execution. It makes it easier to implement various distributed parallel strategies and enables more flexible and user-friendly distributed implementation. It supports models including ResNet50, Wide and Deep, GPT, Bert, Swin-Transformer, InsightFace, etc.
Continues to improve nn.Graph. Supports the advanced features such as ZeRO, GradAcc, Checkpointing, and Pipelining, and enriches the graph.debug mode. Supports random 2D SBP conversion, semi-automatic derivation of 2D SBP, resuming training from the last checkpoint, etc. Adds OneFlow Feature Stages Identifications and identifies each feature of nn.Graph. For nn.Graph, its basic features are at the Beta Stage, which can meet most of the requirements of users; Advanced features are at Alpha Stage, meeting standard requirements.
Deeply optimizes the performance of Eager mode. The performance of the Swin-Transformer model is 3 times higher than that of v0.6.0 when tested on the V100.
Operators-related improvements: In the single-node single-GPU scenario, OneFlow's compatibility with PyTorch is further improved. The interfaces, semantics, and produced results of operators supported by OneFlow are in consistent with that of operators supported by PyTorch and an automatic testing framework is designed to verify the consistency. With common models, you can accomplish the migration by running import oneflow as torch. Compared with v0.6.0, OneFlow adds 16 operators, optimizes the performance of 6 operators, and fixes bugs in 16 operators.
Supports Einsum and View mechanism.
Compiler-related improvements: OneFlow is officially connected to the MLIR ecosystem.
Releases OneFlow-Serving v0.1.0: We provide an out-of-the-box Triton OneFlow backend docker image. try here.
Releases LiBai v0.1.0, a toolbox for massively distributed parallel training of Transformer. Compared with customized code bases such as Megatron-LM, LiBai provides a series of models and training components for distributed training based on a modular design, aiming to make models trained in distributed mode as convenient as in single-GPU mode.
Releases Flow-Vision v0.1.0: adds DeiT, ConvNeXt, ReXNet, and other models and updates tutorials and documentation.

OneFlow Feature Stages identifications

OneFlow Feature Stages identifies the maturity level of OneFlow features. It provides users with a status description of a feature to inform the specific level of it, such as completeness, API stability, documentation, etc. It Provides OneFlow developers with a standard for feature refinement, which facilitates further improvement.

OneFlow Feature Stages

Stable Stage
- Purpose: release for production use
- Audience: all users
- Functionality: same as RC
- Testing: same as RC
- Performance: same as RC
- API: same as RC, with stability within long cycles (e.g., 1 year) and large versions (e.g., 1.0)
- Documentation: same as RC
Release Candidate (RC) Stage
- Purpose: release for deployment evaluation in production environments
- Audience: all users, including those who want to deploy production environments
- Functionality: being able to handle exceptions as well as normal inputs.
- Testing: end-to-end deployment validated in external environment with good experience
- Performance: provide evaluation reports and documentation to evaluate performance and scalability in external environments
- API: API for external user evaluation
- Documentation: features in this stage are added to the core-feature-set documentation
Beta Stage
- Purpose: release to provide a relatively stable, complete, and available version
- Audience: all users, especially those with strong feature demands, little concern for unknown trivial issues, and willingness to provide feedback
- Functionality: complete functionalities addressing the needs of various possible scenarios
- Testing: complete, covering various corner test cases, and various end-to-end integration tests
- Performance: performance evaluation and scalability evaluation
- API: recognized as complete and stable by seed users after full review
- Documentation: tutorials that describe the usage process
Alpah Stage
- Purpose: release to get early feedback for experimental features
- Audience: developers and expert users
- Functionality: core functionality completed
- Testing: unit testing completed for core requirements of the feature, possibly with unknown bugs
- Performance: evaluated
- API: well-defined but not rigorously reviewed, possibly requiring further changes
- Documentation: API documentation is a must to provide feature definitions
Pre-alpha Stage
- Purpose: release to validate feature prototypes or address urgent needs
- Audience: feature developers
- Functionality: limited prototype functionalities
- Testing: limited testing, possibly with many bugs
- Performance: unknown
- API: prone to changes
- Documentation: possibly none

OneFlow Framework

1. Distribution

Global Tensor

Global Tensor is a newly released set of distributed computing interfaces. It can easily support any parallelism including data parallelism, model parallelism, and pipeline parallelism. Unlike a normal Tensor (hereafter called Local Tensor), Global Tensor is a Tensor with a global view, whose data is distributed in a specific way across a set of devices in a cluster, and each node stores some or all of the Global Tensor's data. Placement and SBP are the basic properties of the Global Tensor that describe the distribution of the data in clusters.

Global Tensor's data distribution

Global Tensor supports three different ways of data distribution, which we collectively refer to as SBP.

Split (dim): The data is equally split along dim dimension and distributed to each device.
Broadcast: The data is replicated between each device.
PartialSum: The data is the element-wise addition for each device.

Consistent computational interfaces

Global Tensor has basically the same computational interfaces as Local Tensor. Only with small changes, you can convert the single-GPU mode to the distributed mode.

Local Tensor	Global Tensor
>>> import oneflow as flow >>> x = flow.tensor([1.0, 2.0]) >>> y = x * x	>>> import oneflow as flow >>> x = flow.tensor([1.0, 2.0], placement=flow.placement("cuda", ranks=[0, 1]), sbp=flow.sbp.split(0)) >>> y = x * x # This multiplication is performed on both rank 0 and rank 1

Local Tensor

Global Tensor

>>> import oneflow as flow
>>> x = flow.tensor([1.0, 2.0])
>>> y = x * x

>>> import oneflow as flow
>>> x = flow.tensor([1.0, 2.0],
            placement=flow.placement("cuda", ranks=[0, 1]),
            sbp=flow.sbp.split(0))
>>> y = x * x
# This multiplication is performed on both rank 0 and rank 1

Supporting conversion between Local Tensor and Global Tensor

With Tensor.to_global interface, you can create a Global Tensor based on a Local Tensor, and regard this tensor as the local tensor of the Global Tensor on the present device.
With Tensor.to_local interface, you can return the local tensor of the Global Tensor on the present device.

Local Tensor To Global Tensor	Global Tensor To Local Tensor
>>> import oneflow as flow >>> x = flow.tensor([1.0, 2.0]) >>> y = x.to_global( placement=flow.placement("cuda", ranks=[0, 1]), sbp=flow.sbp.split(0)) >>> y.size() oneflow.Size([4]) >>> y tensor([1., 2., 1., 2.], placement=oneflow.placement(type="cuda", ranks=[0, 1]), sbp=(oneflow.sbp.split(axis=0),), dtype=oneflow.float32)	>>> import oneflow as flow >>> x = flow.tensor([1.0, 2.0], placement=flow.placement("cuda", ranks=[0, 1]), sbp=flow.sbp.split(0)) >>> y = x.to_local() >>> y.size() oneflow.Size([1]) >>> y tensor([1.], device='cuda:0', dtype=oneflow.float32) # tensor([2.], device='cuda:0', dtype=oneflow.float32) if rank is 1

Local Tensor To Global Tensor

Global Tensor To Local Tensor


>>> import oneflow as flow
>>> x = flow.tensor([1.0, 2.0])
>>> y = x.to_global(
            placement=flow.placement("cuda", ranks=[0, 1]),
            sbp=flow.sbp.split(0))
>>> y.size()
oneflow.Size([4])
>>> y
tensor([1., 2., 1., 2.],
       placement=oneflow.placement(type="cuda", ranks=[0, 1]),
       sbp=(oneflow.sbp.split(axis=0),), dtype=oneflow.float32)

>>> import oneflow as flow
>>> x = flow.tensor([1.0, 2.0],
            placement=flow.placement("cuda", ranks=[0, 1]),
            sbp=flow.sbp.split(0))
>>> y = x.to_local()
>>> y.size()
oneflow.Size([1])
>>> y
tensor([1.], device='cuda:0', dtype=oneflow.float32)
# tensor([2.], device='cuda:0', dtype=oneflow.float32) if rank is 1

Supporting redistribution of Global Tensor in clusters

With Tensor.to_global interface, you can redistribute the data of Global Tensor in clusters. The data can be distributed to another set of nodes and the way of distribution in this set of nodes can also be changed (i.e.change SBP). Redistribution usually generates inter-process data communication, but Tensor.to_global interface finely avoids complicated low-level communication details.

>>> import oneflow as flow
>>> x = flow.tensor([1.0, 2.0], placement=flow.placement("cuda", ranks=[0, 1]), sbp=flow.sbp.split(0))
>>> y = x.to_global(placement=flow.placement("cuda", ranks=[2, 3]), sbp=flow.sbp.broadcast)

Each operator of OneFlow defines a set of SBP signatures for the input and output tensor. Global Tensor supports automatic redistribution to provide the required SBP signature of a certain interface. Just as the code shown below:

>>> import oneflow as flow
>>> x = flow.randn(4, 4, 
            placement=flow.placement("cuda", ranks=[0, 1]), 
            sbp=flow.sbp.split(0))
>>> y = flow.randn(4, 4, 
            placement=flow.placement("cuda", ranks=[0, 1]), 
            sbp=flow.sbp.split(1))
>>> z = x + y

When x + y is executed, since x is split along 0 dimension while y is split along 1 dimension, their local tensors at each device can not be added up directly. Therefore, x's SBP will be automatically converted to flow.sbp.split(1) or y's SBP will be converted to flow.sbp.split(0), and the calculated result-z's SBP- is flow.sbp.split(1) or flow.sbp.split(0).

Notes

Global Tensor doesn't support mix-in with DDP interface currently.
Global Tensor requires all devices to execute simultaneously, and the code that has branches would lead to process deadlock because of divergent execution paths. We will continue fixing this problem.

2. Continued improvement of nn.Graph's features

Overview of the development of nn.Graph v0.7.0

Fundamental features enter into Beta Stage, meeting most requirements of users;
Advanced features enter into Alpha Stage, meeting standard requirements of users;
ResNet50, Wide and Deep, GPT, Bert, Swin-Transformer, InsightFace, and other models are supported；

Feature of nn.Graph

Static and dynamic casting of operators under Static Graph enter into Beta Stage from Alpha Stage
- Adds the unit test of static execution for all legal operators under nn.Graph, and automated unit test is ready;
- Supports more flexible inputs and outputs, including List/Tuple/Dict and their nesting, and fixs the Tuple problem of producing a return size of "1";
- Adds backward automatic test;
Optimizer and LR Scheduler under Static Graph enter into Beta Stage from Alpha Stage.
- Adds more built-in LR schedulers, including WarmupLR, CosineAnnealingWarmRestarts and other common schedulers, and provides SequentialLR and ChainedScheduler to enable scheduler with different combination capacity;
- Refactors scheduler's get_lr function, converting it to the implementation of pure function. This change permits to use schedulers in combination by changing the calculation of lr from iterative solution to analytical solution;
- Adds "is_sparse" parameter for add_optimizer interface, supporting sparse updates under graph mode. Optimizers that support sparse updates include Adam and SGD, while optimizers under Eager mode don't support sparse updates yet. Subsequent version will support both sparse updates and sparse tensor. The feature is at Pre-alpha Stage;
- Adds Debug print feature for LR and Step, for which you only need to turn on LR Scheduler's verbose button.
state_dict and load_state_dict under Static Graph are newly added, which allow to resume training from last checkpoint. The feature is at Beta Stage;
Debug under Static Graph enters into Beta Stage from Alpha Stage;
- Adds debug(2)、debug(3) that allow to find out problems in nn.Module, by locating the Python code of operators at c++ layer and locating forward graph creation and inference for operators;
- Adds the display of memory overhead
ZeRO-DP under Static Graph is newly added, which allows to reducememory overhead related to Optimizer under data parallelism, and the feature is at Alpha Stage;
Global Tensor under Static Graph supports multiple parallel methods, and the feature is between Alpha Stage and Beta Stage;
- It is utilized in LiBai and other model libraries;
- It is widely utilized in OneFlow's model libraries, and the coverage of unit test is still ongoing;
- 1D Global Tensor supports you to only define input tensor's SBP, while output tensor's SBP can be derived automatically with good results, and the feature is at Beta Stage;
- 2D Global Tensor supports you to only define input tensor's SBP, while output tensor's SBP can be derived automatically with good results, and the feature is at Alpha Stage;
- Conversion from 1D to ND or ND to 1D is newly supported, and the feature is at Alpha Stage;
- Random conversion of 2D SBP is newly supported, and the feature is at Alpha Stage；
- Testing of 1D&2D single operator is still ongoing, and the feature is at Pre-alpha Stage；
- Selecting SBP with semi-automatic derivation is supported, and the feature is at Pre-alpha Stage；
For Gradient Accumulation under Static Graph, we refactor and repair support for Reshape and add API documentation. For the input of mini-batch interface, the future version will offer the input of micro-batch with better experience, and the feature is from Pre-Alpha to Alpha Stage；
For pipeline parallelism under Static Graph, the tutorial is perfected, and pipeline parallelism is available in Libai and other model libraries. The feature is at Beta Stage;
For automatic mixed precision (AMP) under Static Graph, the API documentation is newly added. The feature is from Pre-Alpha to Alpha Stage；
For Activation Checkpointing under Static Graph, the API documentationis newly added. The feature is from Pre-Alpha to Alpha Stage;
For Op Fuse optimization under Static Graph, the API documentationis newly added. The feature is from Pre-Alpha to Alpha Stage;
For XLA/TensorRT/OpenVINO execution under Static Graph, the API documentationis newly added. The feature is from Pre-Alpha to Alpha Stage;

Tutorials

API Documentation

Tutorials of pipeline parallelism：

Model support under nn.Graph

Training ResNet50 with single-node single-GPU or single-node multi-GPU is supported, https://github.com/Oneflow-Inc/models/tree/main/Vision/classification/image/resnet50
Wide and Deep model is supported, https://github.com/Oneflow-Inc/models/tree/main/RecommenderSystems/wide_and_deep
GPT、Bert、Swin Transformer in Libai are supported, https://github.com/Oneflow-Inc/libai
Functioanl problems in support for above models are resolved;

3. Performance optimization of Eager

The performance of Eager is deeply optimized. When OneFlow run Swin-Transformer's model performance on V100 GPU, single-GPU card delivers a 25% speedup than PyTorch, and 8 single GPU card 10% speedup;
The communication scheduling policy for NCCL in DDP is optimized;
DDP supports the optimization of AllReduce fuse, reducing additional overhead generated by fragmented AllReduce, with a 5% performance speedup when it is tested on ResNet50;
VM supports the optimization of instruction fusion, significantly saving scheduling overhead of Kernel;
Additional memory overhead is optimized when CPU overload is too high;
Eager DataLoader supports the optimization of inter-process memory sharing;
The performance of Clip Grad is optimized;

4. Improvements of operators

OneFlow is successfully adapted to oneDNN for CPU operators acceleration.

The performance of CPU operators such as unary and binary element-wise is improved by 4 times, and the speed of Swin-Transformer's dataloader is improved by 2.5 times. https://github.com/Oneflow-Inc/oneflow/pull/7319

Adds the functionality of inter-process shared memory to Dataloader, which greatly improves the performance of DataLoader in DDP.
Adds Bool type Tensor. https://github.com/Oneflow-Inc/oneflow/pull/7523
Realizes to_contiguous that view relied on. https://github.com/Oneflow-Inc/oneflow/pull/7670
Adds Scalar div operators. https://github.com/Oneflow-Inc/oneflow/pull/7483
Adds Lamb optimizer. https://github.com/Oneflow-Inc/oneflow/pull/7389
Adds Polynomial Learning Rate Scheduler. https://github.com/Oneflow-Inc/oneflow/pull/7260
Adds tensor_split and as_strided operators. https://github.com/Oneflow-Inc/oneflow/pull/7258 & https://github.com/Oneflow-Inc/oneflow/pull/7275
Adds cumprod operators. https://github.com/Oneflow-Inc/oneflow/pull/7278
Adds Tensor.T() and oneflow.t() operators. https://github.com/Oneflow-Inc/oneflow/pull/7269
Adds normalize operators. https://github.com/Oneflow-Inc/oneflow/pull/7113
Adds the inplace version of div and sub operators. https://github.com/Oneflow-Inc/oneflow/pull/7293
Adds the feature of Module.zero_grad. https://github.com/Oneflow-Inc/oneflow/pull/7587/
Adds the feature of Scalar Tensor being the index to do list indexing. https://github.com/Oneflow-Inc/oneflow/pull/7597
Adds support for Leaky ReLU operators half type. https://github.com/Oneflow-Inc/oneflow/pull/7569
Adds support for mask select operators. https://github.com/Oneflow-Inc/oneflow/pull/7492
Adds non-reduce communication operations such as Bool type Broadcast and Allgather. https://github.com/Oneflow-Inc/oneflow/pull/7366
Develops autotest that supports eager global based on an autotest framework. https://github.com/Oneflow-Inc/oneflow/pull/7204
Optimizes performance for ReduceSum CUDA Kernel. https://github.com/Oneflow-Inc/oneflow/pull/7684
Optimizes CUDA Kernel of gather operators. https://github.com/Oneflow-Inc/oneflow/pull/7351
Optimizes the performance for CUDA Kernel of MaxPool and AvgPool operators in NCHW. https://github.com/Oneflow-Inc/oneflow/pull/7426 & https://github.com/Oneflow-Inc/oneflow/pull/7451
Optimizes the backward computing of PReLU operators, which can save more memory in general. https://github.com/Oneflow-Inc/oneflow/pull/7600
Optimizes backward Kernel of LayerNorm to further save memory. https://github.com/Oneflow-Inc/oneflow/pull/6996
Supports passing single int in stride and dilation in Conv1D/2D/3D and DeConv1D/2D/3D Kernel. Adds Tensor.zero_() interface that aligns with PyTorch tensor.norm, torch.max and torch.min. Supports inplace in flow.nn.functional.dropout. https://github.com/Oneflow-Inc/oneflow/pull/7593
Fixes bug where the BatchNorm module raises an error when affine=False. https://github.com/Oneflow-Inc/oneflow/pull/7755
Fixes Maximum and Mimimum backward bug. https://github.com/Oneflow-Inc/oneflow/pull/7519
Fixes bug where the result of var operators is unexpected in some cases. https://github.com/Oneflow-Inc/oneflow/pull/7517
Fixes incorrect behavior of Tensor deepcopy bug. https://github.com/Oneflow-Inc/oneflow/pull/7490
Fixes bug where input index is scalar tensor in slice operators. https://github.com/Oneflow-Inc/oneflow/pull/7479
Fixes bug where BinaryCrossEntropy can produce nan in half. https://github.com/Oneflow-Inc/oneflow/pull/7476
Fixes bug where an error is raised when the base and exponent of pow operators are respectively real number type and Tensor type. https://github.com/Oneflow-Inc/oneflow/pull/7729
Fixes stack operators backward bug. https://github.com/Oneflow-Inc/oneflow/pull/7363
Fixes inefficiency problem caused by CPU synchronization when clip grad is executed on CUDA with the default configuration. https://github.com/Oneflow-Inc/oneflow/pull/7304
Fixes the SBP inference of Batch Gather and Unsorted Batch Segment Sum operators, and runs the global unittest successfully. https://github.com/Oneflow-Inc/oneflow/pull/7590
Fixes Physical Shape inference of Affine Grid operators, fixes the unexpected result bug in some SBP cases, and runs the global unittest successfully. https://github.com/Oneflow-Inc/oneflow/pull/7578
Fixes the problem that arange operators don't support generating 0 size tensor, and runs the global unittest successfully. https://github.com/Oneflow-Inc/oneflow/pull/7576
Fixes the incorrect SBP inference of flip operators, and runs the global unittest successfully. https://github.com/Oneflow-Inc/oneflow/pull/7496
Fixes advanced indexing and zeroslike operators SBP bugs. https://github.com/Oneflow-Inc/oneflow/pull/7238
Fixes bug where Eager global inplace might not be successful. https://github.com/Oneflow-Inc/oneflow/pull/7348

5. Supporting einsum & view mechanism

Adds einsum operators. einsum provides a set of concise but elegant rules, which can implement tensor operations including but not limited to: inner product, outer product, tensor multiplication, tensor transposition and tensor contraction, etc. Proficient use of einsum allows you to easily implement various complex tensor operations and be less error-prone. https://github.com/Oneflow-Inc/oneflow/pull/7526

Adds view mechanism. The view mechanism allows the common operators to reuse/share Tensor's memory, and the memory can be saved by reducing the Kernel Launch/Compute process. At present, new view operators that do not change the tensor.is_contiguous() property have been added, such as reshape, view, squeeze, unsqueeze, etc.: https://github.com/Oneflow-Inc/oneflow/pull/7503 More view operators will be added later (such as transpose, permute, narrow, expand, and unfold).

6. Improvements of the complier

OneFlow is officially connected to the MLIR ecosystem, and the OneFlow Dialect component is complete. Successfully completes OneFlow Job (computation graph of OneFlow nn.Graph) and RoundTrip of MLIR, and runs RoundTrip tests on all operators of OneFlow in CI process.
Implements static graph optimization with a series of automatic fused operators based on MLIR DRR to accelerate OneFlow model training and inference.

7. OneFlow Serving

OneFlow Serving v0.1.0 comes out with the following features:

Provides OneFlow C++ API used for inference, supporting model loading and static graph inference.
The model weights and the computation graph in MLIR format can be saved simultaneously by running flow.save(graph) in Python. They can be loaded in C++ API (while loading computation graph is not supported in Python API at present).
Supports inference of OneFlow model using TensorRT and OpenVINO automatically without model conversion (based on OneFlow XRT module), achieving better acceleration on NVIDIA GPU and Intel CPU.
Implements Triton OneFlow backend
- Provides out-of-the-box Docker image.
- Supports auto configuration: only the model path needs to be given, and no Triton configuration file needs to be written in the configuration.
Welcome to use the project deployed with Triton OneFlow backend launched on OneFlow Cloud Platform.

8. LiBai

LiBai is a toolbox for massively distributed parallel training of Transformer. Compared with custom code bases such as Megatron-LM, LiBai provides a series of models and training components for distributed training based on a modular design, aiming to make models trained in distributed mode as convenient as in single-GPU mode. The 0.1.0 version mainly supports the following features and models:

Features:

Data Parallelism
1D Tensor Parallelism
Pipeline Parallelism
Unified Distributed Layers
Extensible for new parallelism
Mixed Precision Training
Activation Checkpointing
Gradient Accumulation
Gradient Clip
ZeRO
More flexible "LazyConfig" configuration system
Easy-to-use Trainer and Evaluator
Data preprocessing supporting images and texts

Models:

Bert (3D Parallelism)
GPT-2 (3D Parallelism)
ViT (3D Parallelism)
Swin-Transformer (Data Parallelism)
Supports fine-tuning tasks in projects/
Supports text classification tasks in projects/

9. flow-vison

flowvision 0.1.0 stable version comes out with the following improvements based on the previous version:

Adds initialization method trunc_normal_
Adds DeiT model, rebuilt VisionTransformer model
Adds ConvNeXt model
Adds ReXNet model
Supports Learning Rate Schedule in PolyLRScheduler and TanhLRScheduler
Fixes the use of F.normalize in SSD model
Fixes bugs in EfficientNet and Res2Net
Fixes weights problem in vit_small_patch32_384 and res2net50_48w_2s models
Rebuilds model zoo and runs more complete tests on existing models
Rebuilds load_state_dict_from_url method to automatically save the downloaded weights in the cache folder
Improves documents about Getting Started and flowvision.models

The 0.2.0 version of flowvision is already in progress. A large number of new models will be added based on the 0.1.0 version, and the documentation will be improved, so stay tuned.

v0.6.0

2 years ago

OneFlow v0.6.0 Release Notes

OneFlow has been open sourced for 528 days since July 31,2020. Today OneFlow v0.6.0 came out. Welcome to use OneFlow v0.6.0. We would love to hear feedback!

This version mainly updates three parts: framework, model, and OneFlow-ONNX. Hightlights include:

Performance optimization in static graphs, dynamic graphs, operators, memory occupation, etc
A larger number of common operators
Improvements in static graphs and ConsistentTensor
Serving functionality as Nvidia Triton's backend
Richer visual pre-training models similar to torchvision and timm
Better OneFlow-ONNX conversion functionality

The following are the detailed release notes.

Framework

1. Performance Optimization of nn.Graph

Compared to v0.5.0, nn.Graph in v0.6.0 delivers a 10% speedup in training on models such as ResNet AMP and WDL, etc
- Optimized nn.Graph's performance in high frequency iterative training scenarios
- Redesigned the scheduling instructions of nn.Graph and refactored the interaction logic between Actor Graph and Eager VM so that the runtime execution of the Graph is asynchronous and parallel to Python input/output Tensor as much as possible

2. Performance Optimization of Eager

Compared to v0.5.0, v0.6.0 OneFlow Eager's training speed increases dramatically in small batch scenarios
- Optimized the scheduling logic for virtual machines
- Optimized get/set item
- Optimized tensor.numel()
- Optimized oneflow.Size()

3. Performance Optimization of Operators

Optimized some operators that affect the performance of new model to significantly improve the training speed of these models
- Added fused dropout operators
- Added CPU-version group deconv and optimized its performance
- Added inplace-version implementation for operators mul, hard_sigmoid, and sin
- Optimized performance for linalg.vector_norm when ord=2.0 and it is 4 times faster than before
- Deeply optimized the LayerNorm operator, making its performance greatly better than PyTorch and Apex implementation. For more information, refer to How to Implement an Efficient LayerNorm CUDA Kernel — OneFlow Performance Optimization
- Realized automatic type promotion of operators. For more information, refer to Automatic Type Promotion of Operators in OneFlow

4. Performance Optimization of Eager's Memory Occupation

Optimized some operators' memory occupation during net training, making the same computing device run bigger models or data
- Optimized the backward memory occupation of broadcast binary operators
- Optimized the backward memory occupation of Slice operator
- Optimized the memory occupation of LayerNorm operator

5. More Useful Features to Static Computation Graph (nn.Graph)

The newly added features are related to the effeciency, debugging, completeness, and usability of static graphs
- To help the debugging of static graphs, we added the following features:
  - debug mode supports graph.debug(1) printing more information about the composition
  - Provided the environment variable ONEFLOW_DEBUG_PASS to show the changes in the computed graph before and after compile-time optimization
  - Added user-readable thread naming information to Nsight Profile for locating and retrieving target key thread locations
  - Added many static graph test cases and added automatic nn.Graph tests that accompany Eager tests
- Provided graph.save() and load() interfaces to support the deployment of models (Serving) using nn.Graph
- To do AMP acceleration on GPUs which use TensorCore, the environment variable ONEFLOW_ENABLE_NHWC is provided to indicate the CNN-related operators for channels last calculation
- Enabled nn.Graph to support more usage scenarios:
  - Supported for Sparse Update Optimizer for sparse update of parameters in WDL scenarios
  - Supported for using the following nn.Module Containers with nn.Graph: Sequential, ModuleList, ModuleDict, ParameterList, and ParameterDict
  - Supported for creating Optimizer in the init function of nn.Graph
  - Supported multiple parameters sharing the same Tensor with nn.Graph
  - Supported for scenarios where the actual number of processes is greater than the number of GPU devices
  - Supported more Inplace execution for Consistent SBP inference under nn.Graph

6. A Larger Number of Operators

Newly added operators: cumsum, meshgrid, linspace, diagonal, movedim, roialign, nms, arccos, and roll
Newly added operators: masked_fill, floordiv, glu, pool1d, pool2d, and pool3d
Newly added unfold and fold operators: Adding Unfold and Fold Ops into OneFlow
Achieved automatic data type promotion of operators: [Automatic Type Promotion of Operators in OneFlow
Added expand and repeat operators: Added Expand and Repeat Operators into OneFlow
Supported one-click switching for the current torchvision library models by the command import oneflow as torch

7. User-Defined autograd.Function

Users can customize autograd.Function just like using Torch.

8. Added Basic Serving Functionality

Serving functionality of models is provided by OneFlow as Nvidia Triton's backend.

9. Added Some Functionalities of Tensor (ConsistentTensor)

Supported Tensor using 2-D SBP to represent arbitrary hybrid parallelism (such as a Linear operation that runs data parallelism in the row direction of the device matrix and model parallelism in the column)
Supported Tensor's conversion from arbitrary 1-D SBP to 2-D SBP (the network consists of a mixture of 1-D parallel and 2-D parallel)
Supported constructing ConsistentTensor from numpy
oneflow.from_numpy()
oneflow.numel()
tensor.expand_as()

Model

Released flowvision 0.0.54.

1. Richer Visual Pre-training Models

Image Classification

CNN series: ResNet, DenseNet, VGG, ResNext, EfficientNet, etc
Vision Transformer series: ViT, PVT, Swin-Transformer, etc
Vision MLP series: Mlp-Mixer, Res-MLP, g-MLP, etc

Object Detection

SSD, SSDLite
Faster R-CNN
RetinaNet

Image Segmentation

FCN
DeepLabV3

Style Migration

StyleNet: Suport Styles sketch, candy, mosaic, rain_princess, and undie

2. Implemented Data Augmentation Operations Similar to torchvision

For data augmentation operations like CenterCrop and ColorJitter similar to torvhvision, developers can run import flowvision as torchvisionto execute in most scenarios.

3. Implemented Advanced Data Augmentation Opertations Similar to timm

Advanced data augmentation opertations implemented in flowvision.data:

Mixup
CutMix
Random-Erasing
AutoAugment
RandAugment
AugMix

4. Separated the Layers Module and Provided a Plug-and-play Block when Building a Model

flowvision.layers.attention

Implemented plug-and-play attention models like Non-Local, SELayer, CBAM, BAM, ECA, etc

flowvision.layers.blocks

Provided modules that might be used for model building like PatchEmb, Pooler, ConvBnAct, etc

flowvision.layers.regularization

Provided regularization modules such as drop-path, drop-block, and stochastic depth to improve model generalization ability
Provided separate files such as activation and weight_init to improve components like activation function and initialize method

OneFlow-ONNX Conversion

Updated OneFlow to ONNX toolkit:

Supported OneFlow model converting to ONNX model in CPU or GPU mode
Added test cases for operators and models to align all classification models in OneFlowVision library
Fixed onnx-runtime bugs during PReLU conversion
Compatible with v1.9.0 onnx-runtime library or later versions
Released v0.5.4 oneflow-onnx package, and developers can run pip install oneflow-onnx to experience

v0.5.0

2 years ago

v0.5rc2

2 years ago

v0.5rc1

2 years ago

v0.3.0

2 years ago

v0.5.0b1

2 years ago