Deepsparse Versions Save

Sparsity-aware deep learning inference runtime for CPUs

v1.4.0

1 year ago

New Features:

OpenPifPaf deployment pipelines support (#788)
VITPose example deployment pipeline (#794)
DeepSparse Server logging with support for metrics, timings, and input/output values through Prometheus (#821, #791)

Changes:

Inference speed improved by up to 20% on dense FP32 BERT models.
Inference speed improved by up to 50% on quantized EfficientNetV1 and by up to 10% on quantized EfficientNetV2.
YOLOv5 integration upgraded to the latest upstream.

Resolved Issues:

DeepSparse no longer improperly detects each core as belonging to its own socket on some virtual machines, including those on OVHcloud.
When running networks with any Quantized Depthwise Convolution with a nontrivial w_zero_point parameter no longer produces an assertion failure. Trivial in this case means that the zero point is equal to 128 for uint8 data, or 0 for int8 data.
At executable_buffer.cpp (see https://github.com/neuralmagic/deepsparse/issues/899), an assertion failure no longer occurs.
In quantized transformer models, a rare assertion failure no longer occurs.

Known Issues:

None

v1.3.2

1 year ago

This is a patch release for 1.3.0 that contains the following changes:

Softmax operators from ONNX Opset 13 and later now behave correctly in DeepSparse. Previously, the semantics of Softmax from ONNX Opset 11 were applied, which would result in incorrect answers in some cases.
Quantized YOLOv8 models are now supported in DeepSparse. Previously, the user would have encountered an assertion failure.

v1.3.1

1 year ago

This is a patch release for 1.3.0 that contains the following changes:

Performance on some unstructured sparse quantized YOLOv5 models has been improved. This fixes a performance regression compared to DeepSparse 1.1.
DeepSparse no longer throws an exception when it cannot determine L3 cache information and instead logs a warning message.
An assertion failure on some compound sparse quantized transformer models has been fixed.
Models with ONNX opset 13 Squeeze operators no longer exhibit poor performance, and DeepSparse now sees speedup from sparsity when running them.
NumPy version pinned to <=1.21.6 to avoid deprecation warning/index errors in pipelines.

v1.3.0

1 year ago

New Features:

Bfloat16 is now supported on CPUs with the AVX512_BF16 extension. Users can expect up to 30% performance improvement for sparse FP32 networks and an up to 75% performance improvement for dense FP32 networks. This feature is opt-in and is specified with the default_precision parameter in the configuration file.
Several options can now be specified using a configuration file.
Max and min operators are now supported for performance.
SQuAD 2.0 support provided.
NLP multi-label and eval support added.
Fraction of supported operations property added to engine class.
New ML Ops logging capabilities implemented, including metrics logging, custom functions, and Prometheus support.

Changes:

Minimum Python version set to 3.7.
The default logging level has been changed to warn.
Timing functions and a default no-op deallocator have been added to improve usability of the C++ API.
DeepSparse now supports the axes parameter to be specified either as an input or an attribute in several ONNX operators.
Model compilation times have been improved on machines with many cores.
YOLOv5 pipelines upgraded to latest state from Ultralytics.
Transformers pipelines upgraded to latest state from Hugging Face.

Resolved Issues:

DeepSparse no longer crashes with an assertion failure for softmax operators on dimensions with a single element.
DeepSparse no longer crashes with an assertion failure on some unstructured sparse quantized BERT models.
Image classification evaluation script no longer crashes for larger batch sizes.

Known Issues:

None

v1.2.0

1 year ago

New Features:

DeepSparse Engine Trial and Enterprise Editions now available, including license key activations.
DeepSparse Pipelines document classification use case in NLP supported.

Changes:

Mock engine tests added to enable faster and more precise unit tests in pipelines and Python code.
DeepSparse Engine benchmarking updated to use time.perf_counter for more accurate benchmarks.
Dynamic batch implemented to be more generic so it can support any pipeline.
Minimum Python version changed to 3.7 as 3.6 reached EOL.

Performance:

Performance improvements for unstructured sparse quantized convolutional neural networks implemented for throughput use cases.

Resolved Issues:

In the C++ interface, the engine no longer crashes with a segmentation fault when the num_streams provided to the engine_context_t is greater than the number of physical CPU cores.
The engine no longer crashes with assertion failures when running YOLOv4.
YOLACT pipelines fixed where dynamic batch was not working and exported images had color channels improperly swapped.
DeepSparse Server no longer crashes for hyphenated task names such as "question-answering."
Computer vision pipelines now additionally accept single NumPy array inputs.
Protobuf version for ONNX 1.12 compatibility pinned to prevent installation failures on some systems.

Known Issues:

None

v1.1.0

1 year ago

New Features:

Python 3.10 support added.
Zero-shot text classification pipeline implemented.
Haystack Information Retrieval pipeline implemented.
YOLACT pipeline native integration for deployments is available.
DeepSparse pipelines now support dynamic batch, dynamic shape through bucketing, and asynchronous execution support.
CustomTaskPipeline added to enable easier custom pipeline creation.

Changes:

The behavior of the Multi-stream scheduler is now identical to the Elastic scheduler, and the old Multi-stream scheduler has been removed.
NLP pipelines for question answering, text classification, and token classification upgraded to improve accuracy and better match the SparseML training pathways.
Updates made across the repository for new SparseZoo Python APIs.
Max torchvision version increased to 0.12.0 for computer vision deployment pathways.

Performance:

Inference performance improvements for
- unstructured sparse quantized Transformer models.
- slow activation functions (such as Gelu or Swish) when they follow a QuantizeLinear operator.
- some sparse 1D convolutions. Speedups of up to 3x are observed.
- Squeeze, when operating on a single axis.

Resolved Issues:

Assertion errors no longer when one node had multiple inputs, both coming from the same node no longer occurs.
An assertion error no longer appears when a MatMul operator followed a Transpose or Reshape operator no longer occurs.
Pipelines now support hyphenated versions of standard task names such as question-answering,

Known Issues:

In the C++ interface, the engine will crash with a segmentation fault when the num_streams provided to the engine_context_t is greater than the number of physical CPU cores.

v1.0.2

1 year ago

This is a patch release for 1.0.0 that contains the following changes:

Question answering pipeline pre-processing now to exactly match the SparseML training pre-processing. Before there were differences between the logic of the two that was leading to minor drops in accuracy.

v1.0.1

1 year ago

This is a patch release for 1.0.0 that contains the following changes:

Crashes with an assertion failure no longer happen in the following cases:

during model compilation for a convolution with a 1x1 kernel with 2x2 convolution strides.
when setting the num_streams parameter to fewer than the number of NUMA nodes.

The engine no longer enters an infinite loop when an operation has multiple inputs coming from the same source.

Error messaging improved for installation failures of non-supported operating systems.

Supported transformers datasets version capped for compatibility with pipelines.

v1.0.0

1 year ago

New Features:

Support added for running multiple models with the same engine when using the Elastic Scheduler.
When using the Elastic Scheduler, the caller can now use the num_streams argument to tune the number of requests that are processed in parallel.
Pipeline and annotation support added and generalized for transformers, yolov5, and torchvision.
Documentation additions made for transformers, yolov5, torchvision, and serving that focus on model deployment for the given integrations.
AWS SageMaker example created.

Changes:

Click as a root dependency added as the new preferred route for CLI invocation and arg management.

Performance:

Inference performance has been improved for unstructured sparse quantized models on AVX2 and AVX-512 systems that do not support VNNI instructions. This includes up to 20% on BERT and 45% on ResNet-50.

Resolved Issues:

When a layer operates on a dataset larger than 2GB, potential crashes no longer happen.
Assertion error addressed for Reduce operations where the reduction axis is of length 1.
Rare assertion failure addressed related to Tensor Columns.
When running the DeepSparse Engine on a system with a non-uniform system topology, model compilation now properly terminates.

Known Issues:

In rare cases, the engine may crash with an assertion failure during model compilation for a convolution with a 1x1 kernel with 2x2 convolution strides; hotfix forthcoming.
The engine will crash with an assertion failure when setting the num_streams parameter to fewer than the number of NUMA nodes; hotfix forthcoming.
In rare cases, the engine may enter an infinite loop when an operation has multiple inputs coming from the same source; hotfix forthcoming.

v0.12.2

2 years ago

This is a patch release for 0.12.0 that contains the following changes:

Protobuf is restricted to version < 4.0 as the newer version breaks ONNX.