OneDNN Versions Save

oneAPI Deep Neural Network Library (oneDNN)

v3.4.1

4 weeks ago

This is a patch release containing the following changes to v3.4:

  • Fixed an issue with caching and serialization of primitives in deterministic mode (7ed604a1e5688022a59444059e53a6a7967f679a)
  • Introduced memory descriptor serialization API (4cad420e673f4cd49568ea7c4dd6a55e6f55794e, 929a27ae0412a0851629da70916eee360a39baac, 9b848c859a6b1d046dd63cf20f817aa9428fb483)
  • Fixed incorrect results in fp64 convolution and deconvolution on Intel GPUs based on Xe-LPG architecture (ebe77b566bb1cd273e9bda99cc62063b7c2a7e45, 0b399ac42740a9c6ed458aacafdb31ce16205cbd, d748d642d7871608e09f5cee5d964ddcfc8a42ef, 9f4f3d510ddc9d639db052302be579621d46bb1f, 21a8caebb34a85074f3f8a5cef35ed85532a5bbe)
  • Fixed incorrect results in reorder with large sizes on Intel CPUs and GPUs (69a111e6d835f8632ea571f3ea0e273b22488d37, 4b7236134bde1c1a71859a844eae860a71670b97, 74a343bf66a1c8f113fa8e025391aba5015c6e48)
  • Reduced creation time for deconvolution primitive on Intel CPUs (bec487e4ae16b3e88382adf9574e9c62cc76d1bd, 1eab00586881f4fb6966a16f71216528ec549c11)
  • Fixed performance regression in deconvolution on Intel CPUs (fbe5b97c966696a3f5be2240c0eb4592ed548036, 1dd3c6af03addefcf92ac45eddeb8becf63d6a6e)
  • Removed dangling symblols from static builds (e92c4041b12e55837452327c3ebd9411dbc2e861, 6f5621aed75226b93f07879fafa6fb799a36f042)
  • Fixed crash during platform detection on some AArch64-based systems (406a0798c1c5b939726a892ad5a96e20298396ca)
  • Fixed performance regression in int8 deconvolution on Intel CPUs (7e50e152f21a79978b8910260e042b43941b601c)
  • Fixed handling of zero points for matmul in verbose logs converter (15c791686f94291eddda7a2e24835ba1113c530a)

v3.3.6

1 month ago

This is a patch release containing the following changes to v3.3.5:

  • Fixed crash during platform detection on some AArch64-based systems (3e0e69b21ba0694db95bd2af0877f936dcc86dd2)
  • Improved inner product performance with Arm Compute Library (ACL) (e7abee2d883d41613cf243c135037fc68d2dacd0, 214fb9e14227880097729ffffac3b666a0febcd7, 8aacc8ff0dfefddfae30681d056757dba1fb0815)
  • Fixed incorrect results in int8 depthwise convolution with post-ops on processors with Intel AVX2 instruction set support (0c922e04df62cf3042ebdc578a72883bde35079a)
  • Fixed performance regression in fp32 convolution on processors with Intel AVX2 instruction set support (4efc0ad7234741459bab6afc21f571ddb645bcae)

v3.4

1 month ago

Performance Optimizations

  • Intel Architecture Processors:

    • Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
    • Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids). These optimizations are now included by default on compatible processors.
    • Improved RNN primitive performance with LBR_GRU cell.
    • Improved softmax performance on processors with Intel AVX2 or Intel AVX-512 instruction set support.
    • Improved fp32 inner product performance on processors with Intel AVX2 instruction set support.
    • Improved fp32, fp16, bf16 matmul primitive performance on processors with Intel AVX-512 and Intel AMX instruction set support.
    • Improved int8 matmul performance with transposed A tensor.
    • Improved performance of resampling primitive on processors with Intel AVX2 instruction set support.
    • Improved performance of int8 convolution with post-ops.
    • Optimized batch matmul with binary post-op and broadcast mask 1 and 14.
    • Improved the Scaled Dot Product Attention (SDPA) subgraph performance with Graph API.
    • Improved performance of subgraphs including matmul and add operations and mixed int8 and bfloat16 data types with Graph API.
    • [experimental] Improved performance of reduction, softmax and layernorm operations with experimental Graph Compiler backend.
    • [experimental] Improved performance for llama2 MLP subgraph with experimental Graph Compiler backend.
  • Intel Graphics Products:

    • Introduced initial optimizations for Processor Graphics based on Xe2 architecture.
    • Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
    • Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound).
    • Improved matmul performance for cases relevant to Large Language Models (LLMs) and Transformer-like models.
    • Improved convolution performance for cases relevant to the Stable Diffusion model.
    • Improved RNN primitive performance.
    • Improved pooling forward propagation performance.
    • Improved batched matmul performance for cases with 5 dimensions or more.
  • AArch64-based Processors:

    • Added an option to build oneDNN with macOS Accelerate library to improve performance on Apple silicon.
    • Improved reorder primitive performance with Compute Library for the Arm architecture (ACL).
    • Improved bf16 inner product product primitive performance with ACL.

Functionality

  • Introduced GPT-Q support to improve Large Language Models (LLMs) performance with compressed weights. Optimized implementation is available for Intel Graphics Products and support matmul with int8 wight compression.
  • Introduced fp8 data type support in primitives and Graph API. Optimized implementation is available for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
  • Introduced support for fp16 and bf16 scale and shift arguments for layer normalization. Optimized implementation is available for Intel Graphics Products.
  • [experimental] Introduced unstructured sparsity support for processors with Intel AMX support relying on VCOMPRESS/VPEXPAND instructions.
  • Intel Graphics Products
    • Introduced support for Intel Data Center GPU Max 1550VG
    • Introduced PReLU post-op support for inner product and matmul primitives.

Usability

  • Added opt-in deterministic mode support. Deterministic mode guarantees that results are bitwise identical between runs in a fixed environment.
  • Introduced accumulation mode control.
  • Extended oneDNN verbose diagnostics with information on dispatching decisions in convolution and matmul implementations.
  • Extended verbose diagnostics for Graph API with information for operation schema check results and pattern matching results.
  • Reduced RNN primitive memory consumption on GPUs.
  • Added examples demonstrating use of oneDNN Graph API in eager mode use cases.
  • Extended tensor constructor in Graph API to support memory allocation and management by the library.
  • Introduced new API and environment variable to manage Graph API constant tensor cache capacity.
  • Improved the efficiency of pattern matching in Graph API by optimizing pattern registration, reducing pattern numbers, and skipping patterns more wisely.
  • Changed default optimization flags for AArch64 builds to -mcpu=generic to improve portability.

Validation

  • Improved benchdnn performance by optimizing bottlenecks in validation code.
  • Introduced --num-streams knob in benchdnn to support benchmarking in multi-stream scenarios.

Known Limitations

  • Intel Datacenter GPU Flex Series driver for Windows has an issue resulting in program hangs or crashes when oneDNN primitives are created concurrently.
  • int8 concat primitive may produce incorrect results on integrated GPUs with current GPU driver.
  • fp32 pooling primitive may produce incorrect results in rare conditions on Intel Datacenter GPU Max Series with current GPU driver.
  • reorder primitive causes segmentation fault for prime sizes exceeding 2^31 on Intel CPUs.
  • fp64 convolution and deconvolution produces incorrect results on integrated graphics in future Intel Core processors (code name Arrow Lake)
  • int8 matmul primitive creation with fp32 bias fails on Intel GPU Flex Series and Intel Arc Graphics.

Breaking Changes

  • Updated minimal supported ACL version to 23.11 (was 23.02.1).

Thanks to these Contributors

This release contains contributions from the project core team as well as Alexander Grund @Flamefire, David Svantesson @davsva01, Fadi Arafeh @fadara01, Hugh Delaney @hdelan, Ilya Lavrenov @ilya-lavrenov, Jacob Kahn @jacobkahn, Nathan John Sircombe @nSircombe, Renato Barros Arantes @renato-arantes, Sergey Shalnov @shssf, Sunita Nadampalli @snadampal, and Svetlozar Georgiev @sgeor255. We would also like to thank everyone who asked questions and reported issues.

v3.3.5

2 months ago

This is a patch release containing the following changes to v3.3.4:

  • Fixed undefined behavior in 3D depthwise convolution on Intel CPUs (bbaec145f8c64818fd5c3ed2cb9e2ae69daef887)
  • Added warning for ACL versions newer than maximum supported (7473012743ae3227dbfa208cad260d29d86d5080)
  • Added citation file (fea9f88fa7f8056a5addedfdebdb2dda35ee7a9d)
  • Fixed SEGFAULT in int8 convolution on processors with Intel AMX support (2a8e122b63b55f897c470d23f21003bb70f0e839)

v3.4-rc

2 months ago

Performance Optimizations

  • Intel Architecture Processors:

    • Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
    • Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids). These optimizations are now included by default on compatible processors.
    • Improved RNN primitive performance with LBR_GRU cell.
    • Improved softmax performance on processors with Intel AVX2 or Intel AVX-512 instruction set support.
    • Improved fp32 inner product performance on processors with Intel AVX2 instruction set support.
    • Improved fp32, fp16, bf16 matmul primitive performance on processors with Intel AVX-512 and Intel AMX instruction set support.
    • Improved int8 matmul performance with transposed A tensor.
    • Improved performance of resampling primitive on processors with Intel AVX2 instruction set support.
    • Improved performance of int8 convolution with post-ops.
    • Optimized batch matmul with binary post-op and broadcast mask 1 and 14.
    • Improved the Scaled Dot Product Attention (SDPA) subgraph performance with Graph API.
    • Improved performance of subgraphs including matmul and add operations and mixed int8 and bfloat16 data types with Graph API.
    • [experimental] Improved performance of reduction, softmax and layernorm operations with experimental Graph Compiler backend.
    • [experimental] Improved performance for llama2 MLP subgraph with experimental Graph Compiler backend.
  • Intel Graphics Products:

    • Introduced initial optimizations for Processor Graphics based on Xe2 architecture.
    • Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
    • Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound).
    • Improved matmul performance for cases relevant to Large Language Models (LLMs) and Transformer-like models.
    • Improved convolution performance for cases relevant to the Stable Diffusion model.
    • Improved RNN primitive performance.
    • Improved pooling forward propagation performance.
    • Improved batched matmul performance for cases with 5 dimensions or more.
  • AArch64-based Processors:

    • Added an option to build oneDNN with macOS Accelerate library to improve performance on Apple silicon.
    • Improved reorder primitive performance with Compute Library for the Arm architecture (ACL).
    • Improved bf16 inner product product primitive performance with ACL.

Functionality

  • Introduced GPT-Q support to improve Large Language Models (LLMs) performance with compressed weights. Optimized implementation is available for Intel Graphics Products and support matmul with int8 wight compression.
  • Introduced fp8 data type support in primitives and Graph API. Optimized implementation is available for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
  • Introduced support for fp16 and bf16 scale and shift arguments for layer normalization. Optimized implementation is available for Intel Graphics Products.
  • [experimental] Introduced unstructured sparsity support for processors with Intel AMX support relying on VCOMPRESS/VPEXPAND instructions.
  • Intel Graphics Products
    • Introduced PReLU post-op support for inner product and matmul primitives.

Usability

  • Added opt-in deterministic mode support. Deterministic mode guarantees that results are bitwise identical between runs in a fixed environment.
  • Introduced accumulation mode control.
  • Extended oneDNN verbose diagnostics with information on dispatching decisions in convolution and matmul implementations.
  • Extended verbose diagnostics for Graph API with information for operation schema check results and pattern matching results.
  • Reduced RNN primitive memory consumption on GPUs.
  • Added examples demonstrating use of oneDNN Graph API in eager mode use cases.
  • Extended tensor constructor in Graph API to support memory allocation and management by the library.
  • Introduced new API and environment variable to manage Graph API constant tensor cache capacity.
  • Improved the efficiency of pattern matching in Graph API by optimizing pattern registration, reducing pattern numbers, and skipping patterns more wisely.
  • Changed default optimization flags for AArch64 builds to -mcpu=generic to improve portability.

Validation

  • Improved benchdnn performance by optimizing bottlenecks in validation code.
  • Introduced --num-streams knob in benchdnn to support benchmarking in multi-stream scenarios.

Breaking Changes

  • Updated minimal supported ACL version to 23.11 (was 23.02.1).

Thanks to these Contributors

This release contains contributions from the project core team as well as Alexander Grund @Flamefire, David Svantesson @davsva01, Fadi Arafeh @fadara01, Hugh Delaney @hdelan, Ilya Lavrenov @ilya-lavrenov, Jacob Kahn @jacobkahn, Nathan John Sircombe @nSircombe, Renato Barros Arantes @renato-arantes, Sergey Shalnov @shssf, Sunita Nadampalli @snadampal, and Svetlozar Georgiev @sgeor255. We would also like to thank everyone who asked questions and reported issues.

v3.3.4

3 months ago

This is a patch release containing the following changes to v3.3.3:

  • Fixed performance regression in convolution, matmul and inner product primitives with post-ops on Intel CPUs (2e3c94c5aeb6be1ce992d799943fdc4f3123905f)
  • Fixed performance regression in bfloat16 matmul on processors with Intel AMX instruction set support (c0ae38cdf1201caf8ffd2906077defdfe7f4aaa3, fa4364057891fdec528d9442c88d0715306bff2d)
  • Fixed SEGFAULT in 3D convolutions with different h and w parameters on Intel CPUs (b5f916ec068f783dbba2cd4f04a673e996f9efba)
  • Fixed performance regression in fp32 convolution backpropagation on Intel CPUs (ee3b12d5388d7d749a120cf8522efd6f5aeecc09)
  • Reduced benchdnn memory consumption on Intel GPUs (84a8f57d45f215cf89d0f80a57a66b78eaf9b440)

v3.3.3

4 months ago

This is a patch release containing the following changes to v3.3.2:

  • Fixed performance regression in int8 convolutions on processors with Intel AVX-512 and Intel DL Boost support (a00661ff735e5448ef3a80e4e2df7a1556f8a84f)
  • Fixed race condition during library initialization on Intel Data Center GPU Max Series (7dfcd116e245e4a167a64bd39a24e957d2b939de)
  • Fixed accuracy issue in experimental Graph Compiler with LLVM code generator (8892e7efadeaf42d75f75e64d095635458836cd7)
  • Disabled int8 RNN implementation for cases with non-trivial strides (2195e4b23d57c38a439c50232783f654b96f575c)
  • Fixed incorrect results in bfloat16 convolution implementation on processors with Intel AMX support (9f00af9312a9b76a1880e1aaac513188793ecaa7)
  • Fixed incorrect results in fp16 and int8 convolution on Intel Core Ultra integrated GPUs (69cef84c4f09398858393035eafa2bd4a29ec0b0, 79bc6cc0477db1ce7e732f20d005ff2b9e88390e, c9c0b09c5e64114eada1b6beb7f6db36331e0fac)

v3.3.2

4 months ago

This is a patch release containing the following changes to v3.3.1:

  • Fixed incorrect results in bfloat16 reorder on Intel Core Ultra integrates GPUs (9025980286c506908f98819e068a047a1d268842, ed9de2afd1fede32a317cbc5df953dfe997e78ea, 0c6bda10b3ea760205d4707a554b76045ef6f964)
  • Fixed incorrect results in matmul, inner product, and RNN primitives on Intel Core Ultra integrated GPUs (6edab9f01ec5cf8b30ee0b474aa25417f0493897)
  • Updated compiler optimization flags for AArch64 processors to make build portable (8829c249b713dddc87c2669120a9798e202ac633)
  • Fixed segmentation fault during library initialization on AArch64 processors (3e15c6113ffeff3545775cbcca9bd84911856cb9)

v3.3.1

5 months ago

This is a patch release containing the following changes to v3.3:

  • Fixed int8 convolution accuracy issue on Intel GPUs (09c87c79bccbad8fa451b224a0f07f87095e3907)
  • Switched internal stream to in-order mode for NVIDIA and AMD GPUs to avoid synchronization issues (db01d62b3fc80897d88dc42f4dcdfcb0d90c131a)
  • Fixed runtime error for avgpool_bwd operation in Graph API (d025ef6620b131f3487bb748866ddd9d7225c09f, 9e0602ad37afa18d46f407cb52577f1afead238b, e0dc1b3d070313052f5fd6ac739778d45b57859c)
  • Fixed benchdnn error reporting for some Graph API cases (98dc9dbecb3f36234474c9d6e96ab6571497633b)
  • Fixed accuracy issue in experimental Graph Compiler for int8 MHA variant from StarCoder model (5476ef7c165d943fbce94ca0f44a13d6868e65f3)
  • Fixed incorrect results for layer normalization with trivial dimensions on Intel GPUs (a2ec0a0c5805314220db925e1323e4675e3ca379)
  • Removed redundant synchronization for out-of-order SYCL queues (a96e9b1a6769171e74b0b8e031489303438906e5)
  • Fixed runtime error in experimental Graph Compiler for int8 MLP subgraph from LLAMA model (595543dd093df3e92621c253d6da3f9092ec7ff8)
  • Fixed SEGFAULT in experimental Graph Compiler for fp32 MLP subgraph (42071057abb2fcbbca6ed67117bdb1a5ee3dc0cd)
  • Fixed incorrect results in experimental Graph Compiler for MLP subgraph (57e14b56d4e6fab2ab49dbd47fd579482d79535a)
  • Fixed the issue with f16 inner product primitive with s8 output returning unimplemented on Intel GPUs (bf12207b0312c0174f0c47ae0d3abd70edc31957, 800b5e9613bd0994af82706ef024ad2b453be2b6, ec7054a2c79ae33d3db4ff04ce11360c2c896d56)
  • Fixed incorrect results for int8 deconvolution with zero-points on processors with Intel AMX instructions support (55d2cecd698f865efac2e1dbf2f701b4b8095df1)

v3.3

6 months ago

Performance Optimizations

  • Intel Architecture Processors:
    • Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
    • Improved int8 convolution performance with zero points on processors with Intel AMX instruction set support.
    • Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids). This functionality is disabled by default and can be enabled via CPU dispatcher control.
    • Improved fp32 and int8 convolution performance for cases with small numbers of input channels for processors with Intel AVX-512 and/or Intel AMX instruction set support.
    • Improved s32 binary primitive performance.
    • Improved fp16, fp32, and int8 convolution performance for processors with Intel AVX2 instructions support.
    • Improved performance of subgraphs with convolution, matmul, avgpool, maxpool, and softmax operations followed by unary or binary operations with Graph API.
    • Improved performance of convolution for depthwise cases with Graph API.
    • [experimental] Improved performance of LLAMA2 MLP block with Graph Compiler.
  • Intel Graphics Products:
    • Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
    • Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
    • Reduced RNN primitive initialization time on Intel GPUs.
  • AArch64-based Processors:
    • Improved fp32 to bf16 reorder performance.
    • Improved max pooling performance with Arm Compute Library (ACL).
    • Improved dilated convolution performance for depthwise cases with ACL.

Functionality

  • Introduced group normalization primitive support. The functionality is currently available on CPUs.
  • Intel CPUs:
    • Introduced support for zero points in int8 convolution with groups and 3D spatial.

Usability

  • Extended verbose mode output:
    • Improved diagnostics on engine creation errors.
    • Added information on Graph API calls.
    • Added information on strides for non-dense memory objects.
    • Added values of runtime dimension.
    • Added indication that primitive descriptor was created with any memory format tag.
  • Introduced examples for Graph API.
  • Graph API constant tensor cache is now disabled by default and requires opt-in with dnnl::graph::set_constant_tensor_cache() call.
  • Reduced oneDNN Graph API memory consumption in certain scenarios.

Validation

  • Extended benchdnn performance reporting with primitive creation time.
  • Introduced cold cache mode in benchdnn.

Known Limitations

  • Current GPU OpenCL runtime for Linux has an issue resulting in convolution producing incorrect results on integrated GPUs based on Xe architecture. SYCL configuration is not affected.
  • Pooling, resampling, prelu, batch normalization, layer normalization, and eltwise primitives may sporadically produce incorrect results on Intel Arc GPUs on Windows.
  • Current GPU driver for Linux has an issue resulting in program hangs or crashes when oneDNN primitives are executed concurrently on Intel Datacenter GPU Max Series.
  • Extensive use of RNN primitive on Intel GPUs with default primitive cache setting may lead to a device reboot. Workaround: consider reducing primitive cache size to 100.
  • Int8 deconvolution with signed weights and activations may produce incorrect results of processors with Intel AMX support.
  • Int8 softmax may fail crash on Windows in SYCL debug configuration.

Thanks to these Contributors

This release contains contributions from the project core team as well as Amy Wignall @AmyWignall-arm, @baibeta, Benjamin Taylor @bentaylorhk-arm, Ilya Lavrenov @ilya-lavrenov, Kentaro Kawakami @kawakami-k, Milos Puzovic @milpuz01, Renato Barros Arantes @renato-arantes, @snadampal, @sparkyrider, and Thomas Köppe @tkoeppe. We would also like to thank everyone who asked questions and reported issues.