Intel Extension For Transformers Versions Save

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡

v1.0.1

11 months ago

Bug Fixing
Improvement

Bug Fixing

Fix BERT Large accuracy issue (commit ddc4a5)
Fix Dynamic Quantization UnitTest (commit d83040)

Improvement

Enable new fusion patterns for GPT-J (commit c73605 )
ChatBot Refine Data Load and Data Clean (commit f70205, commit 0997ac)

Validated Configurations

Centos 8.4 & Ubuntu 22.04 & Windows 10
Python 3.8, 3.9
TensorFlow 2.10.1
PyTorch 1.13.1+cpu
Intel® Extension for PyTorch 1.13.1+cpu

v1.0.0

1 year ago

Highlights
Features
Productivity
Examples
Bug Fixing
Documentation

Highlights

Provide the optimal model packages for large language model (LLM) such as GPT-J, GPT-NEOX, T5-large/base, Flan-T5, and Stable Diffusion
Provide the end-to-end optimized workflows such as SetFit-based sentiment analysis, Document Level Sentiment Analysis (DLSA), and Length Adaptive Transformer for inference
Support NeuralChat, a custom Chatbot based on domain knowledge fine-tuning and demonstrate less than one hour fine-tuning with PEFT on 4 SPR nodes
Demonstrate the industry-leading sparse model inference solution in MLPerf v3.0 open submission with up to 1.6x over other submissions

Features

Model Optimization
- LLM quantization including GPT-J (6B), GPT-NEOX (2.7B), T5-large, T5-base, Flan-T5, BLOOM-176B
- Enable basic Neural Architecture Search (commit 6cae)
Transformers-accelerated Neural Engine
- Support runtime dynamic quantization (commit 46fa 41c4)
- Enable GPT-J FP32/BF16/INT8 text generation inference (commit ac2c)
- Enable Stable Diffusion BF16/FP32 text-to-image inference (commit 56cf)
- Support OpenNMT FP32 to ONNX with good accuracy (commit 34d8)
Transformers-accelerated Libraries
- CPU Backend: MHA fusion for LLM to improve performance (commit 7c3d)
- GPU Backend: Supports OpenCL infrastructure, and provides matmul implementation (commit 5a60)

Productivity

Support native PyTorch model as input of Neural Engine (commit bc38)
Refine the Benchmark API to provide apple-to-apple benchmark ability. (commit e135)
Simplify end-to-end example usage (commit 6b9c)
N in M/ N x M PyTorch Pruning API enhancement (commit da4d)
Deliver engine-only wheel with size reduce 60% (commit 02ac)

Examples

End-to-end solution for Length Adaptive with Neural Engine, achieves over 11x speed up compared with BERT Base on SPR (commit 95c6)
End-to-end Documentation Level Sentiment Analysis(DLSA) workflow (commit 154a)
N in M/ N x M BERT Large and BERT Base pruning in PyTorch (commit da4d)
Sparse pruning example for Longformer with 80% sparsity (commit 5c5a)
Distillation for quantization for BERT and Stable Diffusion (commit 8856 4457)
Smooth quantization with BLOOM (commit edc9)
Longformer quantization with question-answering task (commit 8805)
Provide SETFIT workflow notebook (commit 6b9c 2851)
Support Text Generation task (commit c593)

Bug Fixing

Enhance BERT QAT tuning duration (commit 6b9c)
Fix Length Adaptive Transformer regression (commit 5473)
Fix accelerated lib compile error when enabling Vtune (commit b5cd)

Documentation

Refine contents of all readme files
API Helper based on GitHub io page (commit e107 )
devcatalog for Mt. Whitney (commit acb6)

Validated Configurations

Centos 8.4 & Ubuntu 20.04 & Windows 10
Python 3.7, 3.8, 3.9, 3.10
Intel® Extension for TensorFlow 2.10.1, 2.11.0
PyTorch 1.12.0+cpu, 1.13.0+cpu
Intel® Extension for PyTorch 1.12.0+cpu,1.13.0+cpu

v1.0b

1 year ago

Highlights
Features
Productivity
Examples
Bug Fixing
Documentation

Highlights

Intel® Extension for Transformers provides more compression examples for popular applications like Stable Diffusion. For Stable Diffusion, we support INT8 quantization with PyTorch and BF16 fine-tune with Intel ® Extension for PyTorch.

Features

Pruning/Sparsity
- Support structured sparsity pattern N:M on PyTorch (https://github.com/intel/intel-extension-for-transformers/commit/25d5e4b59fdacb9b98dda8a4bad5baf9ecb08eda)
- Support structured sparsity pattern NxM on PyTorch (https://github.com/intel/intel-extension-for-transformers/commit/25d5e4b59fdacb9b98dda8a4bad5baf9ecb08eda)
Transformers-accelerated Neural Engine
- Support inference on Windows (https://github.com/intel/intel-extension-for-transformers/commit/fc580d59c917e1b1ec0d52be38a106a50078a5f9)
Transformers-accelerated Libraries
- Support INT8 Softmax operator (https://github.com/intel/intel-extension-for-transformers/commit/fece837d545603a8a92554c1bbd704d45142db56)

Productivity

Simplify the integration with Alibaba BladeDISC

Examples

Support INT8 quantization for large language model (T5-base example) with PyTorch
Support INT8 Vision Transformer examples (ViT-base and ViT-large) in Neural Engine
Support FP32 LAT example in Neural Engine
Support INT8 quantization of 5 top HuggingFace TensorFlow models

Bug Fixing

Fix Protobuf and Onnx version dependency issue
Fix memory leak in Neural Engine

Documentation

Create Notebook for Pruning/Compression Orchestration/IPEX Quantization
Refine the user guide and compression example

Validated Configurations

Centos 8.4 & Ubuntu 20.04 & Windows 10
Python 3.7, 3.8, 3.9
Intel® Extension for TensorFlow 2.9.1, 2.10.0
PyTorch 1.11.0+cpu,1.12.0+cpu, 1.13.0+cpu, Intel® Extension for PyTorch 1.12.0+cpu ,1.13.0+cpu

v1.0a

1 year ago

Highlights
Features
Productivity
Examples

Highlights

Intel® Extension for Transformers provides a rich set of model compression techniques and a leading sparsity-aware libraries and neural engine to accelerate the inference of Transformer-based models on Intel platforms. We published 2 papers on NeurIPS’2022 with the source code released:
- Fast DistilBERT on CPUs: outperform existing state-of-the-art Neural Magic's DeepSparse runtime performance by up to 50%, and deliver 7x better performance on c6i.12xlarge (Ice Lake) than c6a.12xlarge (AMD Milan)
- QuaLA-MiniLM: outperform BERT-base with ~3x reduced size and demonstrate up to 8.8x speedup with <1% accuracy loss on SQuAD1.1 task

Features

Pruning/Sparsity
- Support Distributed Pruning on PyTorch
- Support Distributed Pruning on TensorFlow
Quantization
- Support Distributed Quantization on PyTorch
- Support Distributed Quantization on TensorFlow
Distillation
- Support Distributed Distillation on PyTorch
- Support Distributed Distillation on TensorFlow
Compression Orchestration
- Support Distributed Orchestration on PyTorch
Neural Architecture Search (NAS)
- Support auto distillation with NAS and flash distillation on PyTorch
Length Adaptive Transformer (LAT)
- Support Dynamic Transformer on SQuAD1.1 on PyTorch
Transformers-accelerated Neural Engine
- Support inference with sparse GEMM fusion patterns
- Support automatic benchmarking of sparse and dense mixed model
Transformers-accelerated Libraries
- Support 1x4 block-wise sparse VNNI-INT8 GEMM kernels with post-ops
- Support 1x16 block-wise sparse AMX-BF16 GEMM kernels with post-ops

Productivity

Support seamless Transformers-extended APIs
Support experimental model conversion from PyTorch INT8 model to ONNX INT8
Support VTune performance tracing for sparse GEMM kernels

Examples

LAT examples for MiniLM (NeurIPS’2022)
Fast DistilBert on CPUs (NeurIPS’2022)
PyTorch distributed compression orchestration examples
Post-training quantization for Transformers non-trainer API
PyTorch auto distillation (NAS based) examples
Multiple examples of Quantization/Pruning/Distillation on PyTorch and TensorFlow
Post-training static quantization via Intel® Extension for PyTorch examples and deployment examples

Validated Configurations

Centos 8.4 & Ubuntu 20.04
Python 3.7, 3.8, 3.9, 3.10
TensorFlow 2.9.1, 2.10.0, Intel® Extension for TensorFlow 2.9.1, 2.10.0
PyTorch 1.12.0+cpu, 1.13.0+cpu, Intel® Extension for PyTorch 1.12.0