Intel Extension For Transformers Versions Save

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡

v1.4.1

3 weeks ago

Highlights Improvements Examples Bug Fixing

Highlights

  • Support Weight-only Quantization on MTL iGPU
  • Upgrade lm-eval to 0.4.2
  • Support Llama3

Improvements

  • Support TPP for Xeon Tensor Parallel (5f0430f )
  • Refine Model from_pretrained When use_neural_speed (39ecf38e )

Examples

  • Add vision front-end demo (1c6550 )
  • Add example for table extraction, and enabled multi-page table handling pipeline (db9e6fb )
  • Adapted textual inversion distillation for quantization example to latest transformers and diffusers packages (0ec83b1 )
  • Update NeuralChat Notebooks (83bb65a, 629b9d4 )

Bug Fixing

  • Fix QBits actshuf buf overflow under large batch (a6f3ab3 )
  • Fix TPP support for single socket (a690072 )
  • Fix retrieval dependency (281b0a3 )
  • Fix loading issue of woq model with parameters (37f9db25 )

Validated Configurations

  • Python 3.10
  • Ubuntu 22.04
  • PyTorch 2.2.0+cpu
  • Intel® Extension for Torch 2.2.0+cpu

v1.4

1 month ago

Highlights Features Productivity Examples Bug Fixing

Highlights

  • AutoRound is SOTA weight-only quantization (WOQ) algorithm for low-bit LLM inference on typical LLMs. This release includes support for AutoRound quantization and inference with INT4 models quantized by AutoRound.

Features

Productivity

  • Add bm25 algorithm into retrievers (a19467d0 )
  • Add evaluation perplexity during training (2858ed1 )
  • Enhance embedding to support jit model (588c60 )
  • Update the character checking function to enable the Chinese character (0da63fe1 )
  • Enlarge the context window for HPU graph recompile (dcaf17ac )
  • Support IPEX bf16 & fp32 optimization for emebedding model (b51552 )
  • Enable lm_eval during training. (2de883 )
  • Refine setup.py and requirements.txt (436847 )
  • Improve WOQ model saving and loading (30d9d10, 1065d81c )
  • Add layerwise for WOQ RTN & GPTQ (15a848f3 )
  • Update sparseGPT example (3ae0cd0 )
  • Changed regular expression to add support of the unicode characters (fd2516b )
  • Check and convert contiguous tensor when model saving (d21bb3e )
  • Support load model from modelscope using NeuralSpeed (20ae00 )

Examples

Bug Fixing

  • Fix CLM tasks when transformers >= 4.38.1 (98bfcf8 )
  • Fix distilgpt2 TF signature issue (a7c15a9f )
  • Add User input + max tokens requested exceeds model context window error response (ae91bf8 )
  • Fix audio plugin sample code issue and provide a way to set tts/asr model path (db7da09 )
  • Fix modeling_auto trust_remote_code issue (3a0987 )
  • Fix lm-eval neuralspeed loading model (cd6e488 )
  • Fixed weight-only config save issue (5c92fe31 )
  • Fix index error in Child-parent retriever (8797cfe )
  • Fix WOQ int8 unpack weight (edede4 )
  • Fix gptq desc_act and static_group (528d7de )
  • Fix request.client=None issue (494a571 )
  • Fix WOQ huggingface model loading (01b1a44 )
  • Fix SQ model restore loading (1e00f29 )
  • Remove redundant parameters for WOQ saving config and fix GPTQ issue (ef0882f6 )
  • Fixed exmple error for Intel GPU WOQ (8fdde06 )
  • Fix woq autoround last layer quant issue (d21bb3e )
  • Fix code-generation params (ab2fd05 )

Validated Configurations

  • Python 3.8, 3.9, 3.10, 3.11
  • Ubuntu 20.04 & Windows 10
  • Intel® Extension for TensorFlow 2.13.0, 2.14.0
  • PyTorch 2.2.0+cpu 2.1.0+cpu
  • Intel® Extension for PyTorch 2.2.0+cpu, 2.1.0+cpu

Thanks to these Contributors Thanks for the contribution from dillonalaird, igeni, sramakintel, alexsin368 and huiyan2021
Welcome to contribute to our project and report issues to us.

v1.3.2

2 months ago

Highlights

  • Support NeuralChat-TGI serving with Docker (8ebff39)
  • Support Neuralchat-vLLM serving with Docker (1988dd)
  • Support SQL generation in NeuralChat (098aca7)
  • Enable llava mmmu evaluation on Gaudi2 (c30353f)
  • Improve LLM INT4 inference on Intel GPUs

Improvements

  • Minimize dependencies for running a chatbot (a0c9dfe)
  • Remove redundant knowledge id in audio plugin API (9a7353)
  • Update parameters for NeuralSpeed (19fec91)
  • Integrate backend code of Askdoc (c5d4cd)
  • Refine finetuning data preprocessing with static shape for Gaudi2 (3f62ceb)
  • Sync RESTful API with latest OpenAI protocol (2e1c79)
  • Support WOQ model save and load (1c8078f)
  • Extend API for GGUF (7733d4)
  • Enable OpenAI compatible audio API (d62ff9e)
  • Add pack_weight info acquire interface (18d36ef)
  • add customized system prompts (04b2f8)
  • Support WOQ scheme asym (c7f0b70)
  • update code_lm_eval to bigcode_eval (44f914e)
  • enable Retrieval PDF figure to text (d6a66b3)
  • enable retrieval then rerank pipeline (15feadf)
  • enable gramma check and query polish to enhance RAG performance (a63ec0)

Examples

  • Add Rank-One Model Editing (ROME) implementation and example (8dcf0ea7)
  • Support GPTQ, AWQ model in NeuralChat (5b08de)
  • Add Neural Speed example scripts (6a97d15, 3385c42)
  • Add langchain extension example and update notebook (d40e2f1)
  • Support deepseek-coder models in NeuralChat (e7f5b1d)
  • Add autoround examples (71f5e84)
  • BGE embedding model finetuning (67bef24)
  • Support DeciLM-7B and DeciLM-7B-instruct in NeuralChat (e6f87ab)
  • Support GGUF model in NeuralChat (a53a33c)

Bug Fixing

  • Add trust_remote_code args for lm_eval of WOQ example.( 9022eb)
  • Fix CPU WOQ accuracy issue (e530f7)
  • Change the default value for XPU weight-only quantization (4a78ba)
  • Fix whisper forced_decoder_ids error (09ddad)
  • Fix off by one error on masking (525076d)
  • Fix backprop error for text only examples (9cff14a)
  • Use unk token instead of eos token (6387a0)
  • Fix errors in trainer save (ff501d0)
  • Fix Qdrant bug caused by langchain_core upgrade (eb763e6)
  • Set trainer.save_model state_dict format to safetensors (2eca8c)
  • Fix text-generation example accuracy scripts (a2cfb80)
  • Resolve WOQ quantization error when running neuralchat (6c0bd77)
  • Fix response issue of model.predict (3068496)
  • Fix pydub library import issues (c37dab)
  • Fix chat history issue (7bb3314)
  • Update gradio APP to sync with backend change (362b7af)

Validated Configurations

  • Python 3.10
  • Ubuntu 22.04
  • Intel® Extension for TensorFlow 2.13.0
  • PyTorch 2.1.0+cpu
  • Intel® Extension for Torch 2.1.0+cpu

v1.3.1

3 months ago

Highlights Improvements Examples Bug Fixing Validated Configurations

Highlights

  • Support experimental INT4 inference on Intel GPU (ARC and PVC) with Intel Extension for PyTorch as backend
  • Enhance LangChain to support new vectorstore (e.g., Qdrant)

Improvements

  • Improve error code handling coverage (dd6dcb4 )
  • NeuralChat document refine (aabb2fc )
  • Improve Text-generation API (a4aba8 )
  • Refactor transformers-like API to adapt to latest transformers version (4e6834a )
  • NeuralChat integrate GGML INT4 (29bbd8 )
  • Enable Qdrant vectorstore (f6b9e32 )
  • Support llama series model for llava finetuning (d753cb )

Examples

  • Support GGUF Q4_0, Q5_0 and Q8_0 models from HuggnigFcae (1383c7)
  • Support GPTQ model inference on CPU (f4c58d0 )
  • Support SOLAR-10.7B-Instruct-v1.0 model (77fb81 )
  • Support magicoder model and refine load model (f29c1e )
  • Support Mixstral-8x7b model (9729b6 )
  • Support Phi-2 model (04f5ef6c )
  • Evaluate Perplexity of NeuralSpeed (b0b381)

Bug Fixing

  • Fix GPTQ load in issue ( 226e08 )
  • Fix tts crash with messy retrieval input and enhance normalizer (4d8d9a )
  • Support compatible stats format (c0a89c5a )
  • Fix RAG example for retrieval plugin parameter change (c35d2b )
  • Fix magicoder tokenizer issue and streaming redundant end format (2758d4 )

Validated Configurations

  • Python 3.10
  • Centos 8.4 & Ubuntu 22.04
  • Intel® Extension for TensorFlow 2.13.0
  • PyTorch 2.1.0+cpu
  • Intel® Extension for Torch 2.1.0+cpu

v1.3

4 months ago

Highlights Publication Features Examples Bug Fixing Incompatible change

Highlights

  • LLM Workflow/Neural Chat
    • Achieved Top-1 7B LLM Hugging Face Open Leaderboard in Nov’23
    • Released DPO dataset to Hugging Face Space for fine-tuning
    • Published the blog and fine-tuning code on Gaudi2
    • Supported fine-tuning and inference on Gaudi2 and Xeon
    • Updated notebooks for chatbot development and deployment
    • Provided customizable RAG-based chatbot applications
    • Published INT4 chatbot on Hugging Face Space
  • Transformer Extension for Low-bit Inference and Fine-tuning
    • Supported INT4/NF4/FP4/FP8 LLM inference
    • Improved StreamingLLM for efficient endless text generation
    • Demonstrated up to 40x better performance than llama.cpp on Intel Xeon Scalable Processors
    • Supported QLoRA fine-tuning on CPU

Publications

Features

  • LLM Workflow/Neural Chat
    • Support Gaudi model parallelism serving (7f0090)
    • Add PEFT model support in deepspeed sharded mode (370ca3)
    • Support return error code (ea173a)
    • Enhance NeuralChat security (ab43c7, 43e8b9, 6e0386)
    • Support assisted generation for NeuralChat (5ba797)
    • Add codegen restful API in NeuralChat (0c77b1)
    • Support multi cards streaming inference on Gaudi (9ad75c)
    • Support multi CPU restful API serving (fec4bb4)
    • Support IPEX int8 model (e13363)
    • Enable retrieval with URL as inputs (9d90e1d)
    • Add NER plugin to NeuralChat (aa5d8a)
    • Integrate PhotoAI backend into NeuralChat (da138c, d7a1d8)
    • Support image to image plugin as service (12ad4c)
    • Support optimized SadTalker to Video plugin in NeuralChat (7f24c79)
    • Add askdoc retrieval API & example (89cf76)
    • Add sidebyside UI (dbbcc2b)
  • Transformer Extension for Low-bit Inference and Fine-tuning

Examples

  • LLM Workflow/Neural Chat
    • Add Mistral, Code-Llama, NeuralChat-7B, Qwen (fcee612, 7baa96b, d9a864, 698e58)
    • Added StarCoder, CodeLlama, Falcon and Mistral finetuning example(477018)
    • Add fine-tuning with Deepspeed example (554fb9)
  • Transformer Extension for Low-bit Inference and Fine-tuning
    • Add ChatGLM and Code-Llama example (130b59)
    • Add WOQ to code-generation example (65a645f)
    • Add Text-generation example support ChatGLM2&3 (4525b)
    • Text-generation support qwen (8f41d4)
    • Add INT4 ONNX whisper example (c7f8173c, e9fc4c2)
    • Support DPO on habana/gaudi (98d3ce3)
    • Enable finetune for Qwen-7b-chat on CPU (6bc938)
    • Enable Whisper C++ API (74e92a)
    • Apply the STS task to BAAI/BGE models (0c4c5ed, c399e38)
    • Enable Qwen graph (381331c)
    • Add instruction_tuning Stable Diffusion examples (17f01c6)
    • Enable Mistral-7b (7d1495)
    • Support Falcon-180B (900ebf4)
    • Add Baichuan/Baichuan2 example (98e5f9)

Bug Fixing

  • LLM Workflow/Neural Chat
    • Enhance SafetyChecker to resolve can't find stopword.txt (5ba797)
    • Multilingual ASR enhance (62d002)
    • Remove haystack dependency (16ff4fb)
    • Fix starcoder issues for IPEX int8 and Weight Only int4 (e88c7b)
    • Remove OneDNN env setint for BF16 inference (59ab03)
    • Fix ChatGLM2 model loading issue (4f2169)
    • Fix init issue of langchain chroma (fdefe27)
  • Transformer Extension for Low-bit Inference and Fine-tuning
    • Fixed bug for woq with AWQ (565ab4b)
    • Use validation dataset for evaluation (e764bb)
    • Fix gradient issue for qlora on seq2seq (ff0465)
    • Fix post process with topk topp of python API (7b4730)
    • Fix PC codegen streaming issue (0f0bf22)
    • Fix Jblas stack overflow on Windows (65af04)

Incompatible Changes

  • [Neural Chat] Optimize the structure of NeuralChat example directories (1447e6f)
  • [Transformers Extension for Low-bit Inference] Update baichuan/baichuan2 API (98e5f9)

Validated Configurations

  • Python 3.9, 3.10, 3.11
  • Centos 8.4 & Ubuntu 20.04 & Windows 10
  • Intel® Extension for TensorFlow 2.13.0, 2.14.0
  • PyTorch 2.1.0+cpu 2.0.0+cpu
  • Intel® Extension for PyTorch 2.1.0+cpu, 2.0.0+cpu

v1.2.2

5 months ago

Bug Fixing & Improvements

  • Replace test dataset with validation dataset when do_eval.(e764bb5)
  • fix save issue of deepspeed zero3.(cf5ff82)
  • Fix UT issues on Nvidia GPU.(464962e)
  • Fix Ner nightly ut bug.(9e5a6b3)
  • Escape sql string for SDL.(43e8b9a)
  • Fix added_tokens error.(fd74a9a)

Validated Configurations

  • Python 3.9, 3.10
  • Centos 8.4 & Ubuntu 22.04
  • Intel® Extension for TensorFlow 2.13.0
  • PyTorch 2.1.0+cpu
  • Intel® Extension for PyTorch 2.1.0+cpu
  • Transformers 4.34.1

v1.2.1

6 months ago
  • Examples
  • Bug Fixing & Improvements

Examples

  • Add docker for code-generation (dd3829 )
  • Enable Qwen-7B-Chat for NeuralChat (698e58 )
  • Enable Baichuan & Baichuan2 CPP inference (98e5f9 )
  • Add sidebyside UI for NeuralChat (dbbcc2 )
  • Support Falcon-180B CPP inference (900ebf )
  • Support starcoder finetuning example (073bdd )
  • Enable text-generation using qwen (8f41d4 )
  • Add docker for neuralchat (a17d952 )

Bug Fixing & Improvements

  • Fix bug for woq with AWQ due to not set calib_iters if calib_dataloader is not None.( 565ab4)
  • Fix init issue of langchain chroma (fdefe2)
  • Fix NeuralChat starcoder mha fusion issue (ce3d24)
  • Fix setuptools version limitation for build (2cae32)
  • Fix post process with topk topp of python api (7b4730)
  • Fix msvc compile issues (87b00d)
  • Refine notebook and fix restful api issues (d8cc11)
  • Upgrade qbits backend (45e03b )
  • Fix starcoder issues for IPEX int8 and Weight Only int4 (e88c7b )
  • Fix ChatGLM2 model loading issue (4f2169 )
  • Remove OneDNN graph env setting for BF16 inference (59ab03 )
  • Improve database by escape sql string (be6790 )
  • fix qbits backend get wrong workspace malloc size (6dbd0b )

Validated Configurations

  • Python 3.9, 3.10
  • Centos 8.4 & Ubuntu 22.04
  • Intel® Extension for TensorFlow 2.13.0
  • PyTorch 2.1.0+cpu
  • Intel® Extension for PyTorch 2.1.0+cpu
  • Transformers 4.34.1

v1.2

7 months ago

Highlights Features Productivity Examples Bug Fixing API Modification Documentation

Highlights

  • NeuralChat has been showcased in Intel Innovation’23 Keynote and Google Cloud Next '23 to demonstrate GenAI/LLM capabilities on Intel Xeon Scalable Processors. The chatbot solution has been integrated into LLM as a service (LLMaaS), providing the smooth user experience to build GenAI/LLM applications by leveraging Intel latest Xeon Scalable Processors and Gaudi2 from Intel Developer Cloud.
  • NeuralChat offers a comprehensive pipeline to build an end-to-end chatbot applications with a rich set of pluggable features such as speech cloning & interaction (EN/CN), knowledge retrieval, query caching, and security guardrail. These features allow you to create a custom chatbot from scratch within minutes, therefore significantly improving the chatbot development productivity.
  • LLM runtime extends Transformers API to provide seamless weight-only low precision inference for Hugging Face transformer-based models including LLMs. We improve LLM runtime with more comprehensive kernel support on low precisions (INT8/FP8/INT4/FP4/NF4), while keeping full compatibility with GGML. LLM runtime delivers 25ms/token with int4 llama, 22ms/token with int4 GPT-J on Intel Xeon Scalable Processors. Therefore, providing a complementary and extremely optimized LLM runtime solution for Intel architectures.

Features

  • Neural Chat
    • Support ASR/TTS on CPU and HPU (fb619e5 56685a )
    • Added docker for chatbot on Xeon SPR and Habana Gaudi (59fc92e ad2ee1)
    • Refine Chatbot workflow and use NeuralChat API (53bed4 e95fc32 )
    • Implement API python sdk, weight only quantization and AMP for Neural-Chat. (08ba5d85 )
  • Model Optimization
    • Add GPTQ/TEQ/WOQ quantization with plenty examples (b4b2fcc 1bcab14 )
    • Enhance the ITREX quantization API as well as LLMRuntime, users can now obtain a quantized model using AutoModelForCausalLM.from_pretrained. (be651b f4dc78 )
    • Support GPT-J pruning (802ec0d2 )
  • LLM Runtime
    • Enable FFN fusion LLMs (277108 )
    • Enabled Tensor Parallelism for 4bits GPT-J on 2 sockets (fe0d65c )
    • Implement AMX INT8/BF16 MHA (c314d6c )
    • Support asymmetric models in LLM Runtime (93ca55 )
    • Jblas and Qbits support nf4 and s4-fullrange weight compression. (ff7af86 )
    • Enhance Beam-search early-stopping mechanisms (cd4c33d )

Productivity

  • ITREX moved to fully public development, welcome to contribute to us. (90ca31 )
  • Support streaming mode for neuralchat (f5892ec )
  • Support Direct Preference Optimization to improve accuracy. (50b5b9 )
  • Support query cache for chatbot (1b4463 )
  • Weight-only support for Pytorch Framework (3a064fa )
  • Provide mix INT8 & BF16 inference mode for stable diffusion. (bd2973 )
  • Supported Stable Diffusion v1.4/v1.5/v2.1 and QAT inference on Linux/windows (02cc59 )
  • Update Onednn to v3.3 (e6d8a4 )
  • Weight-only kernel support INT8 quantization (6ce8b13 )
  • Enable flash attention like kernel in weight-only (0ef3942)
  • Weight-only kernel ISA based dispatcher (ff7af86 )
  • Support 4bits per-channel quantization (4e164a8 )

Examples

Bug Fixing

  • Fixed issues from Cobalt, the 3-party company is hired by Intel to do Penetration testing. (51a1b88 )
  • Fix windows compile issues (bffa1b0 )
  • Fix ordinals and conjunctions in tts normalizer (0892f8a )
  • Fix Habana finetuning issues (2bbcf51 )
  • Fix bugs in RAG code for converting the prompt (bfad5c )
  • Fix normalizer: year, punctuation after number, end token (775a12 )
  • Fix Graph Model quantization on AVX2-only Platforms (3c84ec6 )

API Modification

  • Update the input of the 'device' parameter in the NeuralChat fine-tuning API, changing it from 'habana' to 'hpu' (96dabb0)
  • Change default values of do_lm_eval, lora_all_linear and use_fast_tokenizer in ModelArguments from False to True. (52f9f74)

Documentation

Validated Configurations

  • Python 3.8, 3.9, 3.10
  • Centos 8.4 & Ubuntu 20.04 & Windows 10
  • Intel® Extension for TensorFlow 2.12.0, 2.11.0
  • PyTorch 2.0.0+cpu, 1.13.1+cpu
  • Intel® Extension for PyTorch 2.0.0+cpu, 1.13.100+cpu

v1.1.1

8 months ago
  • Highlights
  • Bug Fixing & Improvements
  • Tests & Tutorials

Highlights In this release, we improved NeuralChat, a customizable chatbot framework under Intel® Extension for Transformers. NeuralChat is now available for you to create your own chatbot within minutes on multiple architectures.

Bug Fixing & Improvements

  • Fix the code structure and the plugin in NeuralChat (commit 486e9e)
  • Fix bug in retrieval chat (commit d2cee0)
  • NeuralChat Inference return correct input len without pad to user (commit 18be4c)
  • Fix MPT not support left padding issue (commit 24ae58)
  • Fix double remove dataset columns when concatenation (commit 67ce6e)
  • Fix DeepSpeed and use cache issue (commit 4675d4)
  • Fix bugs in predict_stream (commit e1da7e)
  • Fix docker CPU issues (commit 8fa0dc)
  • Fix read HuggingFaceH4/oasst1_en dataset issue (commit 76ee68)
  • Modify Dockerfile for finetuning (commit 797aa2)
  • Fix the perf of LLaMA2 by static_shape in optimum Habana (commit 481f38)
  • Remove NeuralChat redundant code and hard codes. (commit 0e1e4d, 037ce8, 10af3c)
  • Refined NeuralChat finetuning config (commit e372cf)

Tests & Tutorials

  • Add inference test for LLaMA2 and MPT with HPU (commit 5c4f5e)
  • Add inference test for LLaMA2 and MPT with Intel CPUs (commit ad4bec, 2f6188)
  • Add finetuning test for MPT (commit 72d81e, 423242)
  • Add GHA Unit Tests (commit 49336d)
  • NeuralChat finetuning tutorial for LLaMA2 and MPT (commit d156e9)
  • NeuralChat deployment on Intel CPU/ Habana HPU/ Nvidia tutorial (commit b36711)

Validated Configurations

  • Centos 8.4 & Ubuntu 22.04
  • Python 3.9
  • PyTorch 2.0.0
  • TensorFlow 2.12.0

Acknowledgements Thanks for the contributions from sywangyi, jiafuzha and itayariel. Thanks to all the participants to Intel Extension for Transformers.

v1.1

10 months ago
  • Highlights
  • Features
  • Productivity
  • Examples
  • Bug Fixing
  • Documentation

Highlights

  • Created NeuralChat, the first 7B commercially friendly chat model ranked in top of LLM leaderboard
  • Supported efficient fine-tuning and inference on Xeon SPR and Habana Gaudi
  • Enabled 4-bits LLM inference in plain C++ implementation, outperforming llama.cpp
  • Supported quantization for broad LLMs with the improved lm-evaluation-harness for multiple frameworks and data types

Features

  • Model Optimization
    • Language modeling quantization for OPT-2.7B, OPT-6.7B, LLAMA-7B (commit 6a9608), MPT-7B and Falcon-7B (commit f6ca74)
    • Text2text-generation quantization for T5, Flan-T5 (commit a9b69b)
    • Text-generation quantization for Bloom (commit e44270), MPT (commit 469ac6)
    • Enable QAT for Stable Diffusion (commit 2e2efd)
    • Replace PyTorch Pruner with INC Pruner (commit 9ea1e3)
  • Transformers-accelerated Neural Engine
  • Transformers-accelerated Libraries
    • MHA kernels for static, dynamic quantization and bf16 (commit 0d0932, e61e4b)
    • Support dynamic quantization matmul and post-op (commit 4cb9e4, cf0400, 9acfe1)
    • Int4 weight-only kernels (commit 3b7665) and fusion (commit f00d87)
    • Support dynamic quantization op (commit 6fcc15)
    • Add AVX2 kernels for Windows (commit bc313c)

Productivity

  • Enable Lora fine-tuning(commit 664f4b), multi-nodes fine-tuning(commit 6288fd) and Xeon, Habana inference (commit 8ea55b) for Chatbot
  • Enable docker for Chatbot (commit 6b9522, 37b455)
  • Support Parameter-Efficient Fine-Tuning (PEFT) (commit 27bd7f)
  • Update Torch and TensorFlow (commit f54817)
  • Add Harness evaluation for PyTorch text-generation/language modeling (commit 736921, c7c557, b492f5) and onnx (commit a944fa)
  • Add summarization evaluation for PyTorch (commit 062e62)

Examples

  • Early Exit: TangoBERT, Separating Weights for Early-Exit Transformers (SWEET) (commit dfbdc5, c0eaa5)
  • Electra fp32 & bf16 inference (commit e09c96)
  • GPT-NeoX and Dolly-v2-7B text-generation inference (commit 402bb9)
  • Stable Diffusion v2.1 inference (commit 5affab), image to image(commit a13e11), inference with dynamic quantization (commit bfcb2e)
  • Onnx whisper-large quantization (commit 038be0)
  • 8-layers MiniLM inference (commit 0dd104)
  • Add compression aware training (commit dfb53f), sparse aware training(commit 7b28ef) and fine-tuning and inference workflows (commit bf666c)

Bug Fixing

  • Fix Neural Engine error with gcc13 (commit 37a4a3) and GPU compilation error (commit 0f38eb)
  • Fix quantization for transformers 4.30 (commit 256c1d)
  • Fix error of missing metric when QAT on PyTorch model (commit c7e665)

Documentation

  • Refine doc of NeuralChat (commit 2580f3)
  • Update performance data of LLM and Stable Diffusion (commit 523fe5)

Validated Configurations

  • Centos 8.4 & Ubuntu 20.04 & Windows 10
  • Python 3.8, 3.9, 3.10
  • Intel® Extension for TensorFlow 2.11.0, 2.12.0
  • PyTorch 1.13.1+cpu, 2.0.0+cpu
  • Intel® Extension for PyTorch 1.13.1+cpu,2.0.0+cpu