Intel Extension For Transformers Versions Save

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡

v1.4.1

3 weeks ago

Highlights Improvements Examples Bug Fixing

Highlights

Support Weight-only Quantization on MTL iGPU
Upgrade lm-eval to 0.4.2
Support Llama3

Improvements

Support TPP for Xeon Tensor Parallel (5f0430f )
Refine Model from_pretrained When use_neural_speed (39ecf38e )

Examples

Add vision front-end demo (1c6550 )
Add example for table extraction, and enabled multi-page table handling pipeline (db9e6fb )
Adapted textual inversion distillation for quantization example to latest transformers and diffusers packages (0ec83b1 )
Update NeuralChat Notebooks (83bb65a, 629b9d4 )

Bug Fixing

Fix QBits actshuf buf overflow under large batch (a6f3ab3 )
Fix TPP support for single socket (a690072 )
Fix retrieval dependency (281b0a3 )
Fix loading issue of woq model with parameters (37f9db25 )

Validated Configurations

Python 3.10
Ubuntu 22.04
PyTorch 2.2.0+cpu
Intel® Extension for Torch 2.2.0+cpu

v1.4

1 month ago

Highlights Features Productivity Examples Bug Fixing

Highlights

AutoRound is SOTA weight-only quantization (WOQ) algorithm for low-bit LLM inference on typical LLMs. This release includes support for AutoRound quantization and inference with INT4 models quantized by AutoRound.

Features

LLM Workflow/Neural Chat
- Support Triton Serving/Deployment on HPU/GPU (4657036, c57c17e )
- Enable HF/TGI Endpoint (5b84e5, 525ea8, 34b3e9 )
- Enable RAG + ChatGPT flow (de8800 )
- [UI] Customized side by side (5835c3 )
- Support Multi-language TTS (260155a )
- Support language detection & translation for RAG chat (99df35d8 )
- Add file management in RAG API (b7fc01de )
- Support deepspeed for Textchat API (7b0b995 )
Transformers Extension for LLM Optimization
- Support Autoround WOQ on CPU (6c42b53, a6c05b9 )
- Support GPTQ/Autoround on GPU (6a458f0, 4b5046, 7bebb01, 7084e7f )
- Refine WOQ Config (97f0db9f )
- Reorg transformers and langchain structure (ae54f69, 756646b )
- Promote Qbits as Module (468d7cf )

Productivity

Add bm25 algorithm into retrievers (a19467d0 )
Add evaluation perplexity during training (2858ed1 )
Enhance embedding to support jit model (588c60 )
Update the character checking function to enable the Chinese character (0da63fe1 )
Enlarge the context window for HPU graph recompile (dcaf17ac )
Support IPEX bf16 & fp32 optimization for emebedding model (b51552 )
Enable lm_eval during training. (2de883 )
Refine setup.py and requirements.txt (436847 )
Improve WOQ model saving and loading (30d9d10, 1065d81c )
Add layerwise for WOQ RTN & GPTQ (15a848f3 )
Update sparseGPT example (3ae0cd0 )
Changed regular expression to add support of the unicode characters (fd2516b )
Check and convert contiguous tensor when model saving (d21bb3e )
Support load model from modelscope using NeuralSpeed (20ae00 )

Examples

Support microsoft/biogpt model (3e7e35 )
Add finetuning example for gemma-2b on ARC. (ffa8f3c6 )
Add example to use RAG+OpenAI LLM (3c5959 )
Enable mistralai /Mixtral-8x7B-v0.1 LORA finetuning on Gaudi2 (7539c35 )
Enable image2text finetuning example on CPU (ef94aeaa )
Add LLaVA-NeXT (feff1ec0 )

Bug Fixing

Fix CLM tasks when transformers >= 4.38.1 (98bfcf8 )
Fix distilgpt2 TF signature issue (a7c15a9f )
Add User input + max tokens requested exceeds model context window error response (ae91bf8 )
Fix audio plugin sample code issue and provide a way to set tts/asr model path (db7da09 )
Fix modeling_auto trust_remote_code issue (3a0987 )
Fix lm-eval neuralspeed loading model (cd6e488 )
Fixed weight-only config save issue (5c92fe31 )
Fix index error in Child-parent retriever (8797cfe )
Fix WOQ int8 unpack weight (edede4 )
Fix gptq desc_act and static_group (528d7de )
Fix request.client=None issue (494a571 )
Fix WOQ huggingface model loading (01b1a44 )
Fix SQ model restore loading (1e00f29 )
Remove redundant parameters for WOQ saving config and fix GPTQ issue (ef0882f6 )
Fixed exmple error for Intel GPU WOQ (8fdde06 )
Fix woq autoround last layer quant issue (d21bb3e )
Fix code-generation params (ab2fd05 )

Validated Configurations

Python 3.8, 3.9, 3.10, 3.11
Ubuntu 20.04 & Windows 10
Intel® Extension for TensorFlow 2.13.0, 2.14.0
PyTorch 2.2.0+cpu 2.1.0+cpu
Intel® Extension for PyTorch 2.2.0+cpu, 2.1.0+cpu

Thanks to these Contributors Thanks for the contribution from dillonalaird, igeni, sramakintel, alexsin368 and huiyan2021
Welcome to contribute to our project and report issues to us.

v1.3.2

2 months ago

Highlights

Support NeuralChat-TGI serving with Docker (8ebff39)
Support Neuralchat-vLLM serving with Docker (1988dd)
Support SQL generation in NeuralChat (098aca7)
Enable llava mmmu evaluation on Gaudi2 (c30353f)
Improve LLM INT4 inference on Intel GPUs

Improvements

Minimize dependencies for running a chatbot (a0c9dfe)
Remove redundant knowledge id in audio plugin API (9a7353)
Update parameters for NeuralSpeed (19fec91)
Integrate backend code of Askdoc (c5d4cd)
Refine finetuning data preprocessing with static shape for Gaudi2 (3f62ceb)
Sync RESTful API with latest OpenAI protocol (2e1c79)
Support WOQ model save and load (1c8078f)
Extend API for GGUF (7733d4)
Enable OpenAI compatible audio API (d62ff9e)
Add pack_weight info acquire interface (18d36ef)
add customized system prompts (04b2f8)
Support WOQ scheme asym (c7f0b70)
update code_lm_eval to bigcode_eval (44f914e)
enable Retrieval PDF figure to text (d6a66b3)
enable retrieval then rerank pipeline (15feadf)
enable gramma check and query polish to enhance RAG performance (a63ec0)

Examples

Add Rank-One Model Editing (ROME) implementation and example (8dcf0ea7)
Support GPTQ, AWQ model in NeuralChat (5b08de)
Add Neural Speed example scripts (6a97d15, 3385c42)
Add langchain extension example and update notebook (d40e2f1)
Support deepseek-coder models in NeuralChat (e7f5b1d)
Add autoround examples (71f5e84)
BGE embedding model finetuning (67bef24)
Support DeciLM-7B and DeciLM-7B-instruct in NeuralChat (e6f87ab)
Support GGUF model in NeuralChat (a53a33c)

Bug Fixing

Add trust_remote_code args for lm_eval of WOQ example.( 9022eb)
Fix CPU WOQ accuracy issue (e530f7)
Change the default value for XPU weight-only quantization (4a78ba)
Fix whisper forced_decoder_ids error (09ddad)
Fix off by one error on masking (525076d)
Fix backprop error for text only examples (9cff14a)
Use unk token instead of eos token (6387a0)
Fix errors in trainer save (ff501d0)
Fix Qdrant bug caused by langchain_core upgrade (eb763e6)
Set trainer.save_model state_dict format to safetensors (2eca8c)
Fix text-generation example accuracy scripts (a2cfb80)
Resolve WOQ quantization error when running neuralchat (6c0bd77)
Fix response issue of model.predict (3068496)
Fix pydub library import issues (c37dab)
Fix chat history issue (7bb3314)
Update gradio APP to sync with backend change (362b7af)

Validated Configurations

Python 3.10
Ubuntu 22.04
Intel® Extension for TensorFlow 2.13.0
PyTorch 2.1.0+cpu
Intel® Extension for Torch 2.1.0+cpu

v1.3.1

3 months ago

Highlights Improvements Examples Bug Fixing Validated Configurations

Highlights

Support experimental INT4 inference on Intel GPU (ARC and PVC) with Intel Extension for PyTorch as backend
Enhance LangChain to support new vectorstore (e.g., Qdrant)

Improvements

Improve error code handling coverage (dd6dcb4 )
NeuralChat document refine (aabb2fc )
Improve Text-generation API (a4aba8 )
Refactor transformers-like API to adapt to latest transformers version (4e6834a )
NeuralChat integrate GGML INT4 (29bbd8 )
Enable Qdrant vectorstore (f6b9e32 )
Support llama series model for llava finetuning (d753cb )

Examples

Support GGUF Q4_0, Q5_0 and Q8_0 models from HuggnigFcae (1383c7)
Support GPTQ model inference on CPU (f4c58d0 )
Support SOLAR-10.7B-Instruct-v1.0 model (77fb81 )
Support magicoder model and refine load model (f29c1e )
Support Mixstral-8x7b model (9729b6 )
Support Phi-2 model (04f5ef6c )
Evaluate Perplexity of NeuralSpeed (b0b381)

Bug Fixing

Fix GPTQ load in issue ( 226e08 )
Fix tts crash with messy retrieval input and enhance normalizer (4d8d9a )
Support compatible stats format (c0a89c5a )
Fix RAG example for retrieval plugin parameter change (c35d2b )
Fix magicoder tokenizer issue and streaming redundant end format (2758d4 )

Validated Configurations

Python 3.10
Centos 8.4 & Ubuntu 22.04
Intel® Extension for TensorFlow 2.13.0
PyTorch 2.1.0+cpu
Intel® Extension for Torch 2.1.0+cpu

v1.3

4 months ago

Highlights Publication Features Examples Bug Fixing Incompatible change

Highlights

LLM Workflow/Neural Chat
- Achieved Top-1 7B LLM Hugging Face Open Leaderboard in Nov’23
- Released DPO dataset to Hugging Face Space for fine-tuning
- Published the blog and fine-tuning code on Gaudi2
- Supported fine-tuning and inference on Gaudi2 and Xeon
- Updated notebooks for chatbot development and deployment
- Provided customizable RAG-based chatbot applications
- Published INT4 chatbot on Hugging Face Space
Transformer Extension for Low-bit Inference and Fine-tuning
- Supported INT4/NF4/FP4/FP8 LLM inference
- Improved StreamingLLM for efficient endless text generation
- Demonstrated up to 40x better performance than llama.cpp on Intel Xeon Scalable Processors
- Supported QLoRA fine-tuning on CPU

Publications

NeurIPS'2023 on Efficient Natural Language and Speech Processing: Efficient LLM Inference on CPUs
NeurIPS'2023 on Diffusion Models: Effective Quantization for Diffusion Models on CPUs
Arxiv: TEQ: Trainable Equivalent Transformation for Quantization of LLMs
Arxiv: Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs

Features

LLM Workflow/Neural Chat
- Support Gaudi model parallelism serving (7f0090)
- Add PEFT model support in deepspeed sharded mode (370ca3)
- Support return error code (ea173a)
- Enhance NeuralChat security (ab43c7, 43e8b9, 6e0386)
- Support assisted generation for NeuralChat (5ba797)
- Add codegen restful API in NeuralChat (0c77b1)
- Support multi cards streaming inference on Gaudi (9ad75c)
- Support multi CPU restful API serving (fec4bb4)
- Support IPEX int8 model (e13363)
- Enable retrieval with URL as inputs (9d90e1d)
- Add NER plugin to NeuralChat (aa5d8a)
- Integrate PhotoAI backend into NeuralChat (da138c, d7a1d8)
- Support image to image plugin as service (12ad4c)
- Support optimized SadTalker to Video plugin in NeuralChat (7f24c79)
- Add askdoc retrieval API & example (89cf76)
- Add sidebyside UI (dbbcc2b)
Transformer Extension for Low-bit Inference and Fine-tuning
- Support load_in_nbit in llm runtime (4423f7)
- Extend langchain embedding API (80a779)
- Support QLoRA on CPU device (adb109)
- Support PPO rl_training (936c2d2, 8543e2f)
- Support multi-model training (ecb448)
- Transformers Extension for Low-bit Inference Runtime support GPTQ models (8145e6)
- Enable Beam Search post-processing (958d04, ae95a2, ae95a2, 224656, 958d04, 6ea825)
- Add MX-Format (FP8_E5M2, FP8_E4M3, FP4_E2M1, NF4) (f49f2d, 9f96ae)
- Refactor Transformers Extension for Low-bit Inference Runtime based on the latest Jblas (43e30b)
- Support attention block TP and add jblas split weight interface (2c31dc, 22ceda4)
- Enabing streaming LLM for Runtime (ffc73bb5)
- Support starcoder MHA fusion (841b29a)
- SmoothQuantConfig support recipes (1e0d7e)
- Add SetFit API in ITREX (ffb7fd8)
- Support full parameters finetuning (2b541)
- Support SmoothQuant auto tune (2fde68c)
- Use python logging instead of print (60942e)
- Support Falcon and unify int8 API (0fb2da8)
- Support ipex.optimize_transformers feature (d2bd4d, ee855, 3f9ee42)
- Optimized dropout operator (7d276c)
- Add Script for PPL Evaluation (df40d5)
- Refine Python API (91511d, 6e32ca6)
- Allow CompileBF16 on GCC11 (d9e95d)
- Multi-Round chat with ChatGLM2 (db35a3)
- Shift-RoPE-based Streaming-LLM (68ca20, 61f19f9)
- Enable MHA fusion for LLM (81dde2, 7b73b1, 6599bd, 692fde3)
- Support AVX_VNNI and AVX2 (c9e2ef3, 00baa42, a05ff4b)
- Optimize qbits backend (e1f9e2b3, 45e03b9)
- GELU support (4f5de0)

Examples

LLM Workflow/Neural Chat
- Add Mistral, Code-Llama, NeuralChat-7B, Qwen (fcee612, 7baa96b, d9a864, 698e58)
- Added StarCoder, CodeLlama, Falcon and Mistral finetuning example(477018)
- Add fine-tuning with Deepspeed example (554fb9)
Transformer Extension for Low-bit Inference and Fine-tuning
- Add ChatGLM and Code-Llama example (130b59)
- Add WOQ to code-generation example (65a645f)
- Add Text-generation example support ChatGLM2&3 (4525b)
- Text-generation support qwen (8f41d4)
- Add INT4 ONNX whisper example (c7f8173c, e9fc4c2)
- Support DPO on habana/gaudi (98d3ce3)
- Enable finetune for Qwen-7b-chat on CPU (6bc938)
- Enable Whisper C++ API (74e92a)
- Apply the STS task to BAAI/BGE models (0c4c5ed, c399e38)
- Enable Qwen graph (381331c)
- Add instruction_tuning Stable Diffusion examples (17f01c6)
- Enable Mistral-7b (7d1495)
- Support Falcon-180B (900ebf4)
- Add Baichuan/Baichuan2 example (98e5f9)

Bug Fixing

LLM Workflow/Neural Chat
- Enhance SafetyChecker to resolve can't find stopword.txt (5ba797)
- Multilingual ASR enhance (62d002)
- Remove haystack dependency (16ff4fb)
- Fix starcoder issues for IPEX int8 and Weight Only int4 (e88c7b)
- Remove OneDNN env setint for BF16 inference (59ab03)
- Fix ChatGLM2 model loading issue (4f2169)
- Fix init issue of langchain chroma (fdefe27)
Transformer Extension for Low-bit Inference and Fine-tuning
- Fixed bug for woq with AWQ (565ab4b)
- Use validation dataset for evaluation (e764bb)
- Fix gradient issue for qlora on seq2seq (ff0465)
- Fix post process with topk topp of python API (7b4730)
- Fix PC codegen streaming issue (0f0bf22)
- Fix Jblas stack overflow on Windows (65af04)

Incompatible Changes

[Neural Chat] Optimize the structure of NeuralChat example directories (1447e6f)
[Transformers Extension for Low-bit Inference] Update baichuan/baichuan2 API (98e5f9)

Validated Configurations

Python 3.9, 3.10, 3.11
Centos 8.4 & Ubuntu 20.04 & Windows 10
Intel® Extension for TensorFlow 2.13.0, 2.14.0
PyTorch 2.1.0+cpu 2.0.0+cpu
Intel® Extension for PyTorch 2.1.0+cpu, 2.0.0+cpu

v1.2.2

5 months ago

Bug Fixing & Improvements

Replace test dataset with validation dataset when do_eval.(e764bb5)
fix save issue of deepspeed zero3.(cf5ff82)
Fix UT issues on Nvidia GPU.(464962e)
Fix Ner nightly ut bug.(9e5a6b3)
Escape sql string for SDL.(43e8b9a)
Fix added_tokens error.(fd74a9a)

Validated Configurations

Python 3.9, 3.10
Centos 8.4 & Ubuntu 22.04
Intel® Extension for TensorFlow 2.13.0
PyTorch 2.1.0+cpu
Intel® Extension for PyTorch 2.1.0+cpu
Transformers 4.34.1

v1.2.1

6 months ago

Examples
Bug Fixing & Improvements

Examples

Add docker for code-generation (dd3829 )
Enable Qwen-7B-Chat for NeuralChat (698e58 )
Enable Baichuan & Baichuan2 CPP inference (98e5f9 )
Add sidebyside UI for NeuralChat (dbbcc2 )
Support Falcon-180B CPP inference (900ebf )
Support starcoder finetuning example (073bdd )
Enable text-generation using qwen (8f41d4 )
Add docker for neuralchat (a17d952 )

Bug Fixing & Improvements

Fix bug for woq with AWQ due to not set calib_iters if calib_dataloader is not None.( 565ab4)
Fix init issue of langchain chroma (fdefe2)
Fix NeuralChat starcoder mha fusion issue (ce3d24)
Fix setuptools version limitation for build (2cae32)
Fix post process with topk topp of python api (7b4730)
Fix msvc compile issues (87b00d)
Refine notebook and fix restful api issues (d8cc11)
Upgrade qbits backend (45e03b )
Fix starcoder issues for IPEX int8 and Weight Only int4 (e88c7b )
Fix ChatGLM2 model loading issue (4f2169 )
Remove OneDNN graph env setting for BF16 inference (59ab03 )
Improve database by escape sql string (be6790 )
fix qbits backend get wrong workspace malloc size (6dbd0b )

Validated Configurations

Python 3.9, 3.10
Centos 8.4 & Ubuntu 22.04
Intel® Extension for TensorFlow 2.13.0
PyTorch 2.1.0+cpu
Intel® Extension for PyTorch 2.1.0+cpu
Transformers 4.34.1

v1.2

7 months ago

Highlights Features Productivity Examples Bug Fixing API Modification Documentation

Highlights

NeuralChat has been showcased in Intel Innovation’23 Keynote and Google Cloud Next '23 to demonstrate GenAI/LLM capabilities on Intel Xeon Scalable Processors. The chatbot solution has been integrated into LLM as a service (LLMaaS), providing the smooth user experience to build GenAI/LLM applications by leveraging Intel latest Xeon Scalable Processors and Gaudi2 from Intel Developer Cloud.
NeuralChat offers a comprehensive pipeline to build an end-to-end chatbot applications with a rich set of pluggable features such as speech cloning & interaction (EN/CN), knowledge retrieval, query caching, and security guardrail. These features allow you to create a custom chatbot from scratch within minutes, therefore significantly improving the chatbot development productivity.
LLM runtime extends Transformers API to provide seamless weight-only low precision inference for Hugging Face transformer-based models including LLMs. We improve LLM runtime with more comprehensive kernel support on low precisions (INT8/FP8/INT4/FP4/NF4), while keeping full compatibility with GGML. LLM runtime delivers 25ms/token with int4 llama, 22ms/token with int4 GPT-J on Intel Xeon Scalable Processors. Therefore, providing a complementary and extremely optimized LLM runtime solution for Intel architectures.

Features

Neural Chat
- Support ASR/TTS on CPU and HPU (fb619e5 56685a )
- Added docker for chatbot on Xeon SPR and Habana Gaudi (59fc92e ad2ee1)
- Refine Chatbot workflow and use NeuralChat API (53bed4 e95fc32 )
- Implement API python sdk, weight only quantization and AMP for Neural-Chat. (08ba5d85 )
Model Optimization
- Add GPTQ/TEQ/WOQ quantization with plenty examples (b4b2fcc 1bcab14 )
- Enhance the ITREX quantization API as well as LLMRuntime, users can now obtain a quantized model using AutoModelForCausalLM.from_pretrained. (be651b f4dc78 )
- Support GPT-J pruning (802ec0d2 )
LLM Runtime
- Enable FFN fusion LLMs (277108 )
- Enabled Tensor Parallelism for 4bits GPT-J on 2 sockets (fe0d65c )
- Implement AMX INT8/BF16 MHA (c314d6c )
- Support asymmetric models in LLM Runtime (93ca55 )
- Jblas and Qbits support nf4 and s4-fullrange weight compression. (ff7af86 )
- Enhance Beam-search early-stopping mechanisms (cd4c33d )

Productivity

ITREX moved to fully public development, welcome to contribute to us. (90ca31 )
Support streaming mode for neuralchat (f5892ec )
Support Direct Preference Optimization to improve accuracy. (50b5b9 )
Support query cache for chatbot (1b4463 )
Weight-only support for Pytorch Framework (3a064fa )
Provide mix INT8 & BF16 inference mode for stable diffusion. (bd2973 )
Supported Stable Diffusion v1.4/v1.5/v2.1 and QAT inference on Linux/windows (02cc59 )
Update Onednn to v3.3 (e6d8a4 )
Weight-only kernel support INT8 quantization (6ce8b13 )
Enable flash attention like kernel in weight-only (0ef3942)
Weight-only kernel ISA based dispatcher (ff7af86 )
Support 4bits per-channel quantization (4e164a8 )

Examples

Add Falcon, ChatGLM CLM examples (c3b196 )
Enabled code-generation example with Docker and integrated bigcode/lm-eval (c569fd5 0b3450 )
Weight-only ChatGLM-V1/V2, BLOOM-7B, MPT-30B, Llama2-7B, Llama2-70B, Falcon-40B, Dolly-V2, Starcoder-15B and OPT series examples (9a2cfa 793629 ac5744f 96d424 d4fb27 2a82ee0 e4eb09f d4fb27 578162 f5df02 )
Support Intel/neural-chat-7b-v1-1 model in ChatBot (126d07b )
Add fine-tuning for Text-to-Speech(TTS) task in NeuralChat (1dac9c6 e39fec90 )
Support GPT-J NeuralChat in Habana (9ef6ad8 )
Enable MPT peft LORA finetune in Gaudi (3dc184e )
Add code-generation finetuning pipeline (c070a8 )
E2E Talking Bot example on Windows PC (be2a267 )

Bug Fixing

Fixed issues from Cobalt, the 3-party company is hired by Intel to do Penetration testing. (51a1b88 )
Fix windows compile issues (bffa1b0 )
Fix ordinals and conjunctions in tts normalizer (0892f8a )
Fix Habana finetuning issues (2bbcf51 )
Fix bugs in RAG code for converting the prompt (bfad5c )
Fix normalizer: year, punctuation after number, end token (775a12 )
Fix Graph Model quantization on AVX2-only Platforms (3c84ec6 )

API Modification

Update the input of the 'device' parameter in the NeuralChat fine-tuning API, changing it from 'habana' to 'hpu' (96dabb0)
Change default values of do_lm_eval, lora_all_linear and use_fast_tokenizer in ModelArguments from False to True. (52f9f74)

Documentation

Add notebooks for optimization of NeuralChat on SPR, HPU, and A100. (52f9f74 7218806 d156e9a )
Add Readme and UT for NeuralChat. (daff796 49336d3 9b81f05 )

Validated Configurations

Python 3.8, 3.9, 3.10
Centos 8.4 & Ubuntu 20.04 & Windows 10
Intel® Extension for TensorFlow 2.12.0, 2.11.0
PyTorch 2.0.0+cpu, 1.13.1+cpu
Intel® Extension for PyTorch 2.0.0+cpu, 1.13.100+cpu

v1.1.1

8 months ago

Highlights
Bug Fixing & Improvements
Tests & Tutorials

Highlights In this release, we improved NeuralChat, a customizable chatbot framework under Intel® Extension for Transformers. NeuralChat is now available for you to create your own chatbot within minutes on multiple architectures.

Bug Fixing & Improvements

Fix the code structure and the plugin in NeuralChat (commit 486e9e)
Fix bug in retrieval chat (commit d2cee0)
NeuralChat Inference return correct input len without pad to user (commit 18be4c)
Fix MPT not support left padding issue (commit 24ae58)
Fix double remove dataset columns when concatenation (commit 67ce6e)
Fix DeepSpeed and use cache issue (commit 4675d4)
Fix bugs in predict_stream (commit e1da7e)
Fix docker CPU issues (commit 8fa0dc)
Fix read HuggingFaceH4/oasst1_en dataset issue (commit 76ee68)
Modify Dockerfile for finetuning (commit 797aa2)
Fix the perf of LLaMA2 by static_shape in optimum Habana (commit 481f38)
Remove NeuralChat redundant code and hard codes. (commit 0e1e4d, 037ce8, 10af3c)
Refined NeuralChat finetuning config (commit e372cf)

Tests & Tutorials

Add inference test for LLaMA2 and MPT with HPU (commit 5c4f5e)
Add inference test for LLaMA2 and MPT with Intel CPUs (commit ad4bec, 2f6188)
Add finetuning test for MPT (commit 72d81e, 423242)
Add GHA Unit Tests (commit 49336d)
NeuralChat finetuning tutorial for LLaMA2 and MPT (commit d156e9)
NeuralChat deployment on Intel CPU/ Habana HPU/ Nvidia tutorial (commit b36711)

Validated Configurations

Centos 8.4 & Ubuntu 22.04
Python 3.9
PyTorch 2.0.0
TensorFlow 2.12.0

Acknowledgements Thanks for the contributions from sywangyi, jiafuzha and itayariel. Thanks to all the participants to Intel Extension for Transformers.

v1.1

10 months ago

Highlights
Features
Productivity
Examples
Bug Fixing
Documentation

Highlights

Created NeuralChat, the first 7B commercially friendly chat model ranked in top of LLM leaderboard
Supported efficient fine-tuning and inference on Xeon SPR and Habana Gaudi
Enabled 4-bits LLM inference in plain C++ implementation, outperforming llama.cpp
Supported quantization for broad LLMs with the improved lm-evaluation-harness for multiple frameworks and data types

Features

Model Optimization
- Language modeling quantization for OPT-2.7B, OPT-6.7B, LLAMA-7B (commit 6a9608), MPT-7B and Falcon-7B (commit f6ca74)
- Text2text-generation quantization for T5, Flan-T5 (commit a9b69b)
- Text-generation quantization for Bloom (commit e44270), MPT (commit 469ac6)
- Enable QAT for Stable Diffusion (commit 2e2efd)
- Replace PyTorch Pruner with INC Pruner (commit 9ea1e3)
Transformers-accelerated Neural Engine
- Support PyTorch model as input of Neural Engine (commit e83a51, 3625db)
- Inference with cpp graph: MPT-7B, LLAMA-7B, GPT-NeoX-20B (commit 970bfa), Falcon-7B (commit 762723)
- Inference with weight-only compression (commit d87132, 0065db, d30eff)
- Reduce memory usage of inference (commit 36f3e9, 2dc594, 3f6b47, 5f75df, 7860f9)
- Stable Diffusion on Windows (commit 52d5e6)
- MHA for Bert (commit 59af3af)
Transformers-accelerated Libraries
- MHA kernels for static, dynamic quantization and bf16 (commit 0d0932, e61e4b)
- Support dynamic quantization matmul and post-op (commit 4cb9e4, cf0400, 9acfe1)
- Int4 weight-only kernels (commit 3b7665) and fusion (commit f00d87)
- Support dynamic quantization op (commit 6fcc15)
- Add AVX2 kernels for Windows (commit bc313c)

Productivity

Enable Lora fine-tuning(commit 664f4b), multi-nodes fine-tuning(commit 6288fd) and Xeon, Habana inference (commit 8ea55b) for Chatbot
Enable docker for Chatbot (commit 6b9522, 37b455)
Support Parameter-Efficient Fine-Tuning (PEFT) (commit 27bd7f)
Update Torch and TensorFlow (commit f54817)
Add Harness evaluation for PyTorch text-generation/language modeling (commit 736921, c7c557, b492f5) and onnx (commit a944fa)
Add summarization evaluation for PyTorch (commit 062e62)

Examples

Early Exit: TangoBERT, Separating Weights for Early-Exit Transformers (SWEET) (commit dfbdc5, c0eaa5)
Electra fp32 & bf16 inference (commit e09c96)
GPT-NeoX and Dolly-v2-7B text-generation inference (commit 402bb9)
Stable Diffusion v2.1 inference (commit 5affab), image to image(commit a13e11), inference with dynamic quantization (commit bfcb2e)
Onnx whisper-large quantization (commit 038be0)
8-layers MiniLM inference (commit 0dd104)
Add compression aware training (commit dfb53f), sparse aware training(commit 7b28ef) and fine-tuning and inference workflows (commit bf666c)

Bug Fixing

Fix Neural Engine error with gcc13 (commit 37a4a3) and GPU compilation error (commit 0f38eb)
Fix quantization for transformers 4.30 (commit 256c1d)
Fix error of missing metric when QAT on PyTorch model (commit c7e665)

Documentation

Refine doc of NeuralChat (commit 2580f3)
Update performance data of LLM and Stable Diffusion (commit 523fe5)

Validated Configurations

Centos 8.4 & Ubuntu 20.04 & Windows 10
Python 3.8, 3.9, 3.10
Intel® Extension for TensorFlow 2.11.0, 2.12.0
PyTorch 1.13.1+cpu, 2.0.0+cpu
Intel® Extension for PyTorch 1.13.1+cpu,2.0.0+cpu