Enables efficient implementations of collectives used for deep learning training (allgatherv, allreduce, alltoall(v), broadcast, reduce, reduce_scatter)
Provides C++ API and interoperability with DPC++
Deep Learning Optimizations include:
Asynchronous progress for compute communication overlap
Dedication of cores to ensure optimal network use
Message prioritization, persistence, and out-of-order execution
Collectives in low-precision data types (int[8,16,32,64], fp[32,64], bf16)