Libxsmm Versions Save

Library for specialized dense and sparse matrix operations, and deep learning primitives.

1.11

5 years ago

This release accumulated more than 1200 changes since the last release and is a major preparation for our future v2 of the library. Beside stability improvements, refining existing functionality and bug-fixes, there were several introductions of new functionality: packed/compact data layout functions for solving linear equations, new flavors of SMM-kernels along with relaxed limitations (transb), and overall support for low-precision based on the Bfloat16 FP-format.

The Deep Learning (DL) domain is still under active research and development including co-design. The API however is rather stable (DLv2 since v1.8) with an implementation that continues to receive major development. Towards LIBXSMMv2, the DL domain will undergo major code reduction (implementation) while providing the same or more functionality (first sign is the removal of the Winograd code in this release).

THANK YOU FOR YOUR CONTRIBUTION - we had again several direct (and indirect) contributions, reports, and involvement from people who came across the project. We would like to thank you all for the effort and time you spent working on Open Source!

INTRODUCED

Packed function domain (compact data format) with GEMM, GETRF, TRMM, and TRSM functions.
Relaxed limitation of SMM kernels: TransB=T is now allowed (in addition to TransB=N).
Batch-reduce GEMM-kernel which is optimized for in-cache accumulation (Beta=1).
Included build setup in library (environment variable LIBXSMM_DUMP_BUILD=1).
CPU feature detection is updated for Cascadelake and Cooperlake (CLX and CPX).
Bfloat16 instruction support for Cooperlake (CPX).
Bfloat16 support for DL and SMM domain.
Fast RNGs for single-precision FP data.

IMPROVEMENTS

Cray Compiler (legacy and current versions) is supported with LIBXSMM's use of intrinsics, inline assembly, and CPUID detection, and therefore received major performance improvements. Previously, even JIT code was limited to AVX due to unsupported CPUID flow.
Updated support for tensorflow::cpu_allocator for API change in TensorFlow API (v1.12.0 and beyond).
Guarantee JIT'ted function (non-NULL); see CHANGE about libxsmm_[get|set]_dispatch_trylock.
Call wrapper/interceptor (static/shared library) now always works i.e., no special build required.
SpMDM/Bfloat16 interface to enable TensorFlow which gained type-support for Bfloat16.
GxM framework updated for fused DL ops, Bfloat16, and a variety of new DL operators.
DL domain with LSTM and GRU cells, fully connected layer, and batch norm support.
Reduced unrolling and code size of transpose kernels (to fit i$).
Extended Fortran interface (matdiff, diff, hash, shuffle).
Purified some more routines (Fortran interface).
More statistical values (libxsmm_matdiff/info).

CHANGES

KNC support has been removed (maps to generic code). Offload infrastructure has been kept.
Winograd code has been removed from DL domain (see also introduction to this release).
Removed libxsmm_[get|set]_dispatch_trylock (demoted to compile-time option).
Threshold criterion of libxsmm_gemm (optionally based on arithmetic intensity).

FIXES

Fixed corner case which eventually led to leaking memory (scratch).
Exhausted file handles (in ulimit'ed or restricted environments).
Fixed libxsmm_timer in case of lazy library initialization.
Flawed detection of restricted environments (SELinux).
Fixed buffer handling in case of incorrect input.
Fixed setup of AVX2 code path in SpMDM.
Ensure correct prefix in pkg-config files.
Guarantee JIT'ted function (non-NULL).

Note about platform support: an explicit compile-error (error message) is generated on platforms beside of Intel (or compatible processors) since upstreamed code was reported to produce "compilation failure". Beside of the introduced artificial error, any platform is supported with generic code (tested with ARM cross-compiler). Of course, any Open Source contribution to add JIT support is welcome.

Note about binary compatibility: LIBXSMM's API for Small Matrix Multiplications (SMMs) is stable, and all major known applications (e.g., CP2K, EDGE, NEK5K, and SeisSol) either rely on SMMs or are able (and want) to benefit from an improved API of the other domains (e.g., DL). Until at least v2.0, binary compatibility is not maintained (SONAME version goes with the semantic version).

1.10

5 years ago

Development accumulated many changes since the last release (v1.9) as this version (v1.10) kept slipping because of validation was not able to keep up and started over several times. On the positive side this may allow to call it the "Supercomputing 2018 Edition" which is complemented by an updated list of references including the SC'18 paper "Anatomy of High-Performance Deep Learning Convolutions on SIMD Architectures". Among several external articles, the Parallel Universe Magazine published "LIBXSMM: An Open Source-Based Inspiration for Hardware and Software Development at Intel".

The intense development of LIBXSMM brought many improvements and detailed features across domains as well as end-to-end support for Bfloat16 in LIBXSMM's Deep Learning domain (DL). The latter can be already exercised with the GxM framework which was added to the collection of sample codes. Testing and validation were updated for latest compilers and upcoming Linux distributions. FreeBSD is now formally supported (previously it was only tested occasionally). RPM-, Debian- and FreeBSD package updates will benefit from the smoothed default build-targets and compiler flags.

LIBXSMM supports "one build for all" while exploiting the existing instructions set extensions (CPUID based code-dispatch). Developers may enjoy support for pkg-config (.pc files in the lib folder) for easier linkage when using the Classic ABI (e.g., PKG_CONFIG_PATH=/path/to/libxsmm/lib pkg-config libxsmm --libs).

THANK YOU FOR YOUR CONTRIBUTION - we had several direct (and indirect) contributions, reports, and involvement from people who came across the project. We would like to thank you all for the effort and time you spent working on Open Source!

INTRODUCED

Removed need to build LIBXSMM's static library in a special way for GEMM call-interception.
Moved some previously internal but generally useful code to the public interface (math etc.).
Initial support handle-based "big" GEMM (revamped libxsmm_?gemm_omp).
Support transposed cases in libxsmm_?gemm_omp; not perf.-competitive yet.
Code samples accompanying article in the Parallel Universe magazine.
Fortran interface for some previously only C-exposed functions.
Support Intel C/++ Compiler together with GNU Fortran.
Packed/SOA domain: expanded functionality (EDGE solver).
Deep Learning framework GxM (added as code sample).
RNNs, and LSTM/GRU-cell (driver code experimental).
End-to-end support for Bfloat16 (DL domain).
Fused batch-norm, and fully-connected layer.
Compact/packed TRSM kernels and interface.
Experimental TRMM code (no interface yet).
Support for pkg-config.

IMPROVEMENTS / CHANGES

Zero-mask unused register parts to avoid false positives with enabled FPEs (MM kernels).
Added libxsmm_ptrx helper to Fortran interface (works around C_LOC portability issue).
Mapped TF low-precision to appropriate types, map unknowns to DATATYPE_UNSUPPORTED.
Build banner with platform name, info about Intel VTune (available but JIT-profiling disabled).
Smoothed code base for most recent compilers (incl. improved target attribution).
Official packages for Debian, and FreeBSD (incl. OpenMP in libxsmm/ext for BSD).
LIBXSMM_DUMP environment var. writes MHD-files if libxsmm_matdiff is called.
Warn when libxsmm_release_kernel is called for registered kernel.
Consolidated Deep Learning sample codes into one folder.
Revised default for AVX=3 (MIC=0 is now implicitly set).
LIBXSMM_TARGET: more keys count for AVX512/Core.
Updated TF integration/documentation.
Included workarounds for flang (LLVM).
Attempt to enable OpenMP with Clang.
Install header-only form (make install).
SpMDM code dispatch for AVX2.
Improved CI/test infrastructure.
Show hint if compilation fails.

FIXES

Properly dispatch CRC32 instruction (support older CPUs).
Fixed fallback of statically generated MM kernels (rare).
Remove temporary files that were previously dangling.
Fixed termination message/statistic (code registry).
Fixed finalizing the library (corner case).
Fixed code portability of DNN domain.

1.9

6 years ago

This release enables JIT-code generation of small matrix multiplications for SSE3 targets. Previously, only AVX and beyond has been supported using JIT code. SSE JIT-code generation is only supported for the MM domain (matrix multiplication). The compatibility of the library has been further refined and fine-tuned. The application binary interface (ABI) narrowed from above 500 functions down to ~50% due to adjusted symbol visibility. This revision prepares for a smooth transition to v2.0 and really internalizes low-level internals (descriptor handling, etc.), and two deprecated functions have been removed. More prominent, prefetch enumerators have been renamed e.g., LIBXSMM_PREFETCH_AL2 renamed to LIBXSMM_GEMM_PREFETCH_AL2.

INTRODUCED

ABI specification improved: exported functions are decorated for visibility/internal use (issue #205).
Math functions to eventually avoid LIBM dep., or to control specific requirements (libxsmm_math.h).
MM: enabled JIT-generation of SSE code for small matrix multiplications (BE and FE support).
MM: extended FE to handle multiple flavors of low-precision GEMMs (C and C++).
Detect mainainer build and avoid target flags (GCC toolchain, STATIC=0).
SMM: I16I32 and I16F32 WGEMM for SKX and future processors.
Hardening all builds by default (Linux package requirements).

IMPROVEMENTS / CHANGES

MM domain: renamed prefetch enumerators; kept "generic" names SIGONLY, NONE, and AUTO (FE).
Build system presents final summary (similar to initial summary); also mentions VTune (if enabled).
Adjusted TF scratch allocator to adopt global rather than context's allocator (limited memory).
Combined JIT-kernel samples with respective higher level samples (xgemm, transpose).
Enabled extra (even more pedantic) warnings, and adjusted the code base accordingly.
Adjusted Fortran samples for PGI compiler (failed to deduce generic procedures).
Removed deprecated libxsmm_[create/release]_dgemm_descriptor functions.
Included validation and compatibility information into PDF (Appendix).
MinGW: automatically apply certain compiler flags (workaround).
Internalized low-level descriptor setup (opaque type definitions).
Moved LIBXSMM_DNN_INTERNAL_API into internal API.
Fixed dynamic linkage with CCE (CRAY compiler).

FIXES

Take prefetch requests in libxsmm_xmmdispatch (similar to libxsmm_[s|d|w]mmdispatch).
SpMM: prevent to generate (unsupported) SP-kernels (incorrect condition).
Fixed code-gen. bug in GEMM/KNM, corrected K-check in WGEMM/KNM.
MinGW: correctly parse path of library requirements ("drive letter").
Fixed VC projects to build DLLs if requested.

1.8.3

6 years ago

Overview: while v1.9 is in the works, this release fixes two issues, and pushes for an improved (OSX w/ Intel Compiler) and wider OS/Compiler coverage (MinGW, BSD, see Compatibility). Among minor or exotic issues resolved in this release, the stand-alone JIT-generated matrix transposes (out-of-place) are now limited to matrix shapes such that only reasonable amounts of code are generated. There has been also a rare synchronization issue reproduced with CP2K/smp in LIBXSMM v1.8.1 (and likely earlier), which is resolved since the previous release (v1.8.2).

JIT code generation/dispatch performance: JIT-generating code (non-transposed GEMMs) is known to be blazingly fast, which this release (re-)confirms with the extended dispatch microbenchmark: single-threaded code generation (uncontended) of matrix kernels with M,N,K := 4...64 (equally distributed random numbers) takes less than 25 µs on typical systems, and non-cached code dispatch takes less than 50x longer than calling a function that does nothing whereas cached code-dispatch takes less than 15x longer than an empty function (code dispatch is roughly three orders of magnitudes faster than code generation i.e., Nanoseconds vs. Microseconds).

INTRODUCED

Support for mixing C and C++ code when using header-only based LIBXSMM.
Issue 202: reintroduced copy-update with LIBXSMM's install target (make).
Experimental: sketched Python support built into LIBXSMM (PYMOD=1).

IMPROVEMENTS / CHANGES

Completed revision of synchronization layer (started in v1.8.2); initial documentation.
Reduced TRACE output due to self-watching (internal) initialization/termination.
Wider OS validation incl. more exotic sets (MinGW in addition to Cygwin, BSD).
Prevent production code (non-debug) on 32-bit platforms (compilation error).
Increased test variety while staying within same turnaround time limit.
Continued to close implementation gaps (synchronization primitives).
Sparse SOA domain received fixes/improvements driven by EDGE.
More readable code snippets in documentation (reduced width).
Initial preparation for JIT-generating SSE code (disabled).
Improved detection of OpenBLAS library (Makefile.inc).
Updated (outdated) support for Intel Compiler (OSX).
Compliant soname under Linux and OSX.

FIXES

Fixed selection of statically generated code targeting Skylake server (SKX).
Sparse SOA domain: resolved issues pointed out by static analysis.
Fixed support for JIT-generated matrix transpose (code size).
Fixed selecting an incorrect prefetch strategy (BGEMM).

1.8.2

6 years ago

This last release of the 1.8.x line (before 1.9) accumulated a large number of changes to tweak interfaces, and to generally improve usability. The documentation vastly improved and extended, is more structured, and also available per ReadtheDocs (with online full-text search). In preparation of a fully revised implementation of the DNN API (rewrite), the interface of the DNN domain (Tensor API) changed in an incompatible way (our policy should have delayed this to v1.9). However, the current main user of the DNN API has been updated (integration with TensorFlow). Also notable, v1.8.2 introduces JIT-code generation with Windows call-convention (support limited to 4-argument kernels i.e., no prefetch signature for the MM domain, and no support for DNN/convolution kernels).

INTRODUCED

Introduced kernel introspection/query API for registered code: full GEMM descriptor, and code size.
Introduced explicit batch interface (and an experimental auto-batch option); parallelized/sequential.
Introduced BGEMM interface for handle-based GEMM using optimized format (copy-in/out).
More comprehensive sparse support (EDGE: Extreme Scale Fused Seismic Simulations).
More comprehensive collection of DNN test cases (DeepBench, ResNet50, etc.).
Implemented CI for DNN domain, and infrastructure for validation (libxsmm_matdiff).
Support to schedule CI/tests into a Slurm based cluster environment (.travis.sh).
Introduced "make INTRINSICS=0" to allow building with outdated Binutils.
Generate preprocessor symbols for statically generated code (presence check).
Allow FORTRAN to access (static-)configuration values using preprocessor.
FORTRAN 77 support for a much wider set of functionality (MM domain).
Introduced MHD file I/O to e.g., aid visual inspection and validation.
Cleaned up type-definitions and FE-macros (lower precision GEMM).
More comprehensive set of prefetch strategies (SMM domain).
Extended LIBXSMM_VERBOSE=2 to show library version, etc.
Wider use of QFMA accross domains (MM, SpMM, DNN).
Updated application recipe for CP2K and TensorFlow.
Initial Eigen related code sample (batched SMMs).
CPUID for CPUs codenamed "Icelake".

CHANGES

Revised/unified API attribute decoration, and cleaned up header-only header.
Removed script for regenerating documentation bits (README.sh); now only per make.
Changed matcopy kernels to have column-major semantics (similar to transpose).
Support const/non-const GEMM prototypes interfering with LIBXSMM's header-only.
Slightly revised and based all F2K3 interfaces on lower-level F77 (implicit) routines.
Incorporated/enabled new/additional instructions in the code generator (BE).
Reshuffled properties/sizes in GEMM descriptor for future extensions.
Portable build-locks for improved turnaround time in parallel CI builds.
Comprehensive validation of DNN domain (all major benchmarks).
Consistent use of libxsmm_blasint (libxsmm_dmmdispatch).
Revised error/warning messages (LIBXSMM_VERBOSE=1).
Initial support for some fused operations (DNN domain).
Removed support for small GEMM descriptors (BIG=0).
Removed libxsmm_timer_xtick (libxsmm_timer.h).
Improved turnaround time in Travis CI testing.
Thread-safe scratch memory allocation.
Support VS 2017 (startup script, etc.)

FIXES

Fixed potential issue with GEMM flags being incorrectly created (GEMM wrapper).
Several fixes for improved FORTRAN interface compatibility (optional arguments, etc.).
Disabled AVX-512 code generation with Intel Compiler 2013 (SP1 brings the req. bits).
Fixed code gen. issue with SOA sparse kernels; corrected precision of SOA sample code.
Fixed index calculation in tiled libxsmm_matcopy; updated test case accordingly.
Fixed a number of issues in several DNN code paths unveiled by better testing.
Several fixes in sparse SOA domain (unveiled by LIBXSMM's integration into PyFR).
Improved support for (legacy) Clang wrt AVX-512 code generation (intrinsics).
Ported bit-scan intrinsics abstraction to yield same result with all compilers.
Allow static code generation to target SKX and KNM (Makefile).
Fixed several code generation issues for SMMs on KNM.

1.8.1

7 years ago

This release brings some new features (matcopy/2d-copy and tcopy based on JIT-generated code) as well as a number of bug fixes (TGEMM), improvements (KNM), and refinements (LIBXSMM_GEMM_WRAP control, etc). Given the completed copy/transpose support, this release prepares for a complete stand-alone GEMM routines.

INTRODUCED

Choice between tiled/small GEMM during call-interception (LIBXSMM_GEMM_WRAP=1|2).
Introduced JIT'ted transpose kernels including tiling for larger matrices.
Transpose routines now auto-dispatch JIT-kernels incl. auto-tuned tiles.
Introduced matcopy routines similar to the transpose routines (C/C++/F).
LIBXSMM_DNN_CONV_OPTION_OVERWRITE for faster initial forward convolution.
Implemented/documented named JIT routines in TF when using VTune.
Additional statistics about MCOPY/TCOPY (LIBXSMM_VERBOSE=2).
Lowered overhead of tiled/parallelized GEMM/MCOPY/TCOPY.
Made libxsmm_hash function available (MEM/AUX module).
Initial support for lower precision (backward conv.)

CHANGES

AVX-512 based CPUID-dispatched input/output of Winograd transformation (forward conv.).
Adjusted build system to pick-up RPM_OPT_FLAGS (RPM based Linux distributions).
Moved extensive Q&A to Wiki page and cleaned up the reference documentation.
Improved/extended Getting Started Guide for TensorFlow with LIBXSMM.
Improved general backend error propagation, and avoid duplicated messages.
Iterative subdivision of large matrix transposes (tcopy) and matcopy (mcopy).
Non-task based and (optional) task based parallelization of tcopy and mcopy.
Mentioned KNM target key ("knm") in reference documentation.
Improved prefetches in KNM code path of weight update.
Adjusted initialization sequence during startup.
Improved parallelization grammar.

FIXES

Fixed pruned tile sizes and division-by-zero error in tiled GEMM.
Propagate backend errors in case of an insufficient JIT buffer.
CRC32 SW implementation issues unveiled by the CRAY Compiler.
Call parallelized transpose (C++ interface) when requested.
Fixed VTune support (named JIT code); broken in v1.8.
Fixed incorrect prefetch locations in KNM code path.
Fixed alignment condition in tcopy/mcopy code.
Fixed TF allocator integration with GCC 7.1.0.
Fixed some more warnings in sample codes.

1.8

7 years ago

This set of changes brings the Padding API to life and implements the necessary mechanisms to cover a wider range of cases. This may allow to run a larger variety of TensorFlow workloads using LIBXSMM. The implementation also brings Winograd-based convolutions (chosen automatically when using LIBXSMM_DNN_CONV_ALGO_AUTO). Moreover, support for the Intel Xeon Phi processor code-named "Knights Mill" ("KNM") has been added (QFMA and VNNI instructions can be executed using the Intel SDE).

INTRODUCED

A summary of code samples has been added (pdf), and also a guide (mainly for contributors) to "Getting Started using TensorFlow with LIBXSMM" [PDF]
Additional sparse matrix primitives (fsspmdm domain); see "pyfr" and "edge" sample code
Support for OpenMP SIMD directive on GCC (-fopenmp-simd) used in some translation units
Improved code path selection for legacy compiler versions (functions with multiple compilation targets)
DNN: Winograd based convolutions incl. threshold to automatically select (LIBXSMM_DNN_CONV_ALGO_AUTO) between LIBXSMM_DNN_CONV_ALGO_DIRECT, and LIBXSMM_DNN_CONV_ALGO_WINOGRAD
DNN: logically padded data incl. support for Winograd based implementation
DNN: support for Intel Knights Mill (KNM) instruction set extension (AVX-512)
DNN: support another custom format that blocks the minibatch dimension
SMM: support of FORTRAN 77 for manual JIT-dispatch (libxsmm_xmmdispatch, libxsmm_xmmcall)
SPMDM: narrowed scope of "sum" array to improve optimization on LLVM
SMM/EXT/OMP: introduced table of blocksizes depending on problem size; already yields improved performance for big(er) i.e., tiled matrix multiplications (xgemm sample now includes a hyperparameter tuning script)
SMM/DNN: JIT'ted matrix copy functions (already used in CNN domain); both matcopy and (upcoming) JIT'ted transpose will fully unlock performance of big(ger) GEMMs
AUX/MEM: scope-oriented multi-pool scratch memory allocator with heuristic for buffers of different lifetime

CHANGES

Removed LIBXSMM_MT and LIBXSMM_TASKS environment variables, and updated documentation
COMPATIBLE=1 setting is now automatically applied (e.g., useful with Cray Compiler)
LIBXSMM_TRYLOCK=1 now uses a single lock, and thereby reduces code duplication for the contended case; the trylock property is for user-code that can handle a NULL-pointer as result of the code dispatch i.e., implementing a fallback code path (BLAS)
AUX/MEM: superseded libxsmm_malloc_size function with libxsmm_get_malloc_info
Revised termination message wrt scratch memory allocation (LIBXSMM_VERBOSE)
Other: updated "spack" (HPC packet manager) to use more reasonable build options
SPMDM: improved load balance

FIXES

Implemented FORTRAN dispatch interface (F2K) differently to get it working with CCE (Cray Compiler)
Worked around problem/crashes due to an outdated TCMALLOC replacement of malloc/free (CCE)
TMM: tiled MM fallback code path in multi-threaded tiled GEMM exposed an issue with LIBXSMM_TRYLOCK=1
TMM: fixed incorrect OpenMP in task-based implementation; now always selected when in external par. region
SPMDM: bug fix for handling last block of k correctly and avoid out-of-bound accesses
Minor: fixed all flake8 complaints of our Python scripts, fixed code issues pointed out by static analysis
Fixed transpose FORTRAN sample code

1.7.1

7 years ago

This release finishes the memory allocation interface and documents the two memory allocation domains (default and scratch). Otherwise this release focuses on code quality (sample code) with no fixes or breaking changes when compared to version 1.7.

INTRODUCED

MEM: libxsmm_release_scratch has been introduced (unimplemented)
MEM: libxsmm_release_scratch now called during finalization
MEM: documented memory allocation domains
DNN: updated API documentation

CHANGES

More error/warning messages promoted to LIBXSMM_VERBOSE

FIXES

None

1.7

7 years ago

This version releases a revised DNN API to better suit with an upcoming TensorFlow integration. There is also some foundation laid out to distinct scratch memory from regular/default memory buffers.

INTRODUCED

MEM: ability to change the allocation functions; two different domains: default and scratch
MEM: C++ scoped allocator ("syntactical sugar"); incl. TensorFlow-specific adapter
MEM: optional TBB scalable malloc in both default and scratch allocator domain
DNN: more general buffer and filter link/bind functionality
LIBXSMM_VERBOSE messages rather than debug build
Improved dispatch for legacy compilers

CHANGES

DNN: revised API (breaking changes)

FIXES

SPMDM: fixed disagreement between static/dynamic code path (on top of v1.6.6)
MEM: avoid CRC memory checks for header-only library (different code versions)

1.6.6

7 years ago

This is a bug-fix release with focus on the SPMDM domain. There are also a number of code quality improvements. This is potentially the last 1.6.x release with a number of API changes scheduled for the DNN domain (v1.7).

INTRODUCED

SPMDM: promoted error messages from debug-only builds to LIBXSMM_VERBOSE mode
README now documents on how to inspect the raw binary dumps

CHANGES

Improved code quality according to a code quality checker (potential issues)

FIXES

SPMDM: fixed setup of handle to correspond with CPUID-dispatched/available code path
SPMDM: fixed calculating the size of the scratch buffer (single-threaded case)