Tensile Versions Save

Stretching GPU performance for GEMMs and tensor contractions.

rocm-6.1.1

1 week ago

Tensile code for ROCm 6.1.1 did not change. The library was rebuilt for the updated ROCm 6.1.1 stack.

rocm-6.1.0

1 month ago

Additions

new DisableKernelPieces values to invalidate local read, local write, and global read
stream-K kernel generation, including two-tile stream-k algorithm by setting StreamK=3
feature to allow testing stream-k grid multipliers
debug output to check occupancy for Stream-K
reject condition for FractionalLoad + DepthU!=power of 2
new TENSILE_DB debugging value to dump the common kernel parameters
predicate for APU libs
new parameter (ClusterLocalRead) to turn on/off wider local read opt for TileMajorLDS
new parameter (ExtraLatencyForLR) to add extra interval between local read and wait
new logic to check LDS size with auto LdsPad(=1) and change LdsPad to 0 if LDS overflows
initialization type and general batched options to the rocblas-bench input creator script

Optimizations

enabled MFMA + LocalSplitU=4 for MT16x16
enabled (DirectToVgpr + MI4x4) and supported skinny MacroTile
optimized postGSU kernel: separate postGSU kernels for different GSU values, loop unroll for GSU loop, wider global load depending on array size, and parallel reduction depending on array size
auto LdsPad calculation for TileMajorLds + MI16x16
auto LdsPad calculation for UnrollMajorLds + MI16x16 + VectorWidth

Changes

cleared hipErrorNotFound error since it is an expected part of the search
modified hipcc search path for Linux
changed PCI ID from 32bit to 64bit for ROCm SMI HW monitor
changed LdsBlockSizePerPad to LdsBlockSizePerPadA, B to specify LBSPP separately
changed the default value of LdsPadA, B, LdsBlockSizePerPadA, B from 0 to -1
updated test cases according to parameter changes for LdsPad, LBSPP and ClusterLocalRead
Replaced std::regex with fnmatch()/PathMatchSpec as a workaround to std::regex stack overflow known bug

Fixes

hipcc compile append flag parallel-jobs=4
race condition in Stream-K that appeared with large grids and small sizes
mismatch issue with LdsPad + LdsBlockSizePerPad!=0 and TailLoop
mismatch issue with LdsPad + LdsBlockSizePerPad!=0 and SplitLds
incorrect reject condition check for DirectToLds + LdsBlockSizePerPad=-1 case
small fix for LdsPad optimization (LdsElement calculation)

rocm-6.0.2

3 months ago

Tensile code for ROCm 6.0.2 did not change. The library was rebuilt for the updated ROCm 6.0.2 stack.

rocm-6.0.0

5 months ago

Added

Added aquavanjaram support: gfx940/gfx941/gfx942, fp8/bf8 datatype, xf32 datatype, and stochastic rounding for various datatypes
Added/updated tuning scripts
Added DirectToLds support for larger data types with 32bit global load (old parameter DirectToLds is replaced with DirectToLdsA and DirectToLdsB), and the corresponding test cases
Added the average of frequency, power consumption, and temperature information for the winner kernels to the CSV file
Added asmcap check for MFMA + const src
Added support for wider local read + pack with v_perm (with VgprForLocalReadPacking=True)
Added a new parameter to increase miLatencyLeft

Optimizations

Enabled InitAccVgprOpt for MatrixInstruction cases
Implemented local read related parameter calculations with DirectToVgpr
Adjusted miIssueLatency for gfx940
Enabled dedicated vgpr allocation for local read + pack
Optimized code initialization
Optimized sgpr allocation
Supported DGEMM TLUB + RLVW=2 for odd N (edge shift change)
Enabled miLatency optimization for (gfx940/gfx941 + MFMA) for specific data types, and fixed instruction scheduling

Changed

Removed old code for DTL + (bpe * GlobalReadVectorWidth > 4)
Changed/updated failed CI tests for gfx11xx, InitAccVgprOpt, and DTLds
Removed unused CustomKernels and ReplacementKernels
Added a reject condition for DTVB + TransposeLDS=False (not supported so far)
Removed unused code for DirectToLds
Updated test cases for DTV + TransposeLDS=False
Moved parameter MinKForGSU from globalparameter to BenchmarkCommonParameter to support smaller K
Changed how to calculate latencyForLR for miLatency
Set minimum value of latencyForLRCount for 1LDSBuffer to avoid getting rejected by overflowedResources=5 (related to miLatency)
Refactored allowLRVWBforTLUandMI and renamed it as VectorWidthB
Supported multi-gpu for different architectures in lazy library loading
Enabled dtree library for batch > 1
Added problem scale feature for dtree selection
Enabled ROCm SMI for gfx940/941.
Modified non-lazy load build to skip experimental logic

Fixed

Fixed predicate ordering for fp16alt impl round near zero mode to unbreak distance modes
Fixed boundary check for mirror dims and re-enable disabled mirror dims test cases
Fixed merge error affecting i8 with wmma
Fixed mismatch issue with DTLds + TSGR + TailLoop
Fixed a bug with InitAccVgprOpt + GSU>1 and a mismatch issue with PGR=0
Fixed override for unloaded solutions when lazy loading
Fixed build some errors (adding missing headers)
Fixed boost link for a clean build on ubuntu22
Fixed bug in forcestoresc1 arch selection
Fixed compiler directive for gfx941 and gfx942
Fixed formatting for DecisionTree_test.cpp

rocm-5.7.1

7 months ago

Tensile code for ROCm 5.7.1 did not change. The library was rebuilt for the updated ROCm 5.7.1 stack.

rocm-5.7.0

8 months ago

Added

Added support for FP16 Alt Round Near Zero Mode (this feature allows the generation of alternate kernels with intermediate rounding instead of truncation)
Added user-driven solution selection feature

Optimizations

Enabled LocalSplitU with MFMA for I8 data type
Optimized K mask code in mfmaIter
Enabled TailLoop code in NoLoadLoop to prefetch global/local read
Enabled DirectToVgpr in TailLoop for NN, TN, and TT matrix orientations
Optimized DirectToLds test cases to reduce the test duration

Changed

Removed DGEMM NT custom kernels and related test cases
Changed noTailLoop logic to apply noTailLoop only for NT
Changed the range of AssertFree0ElementMultiple and Free1
Unified aStr, bStr generation code in mfmaIter

Fixed

Fixed LocalSplitU mismatch issue for SGEMM
Fixed BufferStore=0 and Ldc != Ldd case
Fixed mismatch issue with TailLoop + MatrixInstB > 1

rocm-5.6.1

8 months ago

Tensile code for ROCm 5.6.1 did not change. The library was rebuilt for the updated ROCm 5.6.1 stack.

rocm-5.6.0

10 months ago

Added

Added user driven tuning API
Added decision tree fallback feature
Added SingleBuffer + AtomicAdd option for GlobalSplitU
DirectToVgpr support for fp16 and Int8 with TN orientation
Added new test cases for various functions
Added SingleBuffer algorithm for ZGEMM/CGEMM
Added joblib for parallel map calls
Added support for MFMA + LocalSplitU + DirectToVgprA+B
Added asmcap check for MIArchVgpr
Added support for MFMA + LocalSplitU
Added frequency, power, and temperature data to the output

Optimizations

Improved the performance of GlobalSplitU with SingleBuffer algorithm
Reduced the running time of the extended and pre_checkin tests
Optimized the Tailloop section of the assembly kernel
Optimized complex GEMM (fixed vgpr allocation, unified CGEMM and ZGEMM code in MulMIoutAlphaToArch)
Improved the performance of the second kernel of MultipleBuffer algorithm

Changed

Updated custom kernels with 64-bit offsets
Adapted 64-bit offset arguments for assembly kernels
Improved temporary register re-use to reduce max sgpr usage
Removed some restrictions on VectorWidth and DirectToVgpr
Updated the dependency requirements for Tensile
Changed the range of AssertSummationElementMultiple
Modified the error messages for more clarity
Changed DivideAndReminder to vectorStaticRemainder in case quotient is not used
Removed dummy vgpr for vectorStaticRemainder
Removed tmpVgpr parameter from vectorStaticRemainder/Divide/DivideAndReminder
Removed qReg parameter from vectorStaticRemainder

Fixed

Fixed tmp sgpr allocation to avoid over-writing values (alpha)
64-bit offset parameters for post kernels
Fixed gfx908 CI test failures
Fixed offset calculation to prevent overflow for large offsets
Fixed issues when BufferLoad and BufferStore are equal to zero
Fixed StoreCInUnroll + DirectToVgpr + no useInitAccVgprOpt mismatch
Fixed DirectToVgpr + LocalSplitU + FractionalLoad mismatch
Fixed the memory access error related to StaggerU + large stride
Fixed ZGEMM 4x4 MatrixInst mismatch
Fixed DGEMM 4x4 MatrixInst mismatch
Fixed ASEM + GSU + NoTailLoop opt mismatch
Fixed AssertSummationElementMultiple + GlobalSplitU issues
Fixed ASEM + GSU + TailLoop inner unroll

rocm-5.5.1

11 months ago

Tensile code for ROCm 5.5.1 did not change. The library was rebuilt for the updated ROCm 5.5.1 stack.

rocm-5.5.0

1 year ago

Added

Add functions for user-driven tuning
Add GFX11 support: HostLibraryTests yamls, rearragne FP32(C)/FP64(C) instruction order, archCaps for instruction renaming condition, adjust vgpr bank for A/B/C for optimize, separate vscnt and vmcnt, dual mac
Add binary search for Grid-Based algorithm
Add reject condition for (StoreCInUnroll + BufferStore=0) and (DirectToVgpr + ScheduleIterAlg<3 + PrefetchGlobalRead==2)
Add support for (DirectToLds + hgemm + NN/NT/TT) and (DirectToLds + hgemm + GlobalLoadVectorWidth < 4)
Add support for (DirectToLds + hgemm(TLU=True only) or sgemm + NumLoadsCoalesced > 1)
Add GSU SingleBuffer algorithm for HSS/BSS
Add gfx900:xnack-, gfx1032, gfx1034, gfx1035
Enable gfx1031 support

Optimizations

Use AssertSizeLessThan for BufferStoreOffsetLimitCheck if it is smaller than MT1
Improve InitAccVgprOpt

Changed

Use global_atomic for GSU instead of flat and global_store for debug code
Replace flat_load/store with global_load/store
Use global_load/store for BufferLoad/Store=0 and enable scheduling
LocalSplitU support for HGEMM+HPA when MFMA disabled
Update Code Object Version
Type cast local memory to COMPUTE_DATA_TYPE in LDS to avoid precision loss
Update asm cap cache arguments
Unify SplitGlobalRead into ThreadSeparateGlobalRead and remove SplitGlobalRead
Change checks, error messages, assembly syntax, and coverage for DirectToLds
Remove unused cmake file
Clean up the LLVM dependency code
Update ThreadSeparateGlobalRead test cases for PrefetchGlobalRead=2
Update sgemm/hgemm test cases for DirectToLds and ThreadSepareteGlobalRead

Fixed

Add build-id to header of compiled source kernels
Fix solution index collisions
Fix h beta vectorwidth4 correctness issue for WMMA
Fix an error with BufferStore=0
Fix mismatch issue with (StoreCInUnroll + PrefetchGlobalRead=2)
Fix MoveMIoutToArch bug
Fix flat load correctness issue on I8 and flat store correctness issue
Fix mismatch issue with BufferLoad=0 + TailLoop for large array sizes
Fix code generation error with BufferStore=0 and StoreCInUnrollPostLoop
Fix issues with DirectToVgpr + ScheduleIterAlg<3
Fix mismatch issue with DGEMM TT + LocalReadVectorWidth=2
Fix mismatch issue with PrefetchGlobalRead=2
Fix mismatch issue with DirectToVgpr + PrefetchGlobalRead=2 + small tile size
Fix an error with PersistentKernel=0 + PrefetchAcrossPersistent=1 + PrefetchAcrossPersistentMode=1
Fix mismatch issue with DirectToVgpr + DirectToLds + only 1 iteration in unroll loop case
Remove duplicate GSU kernels: for GSU = 1, GSUAlgorithm SingleBuffer and MultipleBuffer kernels are identical
Fix for failing CI tests due to CpuThreads=0
Fix mismatch issue with DirectToLds + PrefetchGlobalRead=2
Remove the reject condition for ThreadSeparateGlobalRead and DirectToLds (HGEMM, SGEMM only)
Modify reject condition for minimum lanes of ThreadSeparateGlobalRead (SGEMM or larger data type only)