LongNet Versions Save

Implementation of plug in and play Attention from "LongNet: Scaling Transformers to 1,000,000,000 Tokens"

0.4.8

9 months ago

Changelog for DilatedAttention with ParallelWrapper:

1. Added ParallelWrapper Class

  • Introduced a ParallelWrapper class to simplify the usage of data parallelism.
  • The ParallelWrapper class:
    • Takes a neural network model as input.
    • Allows the user to specify a device ("cuda" or "cpu").
    • Contains a flag use_data_parallel to enable or disable data parallelism.
    • Checks if multiple GPUs are available and applies nn.DataParallel to the model accordingly.
    • Redirects attribute accesses to the internal model for seamless usage.

2. Modified Usage of DilatedAttention Model

  • Wrapped the DilatedAttention model using the ParallelWrapper class.
  • Enabled the model to be run on multiple GPUs if available.

3. Device Assignment

  • Explicitly defined a device and used it to specify where the DilatedAttention model should be loaded.
  • The device defaults to GPU (cuda:0) if CUDA is available; otherwise, it defaults to CPU.

4. Example Usage

  • Provided an example of how to initialize and use the ParallelWrapper with the DilatedAttention model.

Summary:

The key addition was the ParallelWrapper class to facilitate easy and configurable usage of data parallelism with the provided DilatedAttention model. This ensures scalability across multiple GPUs without any significant change in the existing workflow. The user can now enable or disable data parallelism using a single flag.

0.4.3

9 months ago

Changelog:

  1. Tensor Shape Adjustments:

    • Ensured the consistent shape of tensors across all operations.

    • Squeezed a_indices to 2D to match dimensions of att_denom_sums.

      a_indices = a_indices[:, :, 0].squeeze(-1).squeeze(-1)
      
    • Sliced a_indices to the unpadded sequence length before scattering.

      a_indices = a_indices[:, :unpadded_seq_len]
      
  2. Scatter and Gather Operations:

    • Scatter with squeezed 2D a_indices and gather sparse sums with these indices.

      att_denom_sums.scatter_add_(1, a_indices, a_denoms)
      sparse_att_denom_sum = torch.gather(att_denom_sums, 1, a_indices)
      
  3. DataType Handling:

    • Converted the 'sparse indices' tensors to torch.int64 (or torch.long) to ensure compatibility with PyTorch's indexing operations.
    • Retained the torch.float16 dtype for the 'X' tensor to make it memory-efficient.
  4. Code Cleaning:

    • Removed repeated lines that print the shape and datatype of "sparse indices" to declutter the code.
    • Standardized debug print statements to have a consistent format.
    • Print shapes of tensors before scattering to verify dimensions match.
    • Added comments explaining dimension squeezing, slicing, and other adjustments for clarity.
  5. Validation Checks:

    • Added checks to ensure tensors are on the same device (either all on CPU or all on CUDA).
    • Checked whether the size of the tensor 'X' matches the expected shape before operations.
  6. Enhanced Error Messages:

    • Improved the debug error messages to be more descriptive.
  7. Optimizations:

    • Removed unnecessary tensor operations that don't contribute to the final result.
    • Optimized tensor slicing and indexing operations to be more memory efficient.
  8. Edge Case Handling:

    • Handled the edge case of negative head_idx.
  9. Other Minor Fixes:

    • Ensured that the code uses math or memory-efficient attention only if the input tensor is on CUDA and a non-A100 GPU is detected.
    • Made sure tensor operations are consistent with PyTorch best practices.
  10. Documentation:

  • Added comments to highlight important changes and to explain certain decisions in the code.

0.4.2

9 months ago
  • New sparsify function
  • New Improved concat ops

0.4.1

9 months ago

Changelog

Bug Fixes

  1. Bug: The size mismatch in tensor operations in the forward method of the DilatedAttentionLLAMA class.

    • Root Cause: The tensors that are being operated upon did not have matching dimensions due to incorrect striding operations.
    • Resolution: We modified the dilation process by introducing an inner loop over split tensors to handle each part separately, which resolved the dimension mismatch issues.
  2. Bug: Index out of range error while transposing tensors.

    • Root Cause: The index provided to the transpose operation was larger than the total number of dimensions in the tensor.
    • Resolution: Corrected the index passed to the transpose operation to fit within the number of dimensions in the tensor.

Improvements

  1. Optimized Tensor Operations: The tensor operations in the forward method were optimized to ensure they all operate on tensors with matching dimensions, improving the efficiency of the model.

  2. Added Error Handling: We added checks for dimension mismatches in tensor operations to throw useful error messages when the input data does not match the expected shape.

Features

  1. DilatedAttentionLLAMA Class: Introduced a new DilatedAttentionLLAMA class that uses dilated attention mechanism for the forward method. This new implementation is designed to be more efficient for larger sequence lengths.

  2. Performance Testing: Added a simple performance test to benchmark the speed of the forward method in the DilatedAttentionLLAMA class.

0.4.0

10 months ago

Changelog Bug Fixes Issue: ValueError: too many values to unpack (expected 3)

Root Cause: The attention function was returning more than three values, but the code was trying to unpack its return values into only three variables. Resolution: Modified the line where the attention function is called to collect all additional return values into a list using the * operator. Issue: RuntimeError: The size of tensor a (64) must match the size of tensor b (2) at non-singleton dimension 1

Root Cause: The code was trying to add two tensors of different sizes in the forward method of the DynamicDilatedAttention class. Resolution: Modified the line where the tensors are added to ensure that attn_output has the same size as the corresponding slice of outputs before trying to add them. Issue: ValueError: not enough values to unpack (expected 7, got 6)

Root Cause: The flash_attn function in the FlashAttention class was trying to unpack the shape of the q tensor into seven variables, but the q tensor only had six dimensions. Resolution: Modified the forward method of the DilatedAttention class to reshape the x tensor correctly before passing it to the attention function. Improvements Improvement: Added assertions to check the types and values of the parameters in the init method of the DilatedAttention class to prevent incorrect usage.

Improvement: Added a check for the Distributed parameter in the init method of the DilatedAttention class to decide whether to use the DataParallel wrapper for the FlashAttention modules.

Improvement: Modified the forward method of the DilatedAttention class to process each segment of the input separately for each attention head, allowing the attention heads to share information between different segments.

Improvement: Modified the forward method of the DilatedAttention class to use a buffer to store the attn_output_resized tensor instead of creating a new tensor of zeros in every forward pass, improving efficiency.

0.3.9

10 months ago

0.3.8

10 months ago

0.3.7

10 months ago

0.3.6

10 months ago

0.3.5

10 months ago