Kyegomez Sophia Versions Save

Effortless plugin and play Optimizer to cut model training costs by 50%. New optimizer that is 2x faster than Adam on LLMs.

e11

11 months ago

Decoupled sophia

Algorithmic Pseudocode for Decoupled Sophia Create a new class DecoupledSophia that inherits from torch.optim.Optimizer. Initialize the optimizer with the model, input data, and other necessary parameters. Implement the step method: If a closure is provided, compute the loss. Iterate through the parameter groups and their parameters. If the gradient is not available for a parameter, skip it. Initialize the state for the parameter if it doesn't exist. Update the biased first moment estimate. Update the Hessian estimate every k steps using the chosen estimator. Update the parameters using the decoupled update rule. Implement the Hessian estimators as separate methods, e.g., hutchinson and gauss_newton_bartlett.

e10

11 months ago

Here are five optimization suggestions for the Sophia class:

Use torch.einsum to compute the dot product in the hutchinson method. Use torch.no_grad() to avoid unnecessary gradient computations during the parameter update. Use in-place operations for updating the parameters. Cache the result of group['eps'] and group['rho'] to avoid repeated computations. Use a more efficient method to compute the softmax and loss in the gauss_newton_bartlett method. Pseudocode Modify the hutchinson method to use torch.einsum for the dot product. Use torch.no_grad() in the step method during the parameter update. Replace add_ with addcdiv_ for in-place operations in the step method. Cache the result of group['eps'] and group['rho'] in the step method. Compute the softmax and loss more efficiently in the gauss_newton_bartlett method. PyTorch Python Code import torch

class Sophia(torch.optim.Optimizer): def init(self, model, input_data, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0, k=10, estimator="Hutchinson", rho=1): self.model = model self.input_data = input_data defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay, k=k, estimator=estimator, rho=rho) super(Sophia, self).init(params, defaults)

def step(self, closure=None):
    loss = None
    if closure is not None:
        loss = closure()

    for group in self.param_groups:
        eps = group['eps']
        rho = group['rho']
        for p in group["params"]:
            if p.grad is None:
                continue
            grad = p.grad.data
            if grad.is_sparse:
                raise RuntimeError("Sophia does not support sparse gradients")

            state = self.state[p]

            if len(state) == 0:
                state['step'] = 0
                state['m'] = torch.zeros_like(p.data)
                state['h'] = torch.zeros_like(p.data)

            m, h = state['m'], state['h']
            beta1, beta2 = group['betas']
            state['step'] += 1

            if group['weight_decay'] != 0:
                grad = grad.add(group["weight_decay"], p.data)

            m.mul_(beta1).add_(1 - beta1, grad)

            if state['step'] % group['k'] == 1:
                if group['estimator'] == "Hutchinson":
                    hessian_estimate = self.hutchinson(p, grad)
                elif group['estimator'] == "Gauss-Newton-Bartlett":
                    hessian_estimate = self.gauss_newton_bartlett(p, grad)
                else:
                    raise ValueError("Invalid estimator choice")
                h.mul_(beta2).add_(1 - beta2, hessian_estimate)

            with torch.no_grad():
                p.data.add_(-group['lr'] * group['weight_decay'], p.data)
                p.data.addcdiv_(-group['lr'], m, h.add(eps).clamp(max=rho))

    return loss

def hutchinson(self, p, grad):
    u = torch.randn_like(grad)
    grad_dot_u = torch.einsum("...,...->", grad, u)
    hessian_vector_product = torch.autograd.grad(grad_dot_u, p, retain_graph=True)[0]
    return u * hessian_vector_product

def gauss_newton_bartlett(self, p, grad):
    B = len(self.input_data)
    logits = [self.model(xb) for xb in self.input_data]
    y_hats = [torch.softmax(logit, dim=0) for logit in logits]
    g_hat = torch.autograd.grad(sum([self.loss_function(logit, y_hat) for logit, y_hat in zip(logits, y_hats)]) / B, p, retain_graph=True)[0]
    return B * g_hat * g_hat

Copy code This updated Sophia class incorporates the suggested optimizations, making the code more efficient and potentially faster.

e9

11 months ago

cant spell weight decay

e8

11 months ago

The provided code for the Hutchinson estimator assumes that the input tensors are 1D. However, in many network architectures, the parameters can be multi-dimensional tensors. To handle this case, we need to modify the Hutchinson estimator to compute the dot product and Hessian-vector product correctly for multi-dimensional tensors.

class HutchinsonEstimator(HessianEstimator): def estimate(self, p, grad): u = torch.randn_like(grad) grad_dot_u = torch.sum(grad * u) hessian_vector_product = torch.autograd.grad(grad_dot_u, p, retain_graph=True)[0] return u * hessian_vector_product

e7

11 months ago

fixing relative import

e6

11 months ago

import layer

e5

11 months ago

Research Analysis: Sophia Paper's Training Strategy Architecture Model: Autoregressive models on OpenWebText Context length: 1024 Model type: Decoder-only Transformers Model sizes: 125M (small), 355M (medium), and 770M (large) Datasets OpenWebText (Gokaslan & Cohen, 2019) Baselines Adam with decoupled weight decay (AdamW) (Loshchilov & Hutter, 2017) Lion (Chen et al., 2023) Algorithmic Pseudocode Initialize the model (GPT-2) with the desired number of parameters (small, medium, or large). Load the OpenWebText dataset. Set the context length to 1024. Set the batch size to 480. Use a cosine learning rate schedule with the final learning rate equal to 0.05 times the peak learning rate. Apply gradient clipping with a threshold of 1.0. Use a fixed 2k steps of learning rate warm-up. Train the model using the Sophia optimizer with the chosen Hessian estimator (Sophia-H or Sophia-G) and hyperparameters. Train the model for 100K, 200K, or 400K steps. Evaluate the model using log perplexity on OpenWebText and in-context learning results on SuperGLUE. Training Code with Hugging Face Transformers API

High-Level Architecture Load the OpenWebText dataset from Hugging Face Datasets. Preprocess the dataset: Tokenize the text using a tokenizer. Group the tokenized text into chunks of a specified sequence length. Save the preprocessed dataset. Algorithmic Pseudocode Load the OpenWebText dataset. Initialize the tokenizer. Define a tokenize function that tokenizes the text and adds an end-of-sequence token. Apply the tokenize function to the dataset using the map function. Define a group_texts function that concatenates all texts and splits them into chunks of the specified sequence length. Apply the group_texts function to the tokenized dataset using the map function. Save the preprocessed dataset.

Algorithmic Pseudocode Load the OpenWebText dataset. Preprocess the dataset: Tokenize the text using a tokenizer. Group the tokenized text into chunks of a specified sequence length. Initialize the GPT-2 model and tokenizer. Set up the training arguments. Create the Trainer with the model, training arguments, and preprocessed dataset. Train the model using the DecoupledSophia optimizer with the chosen Hessian estimator and hyperparameters. Evaluate the model using log perplexity on OpenWebText and in-context learning results on SuperGLUE.

e4

11 months ago

To make Sophia decoupled, we can separate the Hessian estimation from the main optimizer. This will allow users to plug in different Hessian estimators without modifying the core optimizer code. Here's the research analysis, algorithmic pseudocode, and Python implementation for a decoupled Sophia optimizer.

Architectural Analysis Create a base Hessian estimator class that defines the interface for all Hessian estimators. Implement specific Hessian estimators (e.g., Hutchinson, Gauss-Newton-Bartlett) as subclasses of the base Hessian estimator class. Modify the Sophia optimizer to accept a Hessian estimator object during initialization. Update the optimizer's step method to use the provided Hessian estimator object for Hessian estimation. Algorithm Pseudocode Base Hessian Estimator Define an abstract method estimate that takes the parameter θ and gradient as input and returns the Hessian estimate. Hutchinson Estimator Inherit from the base Hessian estimator class. Implement the estimate method using the Hutchinson algorithm. Gauss-Newton-Bartlett Estimator Inherit from the base Hessian estimator class. Implement the estimate method using the Gauss-Newton-Bartlett algorithm. Decoupled Sophia Optimizer Modify the Sophia optimizer to accept a Hessian estimator object during initialization. Update the optimizer's step method to use the provided Hessian estimator object for Hessian estimation.

e3

11 months ago

e2

11 months ago

Sophiaaaa