An Introduction to LoRa: Unpacking the Theory and Practical Implementation

12 min readFeb 17, 2024

A blue neon light covered stairway converging on the horizon — Photo by Ashley Jurius on Unsplash

The emergence of Large Language Models (LLM) has led to an increased public awareness of the practical applications of AI. While in the past, human interaction with AI was mainly mediated through software algorithms, today this interaction is more personal and direct.

Large language models are now used in a plethora of tasks, creating a need for LLMs specialized in diverse domains e.g. accounting, medicine, software, etc. A silver lining to the increasing domain application of LLM is the possibility of finetuning already pre-trained language models to enable specialization in a specific domain. For example, an LLM could be finetuned to specialize in generating code, interpreting legal documents, generating poetry, etc.

Finetuning Large Language Models (LLM) from scratch is quite resource-intensive, given the large number of parameters these models contain. If not done appropriately, it could lead to the pre-trained model losing some of its base language understanding. For example, GPT-4 is a model with 175 billion parameters, and finetuning such a model is a non-trivial task.

Considering the immense benefit of LLMs, it is a worthy endeavor to research more efficient ways of finetuning such large language models. These methods are referred to as parameter-efficient finetuning (PEFT). Most of the methods discovered so far involve making some modifications to the original LLM architecture, hence introducing a fresh set of parameters that will encapsulate knowledge of the new domain to further complement the already existing knowledge of the pre-trained model.

LoRA (Low-Rank Approximation) is a PEFT method that achieves this with minimal computational resource requirement while proving just as effective as full finetuning in most cases.

In this article, we will delve deeper into the world of LoRA, focusing on the concept, benefits, and practical application of LoRA.

Parameter Efficient Fine-Tuning

The concept of Parameter-Efficient Fine-Tuning (PEFT) has led to a significant reduction in the economic entry barrier to LLM application development. Fueling research into a diverse array of approaches to achieving PEFT, These methods can be further categorized into three distinct sub-methods:

Selective: This approach involves fine-tuning a carefully chosen subset of the pre-trained model’s weights.
Reparametrization: Reparametrization methods such as Low-Rank Adaptation (LoRA), create a low-dimensional representation of a specific module (set of parameters e.g. query vector of a transformer model) in the original LLM.
Additive: This approach involves adjusting the pre-trained model by adding new modules for fine-tuning. These modules are further trained to incorporate knowledge of the new domain into the pre-trained LLM

The reparametrization method known as Low-Rank Adaptation (LoRA) offers a very efficient method of fine-tuning, focusing on adjusting a smaller set of parameters that approximate the larger matrix of weights in the pre-trained model. This approach significantly reduces the number of trainable parameters, hence saving time and computational resources.

Introduction to LoRA (Low-Rank Adaptation)

The idea behind Low-Rank Adaptation (LoRA) is built upon the observation that the weights learned by Large Language Models (LLMs) after training often contain redundancies. Therefore, instead of fine-tuning the entire set of weights in the LLM, we can streamline the process by focusing on a low-rank approximation of the weights — essentially, a smaller set of weights that eliminates these redundancies.

According to the LoRA paper, this idea comes from the simple hypothesis that trained models have low “intrinsic rank” (simply meaning model parameters contain duplications we can do without).

We take inspiration from Li et al. (2018a); Aghajanyan et al. (2020) which show that the learned over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the change in weights during model adaptation also has a low “intrinsic rank”, leading to our proposed Low-Rank Adaptation (LoRA) approach

During the fine-tuning process, all other weights remain frozen, by doing so we ensure learned weights are not altered during the finetuning process. This approach not only makes the process more efficient but also reduces the risk of overfitting and “catastrophic forgetting”, a phenomenon where the knowledge encapsulated in the pre-trained model is lost during the fine-tuning process.

Theoretical Fundamentals behind LoRA

Transitioning from a large parameter space to a low-rank approximation that is capable of introducing new information into the pre-trained Large Language Model (LLM) is a key aspect of LoRA. To understand this, we can consider the statement below from the LoRA paper.

Inspired by this, we hypothesize the updates to the weights also have a low “intrinsic rank” during adaptation. For a pre-trained weight matrix W0 ∈ R d×k , we constrain its update by representing the latter with a low-rank decomposition Wo + ∆W = Wo + BA, where B ∈ R d×r , A ∈ R r×k , and the rank r << min(d, k).

The following equation succinctly captures the LoRA-based fine-tuning process:

W = Wo + ∆W = Wo + BA

In this equation,

Wo denotes the pre-trained parameter weights
∆W denotes the learned weights to be used in adjusting the original weights
W is the final fine-tuned weight that will be used during inference
B is a matrix of dimension d×r and A is a matrix of dimension r×k

The approach is to fine-tune the matrix decomposition of ∆W i.e. matrices B and A, with a rank r significantly less than the min(d, k) from the original matrix. Hence, reducing the number of parameters we need to fine-tune, consider the following statement from the LORA paper:

On GPT-3 175B, we reduce the VRAM consumption during training from 1.2TB to 350GB. With r = 4 and only the query and value projection matrices being adapted, the checkpoint size is reduced by roughly 10,000× (from 350GB to 35MB)4 . This allows us to train with significantly fewer GPUs and avoid I/O bottlenecks.

To underscore this point further, let’s consider an initial matrix with dimensions 500 x 400, yielding a total parameter count of 200,000. Employing matrix decomposition with a rank r of 4 produces two matrices: a 500 x 4 matrix (A) and a 4 x 400 matrix (B). By training these two matrices instead, the number of training parameters is dramatically reduced from 200,000 to a mere 3,600!

Also, the product of matrices B and A, with dimensions d×r and r×k respectively, yields a matrix with dimensions d×k. Hence we can see that dim(Wo) = dim(∆W) = dim(W) = d x k. Consequently, the resulting matrix ∆W exhibits a dimensionality equal to the original weight matrix Wo.

This simplifies the computation of the final fine-tuned weight W, as it becomes a straightforward task of matrix addition.

The plot below shows this process.

Fine-tuning the matrices A & B gives the following advantages:

It takes significantly less memory and storage finetuning since the dim r << min(d,k)
Reduces the risk of catastrophic forgetting, the rest of the model parameters remain untouched
Switching between different downstream tasks is significantly easier, we just have to subtract the BA from W0 and add the new downstream B’A’ learned matrix decomposition.
No additional latency, since no increase in computation modules or parameters compared to the original model

To finetune we need to choose the candidate parameters in the pretrained LLM for low-rank approximation. In the LoRA the authors focused on the weights in the transformer self-attention module, but the the LoRA method can be applied to any set of parameters within the model. This still leaves us with the following questions :

Which of the weights in the self-attention module to finetune?
What rank is the matrix decomposition of these weights?

Finetuned model performance for different set of weights and Rank

To answer this we can refer to the data provided in the original paper, looking at the table above we can make the following deductions

We get better-finetuned model performance with lower r over multiple weights than only finetuning specific weights with a higher r
Rank as small as two can result in very good model performance, thus allowing the possibility of spreading trainable parameter budgets across several weight matrices.

Implementation with 🤗 and Ray for Hyperparameter tuning

Implementing PEFT fine-tuning with LORA is quite straightforward thanks to the hugging face library, one needs only to select the model they want to finetune, prepare a Lora config — select parameters necessary for the LORA finetuning e.g target modules, rank, etc and finally create and run a training job.

The code below details these steps:

Import load_datset function to help with collecting data, for this demo we use the rotten_tomatoes movie review dataset for sentiment classification
id2label and label2id are used to map input and output labels to the corresponding sentiments either positive or negative
Import pre-trained model distilbert-base-uncased , which is a lighter version of the bert-base-uncased model, with 40% fewer parameters and fits our goal of sequence classification
We use the Autotokenizer.from_pretrained to get the perfect text tokenizer that fits our model checkpoint, ensuring we use the same vocabulary that was used during model pretraining
Create a tokenizer function to enable tokenization over batches & a DataCollatorWithPaddngto be used later when training over data batches, making sure necessary preprocessing is done to tokenized data before passing to training
Create peft_model with desired lora_config (settings of LORA parameters e.g rank, target module, alpha, etc) ensuring the peft_model fits desired expectations
Initialize trainerArgument with peft_model , lora_config , training configurations e.g number of batches, number of training epochs etc
Train model with trainerArgument and tokenized input data

from datasets import load_dataset
from peft import PeftModel, PeftConfig, LoraConfig, TaskType, get_peft_model
from transformers import TrainingArguments, Trainer, AutoModelForSequenceClassification, AutoTokenizer

id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}


model= AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", id2label=id2label, label2id=label2id)
dataset = load_dataset("rotten_tomatoes")
dataset


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenizer_func(input):
  return tokenizer(input["text"],  truncation=True)


data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
data_tokenized = data.map(tokenizer_func, batched=True)

train_data_tokenized = data_tokenized["train"].remove_columns(["text"]).rename_column("label", "labels")
val_data_tokenized = data_tokenized["validation"].remove_columns(["text"]).rename_column("label", "labels")

lora_config = LoraConfig(
    r=8, # rank - see hyperparameter section for more details
    lora_alpha=32, # scaling factor - see hyperparameter section for more details
    target_modules=["q_lin", "v_lin"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_CLS
)

peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()

trainable params: 739,586 || all params: 67,694,596 || trainable%: 1.0925332946813067

output_dir = f'./rotten-tomatoes-classification-training-{str(int(time.time()))}'

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    logging_steps=1,
     max_steps=10
)

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=train_data_tokenized,
    eval_dataset=val_data_tokenized,
    data_collator=data_collator,
    tokenizer=tokenizer,
)

trainer.train()

Step Training Loss
1 0.658800
2 0.689900
3 0.669800
4 0.694300
5 0.681500
6 0.675700
7 0.706700
8 0.663600
9 0.696000
10 0.672500
TrainOutput(global_step=10, training_loss=0.6809009850025177, metrics={'train_runtime': 1.9137, 'train_samples_per_second': 41.805, 'train_steps_per_second': 5.226, 'total_flos': 1000019035200.0, 'train_loss': 0.6809009850025177, 'epoch': 0.01})

Hyper-parameter tuning of LORA parameters

To further improve the model performance we could tune the Lora parameters i.e. r, target modules, and other training parameters e.g. learning rate.

r — lower r leads to a smaller number of parameters to finetune, a pragmatic approach would be to start with a small r and gradually increase until there is diminishing performance.
target modules/layers — target modules should be selected based on model architecture & task, we use hyperparameter tuning to explore which modules result in the best performance.
alpha — alpha is a scaling factor, it simply controls how much you want the new weights from the low-rank modules to replace the already learned weights. The final value used is alpha/rank, which means a scaling factor = 1 if alpha = rank. During hyperparameter tuning, we will set alpha equal = rank since we are not learning something entirely new, and the base model has enough understanding of language to make the right prediction.

import time
import tempfile
import torch
import ray
import os
import numpy as np
from torch.utils.tensorboard import SummaryWriter
from datetime import datetime
from torch.utils.data import Dataset, TensorDataset, Subset
from torch.utils.data import DataLoader
from peft import  LoraConfig, TaskType, get_peft_model
from ray import train, tune
from ray.train import Checkpoint
from ray.tune.schedulers import ASHAScheduler
from ray.tune import  with_parameters, with_resources, TuneConfig
from ray.tune.tuner import Tuner
from torch.utils.data import DataLoader, Subset
from sklearn.metrics import f1_score, roc_auc_score, precision_score, recall_score, accuracy_score

def create_data_loaders(dataset, batch_size, sample_pct, collate_fn):
    subset_size = int(len(dataset) * sample_pct)
    
    indices = torch.randperm(len(dataset))[:subset_size]
    subset = Subset(dataset, indices)
    
    data_loader = DataLoader(subset, collate_fn=collate_fn, batch_size=batch_size, shuffle=True)
    
    return data_loader


def compute_metrics(predicted, actual):
    predicted = np.argmax(predicted.detach().numpy(), axis=1)
    f1 = f1_score(actual, predicted, average='weighted')
    precision = precision_score(actual, predicted, average='weighted')
    recall = recall_score(actual, predicted, average='weighted')
    accuracy = accuracy_score(actual, predicted)
    return f1, precision, recall, accuracy

def train_model(config, model, train_data, val_data, sample_pct, batch_size):
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    writer = SummaryWriter(f"runs at {timestamp}")
    best_eval_loss = float('inf')
    NUM_EPOCHS = 2 #adjust to suit needs
    alpha = config["r"]

    lora_config = LoraConfig(
        r=config["r"], 
        lora_alpha=alpha
        target_modules=config["target_modules"],
        lora_dropout=config["lora_dropout"],
        bias="none",
        task_type=TaskType.SEQ_CLS)

    model = get_peft_model(model, lora_config)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    optimizer = torch.optim.Adam(model.parameters(), lr=config["lr"])
    train_loader = create_data_loaders(train_data, batch_size=batch_size, sample_pct=sample_pct, collate_fn=data_collator)
    val_loader = create_data_loaders(val_data, batch_size=batch_size, sample_pct=sample_pct, collate_fn=data_collator)

    if hasattr(train, "get_checkpoint") and train.get_checkpoint():
        loaded_checkpoint = train.get_checkpoint()
        with loaded_checkpoint.as_directory() as loaded_checkpoint_dir:
            model_state, optimizer_state, epoch_start = torch.load(
                os.path.join(loaded_checkpoint_dir, "checkpoint.pt")
            )
            model.load_state_dict(model_state)
            optimizer.load_state_dict(optimizer_state)
    else:
        epoch_start = 0

    for epoch in range(epoch_start, NUM_EPOCHS):
        print(f"Epoch {epoch + 1}")
        model.train()
        running_loss = 0

        for j, batch in enumerate(train_loader):
            output = model(**batch)
            loss = output.loss
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

            running_loss += loss.item()

            if j % 10 == 9:
                last_loss = running_loss / j + 1
                print(f"loss at batch {j + 1} = {last_loss}")
                tb_x = epoch * len(train_loader) + j + 1
                writer.add_scalar("loss", last_loss, tb_x)
                running_loss = 0

        model.eval()
        running_eval_loss = 0
        running_eval_f1 = 0
        running_eval_auc = 0
        running_eval_precision = 0
        running_eval_recall = 0
        running_eval_accuracy = 0

        for i, batch in enumerate(val_loader):
            output = model(**batch)
            running_eval_loss += output.loss
            f1, precision, recall, accuracy = compute_metrics(output.logits, batch['labels'])
            running_eval_f1 += f1
            running_eval_precision += precision
            running_eval_recall += recall
            running_eval_accuracy += accuracy

        avg_eval_loss = running_eval_loss / len(val_loader)
        avg_eval_f1 = running_eval_f1 / len(val_loader)
        avg_eval_precision = running_eval_precision / len(val_loader)
        avg_eval_recall = running_eval_recall / len(val_loader)
        avg_eval_accuracy = running_eval_accuracy / len(val_loader)

        print(f"Avg validation loss ==>: {float(avg_eval_loss)}, F1 Score ==> {float(avg_eval_f1)}, Precision ==> {float(avg_eval_precision)}, Recall ==> {float(avg_eval_recall)}, Accuracy ===> {float(avg_eval_accuracy)}")
        writer.add_scalar("Eval Loss", avg_eval_loss, epoch + 1)
        writer.add_scalar("F1 Score", avg_eval_f1, epoch + 1)
        writer.add_scalar("Precision", avg_eval_precision, epoch + 1)
        writer.add_scalar("Recall", avg_eval_recall, epoch + 1)
        writer.add_scalar("Accuracy", avg_eval_accuracy, epoch + 1)
        writer.flush()

        train.report({"loss": avg_eval_loss.item()})

        print("Finished Training")

def create_search_space():
    return {
        "lr": tune.loguniform(1e-4, 1e-1),
        "r": tune.choice([2, 4, 6, 8, 10, 16]), 
        "target_modules": tune.choice([["q_lin"], ["v_lin"], ["q_lin", "v_lin"]]), 
        "lora_dropout": tune.uniform(0.1, 0.5), 
    }

def main(data, sample_pct=0.5, batch_size, max_num_epochs=10, num_samples=5):
    model = data[0]
    train_data = data[1]
    val_data = data[2]

    config = create_search_space()

    scheduler = ASHAScheduler(
        max_t=max_num_epochs,
        grace_period=1,
        reduction_factor=2)

    tuner = Tuner(
        with_resources(
            tune.with_parameters(train_model, model=model, train_data=train_data, 
                            val_data=val_data, sample_pct=sample_pct, batch_size=batch_size),
            resources={"cpu": 2}),
        tune_config=TuneConfig(
            metric="loss",
            mode="min",
            scheduler=scheduler,
            num_samples=num_samples,
        ),
        param_space=config,
        
    )
    results = tuner.fit()
    
    best_result = results.get_best_result("loss", "min", filter_nan_and_inf=True)

    print("Best trial config: {}".format(best_result.config))
    print("Best trial final validation loss: {}".format(
        best_result.metrics["loss"]))

train_data_ray = ray.put(train_data_tokenized)
val_data_ray = ray.put(val_data_tokenized)
model_ray = ray.put(model)

main((model, train_data_ray, val_data_ray))

(train_model pid=4656) Epoch 1
(train_model pid=4633) loss at batch 10 = 5.138177904817793
(train_model pid=4634) Epoch 1 [repeated 4x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(train_model pid=4656) loss at batch 20 = 1.369561609468962 [repeated 5x across cluster]
(train_model pid=4656) loss at batch 30 = 1.2327737828780865 [repeated 5x across cluster]

Trial status: 5 RUNNING
Current time: 2024-01-28 21:30:01. Total running time: 30s
Logical resource usage: 10.0/40 CPUs, 0/0 GPUs
+--------------------------------------------------------------------------------------------+
| Trial name                status              lr     r   target_modules       lora_dropout |
+--------------------------------------------------------------------------------------------+
| train_model_5270c_00000   RUNNING    0.0120706       8   ['q_lin']                0.353063 |
| train_model_5270c_00001   RUNNING    0.0164629      10   ['q_lin']                0.125637 |
| train_model_5270c_00002   RUNNING    0.000325552     4   ['v_lin']                0.35453  |
| train_model_5270c_00003   RUNNING    0.00854054     10   ['v_lin']                0.473791 |
| train_model_5270c_00004   RUNNING    0.000217799     4   ['q_lin']                0.357543 |
+--------------------------------------------------------------------------------------------+
(train_model pid=4632) loss at batch 40 = 1.2112806278925676 [repeated 5x across cluster]
(train_model pid=4632) loss at batch 50 = 1.1326625201166893 [repeated 5x across cluster]
(train_model pid=4634) loss at batch 60 = 1.0870521836361642 [repeated 5x across cluster]
(train_model pid=4634) loss at batch 70 = 1.084065737499707 [repeated 5x across cluster]
(train_model pid=4634) loss at batch 80 = 1.0654285089879096 [repeated 5x across cluster]
(train_model pid=4634) loss at batch 90 = 1.057629942893982 [repeated 5x across cluster]
Trial status: 5 RUNNING
Current time: 2024-01-28 21:30:31. Total running time: 1min 0s
Logical resource usage: 10.0/40 CPUs, 0/0 GPUs

Best trial config: {'lr': 0.0003255517115863258, 'r': 4, 'target_modules': ['v_lin'], 'lora_dropout': 0.35452975368914386}
Best trial final validation loss: 0.3918921649456024

The code above does the following

Create a data loader function to enable data subset sampling, this helps run the example with small data in case of resource limitation
Create metrics we want to track during training
Training module that takes a config defining the hyperparameter space and runs model training for NUM_EPOCHS
Hyperparameter tuning function where we set parameters such as scheduler, parameter space, resource, etc.

Feel free to play around with the parameter space to get a better performance depending on your use case.

Conclusion

In this blog post, we did an unpacking of the relevant theoretical fundamentals in the LORA paper and went through practical implementation using hugging face and Ray for hyperparameter tuning.

I hope you enjoyed the article…

References

LoRA: Low-Rank Adaptation of Large Language Models

An important paradigm of natural language processing consists of large-scale pre-training on general domain data and…

arxiv.org

Generative AI with Large Language Models

In Generative AI with Large Language Models (LLMs), you'll learn the fundamentals of how generative AI works, and how…

www.coursera.org

https://docs.ray.io/en/latest/tune/examples/tune-pytorch-cifar.html#tune-pytorch-cifar-ref