Understanding LORA

Fig 1: LORA

How it works
Setup
Common issues
QLora - Extension
- Setup
- Usage
Advanced methods
Notebooks

How it works

It uses a decomposed matrix to update the adapter weights, commonly A and B instead of the pretrained weights. The decomposed matrices are of rank n, which can be defined. More will be discussed in the “How to use” section.

This is the forward function of the PEFT model that is used. Only the important parts have been extracted. For the full code, it can be seen here at lora.py.

class ...:
    def forward(self,x):
        ...
        # forward pass through the frozen pretrained weights
        result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
        
        # setting dtype
        x = x.to(self.lora_A[self.active_adapter].weight.dtype)

        result += (
            self.lora_B[self.active_adapter](
                self.lora_A[self.active_adapter](self.lora_dropout[self.active_adapter](x))
            )
            * self.scaling[self.active_adapter]
        )
        ...
        return result

Here is a psuedo-code breakdown of the above:

pretrained_result = Multiply x with pretrained_weights
Apply dropout to x first
Multiply x with lora_A then lora_b
lora_result_after_scaling = lora_result multiplied by scaling
final_result = Add pretrained_result and lora_result_after_scaling

scaling = lora_alpha / r , represents the weightage of lora_weights

Setup

There are two main packages, Loralib by Microsoft and peft by Huggingface. For transformers models, the Huggingface package is recommended for easier usage. Both can be installed via pip install loralib or pip install peft.

Model loading

# Load any model from the transformers package
model = AutoModel.from_pretrained('llama2') # this is just an example

# Prepare a peft config
peft_config = LoraConfig(
    inference_mode=False,
    r=1,
    lora_alpha=64,
    lora_dropout=0.15,
    # target_modules = ['linear'] # this is commonly not used, see below
    )
    
# get the peft model
# Two options for this
p_model = get_peft_model(model, peft_config, ADAPTER_NAME) or \
          PeftModel(model, peft_config, ADAPTER_NAME)

Model training

from transformers import Trainer, TrainingArguments # or Seq2SeqTrainer

# for the arguments refer to huggingface trainer
args = TrainingArguments()
# load the peft model here
trainer = Trainer(p_model)

trainer.train()

Saving and loading

When saving, peft saves only the adapter weights, as such not the full model is saved. I have also tested that modifying the model config will save the relevant layers as well, such as changing the classifier head.

p_model.save_pretrained(ADAPTER_PATH)

Loading requires the original model to be loaded, then the adapter weights can be applied.

# Load any model from the transformers package
model = AutoModel.from_pretrained('llama2') # same as above

p_model = PeftModel.from_pretrained(model, ADAPTER_PATH, adapter_name= ADAPTER_NAME)

Common issues

Target modules not found

Usually, peft would be able to find the approriate attention layers. In some cases, such as using custom model, the names of the layers would need to be defined.

    TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING = {
    "t5": ["q", "v"],
    "mt5": ["q", "v"],
    "bart": ["q_proj", "v_proj"],
    "gpt2": ["c_attn"],
    "bloom": ["query_key_value"],
    "blip-2": ["q", "v", "q_proj", "v_proj"],
    "opt": ["q_proj", "v_proj"],
    "gptj": ["q_proj", "v_proj"],
    "gpt_neox": ["query_key_value"],
    "gpt_neo": ["q_proj", "v_proj"],
    "bert": ["query", "value"],
    "roberta": ["query", "value"],
    "xlm-roberta": ["query", "value"],
    "electra": ["query", "value"],
    "deberta-v2": ["query_proj", "value_proj"],
    "deberta": ["in_proj"],
    "layoutlm": ["query", "value"],
    "llama": ["q_proj", "v_proj"],
    "chatglm": ["query_key_value"],
    "gpt_bigcode": ["c_attn"],
    "mpt": ["Wqkv"],
    "RefinedWebModel": ["query_key_value"],
    "RefinedWeb": ["query_key_value"],
    "falcon": ["query_key_value"],
    "btlm": ["c_proj", "c_attn"],
    "codegen": ["qkv_proj"],
}

QLora - Extension

Setup

For qlora, an additional package bitsandbytes is required.

pip install bitsandbytes

Usage

8 bit with huggingface

# Load any model from the transformers package
from peft import LoraConfig, PeftModel, 
                  get_peft_model, prepare_model_for_kbit_training
model = AutoModel.from_pretrained('llama2', 
                                  load_in_8bit = True)

4 bit with bitsandbytes configuration

This was experimented on a free colab instance as a proof of concept

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
# sharded bf16 / fp16 is necessary else disk storage will not be enough
model_id = "philschmid/flan-t5-xxl-sharded-fp16"
model = AutoModelForSeq2SeqLM.from_pretrained(model_id,
                                              quantization_config=bnb_config,
                                              device_map = {"":0}
                                              )

Convert to peft model

# Prepare a peft config
peft_config = LoraConfig(
    inference_mode=False,
    r=1,
    lora_alpha=64,
    lora_dropout=0.15,
    # target_modules = ['linear'] # this is commonly not used, see below
    )
# here an additional step is required
model = prepare_model_for_kbit_training(model)
# get the peft model
# Two options for this
p_model = get_peft_model(model, peft_config, ADAPTER_NAME) or \
          PeftModel(model, peft_config, ADAPTER_NAME)

Training, saving and loading

The process is the same as the default LoRa.

Advanced methods

Selective layers

LoraConfig allows individual layers to be trained. It may be possibly to train only a subset of layers, but I have not seen any research done on this. This means possibly training layers 0-4 instead of the full layers. Here is an implementation
```
 peft_config = LoraConfig(
 ...
 layers_to_transform = [0,1,2,3,4]
 )
```
Custom modules

This could possibly be an excellent tool to reduce the trainable layers further, but requires more research on the representations of q,k,v. Some models also use a fused attention.
```
 peft_config = LoraConfig(
 ...
 target_modules = ['q_proj']
 )
```

Notebooks

Personal notebook: FLAN-T5 XXL with colab
Guide from huggingface: Notebook