How it works
It uses a decomposed matrix to update the adapter weights, commonly A and B instead of the pretrained weights. The decomposed matrices are of rank n, which can be defined. More will be discussed in the “How to use” section.
This is the forward function of the PEFT model that is used. Only the important parts have been extracted. For the full code, it can be seen here at lora.py.
class ...:
def forward(self,x):
...
# forward pass through the frozen pretrained weights
result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
# setting dtype
x = x.to(self.lora_A[self.active_adapter].weight.dtype)
result += (
self.lora_B[self.active_adapter](
self.lora_A[self.active_adapter](self.lora_dropout[self.active_adapter](x))
)
* self.scaling[self.active_adapter]
)
...
return result
Here is a psuedo-code breakdown of the above:
-
pretrained_result
= Multiply x with pretrained_weights -
Apply dropout to
x
first -
Multiply
x
withlora_A
thenlora_b
-
lora_result_after_scaling
=lora_result
multiplied byscaling
-
final_result
= Addpretrained_result
andlora_result_after_scaling
scaling =
lora_alpha / r
, represents the weightage of lora_weights
Setup
There are two main packages, Loralib by Microsoft and peft by Huggingface. For transformers models, the Huggingface package is recommended for easier usage. Both can be installed via pip install loralib or pip install peft.
Model loading
# Load any model from the transformers package
model = AutoModel.from_pretrained('llama2') # this is just an example
# Prepare a peft config
peft_config = LoraConfig(
inference_mode=False,
r=1,
lora_alpha=64,
lora_dropout=0.15,
# target_modules = ['linear'] # this is commonly not used, see below
)
# get the peft model
# Two options for this
p_model = get_peft_model(model, peft_config, ADAPTER_NAME) or \
PeftModel(model, peft_config, ADAPTER_NAME)
Model training
from transformers import Trainer, TrainingArguments # or Seq2SeqTrainer
# for the arguments refer to huggingface trainer
args = TrainingArguments()
# load the peft model here
trainer = Trainer(p_model)
trainer.train()
Saving and loading
When saving, peft saves only the adapter weights, as such not the full model is saved. I have also tested that modifying the model config will save the relevant layers as well, such as changing the classifier head.
p_model.save_pretrained(ADAPTER_PATH)
Loading requires the original model to be loaded, then the adapter weights can be applied.
# Load any model from the transformers package
model = AutoModel.from_pretrained('llama2') # same as above
p_model = PeftModel.from_pretrained(model, ADAPTER_PATH, adapter_name= ADAPTER_NAME)
Common issues
-
Target modules not found
Usually, peft would be able to find the approriate attention layers. In some cases, such as using custom model, the names of the layers would need to be defined.
TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING = {
"t5": ["q", "v"],
"mt5": ["q", "v"],
"bart": ["q_proj", "v_proj"],
"gpt2": ["c_attn"],
"bloom": ["query_key_value"],
"blip-2": ["q", "v", "q_proj", "v_proj"],
"opt": ["q_proj", "v_proj"],
"gptj": ["q_proj", "v_proj"],
"gpt_neox": ["query_key_value"],
"gpt_neo": ["q_proj", "v_proj"],
"bert": ["query", "value"],
"roberta": ["query", "value"],
"xlm-roberta": ["query", "value"],
"electra": ["query", "value"],
"deberta-v2": ["query_proj", "value_proj"],
"deberta": ["in_proj"],
"layoutlm": ["query", "value"],
"llama": ["q_proj", "v_proj"],
"chatglm": ["query_key_value"],
"gpt_bigcode": ["c_attn"],
"mpt": ["Wqkv"],
"RefinedWebModel": ["query_key_value"],
"RefinedWeb": ["query_key_value"],
"falcon": ["query_key_value"],
"btlm": ["c_proj", "c_attn"],
"codegen": ["qkv_proj"],
}
QLora - Extension
Setup
For qlora, an additional package bitsandbytes is required.
pip install bitsandbytes
Usage
8 bit with huggingface
# Load any model from the transformers package
from peft import LoraConfig, PeftModel,
get_peft_model, prepare_model_for_kbit_training
model = AutoModel.from_pretrained('llama2',
load_in_8bit = True)
4 bit with bitsandbytes configuration
This was experimented on a free colab instance as a proof of concept
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
# sharded bf16 / fp16 is necessary else disk storage will not be enough
model_id = "philschmid/flan-t5-xxl-sharded-fp16"
model = AutoModelForSeq2SeqLM.from_pretrained(model_id,
quantization_config=bnb_config,
device_map = {"":0}
)
Convert to peft model
# Prepare a peft config
peft_config = LoraConfig(
inference_mode=False,
r=1,
lora_alpha=64,
lora_dropout=0.15,
# target_modules = ['linear'] # this is commonly not used, see below
)
# here an additional step is required
model = prepare_model_for_kbit_training(model)
# get the peft model
# Two options for this
p_model = get_peft_model(model, peft_config, ADAPTER_NAME) or \
PeftModel(model, peft_config, ADAPTER_NAME)
Training, saving and loading
The process is the same as the default LoRa.
Advanced methods
-
Selective layers
LoraConfig allows individual layers to be trained. It may be possibly to train only a subset of layers, but I have not seen any research done on this. This means possibly training layers 0-4 instead of the full layers. Here is an implementation
peft_config = LoraConfig( ... layers_to_transform = [0,1,2,3,4] )
-
Custom modules
This could possibly be an excellent tool to reduce the trainable layers further, but requires more research on the representations of q,k,v. Some models also use a fused attention.
peft_config = LoraConfig( ... target_modules = ['q_proj'] )
Notebooks
Personal notebook: FLAN-T5 XXL with colab
Guide from huggingface: Notebook