Lädt...


📚 Compressing Large Language Models (LLMs)


Nachrichtenbereich: 🔧 AI Nachrichten
🔗 Quelle: towardsdatascience.com

Make LLMs 10X smaller without sacrificing performance

This article is part of a larger series on using large language models (LLMs) in practice. While the immense scale of LLMs is responsible for their impressive performance across a wide range of use cases, this presents challenges in their application to real-world problems. In this article, I discuss how we can overcome these challenges by compressing LLMs. I start with a high-level overview of key concepts and then walk through a concrete example with Python code.

Image from Canva.

The AI mantra of 2023 was "Bigger is Better," where the equation for improving language models was pretty simple: more data + more parameters + more compute = better performance [1].

While this is likely still the case (GPT-5 coming soon?), there are obvious challenges with working with 100B+ parameter models. For example, a 100B parameter model using FP16 requires 200GB just for storage!

Needless to say, most consumer devices (e.g. phones, tablets, laptops) can’t handle models this big. But.. what if we could make them smaller?

Model Compression

Model compression aims to reduce the size of machine learning models without sacrificing performance [2]. This works for (big) neural networks because they are often over-parameterized (i.e. consist of redundant computational units) [3].

The key benefit of model compression is lower inference costs. This means wider accessibility of powerful ML models (i.e. running LLMs locally on your laptop), lower-cost integration of AI into consumer products, and on-device inference, which supports user privacy and security [3].

3 Ways to Compress Models

There is a wide range of techniques for model compression. Here, I will focus on 3 broad categories.

  • Quantization — Representing models with lower precision data types
  • Pruning — Removing unnecessary components from a model
  • Knowledge Distillation — Training a smaller model using a bigger one

Note: these approaches are independent of one another. Thus, techniques from multiple categories can be combined for maximum compression!

1) Quantization

While quantization might sound like a scary and sophisticated word, it is a simple idea. It consists of lowering the precision of model parameters. You can think of this as converting a high-resolution image to a lower-resolution one while still maintaining the picture’s core properties.

Quantization analogy. Image by author.

Two common classes of quantization techniques are Post-training Quantization (PTQ) and Quantization-Aware Training (QAT).

1.1) Post-training Quantization (PTQ)

Given a neural network, Post-training Quantization (PTQ) compresses the model by replacing parameters with a lower-precision data type (e.g. FP16 to INT-8). This is one of the fastest and simplest ways to reduce a model’s computational requirements because it requires no additional training or data labeling [4].

While this is a relatively easy way to cut model costs, excessive quantization in this way (e.g., FP16 to INT4) often leads to performance degradation, which limits the potential gains of PTQ. [3]

1.2) Quantization-Aware Training (QAT)

For situations where greater compression is needed, PTQ's limitations can be overcome by training models (from scratch) with lower-precision data types. This is the idea behind Quantization-Aware Training (QAT) [5].

While this approach is more technically demanding, it can lead to a significantly smaller, well-performing model. For instance, the BitNet architecture used a ternary data type (i.e. 1.58-bit) to match the performance of the original Llama LLM [6]!

Of course, a large technical gap exists between PTQ and from-scratch QAT. An approach between the two is Quantization-aware Fine-tuning, which consists of additional training of a pre-trained model after quantization [3].

QLoRA — How to Fine-Tune an LLM on a Single GPU

2) Pruning

The goal of pruning is to remove model components that have little impact on performance [7]. This is effective because ML models (especially large ones) tend to learn redundant and noisy structures [3].

An analogy here is like clipping dead branches from a tree. Removing them reduces the size of the tree without harming it.

Pruning analogy. Image by author.

Pruning approaches can be categorized into two buckets: Unstructured and Structured Pruning.

2.1) Unstructured Pruning

Unstructured pruning removes unimportant weights from a neural network (i.e. setting them to zero). For example, early works such as Optimal Brain Damage and Optimal Brain Surgeon computed a saliency score for each parameter in the network by estimating the impact pruning on the loss function [7].

More recently, magnitude-based approaches (i.e. removing weights with the smallest absolute value) have become more popular due to their simplicity and scalability [7].

While the granularity of unstructured pruning can significantly decrease parameter count, these gains require specialized hardware to be realized [7]. Unstructured pruning results in sparse matrix operations (i.e. multiplying matrixes with lots of zeros), which standard hardware cannot do more efficiently than non-sparse operations.

2.2) Structured Pruning

Alternatively, structured pruning removes entire structures from a neural network (e.g. attention heads, neurons, and layers) [5]. This avoids the spare matrix operation problem because entire matrices can be dropped from the model rather than individual parameters.

While there are various ways to identify structures for pruning, in principle, they all seek to remove structures with the smallest impact on performance. A survey of structured pruning approaches is available in reference [5].

3) Knowledge Distillation

Knowledge Distillation transfers knowledge from a (larger) teacher model to a (smaller) student model [5]. One way to do this is to generate predictions with a teacher model and use them to train a student model. Learning from the teacher model’s output logits (i.e., probabilities for all possible next tokens) provides richer information than the original training data, which improves student model performance [8].

More recent distillation applications discard the need for logits altogether and learn from synthetic data generated from the teacher model. A popular example is Stanford’s Alpaca model, which fine-tuned the LLaMa 7B (foundation) model using synthetic data from OpenAI’s text-davinci-003 (i.e. the original ChatGPT model), enabling it to follow user instructions [9].

Example code: Compressing a Text Classifier with Knowledge Distillation + Quantization

With a basic understanding of various compression techniques, let’s see a hands-on example of how to do this in Python. Here, we will compress a 100M parameter model that classifies URLs as safe or unsafe (i.e. phishing).

We first use knowledge distillation to compress the 100M parameter model into a 50M parameter one. Then, using 4-bit quantization, we further reduced the memory footprint by 3X, resulting in a final model that is 7X smaller than our original one.

The example code is available on GitHub. The models (Teacher, Student, Student-4bit) and dataset are freely available on the Hugging Face Hub.

We start by importing a few helpful libraries.

from datasets import load_dataset

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import DistilBertForSequenceClassification, DistilBertConfig

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader

from sklearn.metrics import accuracy_score, precision_recall_fscore_support

Then, we load our dataset from the Hugging Face Hub. This includes training (2100 rows), testing (450 rows), and validation (450 rows) sets.

data = load_dataset("shawhin/phishing-site-classification")

Next, we load in our teacher model. To help speed up training, I loaded the model onto a T4 GPU that was freely available on Google Colab.

# use Nvidia GPU
device = torch.device('cuda')

# Load teacher model and tokenizer
model_path = "shawhin/bert-phishing-classifier_teacher"

tokenizer = AutoTokenizer.from_pretrained(model_path)
teacher_model = AutoModelForSequenceClassification.from_pretrained(model_path)
.to(device)

The teacher model is a fine-tuned version of Goolge’s bert-base-uncased that performs binary classification on phishing website URLs. The code to train the teacher model is available on GitHub.

For the student model, we initialize a new model from scratch based on distilbert-base-uncased. We modify the architecture by removing two layers and four attention heads from the remaining layers.

# Load student model
my_config = DistilBertConfig(n_heads=8, n_layers=4) # drop 4 heads per layer and 2 layers

student_model = DistilBertForSequenceClassification
.from_pretrained("distilbert-base-uncased",
config=my_config,)
.to(device)

Before we can train our student model, we will need to tokenize the dataset. This is important because the models expect input text to be represented in a particular way.

Here, I pad examples based on each batch's longest example. This allows the batches to be represented as a PyTorch tensor.

# define text preprocessing
def preprocess_function(examples):
return tokenizer(examples["text"], padding='max_length', truncation=True)

# tokenize all datasetse
tokenized_data = data.map(preprocess_function, batched=True)
tokenized_data.set_format(type='torch',
columns=['input_ids', 'attention_mask', 'labels'])

Another important step before training is defining an evaluation strategy for our models during training. Below, I define a function that computes the accuracy, precision, recall, and F1 score given a model and dataset.

# Function to evaluate model performance
def evaluate_model(model, dataloader, device):
model.eval() # Set model to evaluation mode
all_preds = []
all_labels = []

# Disable gradient calculations
with torch.no_grad():
for batch in dataloader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)

# Forward pass to get logits
outputs = model(input_ids, attention_mask=attention_mask)
logits = outputs.logits

# Get predictions
preds = torch.argmax(logits, dim=1).cpu().numpy()
all_preds.extend(preds)
all_labels.extend(labels.cpu().numpy())

# Calculate evaluation metrics
accuracy = accuracy_score(all_labels, all_preds)
precision, recall, f1, _ = precision_recall_fscore_support(all_labels,
all_preds,
average='binary')

return accuracy, precision, recall, f1

Now, we are ready to begin the training process. To allow our student model to learn from both the ground truth labels in the training set (i.e., hard targets) and the teacher model’s logits (i.e., soft targets), we must construct a special loss function that considers both targets.

This is done by combining the KL divergence of the student and teacher’s output probability distribution with the cross entropy loss of the student’s logits with the ground truth.

# Function to compute distillation and hard-label loss
def distillation_loss(student_logits, teacher_logits,
true_labels, temperature, alpha):
# Compute soft targets from teacher logits
soft_targets = nn.functional.softmax(teacher_logits / temperature, dim=1)
student_soft = nn.functional.log_softmax(student_logits / temperature, dim=1)

# KL Divergence loss for distillation
distill_loss = nn.functional.kl_div(student_soft,
soft_targets,
reduction='batchmean') * (temperature ** 2)

# Cross-entropy loss for hard labels
hard_loss = nn.CrossEntropyLoss()(student_logits, true_labels)

# Combine losses
loss = alpha * distill_loss + (1.0 - alpha) * hard_loss

return loss

Next, we define our hyperparameters, optimizer, and train/test datasets.

# hyperparameters
batch_size = 32
lr = 1e-4 #5e-5
num_epochs = 5
temperature = 2.0
alpha = 0.5

# define optimizer
optimizer = optim.Adam(student_model.parameters(), lr=lr)

# create training data loader
dataloader = DataLoader(tokenized_data['train'], batch_size=batch_size)
# create testing data loader
test_dataloader = DataLoader(tokenized_data['test'], batch_size=batch_size)

Finally, we train our student model using PyTorch.

# put student model in train mode
student_model.train()

# train model
for epoch in range(num_epochs):
for batch in dataloader:
# Prepare inputs
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)

# Disable gradient calculation for teacher model
with torch.no_grad():
teacher_outputs = teacher_model(input_ids,
attention_mask=attention_mask)
teacher_logits = teacher_outputs.logits

# Forward pass through the student model
student_outputs = student_model(input_ids,
attention_mask=attention_mask)
student_logits = student_outputs.logits

# Compute the distillation loss
loss = distillation_loss(student_logits, teacher_logits, labels,
temperature, alpha)

# Backpropagation
optimizer.zero_grad()
loss.backward()
optimizer.step()

print(f"Epoch {epoch + 1} completed with loss: {loss.item()}")

# Evaluate the teacher model
teacher_accuracy, teacher_precision, teacher_recall, teacher_f1 =
evaluate_model(teacher_model, test_dataloader, device)

print(f"Teacher (test) - Accuracy: {teacher_accuracy:.4f},
Precision: {teacher_precision:.4f},
Recall: {teacher_recall:.4f},
F1 Score: {teacher_f1:.4f}")

# Evaluate the student model
student_accuracy, student_precision, student_recall, student_f1 =
evaluate_model(student_model, test_dataloader, device)

print(f"Student (test) - Accuracy: {student_accuracy:.4f},
Precision: {student_precision:.4f},
Recall: {student_recall:.4f},
F1 Score: {student_f1:.4f}")
print("\n")

# put student model back into train mode
student_model.train()

The training results are shown in the screenshot below. Remarkably, by the end of training, the student model outperformed the teacher across all evaluation metrics!

Knowledge distillation training results. Image by author.

As a final step, we can evaluate the models on the independent validation set, i.e., data not used in training model parameters or tuning hyperparameters.

# create testing data loader
validation_dataloader = DataLoader(tokenized_data['validation'], batch_size=8)

# Evaluate the teacher model
teacher_accuracy, teacher_precision, teacher_recall, teacher_f1 =
evaluate_model(teacher_model, validation_dataloader, device)
print(f"Teacher (validation) - Accuracy: {teacher_accuracy:.4f},
Precision: {teacher_precision:.4f},
Recall: {teacher_recall:.4f},
F1 Score: {teacher_f1:.4f}")

# Evaluate the student model
student_accuracy, student_precision, student_recall, student_f1 =
evaluate_model(student_model, validation_dataloader, device)
print(f"Student (validation) - Accuracy: {student_accuracy:.4f},
Precision: {student_precision:.4f},
Recall: {student_recall:.4f},
F1 Score: {student_f1:.4f}")

Here, again, we see the student outperform the teacher.

Model performances on the validation set. Image by author.

So far, we’ve reduced our model from 109M parameters (438 MB) to 52.8M parameters (211 MB). However, we can go one step further and quantize the student model.

First, we push the model of the Hugging Face Hub.

student_model.push_to_hub("shawhin/bert-phishing-classifier_student")

Then, we can load it back in using 4-bit quantization. For that, we can use the BitsAndBytes integration in the transformers library.

We set up the config to store model parameters using the 4-bit NormalFloat data type described in the QLoRA paper and the bfloat16 for computation [10].

from transformers import BitsAndBytesConfig

# load model in model as 4-bit
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype = torch.bfloat16,
bnb_4bit_use_double_quant=True
)

model_nf4 = AutoModelForSequenceClassification.from_pretrained(model_id,
device_map=device,
quantization_config=nf4_config)

We can then evaluate our quantized model on the validation set.

# Evaluate the student model
quantized_accuracy, quantized_precision, quantized_recall, quantized_f1 =
evaluate_model(model_nf4, validation_dataloader, device)

print("Post-quantization Performance")
print(f"Accuracy: {quantized_accuracy:.4f},
Precision: {quantized_precision:.4f},
Recall: {quantized_recall:.4f},
F1 Score: {quantized_f1:.4f}")
Student model performance on the validation set after quantization. Image by author.

Once again, we see a small performance improvement after compression. An intuitive explanation for this is Occam’s Razor principle, which states that simpler models are better.

In this case, the model may be overparameterized for this binary classification task. Thus, simplifying the model results in better performance.

Recap

While modern large language models (LLMs) demonstrate impressive performance on various tasks, their scale presents challenges in deploying them in real-world settings.

Recent innovations in model compression techniques help mitigate these challenges by reducing the computational cost of LLM solutions. Here, we discussed three broad categories of compression techniques (Quantization, Pruning, and Knowledge Distillation) and walked through an example implementation in Python.

More on LLMs 👇

Large Language Models (LLMs)

My website: https://www.shawhintalebi.com/

[1] Scaling Laws for Neural Language Models

[2] A Survey of Model Compression and Acceleration for Deep Neural Networks

[3] Model Compression in Practice: Lessons Learned from Practitioners Creating On-device Machine Learning Experiences

[4] Model Compression for Deep Neural Networks: A Survey

[5] A Survey on Model Compression for Large Language Models

[6] The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

[7] To prune, or not to prune: exploring the efficacy of pruning for model compression

[8] Distilling the Knowledge in a Neural Network

[9] Alpaca: A Strong, Replicable Instruction-Following Model

[10] QLoRA: Efficient Finetuning of Quantized LLMs


Compressing Large Language Models (LLMs) was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

...

📰 Compressing Large Language Models (LLMs)


📈 61.32 Punkte
🔧 AI Nachrichten

📰 What are Large Language Models (LLMs)? Applications and Types of LLMs


📈 46.83 Punkte
🔧 AI Nachrichten

🎥 Large Language Models: How Large is Large Enough?


📈 43.11 Punkte
🎥 Video | Youtube

📰 Large Language Models, GPT-3: Language Models are Few-Shot Learners


📈 40.3 Punkte
🔧 AI Nachrichten

📰 Large Language Models, GPT-2 — Language Models are Unsupervised Multitask Learners


📈 40.3 Punkte
🔧 AI Nachrichten

📰 Compressing LLMs: The Truth is Rarely Pure and Never Simple


📈 36.58 Punkte
🔧 AI Nachrichten

🔧 Large Language Models (LLMs) in Scoring Tasks and Decision Making


📈 35.79 Punkte
🔧 Programmierung

📰 Meta AI Introduces TestGen-LLM for Automated Unit Test Improvement Using Large Language Models (LLMs)


📈 35.79 Punkte
🔧 AI Nachrichten

📰 How does GPT-4’s steerable nature set it apart from the previous Large Language Models (LLMs)?


📈 35.79 Punkte
🔧 AI Nachrichten

📰 How to Implement Knowledge Graphs and Large Language Models (LLMs) together at the Enterprise Level


📈 35.79 Punkte
🔧 AI Nachrichten

📰 SW/HW Co-optimization Strategy for Large Language Models (LLMs)


📈 35.79 Punkte
🔧 AI Nachrichten

🔧 Recommender Systems in the Era of Large Language Models (LLMs)


📈 35.79 Punkte
🔧 Programmierung

🔧 How I Hacked Large Language Models(LLMs) Using Prompt Injection (And It Worked)


📈 35.79 Punkte
🔧 Programmierung

matomo