Back to Blog

Local deployment of Llama 3.2 3B vs. GPT-4o: A deep dive into fine-tuning for specialized content creation

In our previous article, we explored how fine-tuning OpenAI's GPT-4o model enabled the generation of engaging, journalist-like UEFA Champions League match reports from play-by-play commentary. Building upon that foundation, we sought to replicate these capabilities using a locally deployed Llama-3.2 model with the goal of getting a grasp of the model's performance for our specific application before and after fine-tuning.

For this post, we’ll use ollama to locally deploy a pre-trained Llama 3.2 3B model to generate a report for the UEFA Champions League final by using the match’s commentary as input. After that we’ll use the unsloth library to perform a parameter-efficient fine-tuning (PEFT) of the model using Low-Rank Adaptation (LoRA) technique. The results of both models will be compared and different aspects of the fine-tuning process like hardware requirements and relevant hyperparameters will be discussed.

Hardware used

Deploying and fine-tuning a large language model (LLM) like Llama-3.2 demands substantial computational resources. For these experiments, we performed the model deployment and fine-tuning on a machine with the following specifications:

  • GPU: NVIDIA RTX 3090 with 24 GB VRAM
  • RAM: 36 GB
  • CPU: Intel Core i5-14600K

Llama 3.2

The Llama 3.2 collection of multilingual LLMs is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes. The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. This family of models has a 128k context length.

Llama 3.2 was pretrained on up to 9 trillion tokens of data from publicly available sources. For the 1B and 3B Llama 3.2 models, logits from the Llama 3.1 8B and 70B models were incorporated into the pretraining stage of the model development, where outputs (logits) from these larger models were used as token-level targets. Knowledge distillation was used after pruning to recover performance. In post-training a similar recipe as Llama 3.1 was used which produced final chat models by doing several rounds of alignment on top of the pre-trained model.

Llama is designed with fine-tuning in mind, making it highly adaptable to specific use cases. The model can be fine-tuned on domain-specific datasets, allowing users to customize its performance for specialized tasks. This capability is particularly beneficial for those looking to implement an LLM in resource-constrained real-world applications, where general-purpose language models might fall short in terms of accuracy or relevance. Fine-tuning Llama 3.2 allows users to train the model on a relatively small dataset, achieving high-quality results tailored to their needs.

Local deploy with ollama

Ollama is an open-source tool designed to provide a user-friendly platform for running LLMs directly on local systems. It bridges the gap between the complexities of advanced LLM technology and the need for an accessible, customizable AI experience.

Ollama makes it easy to download, install, and use a variety of LLMs, allowing users to experiment with these models without requiring deep technical knowledge or depending on cloud infrastructure. Some examples of models that can be deployed through Ollama include Llama 3.2, Phi 3, Mistral, and Gemma 2, giving users the flexibility to explore cutting-edge models locally with ease and control.

Installing Ollama and deploying Llama 3.2 3B to be accessed through the CLI on our Linux device requires only two commands:

curl -fsSL https://ollama.com/install.sh | sh
ollama run llama3.2:3b

After Ollama is correctly installed, using its API with python is also very simple:

pip install ollama
import ollama
response = ollama.chat(model='llama3.2:3b', messages=[{'role': 'user', 'content': 'Why is the sky blue?'}]) 
print(response['message']['content'])

Deploying Llama 3.2 3B using ollama as shown in this section required 3.7 GB of VRAM.

Parameter-Efficient Fine-Tuning (PEFT)

LLMs built on transformer architectures have achieved a leading performance across a range of Natural Language Processing tasks. The common approach to obtain these models involves large-scale pretraining on vast amounts of data, followed by fine-tuning for specific downstream tasks. This fine-tuning significantly improves performance over using the pretrained models directly.

However, as models grow larger, fully fine-tuning them becomes impractical for consumer-level hardware, and storing a separate fine-tuned model for each task is resource-intensive, as they are the same size as the original model. This is where PEFT methods come into play to address both computational and storage challenges.

PEFT methods allow only a small subset of parameters to be fine-tuned, while the majority of the pretrained model remains frozen. This approach dramatically reduces the resources needed for both training and storage. Additionally, PEFT mitigates the problem of catastrophic forgetting, which can occur in full fine-tuning. PEFT proves especially useful in low-data situations and can generalize better to unseen domains. It enables portability by producing small, task-specific weight updates that can be added to the pretrained model without needing to replace the entire model, allowing efficient handling of multiple tasks with minimal storage requirements.

LoRA: Low-Rank Adaptation

LoRA is a PEFT technique which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each target layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks with no additional inference latency.

Instead of modifying the entire set of weights in the model, LoRA decomposes the weight update into two low-rank matrices, denoted as A and B. These matrices have much smaller dimensions than the original weight matrix, reducing the number of trainable parameters significantly. As displayed in Figure 1, matrix A projects the input into a lower-dimensional space, reducing the number of trainable parameters, while matrix B projects it back into the original space, allowing the model to adjust its predictions. The rank of these matrices controls their size and determines how much information is compressed during these projections. Together, they allow for efficient fine-tuning by training far fewer parameters than traditional methods, while keeping the majority of the pretrained model weights frozen.

Figure 1: Diagram showing low rank reparametrization matrices A and B.

Important LoRA Parameters:

  • Rank (r): Determines the size of the adaptation matrices. Can be any number > 0, suggested are 8, 16, 32, 64, 128.
  • target_modules: Selection of LLM modules where LoRA will be applied. For Llama 3.2 3B the following modules were selected: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

Unsloth library

To facilitate the fine-tuning process, we utilize the unsloth library which offers several advantages over the standard HuggingFace Transformers library:

  • Optimized for Speed: Faster training times due to efficient memory management and computation.
  • Support for Quantization: Seamlessly integrates 4-bit quantization methods. Quantization reduces the precision of the model's weights, significantly lowering memory consumption without drastically affecting performance.
  • Advanced PEFT Features: Provides built-in support for LoRA and other PEFT techniques.
  • User-Friendly API: Simplifies the fine-tuning workflow with intuitive functions.

By using unsloth following the provided collab notebook as guide, we were able to streamline the fine-tuning process and achieve better performance on our hardware.

Dataset Formatting and Preparation

The stored match report dataset adheres to OpenAI's dataset format guidelines, this means it’s structured as a JSONL file that consists of a list of message exchanges formatted as conversations, where each message has a defined role (system, user, assistant) and content. Each conversation is presented as a JSON object, and multiple such objects are separated by line breaks.

Every JSONL entry contains a structured dialogue sequence where the "messages" list includes interactions between the system, which sets the scenario (football journalist that writes engaging reports based on football match play-by-play commentary), and the user who provides a play-by-play commentary input preceded by a brief context of the match being played (e.g. first leg of the Champions League semifinals match between Borussia Dortmund and Paris Saint-Germain). The assistant then responds with a match report written by a sports journalist.

More details about the JSONL dataset content can be found in our previous article.

Converting Dataset Format

Fine-tuning Llama-3.2 requires that we adapt our match report dataset to the model's expected input format. To convert our existing dataset into this format, we utilized a custom method that takes the path to the JSONL file as input along with the Llama-3.2 tokenizer. The code for this conversion can be found in this Github Gist.

The converted dataset is ready to be used with Huggingface TRL's SFTTrainer thanks to the returned text with Llama 3.1 formatting where special tokens needed for structure in model interactions are added:

  • <|begin_of_text|>: Marks the start of the input text.
  • <|start_header_id|>system<|end_header_id|>: Denotes the system's message, which sets the context or role.
  • <|eot_id|>: Marks the end of a section or a segment of the conversation (used multiple times to separate sections).
  • <|start_header_id|>user<|end_header_id|>: Signals the start of the user's input, which in this case is the play-by-play commentary for the football match.
  • <|start_header_id|>assistant<|end_header_id|>: Indicates the assistant’s response, where the model generates the match report based on the user’s input.

Pre-Trained Llama-3.2 output

The following figures show the output of pre-trained Llama-3.2 3B and GTP-4o-2024-08-6 with the same prompt containing the UEFA Champions League 2023-24 final play-by-play commentary as described in previous sections:

Figure 2: Llama-3.2 3B output
Figure 3: GPT-4o-2024-08-06 output

Out of the box, the pre-trained Llama-3.2 3B model produces subpar outputs when generating a match report. Llama doesn’t seem to understand that it’s supposed to write a report of the whole match and instead takes the role of a sport commentator that narrates what happened in the match up to the half time. In this narration, offers a concise and straightforward account mentioning specific events that seem to be picked randomly. The generated text lacks narrative cohesion, resulting in a report that feels disjointed. The language is also simple and lacks the expressive flair typical of professional sports journalism, providing minimal insights into the significance of the events described.

In contrast, GPT-4o-2024-08-06 delivers a rich and compelling narrative of the entire match. It sets the scene with an engaging introduction, highlights key moments with vivid descriptions, and provides context about players' performances and tactical decisions. The language is dynamic and captures the excitement of the match, employing expressive phrases that draw the reader in. It concludes with reflections on the implications of the result for both teams, embodying the depth and style of professional sports journalism.

Training Process and Results

During the training process, the Llama 3.2 3B model was fine-tuned using 4-bit and default 16-bit quantizations, 1 and 2 batch sizes, and 16 and 32 LoRA adaptation matrix ranks to evaluate performance in terms of training time, memory consumption, and checkpoint size. These results, obtained when training for 10 epochs on a dataset with 28 entries, are summarized in the following table:

The 4-bit quantization with a batch size of 1 and LoRA Rank 16 achieved the lowest VRAM usage at 6.332 GB, though it had the longest training time at 372 seconds. In contrast, the 16-bit configuration with a Rank 32 produced a larger checkpoint size of 185.6 MB, while increasing VRAM usage to 13.049 GB with a faster train time of only 311 seconds. The train times of around 30 seconds per epoch is a notable improvement compared to the 140 seconds per epoch of GPT-4o.

Overall, the 4-bit quantization configurations were more resource-efficient, with lower memory footprints. However, the 16-bit configurations with higher ranks, such as Rank 32, traded increased memory and storage requirements for potentially better model performance. 

The checkpoint size remained consistent at 92.8 MB for LoRA Rank 16 but doubled to 185.6 MB for Rank 32. This doubling is expected, as increasing the rank of the adaptation matrices proportionally increases the number of trainable parameters, thus proportionally increasing the checkpoint size. Specifically, the LoRA rank of 16, trains 24,313,856 parameters and rank 32 trains 48,627,712.

Loss Metrics

Throughout training, for all configurations, the initial train loss is about 2.3 and after 10 epochs the final loss is around 1.2.

This reduction indicates that the model effectively learned from the dataset and improved its ability to generate match reports that align with the style of the train dataset.

Fine-tuned Llama 3.2 output

The following image shows the output of a fine-tuning run with batch size 1, LoRA rank 16 and 4-bit quantization:

Figure 4 : UCL Final report generated by the fine-tuned Llama 3.2 3B model. Green highlights indicate correct context added by the model, red highlights incorrect context, and yellow highlights represent fabricated quotes from players and coaches.

The fine-tuned model output shows noticeable transformations:

  • Correct report format: The output reads like a journalistic article reporting on the entire match, unlike the previous output, which generated only a partial, half-time-like commentary.

  • Contextual depth: The AI attempted to include historical context, player backgrounds, and implications of the match outcome, but most of these details were incorrect, as highlighted in Figure 4. Some of them barely make sense syntactically i.e. “The match was the first final to feature teams that started the campaign in different groups, with Madrid and Dortmund initially in different parts of the table.”

  • Quotes from players and managers: As seen in all the reports used in the dataset, the model generated quotes from players and managers (highlighted in yellow). However, all the quotes are hallucinations. While some make sense within the match context, others are unrealistic. For example, "I'd been waiting a long time for a title, but I didn't want it at the expense of the title won next year," Real captain Sola Ruíz said. "We want to add more titles." In this case, the quote doesn’t make much sense and the player giving the quote is not even part of the lineup.

Overall, the fine-tuned model understood on a high level the format, style and content required for the match report but mostly fails to generate accurate and well written content that is not present on the input commentary. 

The output of the fine-tuned GPT-4o model for the same UCL final match commentary input can be seen here.

Inference Time

Generating a full match report took approximately 13 seconds. This duration is acceptable for most applications and demonstrates the practicality of deploying the fine-tuned model in real-world scenarios.

Conclusions

Fine-tuning Llama-3.2 3B using LoRA and the Unsloth library significantly enhanced the model's ability to generate UEFA Champions League match reports. This project demonstrates that with the right techniques and tools, open-source models can be tailored to meet specific needs, offering a cost-effective alternative to proprietary models like GPT-4.

By leveraging PEFT and quantization, the fine-tuning process of an LLM is accessible even on consumer-grade hardware. The improvements in output quality highlight the effectiveness of fine-tuning in expanding the capabilities of LLMs for specialized applications.

Unfortunately, while Llama 3.2 3B fine-tuning works well to adapt the model’s output to a target style and formatting, the difference between a fine-tuned GPT-4 and Llama-3.2 is very notable when it comes to unseen content generation. This means that the scope of a Llama application requires careful planning and the fine-tuning process requires more expertise in hyper-parameter tuning to get the most out of it and it will still probably be less accurate than a fine-tuned GPT-4o model that can be easily trained using OpenAI’s platform.

However, having full control over the model, without reliance on external servers or third-party infrastructure, still makes local deployment and fine-tuning of Llama 3.2 3B a strong choice. The enhanced data security and reduced latency is a key benefit for applications with data privacy requirements or limited cloud access. Additionally, the ability to fine-tune on consumer-grade hardware while having significantly faster train times compared to GPT-4o encourages extended experimentation and model adaptation, opening doors for production-grade solutions that work within the model’s limitations.

---

References:

Related posts