With OpenAI’s recent release of a fine-tuning platform for GPT-4o, generating engaging, journalist-style football match reports is more achievable than ever. In this post, we explore the process of fine-tuning GPT-4 to create authentic UEFA Champions League reports from play-by-play commentary.
The primary motivation behind this project was to delve into the capabilities, limitations, requirements and pricing of fine-tuning ChatGPT models. Specifically, the aim was to enhance GPT-4's ability to produce match reports that not only recount the most relevant events but also capture the essence and context of the match, much like a human sports journalist would.
The default GPT-4o model, while powerful, tended to generate match narrations that strictly followed the play-by-play inputs, always lacking the contextual depth and storytelling flair characteristic of professional journalism. The goal was to have the AI produce reports that:
Fine-tuning a large language model (LLM) and using prompt engineering are two different approaches to modifying how an AI model generates responses. Fine-tuning involves retraining the model on a specific dataset, allowing it to learn new patterns, styles, or domains. This process alters the parameters of the model, which means the changes are baked into how the model processes input going forward.
On the other hand, prompt engineering manipulates the input prompts to obtain the desired responses without altering the model's parameters. By crafting the prompts carefully users can nudge the model to generate responses that better fit their needs. However, prompt engineering is often less reliable when complex changes in style or deep domain-specific expertise are required.
Prompt engineering is best suited for situations where quick adjustments or minor changes are needed without modifying the underlying model. For instance, if you're working on generating responses with a slightly different tone or encouraging the model to focus on specific details, prompt engineering can be a cost-effective and efficient solution. It’s especially useful for tasks that don’t require high consistency or deep domain-specific knowledge, such as ad-hoc content generation, general-purpose Q&A systems, or modifying how the model structures its responses.
On the other hand, fine-tuning is the go-to option when more significant, consistent changes are required. If your use case demands content generated in a specific format or style, such as technical documentation, legal text, or highly specialized fields like medical reports, fine-tuning allows the model to learn these patterns directly from the dataset. Fine-tuning is also crucial when the model needs to consistently follow a complex set of guidelines that prompt engineering alone cannot achieve. A good practice is to start with prompt engineering, and if it falls short of your goals, fine-tuning becomes the logical next step.
To train the model, the match reports and play-by-play commentary were obtained by browsing ESPN's UEFA Champions League section to the specific dates of the target matches. The dataset consisted of the 2023-24 season round-of-16, quarterfinals, and semifinals—a total of 28 games. The championship’s final was used for validation.
Each match entry in the dataset requires both the play-by-play commentary and a journalist's report of the match. The play-by-play commentary consists of timestamps followed by brief descriptions of the events at each moment. ESPN’s website provides commentary for either every event in a match or just the key moments. For our dataset, we used the full commentary. The following image shows an example of the key events from the second leg of the semifinal match between Borussia Dortmund and Paris Saint-Germain:
The writing style of the journalistic reports is direct and focused, providing a clear summary of the match’s key actors, moments, stakes, and context. Some reports use straightforward, neutral English, while others use more colloquialisms from British English. The text combines action-packed descriptions with quotes from players and coaches. The reports usually start with a brief summary of the match and are followed by an in-depth chronological description of the events, using concise language that keeps the reader engaged.
Adhering to OpenAI's dataset format guidelines, the data was structured into a JSONL file format that consists of a list of message exchanges structured as conversations, where each message has a defined role (e.g., system, user, assistant) and content. Each conversation is presented as a JSON object, and multiple such objects are separated by line breaks.
For this fine-tuning, every JSONL entry contains a structured dialogue sequence where the "messages" list includes interactions between the system, which sets the scenario (football journalist that writes engaging reports based on football match play-by-play commentary), and the user who provides a play-by-play commentary input preceded by a brief context of the match being played (e.g. first leg of the Champions League semifinals match between Borussia Dortmund and Paris Saint-Germain). The assistant then responds with a match report written by a sports journalist.
To validate the dataset, ensure compatibility and perform token counting, these code snippets from OpenAI's cookbook were utilized. In general, token counting is necessary to manage the length of prompts and responses, keeping them within the model's context window and also for cost estimation.
With the dataset prepared, the fine-tuning process is created using OpenAI's fine-tuning dashboard:
The gpt-4o-2024-08-06 model was fine-tuned with 3 epochs, a batch size of 1 and an LR multiplier of 2. The total trained tokens were 255,387 while the token counting script estimated 265,644 tokens (approximately 88,548 tokens per epoch).
The entire fine-tuning task took around 7 minutes in which the training loss was gradually decreasing while the validation loss went down in the first two epochs but started increasing in the third which proved that the automatic setting of 3 epochs was correct.
Using OpenAI's platform playground, we can compare our custom model with the gpt-4o-2024-08-06 model through a feature that allows sending the same prompt to both models and viewing their outputs side-by-side. As with the rest of the dataset, the play-by-play commentary of the championship final was sourced from ESPN's website. Here are the results:
The fine-tuned model output shows noticeable transformations:
To estimate the cost of a fine-tuning instance, OpenAI's pricing page is your friend. As of the time of this post, fine-tuning for GPT-4o is free up to a daily limit of 1 million tokens. Any excess is charged at $25 per 1 million tokens. This means that, excluding the free daily limit, a single fine-tuning instance of 3 epochs in our dataset of 28 matches totals 255,387 tokens that would cost $6.38.
Once a model is fine-tuned, OpenAI will bill you only for the tokens used in requests to that model, at a rate of $3.75 per 1 million input tokens and $15 per 1 million output tokens as of the time of this post. For comparison, using the default GPT-4o-2024-08-06 costs $2.50 per 1 million input tokens and $10 per 1 million output tokens. This means using a fine-tuned model is 50% more expensive.
A breakdown of the platform’s usage and the billed prices associated with such use can be found in OpenAI's usage page.
Fine-tuning the GPT-4 model led to a noticeable shift in output style, bringing it much closer to the tone and structure typical of professional journalism. The model successfully adopted the narrative techniques and language styles from the training data, demonstrating how effectively a small, focused dataset—just 28 matches—can drive significant transformation.
However, the process also highlighted some limitations. The model occasionally overemphasized certain undesired aspects of the training data, underscoring the need for careful dataset curation. Ensuring that the data is well-structured and free from unintended biases is crucial to achieving consistent, high-quality results.
Lastly, while fine-tuning offers the advantage of generating content that prompt engineering alone cannot achieve, it comes with added costs. It's important to consider the balance between the improved performance and the financial investment, especially if working within budget constraints or high-volume content generation scenarios.