With OpenAI’s recent release of a fine-tuning platform for GPT-4o, generating engaging, journalist-style football match reports is more achievable than ever. In this post, we explore the process of fine-tuning GPT-4 to create authentic UEFA Champions League reports from play-by-play commentary.

The primary motivation behind this project was to delve into the capabilities, limitations, requirements and pricing of fine-tuning ChatGPT models. Specifically, the aim was to enhance GPT-4's ability to produce match reports that not only recount the most relevant events but also capture the essence and context of the match, much like a human sports journalist would.

The default GPT-4o model, while powerful, tended to generate match narrations that strictly followed the play-by-play inputs, always lacking the contextual depth and storytelling flair characteristic of professional journalism. The goal was to have the AI produce reports that:

Provide Context: Include background on teams, stakes of the match, and significance of events.
Exhibit Style: Use expressive language and varied sentence structures typical of sports journalism.
Offer Insights: Highlight key moments, player performances, and tactical nuances

Fine-Tuning vs. Prompt Engineering

Fine-tuning a large language model (LLM) and using prompt engineering are two different approaches to modifying how an AI model generates responses. Fine-tuning involves retraining the model on a specific dataset, allowing it to learn new patterns, styles, or domains. This process alters the parameters of the model, which means the changes are baked into how the model processes input going forward.

On the other hand, prompt engineering manipulates the input prompts to obtain the desired responses without altering the model's parameters. By crafting the prompts carefully users can nudge the model to generate responses that better fit their needs. However, prompt engineering is often less reliable when complex changes in style or deep domain-specific expertise are required.

Prompt engineering is best suited for situations where quick adjustments or minor changes are needed without modifying the underlying model. For instance, if you're working on generating responses with a slightly different tone or encouraging the model to focus on specific details, prompt engineering can be a cost-effective and efficient solution. It’s especially useful for tasks that don’t require high consistency or deep domain-specific knowledge, such as ad-hoc content generation, general-purpose Q&A systems, or modifying how the model structures its responses.

On the other hand, fine-tuning is the go-to option when more significant, consistent changes are required. If your use case demands content generated in a specific format or style, such as technical documentation, legal text, or highly specialized fields like medical reports, fine-tuning allows the model to learn these patterns directly from the dataset. Fine-tuning is also crucial when the model needs to consistently follow a complex set of guidelines that prompt engineering alone cannot achieve. A good practice is to start with prompt engineering, and if it falls short of your goals, fine-tuning becomes the logical next step.

Dataset Preparation

Data acquisition

To train the model, the match reports and play-by-play commentary were obtained by browsing ESPN's UEFA Champions League section to the specific dates of the target matches. The dataset consisted of the 2023-24 season round-of-16, quarterfinals, and semifinals—a total of 28 games. The championship’s final was used for validation.

Each match entry in the dataset requires both the play-by-play commentary and a journalist's report of the match. The play-by-play commentary consists of timestamps followed by brief descriptions of the events at each moment. ESPN’s website provides commentary for either every event in a match or just the key moments. For our dataset, we used the full commentary. The following image shows an example of the key events from the second leg of the semifinal match between Borussia Dortmund and Paris Saint-Germain:

Key Highlights: Borussia Dortmund vs. Paris Saint-Germain – Second Leg Semifinal

The writing style of the journalistic reports is direct and focused, providing a clear summary of the match’s key actors, moments, stakes, and context. Some reports use straightforward, neutral English, while others use more colloquialisms from British English. The text combines action-packed descriptions with quotes from players and coaches. The reports usually start with a brief summary of the match and are followed by an in-depth chronological description of the events, using concise language that keeps the reader engaged.

Formatting the Dataset

Adhering to OpenAI's dataset format guidelines, the data was structured into a JSONL file format that consists of a list of message exchanges structured as conversations, where each message has a defined role (e.g., system, user, assistant) and content. Each conversation is presented as a JSON object, and multiple such objects are separated by line breaks.

For this fine-tuning, every JSONL entry contains a structured dialogue sequence where the "messages" list includes interactions between the system, which sets the scenario (football journalist that writes engaging reports based on football match play-by-play commentary), and the user who provides a play-by-play commentary input preceded by a brief context of the match being played (e.g. first leg of the Champions League semifinals match between Borussia Dortmund and Paris Saint-Germain). The assistant then responds with a match report written by a sports journalist.

Validation and Token Counting

To validate the dataset, ensure compatibility and perform token counting, these code snippets from OpenAI's cookbook were utilized. In general, token counting is necessary to manage the length of prompts and responses, keeping them within the model's context window and also for cost estimation.

Fine-Tuning Process

With the dataset prepared, the fine-tuning process is created using OpenAI's fine-tuning dashboard:

Model Selection: Opted for the gpt-4o-2024-08-06 model as the base.
Training and validation data: Uploaded the JSONL datasets for training and validation and began the fine-tuning.
Configurable hyperparameters: Only available hyperparameters are batch size, learning rate multiplier and number of epochs. Their values were not modified from the automatic setting of OpenAI.
Evaluation: After training, the fine-tuned model is prompted with the unseen play-by-play input of the 2023-24 UCL Final between Real Madrid and Borussia Dortmund.

Observations

The gpt-4o-2024-08-06 model was fine-tuned with 3 epochs, a batch size of 1 and an LR multiplier of 2. The total trained tokens were 255,387 while the token counting script estimated 265,644 tokens (approximately 88,548 tokens per epoch).

The entire fine-tuning task took around 7 minutes in which the training loss was gradually decreasing while the validation loss went down in the first two epochs but started increasing in the third which proved that the automatic setting of 3 epochs was correct.

UCL 2023-24 Final report

Using OpenAI's platform playground, we can compare our custom model with the gpt-4o-2024-08-06 model through a feature that allows sending the same prompt to both models and viewing their outputs side-by-side. As with the rest of the dataset, the play-by-play commentary of the championship final was sourced from ESPN's website. Here are the results:

UCL Final report generated by the fine-tuned model. Green highlights indicate correct context added by the model, red highlights show incorrect context, and yellow highlights represent fabricated quotes from players and coaches.

‍

The fine-tuned model output shows noticeable transformations:

Enhanced Narrative Quality: The reports read more like articles found in reputable sports outlets that were used for training.
British English Influence: Given that many source reports were from Reuters, the outputs had a distinct British tone, incorporating phrases and stylistic elements common in UK football journalism.
Contextual Depth: The AI included historical context, player backgrounds, and implications of the match outcomes (correct ones are highlighted in green and incorrect in red).
Quotes from players and managers: Just as in all the reports used in the dataset, the model included quotes from players and managers (highlighted in yellow). All the generated quotes are hallucinations that (mostly) make sense within the context of the match.

Cost

To estimate the cost of a fine-tuning instance, OpenAI's pricing page is your friend. As of the time of this post, fine-tuning for GPT-4o is free up to a daily limit of 1 million tokens. Any excess is charged at $25 per 1 million tokens. This means that, excluding the free daily limit, a single fine-tuning instance of 3 epochs in our dataset of 28 matches totals 255,387 tokens that would cost $6.38.

‍

Once a model is fine-tuned, OpenAI will bill you only for the tokens used in requests to that model, at a rate of $3.75 per 1 million input tokens and $15 per 1 million output tokens as of the time of this post. For comparison, using the default GPT-4o-2024-08-06 costs $2.50 per 1 million input tokens and $10 per 1 million output tokens. This means using a fine-tuned model is 50% more expensive.

‍

A breakdown of the platform’s usage and the billed prices associated with such use can be found in OpenAI's usage page.

Conclusions

Fine-tuning the GPT-4 model led to a noticeable shift in output style, bringing it much closer to the tone and structure typical of professional journalism. The model successfully adopted the narrative techniques and language styles from the training data, demonstrating how effectively a small, focused dataset—just 28 matches—can drive significant transformation.

However, the process also highlighted some limitations. The model occasionally overemphasized certain undesired aspects of the training data, underscoring the need for careful dataset curation. Ensuring that the data is well-structured and free from unintended biases is crucial to achieving consistent, high-quality results.

Lastly, while fine-tuning offers the advantage of generating content that prompt engineering alone cannot achieve, it comes with added costs. It's important to consider the balance between the improved performance and the financial investment, especially if working within budget constraints or high-volume content generation scenarios.

Fine-Tuning GPT-4: Crafting Journalist-Like UEFA Champions League Match Reports

Fine-Tuning vs. Prompt Engineering

Dataset Preparation

Data acquisition

Formatting the Dataset

Validation and Token Counting

Fine-Tuning Process

Observations

UCL 2023-24 Final report

Cost

Conclusions

Related posts

5 things you should know about pandas 2.0

Training on Detectron2 with a Validation set, and plot loss on it to avoid overfitting

Local deployment of Llama 3.2 3B vs. GPT-4o: A deep dive into fine-tuning for specialized content creation

Let’s Start a Project Together!