Predicting butter prices

Case study assignment for Certified Data Science Proffesional course DIKW Academy

Setting the stage

Last fall(2019) I enrolled in the CDSP class to enrich  my Data Science knowledge. Besides learning lots of theory about data mining, clustering, the semantic web, time series analysis, neural networks and other state of the art techniques, what set CDSP apart for me is the practical exercises that go along with the theory. At the end of the course the knowledge is accumulated in a final case study into a business problem of your own choosing. 

I paired up with Mels, a trader in dairy products at Hoogwegt International, to investigate possible predictors for the price of butter. 

Predicting butter prices

In this blog I will explain the process we went through and how applying the learnings during the course to a real world problem, not only enhanced the learning experience, but also immediately resulted in a useful business case. 

First a little bit about me. I have a background in Business Intelligence. This entails extracting and transforming data, creating business dashboards, interacting with databases, writing the odd Python script and other general data-wrangling. I do not have much knowledge about data science. Also I don’t know much about the in-depth workings of the stock market in general, or trading dairy products in particular. I know my milk comes from a cow and I get it from my local grocery store at a price that seems reasonable to me. 

My team partner Mels however knows everything about cows and dairy products. How much cows produce at what age, how it should be stored, all the derivative products you can make from milk, which countries are net-importers of these products and which are net-exporters, what makes a good milk season and how sunshine in New Zealand influences the price of my chocolate bar in the Netherlands.

Predicting butter prices

The objective

Together we set down and thought about what would be a good case to apply our newly acquired data knowledge to. Mels had an interesting idea: In trading, traditionally there is a strong correlation between stocks and prices. When the price is high, the stocks are low, everyone wants to sell at these high prices, and vice versa, at low prices everyone stocks their product, and waits for prices to rise again. In one particular market however, the butter market, this correlation is not that evident: stocks and prices do not seem correlated at all. 

Predicting butter prices

Could we use our new Data Science knowledge to quantify the seemingly low correlation between price and stock in the butter market and find alternative predictors?

Choosing the right technique

In the course we learned about a variety of data science techniques but which one was the appropriate one to apply? 

  • General statistics
  • Clustering
  • Decision trees
  • Linear regression
  • Logistic regression
  • Generalized Linear Models 
  • Random Forest
  • Reinforced learning
  • Text mining
  • Timeseries and Forecasting
  • Model selection techniques

We decided, after some deliberation with our very helpful teachers, to apply four of them. Some general statistics was always needed to show correlation and significance, timeseries analysis to find trends and seasonality and generalized linear models to find predictors for the butter price. Lastly we would use the Akaike information criterion for model selection, thus using a different but established method for finding predictors.

Quantifying the correlation

This was the easiest one. Applying the theory and calculating the correlation between the butter price data series and the stock data series in R gave us a Pearson correlation coefficient of 0.19. Since this is nowhere near 0.70, which would indicate significant correlation, we could confidently say the two series are uncorrelated.

Time series analysis

Next we applied time series decomposition to see if there were seasonal effects. We spitted the time series in three parts: trend effects, seasonal effects and rest effects clustered in random effects. In the figure below the results.

The increasing trend line indicates that the average butter sales increased in the shown period. There is an increasing general demand for butter.

What clearly is stands out is the seasonal effect. Mels had a logical explanation for this from a business perspective. Before the festive holidays in December the prices go up since everybody wants butter for cooking and snacks. After the holidays, in January, there is much less demand since new year resolutions rarely entail eating lots of butter. So everyone who has a surplus of butter dumps them at low prices to prevent the costs of storing, hence the post December dump prices.

Taking a closer look at the y-axis however reveals that the random effects dominates. The trend shows a somewhat steady increase from 160 to 220 and where the seasonal variation is between approximately -10 and +20, the random variation is between -30 and +60.

Other effects besides seasonal and trend are affecting the price and now the question is, can we find useful predictors using Mels knowledge of the dairy business?

Splitting the data in training and validation sets

 First, being good students we made a point of splitting the data in a training set, on which we would train our algorithms, and a validation set, which we would use to validate the outcome and compare with the predictions of the algorithms. 

Predicting butter prices

 We  made the split such that we had six years of test data and one year of validation data. Predicting further into the future did not seem necessary and we had to be careful not to make our test data set too small to prevent overfitting.

Finding predictors using Generalized Linear Models

Using Mels business knowledge we selected 21 variables possibly related to the butter price. Since every business has its secrets I cannot fully disclose what each of them was but it suffices to say that we selected some time variables and variations of stock indicators as well as some other possible predictors.

Fitting using a general linearized model

Fitting all 21 variables to the test data resulted in the left figure below. Using only the variables which were deemed significant, five in total, gave us the figure on the right. 

Predicting butter prices

The better fit and thus lower root mean squared error (rmse) on the left does not indicate it is necessarily better, since the low rmse can also be attributed to overfitting, ea. given a larger amount of variables as there are data points you always get a better fit, with in extremis the same number of variables as data points giving a perfect fit. No matter if these variables are in any way logically related to the target variable.

Fitting using a general linearized model with L1 regularisation

The next step was testing a different technique where, starting from zero, a variable is added and seen if the fit improved. We learned this can be done using a general linearized model with L1 regularization. This results in variating the importance of a variable one at a time until no further improvement in the fit is obtained and then adding a variable and again varying the weight of each variable until the fit no longer improves. The resulting mean squared error is calculated each time. Below on the left you can see each coefficient varied one at a time, starting from zero, until all are added, and on the right the resulting mean squared error. One can see that adding more than the first 12 variables no longer decreases the mean squared error thus giving an optimum fit.

Predicting butter prices

The resulting fit on our test data is plotted in the figure below. The rmse in this case is 21.

Predicting butter prices

Using Akaike information criterion

Lastly we applied stepwise AIC to find an optimum model. This not only tries to find the best fit by adding one variable at a time but also gives a punishment for using too many variables, which increases the risk of overfitting, thus finding a balance between adding variables and model complexity. This resulted in 12 variables. The resulting fit gave a rmse of 12 and is shown below.

Predicting butter prices

Reviewing the fitted models

Each technique we used selected slightly different variables for their optimum fitted model resulting in different fits to the training data. We saw that the model selected using the Aikake information criterion gave the lowest root mean squared error and is thus the best performing model with the best selection of predictor variables. 

Sometimes an algorithm is just a black box and one can only evaluate a model by looking at its performance, for example in neural networks, but in this case the variables selected and weights attributed to them by each model can be freely inspected. 

So next we reviewed the variables selected by the models to see if Mels could understand the selection and if it made any business sense. Unfortunately I can not share this analysis since it contains confident information. But the step to evaluate the outcome of computer models and see if it also makes logical sense is always important. 

Validating the models 

Remember we split the data in a training and validation set? Up to now we only used the training data set. To validate the algorithms and see how well they predict the butter price we compared their predictions with the validation set. The outcomes are shown below.


Predicting butter prices

We now see that although Akaike scored good on the training data it performs worse on the validation data. The best model seems to be a Generalized Linear Model with L1-regularization. An explanation for this could be that the pattern in the training set is very different from the validation set. If you look at the split we made in the figure below we see a flat price development instead of a volatile one just were we put the split. 

 Predicting butter prices

Since this pattern was not in the training set it is hard to predict it. 

Then again, just because we cannot see a pattern, the hope was an algorithm just might. And maybe the predicting algorithms have the same flat pattern thus still resulting in a good fit.

Is any model good enough to trade on? Well, Mels does not think so. He concluded that none of the selected variables was good enough to predict the butter price and the market was apparently driven by other things than the selected variables. He was however very enthusiastic because he now had a useful and reliable tool to test if a variable is a good predictor. Now he can no longer just rely on his gut feeling but also prove this by using numbers and explain this to his colleagues.


In the Certified Data Science Course we learned a lot of techniques to extract information from data. We applied some of these to predict butter prices, which could be used by Mels, a dairy product trader, in his daily work. Although we did not find any clear predictors we did find a reliable and reproducible technique to check if a variable is a good predictor. 

It is important to mention that none of us, neither Mels or me, had any extensive knowledge of data science prior to the course. Yet by applying the newly acquired knowledge directly in the course, and with the help of our teachers, we could investigate a real-world practical question where business value is immediately evident. And we were not alone. Other teams worked at banks and used their final assignment to predict the chance that defined groups of people would be applicable for mortgage, thus reducing the cost of the intake process. Another team predicted the occupation of cells in police stations to enhance the flow of short stay inmates.

Of course we could further improve our techniques and evaluate other predictors or maybe even use text mining newspaper articles to predict the butter price. Now, thanks to the course we can! 

I would like to thank our teachers, Hugo Koopmans and Koen de Koning , for their great course and their patience in explaining all the techniques to us!

Predicting butter prices


Datagedreven werken Deel 3 door Nick van de Venn — last modified 03-10-2023
Welke data heeft eigenlijk waarde voor uw organisatie?
Differential Privacy door Nick van de Venn — last modified 18-09-2023
Gevoelige gegevens verwerken zonder dat de gevoelige informatie kan uitlekken
Datagedreven werken Deel 2 door Nick van de Venn — last modified 03-10-2023
Welke data heeft eigenlijk waarde voor uw organisatie?
ChatGPT for Business Intelligence door Nick van de Venn — last modified 18-09-2023
Chatten met je datawarehouse, utopie of werkelijkheid?
Intelligence Factory door Nick van de Venn — last modified 05-07-2023
Agile design thinking met een ML-ops sausje
Data gedreven werken. Deel 1 door Nick van de Venn — last modified 03-10-2023
Bij datagedreven werken staan mens én data centraal
Hoe gebruik je ChatGPT om je data pipeline te bouwen? door Nick van de Venn — last modified 11-05-2023
Marcel-Jan zocht het uit voor zijn grote hobby: astronomie.
Lead consultant en manager Business Analytics Patrick Meulstee door Nick van de Venn — last modified 06-04-2023
Over remote werken op Bonaire
Hoogwaardige data opleidingen bij DIKW Academy door nick van de venn — last modified 28-02-2023
Waarom wij geloven in  de praktische opzet van onze cursussen
Lead data engineer Remy Lamberty door Nick van de Venn — last modified 27-03-2023
Over remote werken vanuit India
Van pizzakoerier naar business intelligence consultant door Nick van de Venn — last modified 15-12-2022
Het verhaal hoe Brian een vaste waarde werd binnen DIKW
Bayesiaanse Statistiek door Marc Jacobs — last modified 25-07-2022
Wiskundig raamwerk voor ouderwets leren
R vs Python door Nick van de Venn — last modified 05-07-2022
Samenwerken is de sleutel
Je fietsroutes eenvoudig in kaart brengen... door Nick van de Venn — last modified 05-05-2022
Marcel-Jan doet het eenvoudig met behulp van Python.
Het DIKW model door marco — last modified 24-02-2022
In vier stappen waarde creëren met data
DIKW Academy: Waar theorie en praktijk samen komen door marco — last modified 21-02-2022
DIKW docenten delen hun expertise uit de praktijk
Laat data voor u renderen! door marco — last modified 08-02-2022
Bluemine: Analytics as a service
30 jaar intelligence: Nieuwe uitdagingen om met data waarde toe te voegen door marco — last modified 08-02-2022
Van oude computerterminal naar smartphone
De fascinerende wereld van testen door marco — last modified 30-12-2021
Verwacht het onverwachte
Verzekeraar creëert meerwaarde met slimme data hub door marco — last modified 20-12-2021
Klant maakt met betere voorspellingen met data
Op naar een mooi data gedreven 2022! door marco — last modified 14-12-2021
Data gedreven organisatie dient blijvend te worden gevoed
Machine Learning: De gereedschapskist van de data scientist door marco — last modified 21-12-2021
Machine Learning algoritmes zijn de gereedschappen voor een data scientist
Met data de wind in de zeilen door marco — last modified 28-01-2022
Met data management kiest u de juiste koers
Hoe ethisch is Facebook? door marco — last modified 05-11-2021
Is regulering en wetgeving voor AI nodig?
Boekbespreking: Data Teams van Jesse Anderson door marco — last modified 02-11-2021
Voor succesvolle big data projecten zijn drie teams nodig
Smells like AI door marco — last modified 01-11-2021
Artificial Intelligence creëert nieuwe muziek
De waarde van data voor het MKB door marco — last modified 09-12-2021
Bluemine ontzorgt MKB door data beheer
Data & AI: Kans of bedreiging? door marco — last modified 21-12-2021
Waarde creëren met data en AI zorgt voor nieuwe business mogelijkheden
Data gedrevenheid is proces van lange adem door marco — last modified 02-11-2021
Data is een ingrediënt dat zorgt voor meerwaarde op lange termijn

Data Science recente Blogs

ChatGPT for Business Intelligence door Nick van de Venn — last modified 18-09-2023
Chatten met je datawarehouse, utopie of werkelijkheid?
Intelligence Factory door Nick van de Venn — last modified 05-07-2023
Agile design thinking met een ML-ops sausje
Bayesiaanse Statistiek door Marc Jacobs — last modified 25-07-2022
Wiskundig raamwerk voor ouderwets leren

Data Science Nieuws & Evenementen

Data science opleidingen gaan weer van start! door marco — last modified 08-02-2022
Vanaf 21 september start ons succesnummer weer! Twaalf weken data science in R, we hebben er weer zin in
Aedes data science workshop 2 van 3 door marco — last modified 07-02-2022
Voor Aedes organiseert DIKW drie workshops data science
DIKW in top 50 beste data science bedrijven door marco — last modified 22-10-2021
DIKW is één van snelst groeiende bedrijven volgens MKB Data Science rapport