# Predicting butter prices

## Setting the stage

Last fall(2019) I enrolled in the CDSP class to enrich my Data Science knowledge. Besides learning lots of theory about data mining, clustering, the semantic web, time series analysis, neural networks and other state of the art techniques, what set CDSP apart for me is the practical exercises that go along with the theory. At the end of the course the knowledge is accumulated in a final case study into a business problem of your own choosing.

I paired up with Mels, a trader in dairy products at Hoogwegt International, to investigate possible predictors for the price of butter.

In this blog I will explain the process we went through and how applying the learnings during the course to a real world problem, not only enhanced the learning experience, but also immediately resulted in a useful business case.

First a little bit about me. I have a background in Business Intelligence. This entails extracting and transforming data, creating business dashboards, interacting with databases, writing the odd Python script and other general data-wrangling. I do not have much knowledge about data science. Also I don’t know much about the in-depth workings of the stock market in general, or trading dairy products in particular. I know my milk comes from a cow and I get it from my local grocery store at a price that seems reasonable to me.

My team partner Mels however knows everything about cows and dairy products. How much cows produce at what age, how it should be stored, all the derivative products you can make from milk, which countries are net-importers of these products and which are net-exporters, what makes a good milk season and how sunshine in New Zealand influences the price of my chocolate bar in the Netherlands.

## The objective

Together we set down and thought about what would be a good case to apply our newly acquired data knowledge to. Mels had an interesting idea: In trading, traditionally there is a strong correlation between stocks and prices. When the price is high, the stocks are low, everyone wants to sell at these high prices, and vice versa, at low prices everyone stocks their product, and waits for prices to rise again. In one particular market however, the butter market, this correlation is not that evident: stocks and prices do not seem correlated at all.

Could we use our new Data Science knowledge to quantify the seemingly low correlation between price and stock in the butter market and find alternative predictors?

## Choosing the right technique

In the course we learned about a variety of data science techniques but which one was the appropriate one to apply?

- General statistics
- Clustering
- Decision trees
- Linear regression
- Logistic regression
- Generalized Linear Models
- Random Forest
- Reinforced learning
- Text mining
- Timeseries and Forecasting
- Model selection techniques

We decided, after some deliberation with our very helpful teachers, to apply four of them. Some general statistics was always needed to show correlation and significance, timeseries analysis to find trends and seasonality and generalized linear models to find predictors for the butter price. Lastly we would use the Akaike information criterion for model selection, thus using a different but established method for finding predictors.

## Quantifying the correlation

This was the easiest one. Applying the theory and calculating the correlation between the butter price data series and the stock data series in R gave us a Pearson correlation coefficient of 0.19. Since this is nowhere near 0.70, which would indicate significant correlation, we could confidently say the two series are uncorrelated.

## Time series analysis

Next we applied time series decomposition to see if there were seasonal effects. We spitted the time series in three parts: trend effects, seasonal effects and rest effects clustered in random effects. In the figure below the results.

The increasing trend line indicates that the average butter sales increased in the shown period. There is an increasing general demand for butter.

What clearly is stands out is the seasonal effect. Mels had a logical explanation for this from a business perspective. Before the festive holidays in December the prices go up since everybody wants butter for cooking and snacks. After the holidays, in January, there is much less demand since new year resolutions rarely entail eating lots of butter. So everyone who has a surplus of butter dumps them at low prices to prevent the costs of storing, hence the post December dump prices.

Taking a closer look at the y-axis however reveals that the random effects dominates. The trend shows a somewhat steady increase from 160 to 220 and where the seasonal variation is between approximately -10 and +20, the random variation is between -30 and +60.

Other effects besides seasonal and trend are affecting the price and now the question is, can we find useful predictors using Mels knowledge of the dairy business?

## Splitting the data in training and validation sets

First, being good students we made a point of splitting the data in a training set, on which we would train our algorithms, and a validation set, which we would use to validate the outcome and compare with the predictions of the algorithms.

We made the split such that we had six years of test data and one year of validation data. Predicting further into the future did not seem necessary and we had to be careful not to make our test data set too small to prevent overfitting.

## Finding predictors using Generalized Linear Models

Using Mels business knowledge we selected 21 variables possibly related to the butter price. Since every business has its secrets I cannot fully disclose what each of them was but it suffices to say that we selected some time variables and variations of stock indicators as well as some other possible predictors.

### Fitting using a general linearized model

Fitting all 21 variables to the test data resulted in the left figure below. Using only the variables which were deemed significant, five in total, gave us the figure on the right.

The better fit and thus lower root mean squared error (rmse) on the left does not indicate it is necessarily better, since the low rmse can also be attributed to overfitting, ea. given a larger amount of variables as there are data points you always get a better fit, with in extremis the same number of variables as data points giving a perfect fit. No matter if these variables are in any way logically related to the target variable.

### Fitting using a general linearized model with L1 regularisation

The next step was testing a different technique where, starting from zero, a variable is added and seen if the fit improved. We learned this can be done using a general linearized model with L1 regularization. This results in variating the importance of a variable one at a time until no further improvement in the fit is obtained and then adding a variable and again varying the weight of each variable until the fit no longer improves. The resulting mean squared error is calculated each time. Below on the left you can see each coefficient varied one at a time, starting from zero, until all are added, and on the right the resulting mean squared error. One can see that adding more than the first 12 variables no longer decreases the mean squared error thus giving an optimum fit.

The resulting fit on our test data is plotted in the figure below. The rmse in this case is 21.

## Using Akaike information criterion

Lastly we applied stepwise AIC to find an optimum model. This not only tries to find the best fit by adding one variable at a time but also gives a punishment for using too many variables, which increases the risk of overfitting, thus finding a balance between adding variables and model complexity. This resulted in 12 variables. The resulting fit gave a rmse of 12 and is shown below.

## Reviewing the fitted models

Each technique we used selected slightly different variables for their optimum fitted model resulting in different fits to the training data. We saw that the model selected using the Aikake information criterion gave the lowest root mean squared error and is thus the best performing model with the best selection of predictor variables.

Sometimes an algorithm is just a black box and one can only evaluate a model by looking at its performance, for example in neural networks, but in this case the variables selected and weights attributed to them by each model can be freely inspected.

So next we reviewed the variables selected by the models to see if Mels could understand the selection and if it made any business sense. Unfortunately I can not share this analysis since it contains confident information. But the step to evaluate the outcome of computer models and see if it also makes logical sense is always important.

## Validating the models

Remember we split the data in a training and validation set? Up to now we only used the training data set. To validate the algorithms and see how well they predict the butter price we compared their predictions with the validation set. The outcomes are shown below.

We now see that although Akaike scored good on the training data it performs worse on the validation data. The best model seems to be a Generalized Linear Model with L1-regularization. An explanation for this could be that the pattern in the training set is very different from the validation set. If you look at the split we made in the figure below we see a flat price development instead of a volatile one just were we put the split.

Since this pattern was not in the training set it is hard to predict it.

Then again, just because we cannot see a pattern, the hope was an algorithm just might. And maybe the predicting algorithms have the same flat pattern thus still resulting in a good fit.

Is any model good enough to trade on? Well, Mels does not think so. He concluded that none of the selected variables was good enough to predict the butter price and the market was apparently driven by other things than the selected variables. He was however very enthusiastic because he now had a useful and reliable tool to test if a variable is a good predictor. Now he can no longer just rely on his gut feeling but also prove this by using numbers and explain this to his colleagues.

## Conclusion

In the Certified Data Science Course we learned a lot of techniques to extract information from data. We applied some of these to predict butter prices, which could be used by Mels, a dairy product trader, in his daily work. Although we did not find any clear predictors we did find a reliable and reproducible technique to check if a variable is a good predictor.

It is important to mention that none of us, neither Mels or me, had any extensive knowledge of data science prior to the course. Yet by applying the newly acquired knowledge directly in the course, and with the help of our teachers, we could investigate a real-world practical question where business value is immediately evident. And we were not alone. Other teams worked at banks and used their final assignment to predict the chance that defined groups of people would be applicable for mortgage, thus reducing the cost of the intake process. Another team predicted the occupation of cells in police stations to enhance the flow of short stay inmates.

Of course we could further improve our techniques and evaluate other predictors or maybe even use text mining newspaper articles to predict the butter price. Now, thanks to the course we can!

I would like to thank our teachers, Hugo Koopmans and Koen de Koning , for their great course and their patience in explaining all the techniques to us!

Meer weten over wat data voor uw organisatie kan betekenen?

Neem contact op met Jan-Maarten Prevoo

Telefoon: +31303078743

E-mail: jan-maarten.prevoo@dikw.com

## Blogs

- Data gedreven werken — door Nick van de Venn — last modified 26-05-2023
- Bij datagedreven werken staan mens én data centraal
- Hoe gebruik je ChatGPT om je data pipeline te bouwen? — door Nick van de Venn — last modified 11-05-2023
- Marcel-Jan zocht het uit voor zijn grote hobby: astronomie.
- Lead consultant en manager Business Analytics Patrick Meulstee — door Nick van de Venn — last modified 06-04-2023
- Over remote werken op Bonaire
- Hoogwaardige data opleidingen bij DIKW Academy — door nick van de venn — last modified 28-02-2023
- Waarom wij geloven in de praktische opzet van onze cursussen
- Lead data engineer Remy Lamberty — door Nick van de Venn — last modified 27-03-2023
- Over remote werken vanuit India
- Van pizzakoerier naar business intelligence consultant — door Nick van de Venn — last modified 15-12-2022
- Het verhaal hoe Brian een vaste waarde werd binnen DIKW
- Data management — door Nick van de Venn — last modified 18-11-2022
- Met vertrouwen bouwen op data
- Bayesiaanse Statistiek — door Marc Jacobs — last modified 25-07-2022
- Wiskundig raamwerk voor ouderwets leren
- R vs Python — door Nick van de Venn — last modified 05-07-2022
- Samenwerken is de sleutel
- Je fietsroutes eenvoudig in kaart brengen... — door Nick van de Venn — last modified 05-05-2022
- Marcel-Jan doet het eenvoudig met behulp van Python.
- Het DIKW model — door marco — last modified 24-02-2022
- In vier stappen waarde creëren met data
- DIKW Academy: Waar theorie en praktijk samen komen — door marco — last modified 21-02-2022
- DIKW docenten delen hun expertise uit de praktijk
- Laat data voor u renderen! — door marco — last modified 08-02-2022
- Bluemine: Analytics as a service
- 30 jaar intelligence: Nieuwe uitdagingen om met data waarde toe te voegen — door marco — last modified 08-02-2022
- Van oude computerterminal naar smartphone
- De fascinerende wereld van testen — door marco — last modified 30-12-2021
- Verwacht het onverwachte
- Verzekeraar creëert meerwaarde met slimme data hub — door marco — last modified 20-12-2021
- Klant maakt met betere voorspellingen met data
- Op naar een mooi data gedreven 2022! — door marco — last modified 14-12-2021
- Data gedreven organisatie dient blijvend te worden gevoed
- Machine Learning: De gereedschapskist van de data scientist — door marco — last modified 21-12-2021
- Machine Learning algoritmes zijn de gereedschappen voor een data scientist
- Met data de wind in de zeilen — door marco — last modified 28-01-2022
- Met data management kiest u de juiste koers
- Hoe ethisch is Facebook? — door marco — last modified 05-11-2021
- Is regulering en wetgeving voor AI nodig?
- Boekbespreking: Data Teams van Jesse Anderson — door marco — last modified 02-11-2021
- Voor succesvolle big data projecten zijn drie teams nodig
- Smells like AI — door marco — last modified 01-11-2021
- Artificial Intelligence creëert nieuwe muziek
- De waarde van data voor het MKB — door marco — last modified 09-12-2021
- Bluemine ontzorgt MKB door data beheer
- Data & AI: Kans of bedreiging? — door marco — last modified 21-12-2021
- Waarde creëren met data en AI zorgt voor nieuwe business mogelijkheden
- Data gedrevenheid is proces van lange adem — door marco — last modified 02-11-2021
- Data is een ingrediënt dat zorgt voor meerwaarde op lange termijn
- Hoe data leidt tot de optimalisatie van de customer journey — door marco — last modified 02-11-2021
- Ondersteun uw customer journey met data strategie
- Wat is data engineering? — door marco — last modified 03-02-2022
- Hoe word je een data engineer?
- De fasen om te transformeren naar een data gedreven organisatie — door marco — last modified 02-11-2021
- Welke vier fasen doorloopt een organisatie naar data gedrevenheid?
- Data gedreven organisaties hebben grotere kans om te overleven — door marco — last modified 02-11-2021
- Transformeren naar een data gedreven organisatie kost tijd

## Data Science recente Blogs

- Bayesiaanse Statistiek — door Marc Jacobs — last modified 25-07-2022
- Wiskundig raamwerk voor ouderwets leren
- R vs Python — door Nick van de Venn — last modified 05-07-2022
- Samenwerken is de sleutel
- Machine Learning: De gereedschapskist van de data scientist — door marco — last modified 21-12-2021
- Machine Learning algoritmes zijn de gereedschappen voor een data scientist

## Data Science Nieuws & Evenementen

- Data science opleidingen gaan weer van start! — door marco — last modified 08-02-2022
- Vanaf 21 september start ons succesnummer weer! Twaalf weken data science in R, we hebben er weer zin in
- Aedes data science workshop 2 van 3 — door marco — last modified 07-02-2022
- Voor Aedes organiseert DIKW drie workshops data science
- DIKW in top 50 beste data science bedrijven — door marco — last modified 22-10-2021
- DIKW is één van snelst groeiende bedrijven volgens MKB Data Science rapport