Predicting butter prices

Case study assignment for Certified Data Science Proffesional course DIKW Academy

Setting the stage

Last fall(2019) I enrolled in the CDSP class to enrich  my Data Science knowledge. Besides learning lots of theory about data mining, clustering, the semantic web, time series analysis, neural networks and other state of the art techniques, what set CDSP apart for me is the practical exercises that go along with the theory. At the end of the course the knowledge is accumulated in a final case study into a business problem of your own choosing. 

I paired up with Mels, a trader in dairy products at Hoogwegt International, to investigate possible predictors for the price of butter. 

8fc33a3f-167b-4ef9-ae83-a9c1d5b4e0c7.jpeg

In this blog I will explain the process we went through and how applying the learnings during the course to a real world problem, not only enhanced the learning experience, but also immediately resulted in a useful business case. 

First a little bit about me. I have a background in Business Intelligence. This entails extracting and transforming data, creating business dashboards, interacting with databases, writing the odd Python script and other general data-wrangling. I do not have much knowledge about data science. Also I don’t know much about the in-depth workings of the stock market in general, or trading dairy products in particular. I know my milk comes from a cow and I get it from my local grocery store at a price that seems reasonable to me. 

My team partner Mels however knows everything about cows and dairy products. How much cows produce at what age, how it should be stored, all the derivative products you can make from milk, which countries are net-importers of these products and which are net-exporters, what makes a good milk season and how sunshine in New Zealand influences the price of my chocolate bar in the Netherlands.

The objective

Together we set down and thought about what would be a good case to apply our newly acquired data knowledge to. Mels had an interesting idea: In trading, traditionally there is a strong correlation between stocks and prices. When the price is high, the stocks are low, everyone wants to sell at these high prices, and vice versa, at low prices everyone stocks their product, and waits for prices to rise again. In one particular market however, the butter market, this correlation is not that evident: stocks and prices do not seem correlated at all. 

Could we use our new Data Science knowledge to quantify the seemingly low correlation between price and stock in the butter market and find alternative predictors?

Choosing the right technique

In the course we learned about a variety of data science techniques but which one was the appropriate one to apply? 

  • General statistics
  • Clustering
  • Decision trees
  • Linear regression
  • Logistic regression
  • Generalized Linear Models 
  • Random Forest
  • Reinforced learning
  • Text mining
  • Timeseries and Forecasting
  • Model selection techniques

We decided, after some deliberation with our very helpful teachers, to apply four of them. Some general statistics was always needed to show correlation and significance, timeseries analysis to find trends and seasonality and generalized linear models to find predictors for the butter price. Lastly we would use the Akaike information criterion for model selection, thus using a different but established method for finding predictors.

Quantifying the correlation

This was the easiest one. Applying the theory and calculating the correlation between the butter price data series and the stock data series in R gave us a Pearson correlation coefficient of 0.19. Since this is nowhere near 0.70, which would indicate significant correlation, we could confidently say the two series are uncorrelated.

Time series analysis

Next we applied time series decomposition to see if there were seasonal effects. We spitted the time series in three parts: trend effects, seasonal effects and rest effects clustered in random effects. In the figure below the results.

The increasing trend line indicates that the average butter sales increased in the shown period. There is an increasing general demand for butter.

What clearly is stands out is the seasonal effect. Mels had a logical explanation for this from a business perspective. Before the festive holidays in December the prices go up since everybody wants butter for cooking and snacks. After the holidays, in January, there is much less demand since new year resolutions rarely entail eating lots of butter. So everyone who has a surplus of butter dumps them at low prices to prevent the costs of storing, hence the post December dump prices.

Taking a closer look at the y-axis however reveals that the random effects dominates. The trend shows a somewhat steady increase from 160 to 220 and where the seasonal variation is between approximately -10 and +20, the random variation is between -30 and +60.

Other effects besides seasonal and trend are affecting the price and now the question is, can we find useful predictors using Mels knowledge of the dairy business?

Splitting the data in training and validation sets

 

First, being good students we made a point of splitting the data in a training set, on which we would train our algorithms, and a validation set, which we would use to validate the outcome and compare with the predictions of the algorithms. 

 We  made the split such that we had six years of test data and one year of validation data. Predicting further into the future did not seem necessary and we had to be careful not to make our test data set too small to prevent overfitting.

Finding predictors using Generalized Linear Models

Using Mels business knowledge we selected 21 variables possibly related to the butter price. Since every business has its secrets I cannot fully disclose what each of them was but it suffices to say that we selected some time variables and variations of stock indicators as well as some other possible predictors.

Fitting using a general linearized model

Fitting all 21 variables to the test data resulted in the left figure below. Using only the variables which were deemed significant, five in total, gave us the figure on the right. 

The better fit and thus lower root mean squared error (rmse) on the left does not indicate it is necessarily better, since the low rmse can also be attributed to overfitting, ea. given a larger amount of variables as there are data points you always get a better fit, with in extremis the same number of variables as data points giving a perfect fit. No matter if these variables are in any way logically related to the target variable.

Fitting using a general linearized model with L1 regularisation

The next step was testing a different technique where, starting from zero, a variable is added and seen if the fit improved. We learned this can be done using a general linearized model with L1 regularization. This results in variating the importance of a variable one at a time until no further improvement in the fit is obtained and then adding a variable and again varying the weight of each variable until the fit no longer improves. The resulting mean squared error is calculated each time. Below on the left you can see each coefficient varied one at a time, starting from zero, until all are added, and on the right the resulting mean squared error. One can see that adding more than the first 12 variables no longer decreases the mean squared error thus giving an optimum fit.

 L1-norm applied to our variables. Due to business non-disclosure the used variables are blurred. Resulting mean squared error when variables are added.

The resulting fit on our test data is plotted in the figure below. The rmse in this case is 21.

Using Akaike information criterion

Lastly we applied stepwise AIC to find an optimum model. This not only tries to find the best fit by adding one variable at a time but also gives a punishment for using too many variables, which increases the risk of overfitting, thus finding a balance between adding variables and model complexity. This resulted in 12 variables. The resulting fit gave a rmse of 12 and is shown below.

Reviewing the fitted models

Each technique we used selected slightly different variables for their optimum fitted model resulting in different fits to the training data. We saw that the model selected using the Aikake information criterion gave the lowest root mean squared error and is thus the best performing model with the best selection of predictor variables. 

Sometimes an algorithm is just a black box and one can only evaluate a model by looking at its performance, for example in neural networks, but in this case the variables selected and weights attributed to them by each model can be freely inspected. 

So next we reviewed the variables selected by the models to see if Mels could understand the selection and if it made any business sense. Unfortunately I can not share this analysis since it contains confident information. But the step to evaluate the outcome of computer models and see if it also makes logical sense is always important. 

Validating the models 

Remember we split the data in a training and validation set? Up to now we only used the training data set. To validate the algorithms and see how well they predict the butter price we compared their predictions with the validation set. The outcomes are shown below.

 

We now see that although Akaike scored good on the training data it performs worse on the validation data. The best model seems to be a Generalized Linear Model with L1-regularization. An explanation for this could be that the pattern in the training set is very different from the validation set. If you look at the split we made in the figure below we see a flat price development instead of a volatile one just were we put the split. 

 

Since this pattern was not in the training set it is hard to predict it. 

Then again, just because we cannot see a pattern, the hope was an algorithm just might. And maybe the predicting algorithms have the same flat pattern thus still resulting in a good fit.

Is any model good enough to trade on? Well, Mels does not think so. He concluded that none of the selected variables was good enough to predict the butter price and the market was apparently driven by other things than the selected variables. He was however very enthusiastic because he now had a useful and reliable tool to test if a variable is a good predictor. Now he can no longer just rely on his gut feeling but also prove this by using numbers and explain this to his colleagues.

Conclusion

In the Certified Data Science Course we learned a lot of techniques to extract information from data. We applied some of these to predict butter prices, which could be used by Mels, a dairy product trader, in his daily work. Although we did not find any clear predictors we did find a reliable and reproducible technique to check if a variable is a good predictor. 

It is important to mention that none of us, neither Mels or me, had any extensive knowledge of data science prior to the course. Yet by applying the newly acquired knowledge directly in the course, and with the help of our teachers, we could investigate a real-world practical question where business value is immediately evident. And we were not alone. Other teams worked at banks and used their final assignment to predict the chance that defined groups of people would be applicable for mortgage, thus reducing the cost of the intake process. Another team predicted the occupation of cells in police stations to enhance the flow of short stay inmates.

Of course we could further improve our techniques and evaluate other predictors or maybe even use text mining newspaper articles to predict the butter price. Now, thanks to the course we can! 

I would like to thank our teachers, Hugo Koopmans and Koen de Koning , for their great course and their patience in explaining all the techniques to us!


Blogs

Data gedrevenheid is proces van lange adem door Marco van den Doel — last modified 16-09-2021
Data is een ingrediënt dat zorgt voor meerwaarde op lange termijn
Hoe data leidt tot de optimalisatie van de customer journey door Marco van den Doel — last modified 03-09-2021
Ondersteun uw customer journey met data strategie
Wat is data engineering? door Marco van den Doel — last modified 03-09-2021
Hoe word je een data engineer?
De fasen om te transformeren naar een data gedreven organisatie door Marco van den Doel — last modified 02-09-2021
Welke vier fasen doorloopt een organisatie naar data gedrevenheid?
Data gedreven organisaties hebben grotere kans om te overleven door Marco van den Doel — last modified 02-09-2021
Transformeren naar een data gedreven organisatie kost tijd
Data gedreven logistiek onderhoud voorkomt uitval door Marco van den Doel — last modified 06-09-2021
Operationele en logistieke kosten lager door gebruik van data
Er zijn meer logistieke wegen die naar Rome leiden door Marco van den Doel — last modified 03-09-2021
Duurzame innovatieve logistieke oplossing op basis van data science
Welke sandwich mogen wij voor u bereiden? door Marco van den Doel — last modified 03-09-2021
Data gedrevenheid is als een goede en juist belegde sandwich
Met data bijdragen aan een betere wereld door Marco van den Doel — last modified 03-09-2021
DIKW is partner van Sensing Clues
In het verleden behaalde resultaten... door Marco van den Doel — last modified 02-09-2021
Data zorgt voor betere resultaten in de toekomst
Blokkade Ever Given geeft noodzaak betere data science aan door Marco van den Doel — last modified 02-09-2021
Containerschip blokkeert Suezkanaal
Van voor naar achteren en van links naar rechts in de logistieke keten door Marco van den Doel — last modified 02-09-2021
Het verminderen van opslagkosten en verplaatsingen van het aantal containers
De zeven pilaren van DataOps door Marco van den Doel — last modified 03-09-2021
DataOps wordt gedefinieerd door zeven hoofdkenmerken
Kijk verder dan je dashboard door Marco van den Doel — last modified 02-09-2021
Met de DIKW Analytical Roadmap kijk je verder!
Data discovery tools door Marco van den Doel — last modified 03-09-2021
Hoe zorgen data gedreven organisaties er voor dat data snel gevonden wordt en dat nieuwe medewerkers snel productief zijn?
Van BICC naar DACoE deel 3 door Marco van den Doel — last modified 03-09-2021
Van BICC naar Data & Analytics Center of Excellence deel 3:  Van ambitie naar realiteit
COVID-19 Oktober forecast : Het kan vriezen het kan dooien door Marco van den Doel — last modified 03-09-2021
Ter ondersteuning van het corona dashboard van de rijksoverheid
Koning TOTO : Sjaak vs Bayes door Marco van den Doel — last modified 03-09-2021
Definitieve uitslag voetbal eredivisie op basis van een wiskundig model, Baysiaanse statistiek, en een kleine Monte Carlo simulatie
COVID-19 Weersverwachting door Marco van den Doel — last modified 03-09-2021
Ter ondersteuning van het corona dashboard van de rijksoverheid
Dashboard coronavirus door Marco van den Doel — last modified 03-09-2021
Eerste observaties van een datascientist
Van BICC naar DACoE deel 1 door Marco van den Doel — last modified 03-09-2021
Van BICC naar Data & Analytics Center of Excellence: waarom je moet veranderen om relevant te blijven
SERVAL Open Ears AI machine listening door Marco van den Doel — last modified 03-09-2021
Building artificial ears for (urban) jungle applications
Granuliet WOB documenten door Marco van den Doel — last modified 03-09-2021
Textmining LDA Topic Models toegepast op 2 GB aan WOB documenten over granuliet
BERT en Transformer Learners door Marco van den Doel — last modified 02-09-2021
Ontwikkelingen op het gebied van Natural Language Processing
Textmining vs NLP door Marco van den Doel — last modified 03-09-2021
De verschillen en toepassingen van textmining en Natural Language Processing
Wat is nieuw in IBM Cognos Analytics 11 door Marco van den Doel — last modified 03-09-2021
Cognos Analytics ontwikkelt zich snel en voortvarend als een betrouwbaar self-service platform voor data analyse
Predicting butter prices door Marco van den Doel — last modified 03-09-2021
Case study assignment for Certified Data Science Proffesional course DIKW Academy
Forecast in R door Marco van den Doel — last modified 03-09-2021
Tutorial Forecast in R
Tijdreeksanalyse in R door Marco van den Doel — last modified 03-09-2021
Tijdreeksanalyse ARIMA in R, Handleiding modelselectie in R

Data Science recente Blogs

Data gedreven logistiek onderhoud voorkomt uitval door Marco van den Doel — last modified 06-09-2021
Operationele en logistieke kosten lager door gebruik van data
Er zijn meer logistieke wegen die naar Rome leiden door Marco van den Doel — last modified 03-09-2021
Duurzame innovatieve logistieke oplossing op basis van data science
Met data bijdragen aan een betere wereld door Marco van den Doel — last modified 03-09-2021
DIKW is partner van Sensing Clues

Data Science Nieuws & Evenementen

DIKW in top 50 beste data science bedrijven door Marco van den Doel — last modified 07-09-2021
DIKW is één van snelst groeiende bedrijven volgens MKB Data Science rapport
AI Hub Midden Nederland gelanceerd! door Marco van den Doel — last modified 06-09-2021
DIKW is partner van de AI Hub Midden Nederland en ondersteund en helpt het MKB in de regio.
AEDES innovatie boost datascience powered by DIKW door Marco van den Doel — last modified 06-09-2021
De innovatie boost van AEDES is binnen gehaald door de werkgroep Big data.