SERVAL Open Ears AI machine listening

Building artificial ears for (urban) jungle applications

Setting the stage

Back in 2016 we were in Bardia National park Nepal when I first understood the impact of human wildlife conflict. The night we arrived a local villager was attacked in his house, together with his wife. Fortunately they where unharmed, but their home was ruined... what struck me the most was the respect of this man for the elephant that ruined his house, I will never forget.

SERVAL Open Ears AI machine listening

In this blog I will explain the process we went through while we  developed a sensor solution build on artificial intelligence that can process sounds the same way we can. It can classify what it hears and these signals can be used in real applications from noise pollution in the city of Amsterdam to human wildlife conflict mitigation in Nepal.

The objective

The objective is to develop a listening device that does not record and store sounds, which would imply all kinds of privacy issues, but rather would be able to process the sound and analyse it on the spot and only sends labels of sounds that it identified.

The end-goal is to make the sensor solar powered and deploy it anywhere in the world to mitigate human-wildlife conflicts.

Could we use our new Artificial Intelligence techniques to make this work?

Choosing the right technique

When we first came back from Nepal , I spend some time on the web and found two great projects.

The first project was the Sounds of New York City project (SONIC). What I specifically liked about this project was the inherent collaboration of the project with the citizens of New York. I recommend you to watch their video explaining their project, which was an inspiration for me it could be possible.

See the below overview picture explaining this great set up:

CPS_diagram-1024x685.png


The second project was the work by Karol Pyczak a then Phd student who did his Phd research on sound classification. I first found his work on Github and contacted him, explaining my ambition and goals. This was around Januari 2017, when I explained to him I wanted to run his code on an edge device like a Raspberry Pi he laughed, and said that would be really challenging, by Easter that year he bought himself a Raspberry Pi and after that holiday he had his code running on the pi, impressive ...

His research helped me getting our ambitions of the ground. His jupyter notebooks are available online and a great resource for all who want to start on analysing sounds. The picture below shows his initial deep neural network structure he used in his research.

piczal-research-network-topology.png

So this made me wonder if it would be possible to build a device that can hear elephants come to town and help locals all over the world to react effectively to mitigate human wildlife conflicts. Every year in India alone 400 people get killed by elephants.

Google Audioset

In October 2017 Google released its Audioset and the accompanying deep learning neural network models. This again shed a new light on our options. We leveraged the work of the guys in this blog that explained how to apply transfer learning on these pre-trained models and train this model for data that we collected for our specific use case.

The audio set collected by Google is, as you would expect, huge. Over 2 million tagged youtube recordings, by now(2020) they have improved the dataset by a couple of versions.

histogram.svg


Sensing Clues

As a partner  DIKW Intelligence invests in the development of Artificial Intelligence applications for in the use in the  Sensing Clues Wildlife Intelligence platform. Amongst other things the SERVAL sensor is one of the technologies that we work on.

As our ambition is high, we look forward to train this sensor to also be able to identify sounds produced by (big) wild animals like elephant, lion, wild boar, and other species. Recent research shows that many of these animals communicate in very low frequencies, not detectable for the  human ear. While we have advanced quite a lot, we still have work to do in achieving this goal. Big challenges that we face include facts like: most audio equipment filters out (or just does not record) the sub-frequency sounds we are interested in; knowledge and samples of sounds within this frequency domain are scarce. For example, researchers just learned a few years ago that giraffe, too, produce sounds, hardly audible to the human ear. So, distinguishing sub-frequency sounds is truly an enormous task lying ahead.

Meanwhile in the urban jungle ....

Amsterdam Sounds project

The city of Amsterdam set to fight sound pollution and noise disturbance. To this end they initiated the Amsterdam Sounds project, in which the Serval sound sensor plays an important role! What I especially like in this collaboration is the work we do together with the Sensemakers of Amsterdam. Together we build a dedicated version of the SERVAL sound sensor for detecting sources of sound pollution, the Open Ears sensor.

Putting it all together

So let's put together all the bits and pieces and see where we are.

The basic idea is that sound can we transformed into an image by applying a Fourier transformation, thus creating a spectrogram of the sound.

Some examples taken from the samples collected in Amsterdam:

A "brommer alarm"  sounds like this

and looks like this

brommer-alarm.png

A car horn sounds like this:

And a car horn looks like this

claxon.png

Now the challenge for the classifier is to see the sound structure in the spectrograms... so we are back at image classification again!

So we can talk about all the bits and bolts of the machine learning parts of this (and I am very happy to do so, drop me a note!) but the proof of the pudding is in the eating....

On way to show how good a classifier actually learned a certain task is to look at the confusion matrix of all the classes it is trained to recognise and compare what the model classifies with the so-called ground truth. in this case  a set of sound examples the model has never seen during training.

So to show you some preliminary results lets have a look at the confusion matrix given the default cutoff for all the class probabilities.

  cm-0.4.png

So how do we interpret this result?

Let's look at the class "Brommer Alarm", on the horizontal axis we see the predicted classes, so this is what the models says it hears when we play some example sound. In total the model fired  26x time thee flag "I heard a brommer alarm", in 17 cases it was actually right. The Ground Truth is on the vertical axis, showing 17x indeed the Brommer Alarm, but also some mistakes... 2x it was actually a gunshot... you can have a look at the confusion matrix yourself and see if the mistakes it makes actually make sense?

So what would we actually be interested in when deploying such a sensor in the city?

To be able to answer the question if and how useful such a sensor can be we need to talk a little bit about precision and recall. Wikipedia has a great explanation on this topic so please read that if you need to freshen up your memory on the subject ;-) .

Here I just use the great picture that comes with it:

Precisionrecall.png

So when classifying in this context we are interested in the precision of the model.  Why? Let me put it this way: if the model signals something, it better be right, otherwise it can better keep quiet!

So how can we influence precision? We tell the model only to shout if she thinks there is a high probability she is right.

We can do this by increasing the cut off probability by which the model signals a class, see the below example where we have increased the threshold to speak from 0.45 to 0.9 (extreme but just to make the point).


cm-0.9.png

Now we see the model hardly ever dears to speak , but when she does she is more often right.

So to stick to our example of the "Brommer Alarms", in the first case the model shouted 26 times of which 17 time correct, a precision of 17/26 around 0.65. When we restrict the model with a higher cut off the model only speaks 15 times of which 14 times right, a precision of 0.93. 

Of cause this comes at the cost of missing out cases it should classify, the so called false negatives.

Next steps

Some of the next things we are working on are:

- Improving classification of the same sound but farther away from the source. To do this we have resampled our training samples and lowered the volume by applying a decrease in decibels by 6dB, which results in an other recoding of this sound but roughly 2 x further from the source. We have done this recursively 3 x so we have one sample with 4 variant of loudness (0 db, -6 dB, -12 dB and -18 dB). Field test we are currently doing show improvements in classification of the same sounds further away from the sensor.

- Data augmentation in general is a hot topic in deeplearning(some nice links here and here). As samples are hard to come by and expensive to collect we need to be creative in generating as much augmented data as possible in such a way it increases the model performance in de the end. We are working on a data augmentation strategy for sound samples.

Conclusion

In this blog post I described our quest at DIKW to develop useful applications of Artificial Intelligence. This particular journey has been a great one it broad me in contact with some great people. Hope you enjoyed the read. We are far from done, we keep pursuing the goal of applying this technology on the Sensing Clues Serval sound sensor in the field, most likely to start in the urban jungle of Amsterdam, but hopefully soon thereafter somewhere in Nepal, Kenia or any other place where we can turn wild spaces into safe havens.

Please feel free to contact me, leave a message or share your insights how to progress this further.