Summary: Nowadays Neural Networks and Deep Learning are everywhere, everybody is talking about deep learning. If you read an article, you will find only the advantages of neural networks, these networks are magical, they can learn everything, they can be used for everything… But are they that perfect? What are the problems with them? Are we blinded by the advertisements?
Why are they so popular? Are they that unique, something totally new?
The answer for this question is NO, the idea of neural networks is old. The basic idea was invented in 1943 and it was called threshold logic. Yes, this is the problem, “threshold logic”? Seriously? Its name is very practical, describes the mechanism of the model, but the problem is that it sounds too scientific, you cannot create good advertisements for that. (And in our era to have success, advertisements and media is the key…everything must be magical, perfect, fancy…).
This was the first attempt, it was a primitive version of the current neural networks, but this was the first step.
The next big step was the invention of backpropagation in 1975, but even with that, neural networks were not very popular, simple linear classifiers were much more powerful and popular.
Now the question is, why are neural networks so popular today? Because today we have real computational power even in our pockets. Our smartphones have more computational power than the most powerful computer form 1975 and still, nobody is talking about the hard work and the innovations made by hardware manufacturers. You can read lots of articles exalting neural networks, but there are very few articles about hardware innovations. Why is that? Because the hardware part is invisible, it is just working well (but what if hardware won’t work? We observe meaningful and valuable things, only when they don’t work, only when we feel the absence of them).
Hardware manufacturers cannot create advertisements so powerful like in the case of AI, because they are invisible for most of the people. But the real innovations are based on hardware development, without hardware inventions, Neural Networks would be nowhere. Is AI everywhere? Is AI the biggest invention of our era? What about Quantum Computing? But this is another interesting topic! I will have a followup discussion about that!
What is the truth about deep learning?
I think that everybody saw a figure like this:
Yes, it seems to be complicated to implement something like that, you have to know lots of mathematical formulas, derivatives, error functions, etc. But I can tell you that is not that hard as you think, is not harder than developing a website. Just think about it, to implement a website, you have to know multiple frameworks, you have to know different design patterns, you have to know multiple programming languages, scripting, deployment, maybe load balancing and so on. This is the same as in the case of neural networks, you have to know your tools to implement networks and that’s all. Yes, it is better to have mathematical knowledge, but this is true in each engineering fields.
About useful tools (frameworks if you want) to easily implement neural networks I will have a followup article. But now, let me show you, how easy is to implement an extremely simple, yet powerful, two layered neural network!
Let’s implement our own Neural Network!
First step is to define the “activation function“. But what is an activation function? Yes, it has an elegant name, but basically it is just a simple function that gets an input and it generates an output, usually between 0 and 1, which is the confidence of the model. The simplest activation function is the step function, which has a threshold and if the input is less than the threshold, the result is 0 otherwise the result is 1. This is the simplest mathematical function, everybody can understand it. As you see this isn’t rocket science! There are lots of powerful activation functions, but about those we will talk in later articles. For our example we will use the sigmoid function, which has the following formula:
And to derivative of the sigmoid is:
Even if you think that, “Oh Man! This is too hard!”, it isn’t, and you will see in the next paragraphs!
Implementation of the Sigmoid Function
How can we define the sigmoid function and its derivative? It’s simple, like this:
What do you think, is that so hard? I don’t think so!
Developing our two layered network
Ok, now let’s see the code for the two layered neural network:
The input matrix X:
I initialized the weights with random values (with mean 0, which is just an optimization, to have better results) and trained my model using 10000 steps. The next step is to calculate the sum of the inputs multiplied by the weights, which is basically a dot product and that is why I used the dot(x1_array, x2_array) function from the numpy library. After that we should apply the sigmod function over the dot product, so we use the activation function, to calculate the output, the result predicted by our model. This was the forward propagation, which means predicting results.
Reducing the errors
To optimize our results, we have to use back propagation, which, in the easiest case is just comparing the results predicted by our model, with the ground truth values and calculate the difference, which is basically the error of our prediction. The goal is to reduce this error, so reduce the difference between y (ground truth) and predicted values l1. How can we do that? Let’s see the diagram of the sigmoid function:
When we multiply the “slopes” by the error, we are reducing the error of high confidence predictions. Look at the sigmoid picture again! If the slope was really shallow (close to 0), then the network either had a very high value, or a very low value. This means that the network was quite confident one way or the other. However, if the network guessed something close to (x=0, y=0.5) then it isn’t very confident. We update these non-confident predictions most heavily, and we tend to leave the confident ones alone by multiplying them by a number close to 0.
Results of our network
This model gives us a very precise result, even if it is a very simple model. The output prediction vector is:
And the ground truth vector was:
As you can see 0.993 is very close to 1, and 0.007 is very close to 0, which shows that our model generates good results.
What do you think? Isn’t it easy?
Deep Learning explained
Ok, so what is the difference between this easy model and a model called Deep Learning? Yes, the difference is its sophisticated name, basically deep means that the model has lots of hidden layers. If you have more nodes, then the summing and the activation function is called multiple times, and the back propagation is a little bit harder, it uses Gradient Descent, which is another fancy name for calculating derivatives (multi variate derivative). As you can see Neural Networks are not that hard as you think, and there are lots of good libraries implementing Neural Networks (like sklearn). As in the case of website development, the only thing you have to know is the names of the libraries, how to use them and of course what does the parameters mean (about libraries I will have another article, so follow the blog!)
What are the disadvantages of Deep Learning?
As you see Deep means lots of artificial neurons in the hidden layer, which has lots of disadvantages. Let’s see a few of those:
- It’s far more complicated than many other models, such as decision tree. It’s hard to interpret and understand the weights, and why it has to be like that. Weights in regression have simple statistical meaning.
- Harder to visualize and present your NN model, in particular, non-technical audience. Decision tree is simple for them. Regression may also be visualized.
- Easier to overfit your model.
- It’s harder to get confidence and prediction interval.
- Long training times for deep networks, which are the most accurate architecture for most problems. This is especially true if you’re training on a CPU instead of a specialized GPU instance.
- Architectures have to be fine-tuned to achieve the best performance. There are many design decisions that have to be made, from the number of layers to the number of nodes in each layer to the activation functions, and an architecture that works well to some problems, very often does not generalize well
- Needs lots of data, especially for architectures with many layers. This is a problem for most ML algorithms, of course, but is especially relevant for NNs because of the vast number of weights and connections in NNs.
The main question is, is it ok to use NN for every classification or regression problems? NO, don’t use NN for everything, there are lots of simple models like Decision Trees, kNN, Random Forests, SVM, Isolation Forests, etc. that can be used, which can give you better performance, less training time and better results than Neural Networks. Before you want to apply NN, please take in consideration the disadvantages of NNs, analyse your problem, the platform on which you want to run the training process, the computational power that you on that platform, the quantity of your training data and the nature of the data. How to choose the best algorithm is a very important and interesting topic, so I will have a followup article, explaining the steps you should consider, when choosing a classification of regression model.
If you liked this article, if you think that this article contains useful information, then please like it and share it, help others to find useful information!
Thank you for reading!