SVM and Kernel SVM

Summary

In this article you will learn about SVM or Support Vector Machine, which is one of the most popular AI algorithms (it’s one of the top 10 AI algorithms) and about the Kernel Trick, which deals with non-linearity and higher dimensions. We will touch topics like hyperplanes, Lagrange Multipliers, we will have visual examples and code examples (similar to the code example used in the KNN chapter) to better understand this very important algorithm.

SVM Explained

The Support Vector Machine is a supervised learning algorithm mostly used for classification but it can be used also for regression. The main idea is that based on the labeled data (training data) the algorithm tries to find the optimal hyperplane which can be used to classify new data points. In two dimensions the hyperplane is a simple line.

Usually a learning algorithm tries to learn the most common characteristics (what differentiates one class from another) of a class and the classification is based on those representative characteristics learnt (so classification is based on differences between classes). The SVM works in the other way around. It finds the most similar examples between classes. Those will be the support vectors.

As an example, lets consider two classes, apples and lemons.

Other algorithms will learn the most evident, most representative characteristics of apples and lemons, like apples are green and rounded while lemons are yellow and have elliptic form.

In contrast, SVM will search for apples that are very similar to lemons, for example apples which are yellow and have elliptic form. This will be a support vector. The other support vector will be a lemon similar to an apple (green and rounded). So other algorithms learns the differences while SVM learns similarities.

If we visualize the example above in 2D, we will have something like this:

As we go from left to right, all the examples will be classified as apples until we reach the yellow apple. From this point, the confidence that a new example is an apple drops while the lemon class confidence increases. When the lemon class confidence becomes greater than the apple class confidence, the new examples will be classified as lemons (somewhere between the yellow apple and the green lemon).

Based on these support vectors, the algorithm tries to find the best hyperplane that separates the classes. In 2D the hyperplane is a line, so it would look like this:

Ok, but why did I draw the blue boundary like in the picture above? I could also draw boundaries like this:

As you can see, we have an infinite number of possibilities to draw the decision boundary. So how can we find the optimal one?

Finding the Optimal Hyperplane

Intuitively the best line is the line that is far away from both apple and lemon examples (has the largest margin). To have optimal solution, we have to maximize the margin in both ways (if we have multiple classes, then we have to maximize it considering each of the classes).

So if we compare the picture above with the picture below, we can easily observe, that the first is the optimal hyperplane (line) and the second is a sub-optimal solution, because the margin is far shorter.

Because we want to maximize the margins taking in consideration all the classes, instead of using one margin for each class, we use a “global” margin, which takes in consideration all the classes. This margin would look like the purple line in the following picture:

This margin is orthogonal to the boundary and equidistant to the support vectors.

So where do we have vectors? Each of the calculations (calculate distance and optimal hyperplanes) are made in vectorial space, so each data point is considered a vector. The dimension of the space is defined by the number of attributes of the examples. To understand the math behind, please read this brief mathematical description of vectors, hyperplanes and optimizations: SVM Succintly.

All in all, support vectors are data points that defines the position and the margin of the hyperplane. We call them “support” vectors, because these are the representative data points of the classes, if we move one of them, the position and/or the margin will change. Moving other data points won’t have effect over the margin or the position of the hyperplane.

To make classifications, we don’t need all the training data points (like in the case of KNN), we have to save only the support vectors. In worst case all the points will be support vectors, but this is very rare and if it happens, then you should check your model for errors or bugs.

So basically the learning is equivalent with finding the hyperplane with the best margin, so it is a simple optimization problem.

Basic Steps

The basic steps of the SVM are:

  1. select two hyperplanes  (in 2D) which separates the data with no points between them (red lines)
  2. maximize their distance (the margin)
  3. the average line (here the line half way between the two red lines) will be the decision boundary

This is very nice and easy, but finding the best margin, the optimization problem is not trivial (it is easy in 2D, when we have only two attributes, but what if we have N dimensions with N a very big number)

To solve the optimization problem, we use the Lagrange Multipliers. To understand this technique you can read the following two articles: Duality Langrange Multiplier and A Simple Explanation of Why Langrange Multipliers Wroks.

Until now we had linearly separable data, so we could use a line as class boundary. But what if we have to deal with non-linear data sets?

SVM for Non-Linear Data Sets

An example of non-linear data is:

In this case we cannot find a straight line to separate apples from lemons. So how can we solve this problem. We will use the Kernel Trick!

The basic idea is that when a data set is inseparable in the current dimensions, add another dimension, maybe that way the data will be separable. Just think about it, the example above is in 2D and it is inseparable, but maybe in 3D there is a gap between the apples and the lemons, maybe there is a level difference, so lemons are on level one and lemons are on level two. In this case we can easily draw a separating hyperplane (in 3D a hyperplane is a plane) between level 1 and 2.

Mapping to Higher Dimensions

To solve this problem we shouldn’t just blindly add another dimension, we should transform the space so we generate this level difference intentionally.

Mapping from 2D to 3D

Lets assume that we add another dimension called X3. Another important transformation is that in the new dimension the points are organized using this formula x1^2 + x2^2.

If we plot the plane defined by the x^2 + y^2 formula, we will get something like this:

Now we have to map the apples and lemons (which are just simple points) to this new space. Think about it carefully, what did we do? We just used a transformation in which we added levels based on distance. If you are in the origin, then the points will be on the lowest level. As we move away from the origin, it means that we are climbing the hill (moving from the center of the plane towards the margins) so the level of the points will be higher. Now if we consider that the origin is the lemon from the center, we will have something like this:

Now we can easily separate the two classes. These transformations are called kernels. Popular kernels are: Polynomial Kernel, Gaussian Kernel, Radial Basis Function (RBF), Laplace RBF Kernel, Sigmoid Kernel, Anove RBF Kernel, etc (see Kernel Functions or a more detailed description Machine Learning Kernels).

Mapping from 1D to 2D

Another, easier example in 2D would be:

After using the kernel and after all the transformations we will get:

So after the transformation, we can easily delimit the two classes using just a single line.

In real life applications we won’t have a simple straight line, but we will have lots of curves and high dimensions. In some cases we won’t have two hyperplanes which separates the data with no points between them, so we need some trade-offs, tolerance for outliers. Fortunately the SVM algorithm has a so-called regularization parameter to configure  the trade-off and to tolerate outliers.

Tuning Parameters

As we saw in the previous section choosing the right kernel is crucial, because if the transformation is incorrect, then the model can have very poor results. As a rule of thumb, always check if you have linear data and in that case always use linear SVM (linear kernel). Linear SVM is a parametric model, but an RBF kernel SVM isn’t, so the complexity of the latter grows with the size of the training set. Not only is more expensive to train an RBF kernel SVM, but you also have to keep the kernel matrix around, and the projection into this “infinite” higher dimensional space where the data becomes linearly separable is more expensive as well during prediction. Furthermore, you have more hyperparameters to tune, so model selection is more expensive as well! And finally, it’s much easier to overfit a complex model!

Regularization

The Regularization Parameter (in python it’s called C) tells the SVM optimization how much you want to avoid miss classifying each training example.

If the C is higher, the optimization will choose smaller margin hyperplane, so training data miss classification rate will be lower.

On the other hand, if the C is low, then the margin will be big, even if there will be miss classified training data examples. This is shown in the following two diagrams:

As you can see in the image, when the C is low, the margin is higher (so implicitly we don’t have so many curves, the line doesn’t strictly follows the data points) even if two apples were classified as lemons. When the C is high, the boundary is full of curves and all the training data was classified correctly. Don’t forget, even if all the training data was correctly classified, this doesn’t mean that increasing the C will always increase the precision (because of overfitting).

Gamma

The next important parameter is Gamma. The gamma parameter defines how far the influence of a single training example reaches. This means that high Gamma will consider only points close to the plausible hyperplane and low Gamma will consider points at greater distance.

As you can see, decreasing the Gamma will result that finding the correct hyperplane will consider points at greater distances so more and more points will be used (green lines indicates which points were considered when finding the optimal hyperplane).

Margin

The last parameter is the margin. We’ve already talked about margin, higher margin results better model, so better classification (or prediction). The margin should be always maximized.

SVM Example using Python

In this example we will use the Social_Networks_Ads.csv file, the same file as we used in the previous article, see KNN example using Python.

In this example I will write down only the differences between SVM and KNN, because I don’t want to repeat myself in each article! If you want the whole explanation about how can we read the data set, how do we parse and split our data or how can we evaluate or plot the decision boundaries, then please read the code example from the previous chapter (KNN)!

Because the sklearn library is a very well written and useful Python library, we don’t have too much code to change. The only difference is that we have to import the SVC class (SVC = SVM in sklearn) from sklearn.svm instead of the KNeighborsClassifier class from sklearn.neighbors.

After importing the SVC, we can create our new model using the predefined constructor. This constructor has many parameters, but I will describe only the most important ones, most of the time you won’t use other parameters.

The most important parameters are:

  1. kernel: the kernel type to be used. The most common kernels are rbf (this is the default value), poly or sigmoid, but you can also create your own kernel.
  2. C: this is the regularization parameter described in the Tuning Parameters section
  3. gamma: this was also described in the Tuning Parameters section
  4. degree: it is used only if the chosen kernel is poly and sets the degree of the polinom
  5. probability: this is a boolean parameter and if it’s true, then the model will return for each prediction, the vector of probabilities of belonging to each class of the response variable. So basically it will give you the confidences for each prediction.
  6. shrinking: this shows whether or not you want a shrinking heuristic used in your optimization of the SVM, which is used in Sequential Minimal Optimization. It’s default value is true, an if you don’t have a good reason, please don’t change this value to false, because shrinking will greatly improve your performance, for very little loss in terms of accuracy in most cases.

Now lets see the output of running this code. The decision boundary for the training set looks like this:

As we can see and as we’ve learnt in the Tuning Parameters section, because the C has a small value (0.1) the decision boundary is smooth.

Now if we increase the C from 0.1 to 100 we will have more curves in the decision boundary:

What would happen if we use C=0.1 but now we increase Gamma from 0.1 to 10? Lets see!

What happened here? Why do we have such a bad model? As you’ve seen in the Tuning Parameters section, high gamma means that when calculating the plausible hyperplane we consider only points which are close. Now because the density of the green points is high only in the selected green region, in that region the points are close enough to the plausible hyperplane, so those hyperplanes were chosen. Be careful with the gamma parameter, because this can have a very bad influence over the results of your model if you set it to a very high value (what is a “very high value” depends on the density of the data points).

For this example the best values for C and Gamma are 1.0 and 1.0. Now if we run our model on the test set we will get the following diagram:

And the Confusion Matrix looks like this:

As you can see, we’ve got only 3 False Positives and only 4 False Negatives. The Accuracy of this model is 93% which is a really good result, we obtained a better score than using KNN (which had an accuracy of 80%).

NOTE: accuracy is not the only metric used in ML and also not the best metric to evaluate a model, because of the Accuracy Paradox. We use this metric for simplicity, but later, in the chapter Metrics to Evaluate AI Algorithms we will talk about the Accuracy Paradox and I will show other very popular metrics used in this field.

Conclusions

In this article we’ve seen a very popular and powerful supervised learning algorithm, the Support Vector Machine. We’ve learnt the basic idea, what is a hyperplane, what are support vectors and why are they so important. We’ve also seen lots of visual representations, which helped us to better understand all the concepts.

Another important topic that we touched is the Kernel Trick, which helped us to solve non-linear problems.

To have a better model, we saw techniques to tune the algorithm. At the end of the article we had a code example in Python, which showed us how can we use the KNN algorithm.

As final thoughts, I would like to give some pros & cons and some popular use cases.

Pros

  1. SVN can be very efficient, because it uses only a subset of the training data, only the support vectors
  2. Works very well on smaller data sets, on non-linear data sets and high dimensional spaces
  3. Is very effective in cases where number of dimensions is greater than the number of samples
  4. It can have high accuracy, sometimes can perform even better than neural networks
  5. Not very sensitive to overfitting

Cons

  1. Training time is high when we have large data sets
  2. When the data set has more noise (i.e. target classes are overlapping) SVM doesn’t perform well

Popular Use Cases

  1. Text Classification
  2. Detecting spam
  3. Sentiment analysis
  4. Aspect-based recognition
  5. Aspect-based recognition
  6. Handwritten digit recognition

References

  1. Understanding SVM
  2. SVM: A Simple Explanation
  3. SVM: Understand the math
  4. SVM Theory
  5. Udemy: Machine Learning A-Z

Stupid Simple AI Series

Stupid Simple AI Series

Summary

This is the first post of my SERIES OF ARTICLES to explain all the most popular Artificial Intelligence algorithms. This is an introduction, in which I will present the roadmap of this series of articles, an overview of what’s coming. If you want to learn about AI for FREE, then please SUBSCRIBE!

Structure of this course

  1. Classification:
    1. KNN
    2. SVM and Kernel SVM
    3. Decision Tree
    4. Random Forest
    5. Isolation Forest
    6. Naive Bayes
  2. Metrics to Evaluate AI Algorithms
  3. Regression:
    1. Simple and Multi Linear Regression
    2. Polynomial Regression
    3. Decision Tree and Random Forest
    4. SVR
  4. Clustering:
    1. K-Means
    2. Mean Shift
    3. Hierarchical Clustering
    4. Density Based Clustering
    5. Expectation-Maximization Clustering using Gaussian Mixture
  5. Dimensionality Reduction:
    1. Principal Component Analysis
    2. Linear Discriminant Analysis
    3. Non-Negative Matrix Factorization
    4. Generative Topographic Maps
    5. Self Organizing Maps
    6. Factor Analysis
  6. Natural Language Processing
  7. Deep Learning:
    1. Main Concepts and Theory
    2. Introduction to Neural Networks
    3. Feed-Forward Neural Networks
    4. Recurrent Neural Networks
    5. Auto Encoders
    6. Generative Adversarial Networks
    7. Convolutional Neural Networks:
      1. AlexNet
      2. VGG
      3. GoogLeNet
      4. ResNet
      5. Inception V4
      6. SqueezeNet
      7. ENet
    8. Implemented APPLICATIONS:
      1. Colorization of Black and White Images
      2. Automatic Machine Translation
      3. Object Classification in Photographs
      4. Automatic Handwriting Generation
      5. Character Text Generation
      6. Automatic Game Playing
  8. Reinforcement Learning:
    1. Upper Confidence Bound
    2. Thompson Sampling
    3. Marcov Decision Processes
    4. Q-Learning
    5. Policy Learning
    6. Deep Reinforcement Learning
  9. Boosting the AI Algorithms
  10. AI in the real life

IMPORTANT: Each bullet point will be a link to the actual content. If you subscribe to my channel, you will get a mail each time I add a new chapter to this course!

The first article is already added, in which you can find a detailed but clear explanation of KNN or K-Nearest Neighbors algorithm. Click on the first link and learn!

 

KNN in Python

Summary

In this article you will learn about a very simple yet powerful algorithm called KNN or K-Nearest Neighbor. The first sections will contain a detailed yet clear explanation of this algorithm. At the end of this article you can find an example using KNN (implemented in python).

KNN Explained

KNN is a very popular algorithm, it is one of the top 10 AI algorithms (see Top 10 AI Algorithms). Its popularity springs from the fact that it is very easy to understand and interpret yet many times it’s accuracy is comparable or even better than other, more complicated algorithms.

KNN is a supervised algorithm (which means that the training data is labeled, see Supervised and Unsupervised Algorithms), it is non-parametric and lazy (instance based).

Why is lazy? Because it does not explicitly learns the model, but it saves all the training data and uses the whole training set for classification or prediction. This is in contrast to other techniques like SVM, where you can discard all non support vectors without any problem.

This means that the training process is very fast, it just saves all the values from the data set. The real problem is the huge memory consumption (because we have to store all the data) and time complexity at testing time (since classifying a given observation requires a run down of the whole data set) . But in general it’s a very useful algorithm in case of small data sets (or if you have lots of time and memory) or for educational purposes.

Other important assumption is that this algorithm requires that the data is in metric space. This means that we can define metrics for calculation distance between data points. Defining distance metrics can be a real challenge (see Nearest Neighbor Classification and Retrieval). An interesting idea is to find the distance metrics using machine learning (mainly by converting the data to vector space, represent the differences between objects as distances between vectors and learn those differences, but this is another topic, we will talk about this later).

The most used distance metrics are:

  • Euclidean Distance:

This is the geometrical distance that we are using in our daily life. It’s calculated as the square root of the sum of the squared differences between the two point of interest.

The formula is in 2D space: 

  • Manhattan Distance:

Calculate the distance between real vectors using the sum of their absolute difference. Also called City Block Distance. You can imagine this as walking in a city which is organized as a matrix (or walking in Manhattan). The streets are the edges of the little squares from the matrix. If you want to go from square A to square B, you have to go on the edges of the little squares. This is longer than the Euclidean distance, because you are not going straight from A to B, but in zigzag.

The formula is in 2D space: 

  • Minkowski DistanceGeneralization of Euclidean and Manhattan distance. It is a general formula to calculate distances in N dimensions (see Minkowski Distance).
  • Hamming Distance: Calculate the distance between binary vectors (see Hamming Distance).

KNN for classification

Informally classification means that we have some labeled examples (training data) and for new unlabeled examples (test set) we want to assign labels based on the lessons learned for the training set.

As I said earlier, the KNN algorithm is lazy, there is no learning or generalization in the training phase. The actual work is done at classification or prediction time.

The steps of the KNN algorithm are (formal pseudocode):

  1. Initialize selectedi = 0 for all i data points from the training set
  2. Select a distance metric (let’s say we use Euclidean Distance)
  3. For each training set data point i calculate the distancei = distance between the new data point and training point i
  4. Choose the K parameter of the algorithm (K = number of neighbors considered), usually it’s an odd number, this way avoiding ties in majority voting
  5. For j = 1 to K loop through all the training set data points and in each step select the point with minimum distance to the new observation (minimum distancei)
  6. For each existing class count how many of the K selected data points are part of that class (voting)
  7.  Assign to the new observation the class with the maximum count (highest vote) – this is majority voting.

Ok, maybe the formal pseudocode above it’s a little bit hard to understand, so let’s see the informal explanation.

The main idea is that for a new observation we search the K nearest point (with minimum distance). These points will define the class of the new observation by majority voting.

For example, if we have two classes, red and green and after calculating the distances and getting the 3 nearest points, from which 2 are red and 1 is green, then the selected class by majority voting is red (2 > 1).

If we have to following situation:

We have two classes, red and green and a new observation, the black star. We choose K = 3, so we will consider the 3 points with minimum distances from the new observation. The star is close to 3 red points, so it is obvious that this new observation will be classified as a red point.

In the picture above I moved the star a little bit closer the the green points. In this case we have 2 red and 1 green points selected. The star will be still classified as red, because 2 > 1.

As we move closer to the green points  the confidence of selecting red as a label drops, until the new observation will be classified as green. This is the boundary between the red class and the green class, as in the case of different countries. So from a different point of view, we can say that with this algorithm we build the boundaries of the classes. The boundaries are controlled by the value of K. A small K will result sharp boundaries while a big K will result smooth boundaries. The most interesting and most important part is how to choose the best value of K in the context of your specific data set. In the following sections we will see how to choose the best value of K.

What I described above doesn’t necessarily mean that the KNN algorithm would always linearly compare the test samples to the training data as if it were a list. The training data can be represented with different structures, like K-D Tree or Ball Tree.

Another improvement is that we can assign weights to the attributes which are more important in the classification. This way, if we know that in our data set there are some important attributes to consider, we can assign them higher weights. This will cause that they will have a greater impact in assigning labels to new observations. The weights can either be uniform (equal dominance of all neighbors) or inversely proportional to the distance of the neighbor from the test sample. You can also devise your own weight assignment algorithm (for example using another AI algorithm to find the most important attributes and assign them higher weights).

KNN for Prediction

Tke KNN algorithm can also be used to predict new values. The most common example is to use KNN to predict the price of something (house, car, etc.) based on the training data. To predict new values, the KNN algorithm is almost the same. In the case of prediction we calculate the K most similar points (metrics for similarity must be defined by the user) and based on these points we can predict new values using formulas like average, weighted average, etc. So the idea is the same, define the metrics to calculate the distance (here similarity), choose the K most similar points and then use a formula to predict the new values based on the selected K points.

Computational Complexity

To calculate the computational complexity of KNN, let’s consider a d dimensional space, is the number of neighbors and is the total number of training data points.

To understand how can we calculate the complexity of this algorithm, please take a look over the formal pseudocode! Each distance computation requires O(d) runtime, so the third step requires O(nd) work. For each iterate in step five, we perform O(n) work by looping through the training set observations, so the step overall requires O(nk) work. The first step only require O(n) work, so we get a O(nd + kn) runtime. We can reduce this runtime complexity to O(nd) if we use the quickselect algorithm to choose the K points with minimum distances.

How to choose the best K for your data set

You are probably wondering, how can you find the best value for K to maximize the accuracy of the classification or prediction. Firstly I have to mention that the K is a hyperparameter, which means that this parameter is chosen by you, the developer. As I said earlier, you can imagine K as the parameter that controls the decision boundary. For example:

As you can see, if K=1 the border is very sharp, in zigzag but in the case K = 7, the border is smoother. So as we increase the value of K, the boundary becomes smoother. If K would be infinity everything would be blue or red, based on the total majority.

The training error rate and the validation error rate are two parameters that we need to take in consideration when choosing the right K. If the training error is very low but the test error is high, then we have to discuss about overfitting.

Overfitting happens when a model learns the details and noise in the training data to the extent that it negatively impacts the performance of the model on new data. This means that the noise or random fluctuations in the training data is picked up and learned as concepts by the model.

Comparing an overfitted and a a regular boundary:

The green line represents an overfitted model and the black line represents a regularized model. While the green line best follows the training data, it is too dependent on that data and it is likely to have a higher error rate on new, unseen data, compared to the black line.

Underfitting refers to a model that can neither model the training data nor generalize to new data. For example using a linear algorithm on non-linear data will have poor performance.

We have to find the golden mean, having a model the can well generalize to new data, but not too well, avoiding to learn the noise and avoiding overfitting.

If we represent the training error rate and the test error rate, we will get some diagrams like the following:

Errors

As you can see when K=1 the training error is 0 because the closes point to any training data point is itself. If we look at the test error, when K=1 the error is very high, which is normal because we have overfitting. As we increase the value of K, the test error rate is dropping until it reaches it minimum point, after which the error starts to increase again. So basically the problem to find the best K is an optimization problem, finding the minimum value on a curve. This called the Elbow Method, because the test error curve looks like an elbow.

The conclusion is in order to find the best K use the elbow method and find the minimum on the curve. This can be easily done by brute force, by running the model multiple times, each time increasing the value of K. An efficient way to find the best K is by using K-Fold Cross Validation, but we will talk about this in the last chapter (Boosting the AI Algorithms).

KNN example using Python

In this example we will use the Social_Networks_Ads.csv file which contains information about the users like Gender, Age, Salary. The Purchased column contains the labels for the users. This is a binary classification (we have two classes). If the label is 1 it means that the user has bought product X and 0 means the the users hasn’t bought that specific product.

Here you can download the file: Social_Network_Ads.

In this example we will use the following libraries: numpy, pandas, sklearn and motplotlib.

The first step is to import out dataset.

This is an easy task, because the pandas library contains the read_csv method which reads our data and saves it in a data structure called DataFrame.

Most of the algorithms from the sklearn library requires that the attributes and the labels are in separate variables, so we have to parse our data.

In this example (because we want to represent the data in 2-D diagram) we will use only the Age and the Salary to train our model. If you open the file, you can see that the first two columns are the ID and the Gender of the user. We don’t want to take these attributes in consideration.

X contains the attributes. Because we don’t want to take in consideration the first two columns, we will copy only column 2 and 3 (see line 8).

The labels are in the 4th column, so we will copy this column in variable y (see line 9).

The next step is to split our data in two different chunks, one will be used to train our data and one will be use to test the results of our model (the test attributes will be the new observations and the predicted label will be compared with the labels from the test set).

This is another easy task, because sklearn has the method called train_test_split, which will split our data set returning 4 values, the train attributes (X_train), test attributes (X_test), train labels (y_train) and the test labels (y_test). A usual setup is to use 25% of the data set for test and 75% for train. You can use other setup, if you like.

Now take another look over the data set. You can observe that the values from the Salary column are much higher than in the Age column. This can be a problem, because the impact of the Salary column will be much higher. Just think about it, if you have two very close salaries like 10000 and 9000, calculating the distance between them will result in 10000 – 9000 = 1000. Now if you take the Age column with values like 10 and 9, the difference is 10-9=1, which has lower impact (10 is very small compared to 1000, it’s like you have weighted attributes).

In order to resolve this magnitude problem, we have to scale the attributes. For this we used the StandardScaler from sklearn.

The next step is to train the model:

We import the KNeighborsClassifier from sklearn. This takes multiple parameters. The most important parameters are:

  1. n_neighbors: the value of k, the number of neighbors considered
  2. weights: if you want to use weighted attributes, here you can configure the weights. This takes values like uniform, distance (inverse distance to the new point) or callable which should be defined by the user. The default value is uniform.
  3. algorithm: if you want a different representation of the data, here you can use values like ball_tree, kd_tree or brute, default is auto which tries to automatically select the best representation for the current data set.
  4. metric: the distance metric (Euclidean, Manhattan, etc), default is Euclidean.

We leave all the default parameters, but for n_neighbors we will use 2 (the default is 5).

If you want to predict the classes for the new observations, you can use the following code:

The next step is to evaluate our model. For this we will use a Confusion Matrix.

The results of the confusion matrix is:

As you can see we have only 5 False Positives and only 7 False Negatives, which is a really good result (here the accuracy is 80%).

The last step is to visualize the decision boundaries. Let’s start with the decision boundaries of the training set.

The meshgrid function creates a rectangular grid out of an array of x values and an array of y values (here x = X1 and y = X2).

The contourf method draws filled contours, so we use this method to fill the background with the color of the class associated to it. The trick is that for each pixel we make a prediction and we color that pixel with the color associated to the predicted class. Here we have two classes, we use red and green for 0 and 1.

Between lines 10 and 12 a we loop over all the training data points, we predict the label for them and color them red if the predicted value is 0 and green if the predicted value is 1.

The result of this code is:

To visualize the boundaries of the test set, we use the same code, but changing the X_set, y_set to X_test, y_test. The result of running the code to visualize the boundaries of the test set is:

As you can see also in the figure above, we have a really good model, only a few points were wrongly classified.

Conclusions

  1. The KNN algorithm is very intuitive and easy to understand
  2. The training time is minimal, the model doesn’t actually learns or generalizes
  3. The testing time can be very long, because the algorithm loops over the whole training dataset and calculates the distances (distance calculation can be a hard work, based on the type of distance metric and based on the type of the dataset)
  4. For small K values the algorithm can be noise sensitive and easy to overfit
  5. The data set should numeric or a distance metrics should exist to calculate distances between points
  6. Doesn’t perform too well on unbalanced data

References

  1. KNearestNeighbors Explained
  2. Complete Guide to KNN
  3. Overfitting and Underfitting in Machine Learning
  4. Detailed Introduction to KNN
  5. Udemy – Machine Learning from A-to-Z
  6. KNN Computational Complexity

If you LIKED this article and you want to LEARN more interesting AI algorithms for FREE, please SUBSCRIBE!

 

 

The Truth about Neural Networks

Summary:  Nowadays Neural Networks and Deep Learning are everywhere,  everybody is talking about deep learning. If you read an article, you will find only the advantages of neural networks, these networks are magical, they can learn everything, they can be used for everything… But are they that perfect? What are the problems with them? Are we blinded by the advertisements?

Why are they so popular? Are they that unique, something totally new?

The answer for this question is NO, the idea of neural networks is old. The basic idea was invented in 1943 and it was called threshold logic.  Yes, this is the problem, “threshold logic”? Seriously? Its name is very practical, describes the mechanism of the model, but the problem is that it sounds too scientific, you cannot create good advertisements for that. (And in our era to have success, advertisements and media is the key…everything must be magical, perfect, fancy…).
This was the first attempt, it was a primitive version of the current neural networks, but this was the first step.

The next big step was the invention of backpropagation in 1975, but even with that, neural networks were not very popular, simple linear classifiers were much more powerful and popular.

Now the question is, why are neural networks so popular today? Because today we have real computational power even in our pockets. Our smartphones have more computational power than the most powerful computer form 1975 and still, nobody is talking about the hard work and the innovations made by hardware manufacturers. You can read lots of articles exalting neural networks, but there are very few articles about hardware innovations. Why is that? Because the hardware part is invisible, it is just working well (but what if hardware won’t work? We observe meaningful and valuable things, only when they don’t work, only when we feel the absence of them).

Hardware manufacturers cannot create advertisements so powerful like in the case of AI, because they are invisible for most of the people. But the real innovations are based on hardware development, without hardware inventions, Neural Networks would be nowhere. Is AI everywhere? Is AI the biggest invention of our era? What about Quantum Computing? But this is another interesting topic! I will have a followup discussion about that!

What is the truth about deep learning?

I think that everybody saw a figure like this:
Yes, it seems to be complicated to implement something like that, you have to know lots of mathematical formulas, derivatives, error functions, etc. But I can tell you that is not that hard as you think, is not harder than developing a website. Just think about it, to implement a website, you have to know multiple frameworks, you have to know different design patterns, you have to know multiple programming languages, scripting, deployment, maybe load balancing and so on. This is the same as in the case of neural networks, you have to know your tools to implement networks and that’s all. Yes, it is better to have mathematical knowledge, but this is true in each engineering fields.

About useful tools (frameworks if you want) to easily implement neural networks I will have a followup article. But now, let me show you, how easy is to implement an extremely simple, yet powerful, two layered neural network!

Let’s implement our own Neural Network!

Theory

First step is to define the “activation function“. But what is an activation function? Yes, it has an elegant name, but basically it is just a simple function that gets an input and it generates an output, usually between 0 and 1, which is the confidence of the model. The simplest activation function is the step function, which has a threshold and if the input is less than the threshold, the result is 0 otherwise the result is 1. This is the simplest mathematical function, everybody can understand it. As you see this isn’t rocket science! There are lots of powerful activation functions, but about those we will talk in later articles. For our example we will use the sigmoid function, which has the following formula:
And to derivative of the sigmoid is:
Even if you think that, “Oh Man! This is too hard!”, it isn’t, and you will see in the next paragraphs!

Implementation of the Sigmoid Function

How can we define the sigmoid function and its derivative? It’s simple, like this:

 

 

What do you think, is that so hard? I don’t think so!

Developing our two layered network

Ok, now let’s see the code for the two layered neural network:

The input matrix X: 

I initialized the weights with random values (with mean 0, which is just an optimization, to have better results) and trained my model using 10000 steps. The next step is to calculate the sum of the inputs multiplied by the weights, which is basically a dot product and that is why I used the dot(x1_array, x2_array) function from the numpy library. After that we should apply the sigmod function over the dot product, so we use the activation function, to calculate the output, the result predicted by our model. This was the forward propagation, which means predicting results.

Reducing the errors

To optimize our results, we have to use back propagation, which, in the easiest case is just comparing the results predicted by our model, with the ground truth values and calculate the difference, which is basically the error of our prediction. The goal is to reduce this error, so reduce the difference between y (ground truth) and predicted values l1. How can we do that? Let’s see the diagram of the sigmoid function:

When we multiply the “slopes” by the error, we are reducing the error of high confidence predictions. Look at the sigmoid picture again! If the slope was really shallow (close to 0), then the network either had a very high value, or a very low value. This means that the network was quite confident one way or the other. However, if the network guessed something close to (x=0, y=0.5) then it isn’t very confident. We update these non-confident predictions most heavily, and we tend to leave the confident ones alone by multiplying them by a number close to 0.

Results of our network

This model gives us a very precise result, even if it is a very simple model. The output prediction vector is:
And the ground truth vector was:
As you can see 0.993 is very close to 1, and 0.007 is very close to 0, which shows that our model generates good results.
What do you think? Isn’t it easy?

Deep Learning explained

Ok, so what is the difference between this easy model and a model called Deep Learning? Yes, the difference is its sophisticated name, basically deep means that the model has lots of hidden layers. If you have more nodes, then the summing and the activation function is called multiple times, and the back propagation is a little bit harder, it uses Gradient Descent, which is another fancy name for calculating derivatives (multi variate derivative). As you can see Neural Networks are not that hard as you think, and there are lots of good libraries implementing Neural Networks (like sklearn). As in the case of website development, the only thing you have to know is the names of the libraries, how to use them and of course what does the parameters mean (about libraries I will have another article, so follow the blog!)

What are the disadvantages of Deep Learning?

As you see Deep means lots of artificial neurons in the hidden layer, which has lots of disadvantages. Let’s see a few of those:

  1. It’s far more complicated than many other models, such as decision tree. It’s hard to interpret and understand the weights, and why it has to be like that. Weights in regression have simple statistical meaning.
  2. Harder to visualize and present your NN model, in particular, non-technical audience. Decision tree is simple for them. Regression may also be visualized.
  3. Easier to overfit your model.
  4. It’s harder to get confidence and prediction interval.
  5. Long training times for deep networks, which are the most accurate architecture for most problems. This is especially true if you’re training on a CPU instead of a specialized GPU instance.
  6. Architectures have to be fine-tuned to achieve the best performance. There are many design decisions that have to be made, from the number of layers to the number of nodes in each layer to the activation functions, and an architecture that works well to some problems, very often does not generalize well
  7. Needs lots of data, especially for architectures with many layers. This is a problem for most ML algorithms, of course, but is especially relevant for NNs because of the vast number of weights and connections in NNs.

Conclusions

The main question is, is it ok to use NN for every classification or regression problems? NO, don’t use NN for everything, there are lots of simple models like Decision Trees, kNN, Random Forests, SVM, Isolation Forests, etc. that can be used, which can give you better performance, less training time and better results than Neural Networks. Before you want to apply NN, please take in consideration the disadvantages of NNs, analyse your problem, the platform on which you want to run the training process, the computational power that you on that platform, the quantity of your training data and the nature of the data. How to choose the best algorithm is a very important and interesting topic, so I will have a followup article, explaining the steps you should consider, when choosing a classification of regression model.


If you liked this article, if you think that this article contains useful information, then please like it and share it, help others to find useful information!
Thank you for reading!

References:

  1. https://beckernick.github.io/sigmoid-derivative-neural-network/
  2. https://towardsdatascience.com/activation-functions-and-its-types-which-is-better-a9a5310cc8f
  3. https://iamtrask.github.io/2015/07/12/basic-python-network/