Support Vector Machines Deep Intuition PART-II (Math)

Vishnu vardhan Varapalli
5 min readAug 3, 2021

This post is the continuation post of this. Make sure you went through this article before proceeding.

Let’s see a diagram:-

Image from ResearchGate

Math:-

We know that, for Logistic regression, the equation of the line that’ll be formed is,

y = transpose(W)X + b …………………………………………..(1)

As the line is towards the right bottom direction, let’s consider the slope = -1 (when you check it with the real coordinates in a graph, you’ll get a negative value, for making the calculations simple, I assume the case of slope = -1).

Also, consider the line is passing through the origin(0,0)(for simplifying the calculations). the (1) becomes,

y = transpose(W)X (when (1) is passing through the origin, b value becomes zero) ………………………………………………………………..(2)

Now, let’s consider the point of X1 = (-4,0) which will be located at the left side of the line (2).

while substituting (-4,0) in (2),

transpose(W) = [ -1 0 ] (where -1 is the slope, and 0 is for calculation purpose(you can keep any value there))(in vector form complex number)

x = transpose([ -4 0 ])(in vector form complex number)

Now, when we substitute the values, if you know the multiplication of matrix the (2) will become [ ( -1 * -4 ) + (0*0) ] and then the y value will be 4. which is a +ve number. (y = 4)

Whatever the coordinates you consider at the left side of line (2) it’ll be +ve only. So, for making the calculations similar let’s consider y value = + 1.

Then (1) becomes,

transpose(W)X1 + b = 1 …………………………………..(3)

Similarly, if we consider the coordinates(X2) of (1) and do the above process we’ll end up with the equation of

transpose(W)X2+ b =-1 …………………………………..(4)

subtracting (3) from (4),

transpose(W)[ X1 - X2 ] = 2 ………………………………(5)

Now, we need to divide (5) with the norm of transpose(W),

Note:- if W is a complex number x + iy is a vector in the Euclidean plane, as W possesses magnitude & Directions in the Euclidean plane, the quantity of square_root(x2 + y2)[x square + y square] is the Euclidean norm of W.

Now by dividing to (5),

transpose(W) / ||W|| * ( X1 - X2) = 2 / ||W|| ……..(6)

Here, 2 / ||W|| is our optimization function.

For getting the best hyperplane we need to have a max value of (6). and we observed while doing the above process that y is +ve, if transpose(W)X + b ≥1 else -ve if transpose(W)X + b <1.

So what we can say is, if we multiply the value of y and transpose(W)X + b, then it should be +ve for correct prediction else -ve for the wrong prediction.[ Check the above para if you have doubt]

This is How predictions will be made.

So, for now,

Things to keep in mind as rough:

  1. There will be many hyperplanes that can give 100% accuracy.
  2. We need to select a hyperplane in such a way that the distance of closest points from the hyperplane should be maximum.
  3. The distance between the closest points(which passes through marginal lines) and the hyperplane will be calculated as the perpendicular projection of the point on the hyperplane.

Now, for proceeding before, we need to discuss a concept called Margins.

SVMs are built of 2 types of margins:-

  1. Hard margins
  2. Soft margins

The hard margin(middle of marginal lines) is the Hyperplane, which we select as a final hyperplane when the data we had is Linear separable data.

When the data is linearly separable, we don’t want to have any misclassifications, we use SVMs with a Hard Margin. Hard margins are sensitive to outliers & noise in the data.

However, for Non-linear data, we can allow some misclassifications in the hope of achieving better generality, we’ll use a soft margin for our classifier. Note that, soft margins SVMs avoid iterating over outliers.

For Hard Margins, i.e., for Linear data classifiers, we’ll have optimization function,

Optimization function for linear classifiers = MAX(2 / ||W||) …… (7)

  • In the real world, the data will be not linear in nature. There will be many overlapping data points and many characteristics which lead to overfitting(I hope you know about overfitting).

For dealing with Non-linear data, i.e., soft margin classifiers, Zeta(i) and C(i) factors come into the picture in optimization function(look ahead to understand 😊)

Let’s think while training with the data,

C(i) -> represents No. of errors your hyperplane can consider to finalize it as the final hyperplane.

let’s think we’ve assigned a value of C(i) as 5, which means 5 errors can be considered. if more than 5 errors are encountered, then we’ll change the hyperplane.

Zeta(i) -> it represents the sum of errors(like in linear regression).

Now, the Optimization function becomes,

Optimiation function = MAX[ min(||W||/2) + C(i) sum(Zeta(i)) ] . (8)

(8) represents the optimization function for the non-linear data, i.e., soft margin classifiers. But, this is not the end for non-linear data, as kernels comes to picture(we’ll discuss them later)

According to the optimization function, the hyper plane will be selected, the hyperplane which can give the maximum value of optimization, that hyperplane will be finalized for the model.

This is how the Model will get built with support vector machines.

For now, you can confidently say that “ I know the math & stats behind how a model will get built-in support vector machines with linear type of data”. Yes, real-world data will be completely different and nonlinear.

So, to deal with nonlinear data, we have SVM-Kernels. in PART-3, we’ll discuss different types of SVM kernels, and math behind it, and the code implementation of SVMs.

That’s all for now. See you in the next Article😁.

Thank you & Happy Learning✌.

If something is missing, lemme know by contacting me from here. you can follow me on different platforms LinkedIn, Github and, medium.

--

--

Vishnu vardhan Varapalli

Software Engineer, working on real-time problems to minimize the errors.