Let’s Learn: AI Chapter 4 | Math Behind Linear Regression Explained

Prakashmani Awasthi
11 min readJul 12, 2024

--

Explore the mathematical foundations of linear regression, comparing traditional and iterative methods to build a solid understanding of this crucial AI algorithm.

Let’s Learn AI Chapter 5 (Image by Ryan Parker)

Introduction

Linear Regression is a supervised machine learning algorithm, known for its easy implementation and simple working. The algorithm is so commonly used and is relevant enough to be employed as a base model or a starting point in ML use cases utilizing Regressors. However, this algorithm has a set of conditions the dataset should fill such as Linearity, Normality, Homoscedasticity, Independence, and Multicolinearity. To know more about the basics refer to Mastering Linear Regression for Predictive Analytics [1].

When I started studying Machine Learning, some of the problems I frequently stumbled upon were lacking understanding of why there were certain conditions in an algorithm. how does one decide the parameters? why does a certain type of use case fit a specific algorithm? etc. Moreover, I was fascinated by how people on the Internet were so good with these concepts and I wondered how much knowledge it takes to make something new rather than just importing stuff all the time. If you thought so too, then today, we will try to deepen our understanding of the mathematics behind this algorithm and solve some example questions. Knowing how this ML algorithm works will help you to utilize it to its full extent and start creating the foundation for you to develop such algorithms in the future.

Methods to approach the Linear Regression

If you have read my previous article [1], you might be wondering why I used the word “Methods” and not “Method”. If you didn’t well…., don’t worry, I got you!

You see, there are 2 ways in which we can implement Linear regression:
1. Using the traditional methods we can directly calculate the slope and intercept or the weight and bias found in the Linear Regression Algorithm. This is called the Traditional method.
2. Another is to use the Iterative method that involves the Ordinary Least Square, or the gradient descent approach, which takes in the X and y values, producing yi as a result, and using the cost function followed by the optimization function which updates the value of Weight and bias until they stop changing or the maximum number of iterations is reached.

I learned about Machine learning on my own, thus it took me by surprise when I first encountered the traditional way of solving things in Linear Regression. Till then, I didn’t even know it existed, but you may find this algorithm if you are learning AI at your college. They use the traditional methods because the dataset in question is small while the iterative method is not reliable for a smaller number of iterations.

Traditional Method

The traditional method is quite straightforward, we have a formula and put in the corresponding values to get our answer. Simple, isn’t it? Let me give you the formula first so that you can question your existence.

Given, that

the formula for the weight and bias is:

Even though this thing looks menacing, it actually simple.

Approach to solving a question

  1. Make a table and write the value of x and y.
  2. Then write the formula for both weight and bias, below the table.
  3. So, we now have the value of x, y, and formula to find what to calculate for the formula to work.
  4. thus, we make a column for x, y, xi square, and xy.
  5. Once achieved, we find their respective sum, the first column gives us a summation of x, following the summation of y, the summation of xi square, and the summation of xiyi.
  6. It will look somewhat like this:
(Source link)

7. Now put the values in the formula and you have your weight and bias.

Derivation of the traditional formula

After helping you overcome your fear of this formula let’s push you off the cliff again by derivating the formula. Remember, our goal is to understand how AI and its concepts work, so we can’t leave any stone unturned.

We know that,

Where:
yi are the observed values.
xi is for the corresponding Independent variable values.
b represents bias.
while w is for the weight.

The goal is to minimize the error, taking the Sum of the squared error we have:

the formula finds the difference between the original and the predicted values of y.

Like any other mathematician apply partial differentiation when you see something differentiable. We need to calculate both b and w, thus, we will first partially differentiate based on b and then w. This will give us 2 equations to find b and w.

a. Partial differentiation based on b: equation 1

b. Parital differentiation based on w: equation 2

In partial differentiation, the variables other than the ones involved in the differentiation are set to constant. For example in above equation 1, yi and wxi are constant while in equation 2, yi and b are constants.

To proceed further let’s equate both the equation to zero because our goal is to minimize the Sum of error.

To find the maximum or minimum of a function, the functions are first partially differentiated. If the derivative is negative the function is minimum or else maximum, to find the maxima and minima we put the derivative equal to zero to get the value of the variable.

In equation 1, send -2 to the other side and open the brackets you will get

If a constant is added to itself n times it becomes n times the constant. Thus, b becomes nb.

Put this value in the above equation and divide the equation by n.

Note that the sum of a variable divided by its quantity (number) gives its average. So, we will gain the mean of yi and xi in the equation.

Equation a and equation b

The equation now is,

Equation 3
We get the value of b from equation 3, let’s call it equation 4.

Moving back to equation 2, after putting it equal to zero and opening the brackets we have

equation 5

substituting the value of b from equation 4 to equation 5

Simplify this further to get,

In terms of w, the equation is

Now we replace the value of y (predicted) and x mean to their expanded value from equation a and equation b. Simplifying the numerator and denominator by taking LCM using n. We end up with

equation 6

To get b, put the value of w from equation 6 into equation 4.

Put the values of y mean and x mean from equation a and b.

and then

and finally,

The final formula we have

Numericals

The question is as follows

We have x with values [3, 9, 5, 3] and y with values [8, 6, 4, 2].

Step 1: make the table in your copy first containing all values of x and y in their respective columns.

Step 2: write the formula below the table to ensure we don’t miss anything.

Step 3: make a column for x square and xy.

Step 4: at the last row sum the elements of the row to obtain a summation of x, y, x², xy, and also find the square of the summation of x.

The table now becomes:

Step 5: now that we have the values we put them into the formula of b.

b = {20 (124) — 20 (104)} / {4 (124) — 400}
b = 400/96 = 4.167

Step 6: Now we calculate the value of w.

w = {4 (104) — 20 (20)} / {4 (124) — 400}
w = 16/96 = 0.167

Step 7: Write the final equation for the best-fit line

y = 0.167 + 4.167x

If you are having a hard time remembering the formula of W and b, then you can just remember the formula for W and then substitute its value in the formula. But make sure to not forget that the y and x in the formula are the mean y and mean x.

The graph will be,

x vs y graph
x vs ypred graph

Advantages:

  1. Exact Solution: It provides a precise solution for the regression coefficients.
  2. Fast for Small Datasets: For small to moderate-sized datasets, this method is computationally efficient and straightforward.
  3. Simple Calculation: Straightforward to implement for simple linear regression without involving iterative processes.

Disadvantages:

  1. Limited to Simple Linear Regression: This formula is specifically for simple linear regression (one predictor). For multiple predictors, the normal equation or gradient descent is used.
  2. Numerical Stability: In practice, the calculations might suffer from numerical instability for large datasets or when the predictors are highly correlated.
  3. Not Scalable: As the size of the dataset grows, the computation of sums can become infeasible.

Iterative Method

This is the method most commonly found in results when we search for the mathematical intuition of linear regression. In this method instead of directly calculating the value of W and b. We go through a series of steps that repeat until a specified limit or when an optimal value of W and b is obtained. Let’s discuss this in further detail.

  • We start by taking a random value of W and b.
  • The cost function will be the mean of the sum of square errors or Mean Squared error (MSE).
  • Our task is to minimize this function. For this, we use an algorithm called Gradient descent. The formula for this is:
  • The α in the formula stands for learning rate, it can be anywhere from 0 to 1. It decides how fast or slow should we update the values of W and b.
  • Similar to how we partially differentiated on the basis of w and b, we will do the same here to obtain the partially differentiated part of the formula.
  • Let’s solve to get the dW
  • Now to obtain db
  • Now that we have the value just put them in the formula:

Another point to remember, if you want you can assume the cost function to have 2 in the denominator which makes it easy to eliminate the 2 introduced in the equation due to power of 2. You may find this implementation in some places.

  • now that we have the formula, we will put all the values of x and y into the formula to calculate the values of W and b. But the first answer will most probably be wrong so we will repeat this process until either the values of W and b stop changing or a specified number of iterations has been reached.

Example

Solving using this algorithm takes a lot of time because we need to undergo many iterations. Thus, solving them here won’t give you much insight. Therefore, I coded the algorithm from scratch for us to observe how the value are being updated and after how many iteration do we reach the solution. Below is the output from the program.

Output from the first 3 iterations
Output from the last 3 iterations

Well, that took a long time, but you can see we finally reached the value that we calculated using method 1. This guarantees that our findings are correct.

Important Points

  • The first value of W and b can be anything random, the value of learning rate is 0.5 if not given explicitly.
  • If the learning rate is too high, the algorithm may overshoot the minimum, and if it is too low, the algorithm may take too long to converge.
(Source Link)

Advantages:

  1. Scalable: Efficiently handles large datasets and high-dimensional data.
  2. Flexible: Can be used for simple and multiple linear regression, as well as other machine learning models.
  3. No Closed-Form Solution Needed: This does not require solving equations analytically; it approximates the solution iteratively.

Disadvantages:

  1. Approximate Solution: Convergence to the exact solution depends on the learning rate and number of iterations.
  2. Hyperparameter Tuning: Requires careful tuning of learning rate and other hyperparameters.
  3. Convergence Issues: Can converge slowly or get stuck in local minima, requiring monitoring and adjustments.

Key Differences

  • Computation:
  1. Traditional: One-time calculation using sums.
  2. Gradient Descent: Iterative updates to the model parameters.
  • Solution:
  1. Traditional: Exact solution.
  2. Gradient Descent: Approximate solution.
  • Scalability:
  1. Traditional: Limited to small datasets.
  2. Gradient Descent: Scales well with large and high-dimensional datasets.
  • Flexibility:
  1. Traditional: Limited to simple linear regression.
  2. Gradient Descent: Flexible and applicable to various regression models and other machine learning algorithms.

Conclusion

Understanding the mathematical concepts or mathematical intuition behind the various algorithms helps build a solid foundation for machine learning in your mind. It helps you visualize how its various aspects work, what the hyperparameters are, how the values are updated, how to use the algorithm to its full extent, and how to make changes to the algorithm if required by our use case. Although it may seem too overwhelming at first, once you get used to these concepts it becomes a walk in the park.

--

--

Prakashmani Awasthi

Exploring the Intersections of Mind, Art, and Technology 🧠🎨💻. Dive into the world of #AiandTech, #LiteratureandWriting, and #PsychologyandResearch with me.