Remember linear regression as Walmart predicting their Spectre 32" TV sales based on ad spend

16 min readJun 8, 2020

Linear regression: Walmart predicting Spectre 32" TV sales based on ad spend

Two years ago I was hugely inspired by a machine learning piece of art by Studio Drift at the Stedelijk Museum in Amsterdam. Enthusiastically, I started Stanford’s Machine Learning course, and was able to get a business case at work on natural language processing. Throughout the process of getting into machine learning I got discouraged (e.g. some concepts like recurrent neural networks are extremely hard to grasp and I felt like an imposter) and wondered where exactly the challenge was in implementing machine learning models if tools do the work for you (e.g. Azure’s Machine Learning Studio or Automated ML allow you to implement a bunch of models and spit out the best one). I put it aside for one year, until I recently got re-inspired by ML. I realized there were so many cool things I learned during Andrew Ng’s intro to machine learning and other books and courses and lived and breathed at the time, that it would be a shame to loose this knowledge. Hence, I’ve decided to make a recap of some of ML’s core concepts. Who knows where this journey might take me. Let’s find out! First up: linear regression. I recently bought a house, so let’s take house hunting as an example use case.

Linear regression is a form of supervised learning

Linear regression belongs to the supervised learning branch of machine learning. First we’ll discuss univariate linear regression, meaning we’re dealing with only one input variable and one output variable. With our house hunting use case, we can ask ourselves the question: “To what extent does house size (independent variable) affect house price (dependent variable or target variable)?”

The goal of linear regression is to find the hypothesis function that best fits the historical data points

The goal of linear regression is to find the fitted line that best represents the training data points. These data points usually consist of historical data. It aims to create a linear line that fits historical data points as closely as possible in order to predict an outcome based on one input variable. In the case of house hunting, the training data will consist of an overview of house sizes (independent variable) and the corresponding sold house prices (dependent variable). Applying univariate linear regression to determine a house price is not very realistic; when I bought a house I took into account various factors to determine my bid besides house size like location, construction year and external living space. But for the calculation of my final bidding price I went back to basics and looked up an overview of house prices and their sizes on kadaster.nl, the Dutch housing website that registers all sold house prices. Let’s pretend Kadaster was my only data source to determine my bid.

Use linear regression to predict a continuous output variable

We use linear regression to predict a continuous output variable. A continuous output variable means that the output can have any real value. For example, based on the number of squared metres, we want to predict the housing price. The house price can range from 1 euro (not very likely) to millions of euros.

Determine the fitted line that best represents the historical data

For linear regression, our fitted line is linear. You might remember from school a long time ago that a linear line is represented by the formula y = a + bx, where a is the point that crosses the y-axis and b is the slope of the line. The goal is to find the values of a and b that best represent our historical house prices. Once we have these values, we can plug any value into the hypothesis, and get a predicted value. The values a and b are called parameters. In the field of machine learning, these parameters that we need to determine are called thetas. To distinguish the different parameters, we call ‘a’ theta zero, and ‘b’ theta one. I agree that ‘theta’ isn’t a word that rings many bells, but for now let’s just embrace this odd naming and move on to more exciting insights.

I’ve now given you the hypothesis function, but how exactly can we calculate these optimal values of theta zero and theta one?

Minimize the cost of J to come up with the best fitted line

In order to come up with the fitted line that best represents our historical data, we want to measure to what extent our hypothesis is actually doing a good job in fitting our data points. We use the cost of J to help us with this exercise. This formula will help us calculate to what extent our fitted line fits the historical training data points. The goal is to minimize the sum of the squared differences between the predictions of housing prices and the actual housing prices. The higher the cost of J, the worse our line fits our data. Hence, to find our best fitted line, we want to minimize the cost! Btw, again, this naming ‘cost of J’ doesn’t sound very intuitive. However, just remember that each of the lines that you’ll plot to check whether they fit your historical data have a certain cost. This cost happens to be called the cost of J.

Here is what’s needed step by step to calculate the cost of J.

Step 1. Generate the hypothesis for the specific parameter values of Theta 0 and Theta 1.

Step 2. Draw the line that belongs to those parameters.

Step 3. Calculate the difference between the predicted output value based on the hypothesis, and the actual (historical) training data point. Repeat for all training data points: see how far the line is from each training data point.

Step 4. Square those differences.

Step 5. Add these differences for all training data points.

Step 6. Multiply by 1/(2m).

Step 7. Outcome is your cost.

Step 8. Plot J against your parameters. A convex (bowl) function will appear.

Step 9. Continue this process until you’ve reached the minimum cost of J.

Minimize the cost of J to come up with the best fitted line

In order to calculate the cost you need specific parameter values for theta 0 and theta 1. Let’s start by a simple example and come up with our own parameter values.

Simple use case: we’re on the dollhouse housing market.

Let’s say we have 3 historical data points about housing prices, and let’s pretend we’re into the doll-house housing market. Last month, a house of 1 squared meter was sold for 1 euro, a house of 2 squared meter was sold for 2 euros, and a house of 3 squared meter was sold for 3 euros. We plot these data points in a graph to see what it looks like. Check the plot in the infographic below.

We will then come up with random values for theta 0 and theta 1 that we will plug into the hypothesis. We start with theta 0 = 0 & theta 1 = 0 and calculate the cost. We’re plotting the cost against our thetas and see what happens to the cost. We repeat this process for theta 0 = 0 and theta 1 = 0.5, and theta 0 = 0 and theta 1 = 1. Eventually, we come to realize that our Cost of J is minimized for theta 1 = 1. Hence, for theta 1 = 1, we can draw a line that best represents our historical data points. To do so, we plug this theta value value into our hypothesis function and generate our fitted line h(x) = x. Let’s say we’re currently looking at a superdeluxe doll-house in the center of Amsterdam with view over the Amstel river. It’s 10 squared meters. How much will the house price be? Plug it into the hypothesis and we’ll find out: h(10) = 10. The house price will be 10 euros!

Learning via gradient descent automatically brings us to the lowest cost of J

The goal of gradient descent is to find the parameter values you need to get to the minimum cost of J

As you might have noticed, to calculate our cost for a specific hypothesis, we take into account all our training examples (e.g. our historical sold house prices). If we have a dataset with millions of records, it would take us forever to get to the minimal cost, because we would have to go through each single one of the million records to come up with a new cost, again and again, until we’ve found the minimum cost. Furthermore, in the previous example we chose 3 random sets of values of theta 0 and theta 1 (Theta 0 = 0 & theta 1 = 0, theta 0 = 0 and theta 1 = 0.5, and theta 0 = 0 and theta 1 = 1), and happened to find the parameter values for which the cost was minimized ( this proved to be the case for theta 0 = 0 and theta 1 = 1), but for millions of training examples, it’s impossible to find these optimal parameter values by coincidence. So we really need some help. Luckily, gradient descent comes to rescue.

What is gradient descent?

Gradient descent is a procedure that helps us find the minimum cost of J, enabling us to obtain our optimal theta parameter values to represent our training data. It consists of two steps: First, we calculate the change — or, a step / the gradient — to our theta parameter values. Then, we update our parameter values. If gradient descent works correctly, our steps should become smaller and smaller. To calculate a step, we use the slope of the cost of J. Calculating the slope of the cost of J consists of summing the slopes between each training example and its predicted value. We have reached the global minimum cost of J when its slope is close to 0.

Why do we use gradient descent?

We use gradient descent to find the parameter values of theta 0 and theta 1 for which the cost of J is minimised. These parameter values will enable us to create the best fitted line through the training data, and will allow us to predict an output variable for new incoming input data (i.e. to predict the house price (target variable) based on a specific house size (input variable/ independent variable)). We will come up with linear lines throughout each step of gradient descent, but once we reach the global minimum, we can fix the fitted line because we found our parameter values of theta 0 and theta 1 that best represent the historical data set.

Gradient descent step-by-step

Step 1. Calculate the hypothesis (prediction)

Step 2. Use that to calculate the cost. Note that for calculating the cost we take into account ALL training examples (hence calculate the hypothesis also for each training example)

Step 3. We do gradient descent to, via the slope through each training example x(i), get the updated parameter values. The sign of the slope tells you which direction to move to.

Step 4. With these new parameter values, you get a new hypothesis which you use to calculate the cost. Check whether the cost decreased. If it did, you’re moving into the right direction. Decreased cost means the hypothesis line already better represents your data than it did with the previous parameter values.

Step 5. If everything goes well (e.g. you’ve chosen a good learning rate), gradient descent should take smaller steps over time, because the slope through x(i) should decrease. Once the slope is 0, you’ve reached the global minimum.

Update until convergence

‘Updating until convergence’ may sound a little vague. It means you should update your theta 0 and theta 1 values

theta (j) = theta (j) — alpha * slope of J (derivative term). Subscript ‘j’ refers to feature j.
When the learning rate alpha is too large, gradient descent will overshoot the minimum, but when it is too small, it will take too long to converge. That’s why it’s necessary to choose a correct learning rate.

Slopes 101

Let’s look at a little slope refresher.

If the slope is positive, gradient descent will be updated by a negative number and will become smaller, because the second part of gradient descent ( — alpha * slope of J ) is positive ( — alpha * positive number = negative number, hence X — negative number becomes smaller). Gradient descent will move towards the left. This makes sense, as it is moving towards the global minimum of the cost function.
If the slope is negative, gradient descent will be updated by a positive number and will become bigger, because the second part of gradient descent is negative ( — alpha * negative number = positive number, hence X + positive number becomes bigger). Gradient descent will move towards the right. This makes sense, as gradient descent will move towards the global minimum.
As gradient descent reaches the global minimum, the second part of gradient descent (- learning rate * derivative term) will become smaller and smaller because the slope will get flatter (hence smaller) with each step. You know you’re reaching the global minimum if steps of gradient descent become smaller and eventually get close to 0, because global minimum has a slope of 0.

The importance of choosing your learning rate carefully

The learning rate is an important part of gradient descent. It’s important to choose a correct learning rate: a rate too small will take you forever to reach the minimum, while a rate too large might inhibit you from finding the minimum at all.

Some things to keep in mind regarding linear regression

Choose your learning rate wisely: when the learning rate is too large, you will overshoot the minimum. When the learning rate is too small, it will take too long to converge.
For linear regression, there will be one global minimum, as J is a convex (bowl) quadratic function.
Gradient descent for linear regression uses batch gradient descent. This means that for each step of gradient descent, it takes into account all training data points. This makes sense, because in order to come up with a new fitted line, it needs to know the location of each data point before it can make a good estimation of what the next fitted line should be.
It’s important to update theta 0 and theta 1 simultaneously, because the line adjusts as a whole. If you don’t update them at once, you won’t be able to come up with a new line, which prevents you from coming up with an updated cost, which in turn prevents you from performing gradient descent.
Theta 0 indicates where to cross the Y-axis, whereas theta 1 determines the slope of the line.
The number of iterations refers to the number of times gradient descent starts over to update the thetas. The question is how many iterations are just enough to get to the minimum. There are different ways to solve this problem, for now let’s say we used a fixed number of iterations and assume that the last iteration will get us very close to a zero slope and incredibly tiny update to our theta parameters which will bring us close enough to reaching global minimum, and providing us with our optimal parameter values to draw the linear line.
You can reach global minimum even with a fixed learning rate, because with a decreased slope at each step of gradient descent, the updates to theta become smaller, until they won’t update anymore once you’ve reached a slope of zero.

Linear regression in action

Let’s close off with a practical example from Stanford’s Machine Learning week 2 assignment.

Linear regression practical example Stanford Machine Learning week 2

Key takeaways linear regression with one variable

Let’s recap what we’ve learned:

The goal of linear regression is to come up with a linear line that matches your training set (historical data points) as closely as possible
Once you’ve come up with this linear line, you are able to make predictions with new input data
You get to this linear line by plotting different linear lines over and over, and calculating the cost of J for each line. The cost of J is a measure of how well your linear line fits the historical data points. It calculates the sum of the squared errors. In simple terms, it checks how far each historical data point is from its predicted value, squares those differences and multiplies it by a certain number.
Because it’s impossible to calculate the cost of J for huge data sets ( and come up with new combinations of theta values over and over), gradient descent comes to rescue.
Gradient descent will help us find the parameter values of theta 0 and theta 1 for which the cost is minimized. Once we have these parameter values, we can come up with our linear line and start predicting with new input values.
Gradient descent searches for the global minimum of the cost of J, by calculating the cost of J for different linear lines, and looks for the direction to move towards to get to this global minimum. For linear regression, there will always be only one global minimum because we are dealing with a bowl-shaped function.
In order to check whether you’re reaching global minimum, the cost of J should become smaller and smaller (we are looking for a cost close to 0). Also, the derivative term of gradient descent should become smaller because the cost of J will reach global minimum where the slope is zero.
Make sure you choose an appropriate learning rate for gradient descent that is neither to small nor too large, as you don’t want gradient descent to last forever or let it overshoot the global minimum.
For gradient descent to run properly, it’s important to update your parameter values theta 0 and theta 1 simultaneously, because these two values help you come up with your new linear line. If you’re not updating these values simultaneously, you won’t execute gradient descent with an updated linear line.

Linear regression with multiple variables (multivariate linear regression)

Let’s say that now I would like to take into account the number of bedrooms, floors and age of the house besides size to calculate the house price. We can take all of these features into account by using multivariate regression.

How multivariate regression compares to univariate regression

The cost function is the same, except that the hypothesis has changed to include more parameters.
Rather than having only 2 thetas to update (theta 0 to know where the linear line is crossing the y-axis and theta 1 to know the slope of the linear line for univariate regression), we now perform gradient descent for n features (still taking into account all training examples for each step of gradient descent). Eg., for theta 4 (number of bedrooms), we take into account the number-of-bedrooms feature values of all training examples (j=4) and multiply these with the errors (hypothesis — real y values) of all training examples to calculate the slope for theta 4 and subsequently the update to theta 4.
It is still important to update all thetas simultaneously.
When performing multivariate regression it’s important to apply feature scaling and mean normalization to speed up gradient descent, because the theta vector with theta for each variable will descend more quickly on small ranges.
Mean normalization means you subtract the average value of the training set for an input variable to make features have a zero mean. For example, if you look at house size and several training examples have an average size of 1000 sq. metres, you subtract 1000 from the feature house size for each training example.
Feature scaling makes sure your features range from -1 to 1. For each training example’s feature n, divide (original feature value — average) by the range (max-min) or standard deviation.
For the vectorized approach of calculating the cost of J and performing gradient descent, remember that the theta vector holds all parameters (vertically). The data matrix X holds the data of all training examples. Each row holds the data for 1 example; the columns represent the features. Theta 0 is the point where y-axis is crossed. In order to match the Theta vector with the design matrix (which has one column less than the theta vector as it misses on x(0) (training examples start with first feature x(1)), we create x(0) = 1. In that way, matrix multiplication is possible. Another way to look at it is that because the theta vector has one more element than the x(i)-vector, a training example’s data representation of the features, (because the vector of a training example starts with x1 housing size, x2 # of bedrooms, etc.). Hence, in order to perform vector multiplication, we ‘create’ x0, so that the Theta vector and x-vector are of the same size.
If you don’t go for the vectorized approach, your hypothesis will be h = theta transpose * x, where x is a vector that holds data about features on one training example.
You need to transpose theta because if you want to multiply two vectors, at least one of them needs to be transposed to be able to multiply.
Whereas the hypothesis for the vectorized approach will result in a vector of m * 1, the non-vectorized approach will do a for-loop calculating the hypothesis per training example
To remember about vectorization: in tabular format, each row holds data of 1 training example m. If you look at 1 training example in vector form, data is represented vertically.
Besides gradient descent, there is another method to come up with the optimal parameter values: by using the normal equation. You can solve for optimal values of theta in one step without doing feature scaling, but gradient descent is easier to comprehend.
The design matrix holds are you training examples m in a matrix where each training example’s individual vector is transposed.

That was it for linear regression! Stay tuned for a recap of logistic regression.

Source input: Machine Learning, Stanford University, week 1 and 2