Imagine logistic regression as a hungry snake predicting whether an object in his vicinity is eatable

10 min readJun 8, 2020

After the linear regression recap, it’s time for logistic regression. Whereas with linear regression we’re predicting an output that can take on any real value, logistic regression enables us to classify an item and place it in a bucket. Let’s hit it!

Logistic regression is a form of supervised learning used for classification problems

Today we’re discussing logistic regression. The main source for this article is Stanford University’s Machine Learning course, week 3.

The goal of logistic regression is to predict the probability that a certain output is positive

Applying linear regression to classification problems isn’t the best idea, because of various reasons. First, the linear line of linear regression cannot place data points in buckets. Furthermore, for linear regression, the predicted y of the hypothesis can be much bigger than 1, or smaller than 0, while logistic regression only takes on values between 0 and 1. Last, classification is not a linear function. In fact, the values we want to predict take on only a small number of discrete values between 0 and 1 (eg spam = 1, vs no spam = 0).

The hypothesis for logistic regression becomes a sigmoid function

The goal of logistic regression is to predict whether an input variable falls into a specific bucket (value = 1) or not (value = 0). Therefore, we want to come up with theta parameter values that are needed to come up with a logistic function. The sigmoid function is a good way to plot a logistic function. The hypothesis is hence now a sigmoid function, compared to a linear line as we learned previously for linear regression. Imagine the shape of sigmoid like a snake.

Predicted values of our new hypothesis sigmoid function range between 0 and 1. In the case of predicting spam email, if the predicted value is larger than 0.5 we classify it as spam (1). If the predicted value is smaller than 0.5, we classify it as no spam (0).

Use binary logistic regression when the variable y that you want to predict is valued (takes on one of two values)

We use logistic regression to predict a variable y that is valued (falls into one of two buckets). Some examples:

does someone have COVID-19 or not?
is a new incoming email spam or not?
is a tumor malign or not?
is the animal on a picture a cat or not?
is a payment fraudulent or not?

The hypothesis gives us the probability that our output is 1

If the hypothesis is 0.7, there is a 70% chance that our output is 1. The probability that our prediction is 0 is the complement of the probability that it is 1 (30%).

The decision boundary is the line that separates the area where y = 0 and where y = 1

The decision boundary separates our classes

The decision boundary is created by our hypothesis (sigmoid) function. We predict that y = 1 when z (our input to sigmoid, or theta transpose * x) is equal or larger than 0. If the input to the sigmoid function is equal or greater than 0, then the sigmoid g(z) will be greater or equal to 0.5.

We predict that y = 0 when z (input to sigmoid, theta transpose * x) is smaller than 0. If the input to the sigmoid is smaller than 0, then the hypothesis g(z) is smaller than 0.5.

The sigmoid function enables us to predict y = 1 if z is equal to or larger than 0.5

Let’s close off with a practical example on decision boundaries.

A practical example of decision boundaries

We are dealing with non-linear decision boundaries when the boundary does not follow the shape of a linear line.

The cost function consists of two parts: the case that y=1, and y=0

We need to make the cost function for logistic regression convex, so that we can find a global minimum

The cost of J is equal to half of the squared error (squared error = difference between our predicted y and real y). However, we’re facing a problem with the squared cost function for logistic regression, because we have a nonlinear sigmoid function that we put in the middle of the quadratic cost function. In this case, J of theta turns out to be a non-convex (non bowl-shaped) function if you define it as a squared cost function. Unlike what we learned in our linear regression lecture where we always reach a global minimum as J of theta is convex, plotting J of theta would result in a non-convex function, where there are lots of local minima. Therefore, we want to figure out a cost function that is convex, i.e., we are able to find a global minimum.

It turns out we can come up with a cost function for logistic regression by combining the case for which y = 1, and y =0.

With logistic regression, our cost of J consists of 2 parts

Here’s why it’s costly to predict a wrong outcome. Let’s imagine you have COVID-19 (y=1), and the doctors predict you’re healthy (hyp = 0). You’ll walk out of the doctor’s office thinking you’re fine while you’re not. Consequently, the cost will be super high: you’ll go outside, have walks in the park, and infect lots of other people. If you have the disease (y=1) and doctors confirm so (hyp = 1), you’ll stay inside for 2 weeks, suck it up, and you’re probably going to be OK. The cost will be zero. On the other hand, let’s imagine you’re healthy (y=0) but doctors predict you’re ill (hyp = 1). You come home depressed thinking you’re deprived of having walks in the park, enjoying the sun, saying hi to a friend, and going for a run in fresh air. The cost is immensely high. If you are healthy (y=0) and doctors confirm so (hyp = 0), everything’s good and the cost is zero.

Wrongly predicting an outcome has terrible consequences

Gradient descent for logistic regression

The gradient descent formula is the same as the one for linear regression. Here’s an overview of the non-vectorized and vectorized implementation.

Multiclass classification (one vs all) uses binary classification for each class predicting that ‘y’ is a member of one of the categories

For multiclass classification, we perform binary classification for each class that we’re predicting. This means that if we have k classes, we will train k different logistic regression classifiers to predict the probability that y = 1. We use the hypothesis that gives us the highest value of the different classes as our prediction. Let’s say we’re predicting email to fall into class ‘friends’, ‘family’, and ‘work’, a new email comes in, and the respective hypothesis outcomes are 0.7, 0.5, and 0.6, then we’ll predict the incoming email as ‘friends’, as the hypothesis has the highest value for this category.

Potential problems that you might run into forming a hypothesis are overfitting and underfitting

With underfitting, the model is too simple for the trend of the data

Underfitting happens when your hypothesis doesn’t represent your training data well enough; it’s too simple and has too little features. It’s a poor fit to the training data. It’s paired with high bias: it thinks it knows the data very well while in fact it doesn’t. In a university setting, you can compare it to a cocky or lazy classmate who’s confident of knowing the course material but only flipped through a course material summary minutes before D-day. He gets to the exam and realises he (or she) knows sh*t. He was under the preconception of having it covered while that wasn’t the case.

The difference between underfitting and overfitting

With overfitting, the model is too complex and therefore poorly generalizes to predict on new examples

With overfitting, your hypothesis doesn’t generalize to new examples outside of your training set very well. It is too keen of your training data. This problem occurs when you’re dealing with many features, and your hypothesis becomes this highly flexible line with too many unnecessary curves. In fact, the hypothesis has become too complicated. In a university setting, you can compare it to having studied your ass off for several days and knowing every tiny little detail for an exam but then failing anyway: rather than a simple multiple choice exam, the professor asks you to put the knowledge to practice via open-ended answers. You’re unable to, as you realize you didn’t really understand the subject. You can’t ‘go with the flow’ and apply the learned knowledge in different ways.

You can address overfitting by reducing the number of features, or applying regularization. The problem with reducing the number of features is that we throw away information about the problem, which might be disadvantageous. Regularization is a better proposal: it means we keep all the features, but reduce the magnitude of the theta parameters. Applying regularization is especially a good idea when we really don’t want to get rid of features as they all have some predictive power.

Regularization is the key to success when we don’t want to get rid of features that are all slightly useful

The effect of regularization on the theta parameters

Regularization makes sure that some of our parameters have less influence on the hypothesis than they do in reality. The result is a simpler hypothesis that is less prone to overfitting.

We can reduce all theta parameters at once, by using a regularization parameter called lambda. Lambda helps us regulate to what extent our theta parameters are inflated, and controls the tradeoff between two different goals. On the one hand, we want to find a good fit for our training data (by keeping all parameters). On the other hand, we want to keep the theta parameters small.

Choose your regularization parameter wisely

The regularization parameter can have a large influence on your hypothesis. If lambda is very large, we will penalize the thetas a lot. We end up with those parameters to be extremely small and it’s is as if we’re crossing out those parts of the hypothesis. This would result in fitting a straight horizontal line through the data, and we would end up with a model that is too simple (underfitting).

Therefore it is important to choose lambda well. Remember that lambda is used to fight against overfitting:

if lambda is too large, it will result in underfitting
if lambda is too small, it won’t be a remedy to the problem of overfitting, because you practically will end up with the old hypothesis

Practical example: let’s predict whether a student is admitted to Stanford based on 2 exams

Let’s look at a practical example: Stanford’s Machine Learning Octave assignment 2 week 3. We are the administrator of a university and have to come up with a model that determines the chance of a student being admitted based on two exams. We have historical data.

Octave assigment 2 week 3 Stanford University Machine Learning

Let’s recap we’ve learned about logistic regression

Logistic regression is suited to classify items into buckets: is an email spam or not, does someone have COVID-19 or not?
Hypothesis values range between 0 and 1 and the hypothesis function becomes a sigmoid function. The hypothesis gives us the probability that our output is 1.
If the input to sigmoid (z) is equal or larger than 0, then the output is equal or larger than 0.5, and we predict y = 1.
If the input to sigmoid (z) is smaller than 0, then the output is smaller than 0.5, and we predict y = 0.
The decision boundary divides items into groups (2 groups for binary logistic regression, or more groups for multiclass classification)
The cost function for logistic regression consists of 2 parts: the case for which y = 1, and y = 0.
Cost will go to infinity when we wrongly predict an outcome (e.g. when we predict someone has COVID-19 while they don’t (because she will be deprived from going outside for no reason), or when we don’t predict that they have it, while they do (because she will infect other people). The cost function captures this nicely.
Potential problems with forming a hypothesis are overfitting and underfitting. We can fight overfitting by adding regularization parameter lambda to the cost function. This helps reduce the influence of each feature on the hypothesis. Underfitting is solved by adding more features.

That was it! Next up: neural networks!

PS. This is a recap from my notes from Stanford’s Machine Learning course, week 3.