The business learns NLP (part 1): Twitter sentiment analysis with logistic regression

6 min readJun 28, 2020

Today we will discuss predicting the sentiment of a tweet with logistic regression. The main lesson is that rather than using all the words in your bag of words as separate features to predict a tweet’s sentiment, you can downsize the number of features to only 3 features per tweet, and make highly accurate predictions!

For a detailed explanation of logistic regression, check my earlier blogpost.

This is a recap from Deeplearning.ai’s NLP specialization Course 1.

This article is part of a series. Continue with The Business Learns NLP — Part 2: sentiment analysis with Naive Bayes.

Nitty gritty notes for those who want to get to the bottom of it:

Note that logistic regression is part of supervised learning, hence we will use the sentiment labels of our training set (real ys) to train our model.

Step 1. Prep your data as input for your model

Step A First things first, we will create a train set and a test set, so that we can evaluate how well our models generalizes to new data
Step B Process your tweets (def process_tweet(tweet)) → all about cleaning up your tweets and make them nice and tidy to work with!
Input: a tweet
Clean up the dirt: Look at each tweet, remove hashtags, hyperlinks, retweets, handles
Once dirt is removed, cut each tweet up in separate words, this is called tokenization, and create a variable (tweet_tokens)
Create an empty array in which to store all the cleaned up words (tweets_clean)
For every word in tweet_tokens, if the word isn’t a stopword or punctuation, then bring the word back to its stem, and append the word to tweets_clean
Output: tweets_clean, list of words containing processed tweet
Step C Build frequency table (def build_freqs(tweets, ys)) → get your post & neg frequencies of each word!
We now want to create an overview of all the words in our corpus (training set), and count how often each word is used in a positive or negative setting. Note: rather than looking at a word and estimating whether it is considered positive or negative, we use our training set to tell us how often each word is mentioned in a positive or negative tweet. Hence, the word ‘happy’ will probably be mentioned a lot in our positive tweets, but it might also be used in a negative tweet (e.g. “I am not happy”).
Input: tweets and real ys (real sentiment labels 1 for positive tweet, 0 for negative tweet)
Go over each tweet and each y (sentiment label)
and push each tweet through your process_tweet function to get a cleaned up, tokenized version of your tweet.
With each cleaned up word of your tweet, create a pair of (word, sentiment (y))
And add a count of 1 to your frequency dictionary each time that the word with this specific label is discovered in a new tweet
If it’s a new word that hasn’t been registered, then register the word and its sentiment and assign it a frequency of ‘1’, as it has been registered for the first time
Output: a frequency dictionary: an overview of our words, their sentiment, and a frequency
Step D Extract the features (def extract_features( tweet, freqs)) → Now we will finally build our encoded version of the tweets to insert into our logreg model!
We end up with a 3-column row for each tweet: a bias unit, a sum of all positive counts of the words in that tweet, and a sum of all negative counts of the words in that tweet
Input: a tweet, and the frequency table
Create a 1 x 3 vector for each tweet and initialize with zeros so that our function has a template in which to store the features
For each tokenized tweet consisting of a list of words, go over each word,
If we find that particular word in the positive sentiment with its associated frequency (number of times that that word occurred in all of our tweets together), then for that tweet, add its frequency to the ‘positive sentiment’ column. If we cannot find that particular word (because it might be a stopword), then don’t do anything
Output: data matrix X: an overview of all training examples, with for each tweet only 3 column features: a bias unit, a sum of positive frequencies, and a sum of negative frequencies

Step 2. Train your model (def gradient descent (X, y, thetas, alpha, num. iter)) → we are now ready to train our model, come up with our optimal theta values, and define our hypothesis function to determine the sentiment of new tweets

Input: data matrix X, Y, initial theta parameters, learning rate alpha, number of iterations)
Initialize your data matrix X with zeros
Fill your data matrix X by using our extract_features() function
Set Y to train_y (real sentiments of the tweets!)
Apply gradient descent by starting with 3 zero thetas, calculating the cost over the whole training set (comparing how far each predicted sentiment is to its real sentiment, performing gradient descent, updating the thetas, and continuing this exercise for a fixed number of iterations (let’s say that after 1500 times looping over the data we will have arrived at our optimal thetas).
Output: our optimal theta parameters for which the cost is minimized.

Step 3. Predict the sentiment of a new tweet (def predict tweet( tweet, freqs, theta)) → Once we have trained our model, we are eager to know whether we can correctly predict the sentiment of a new incoming tweet!

Input: a new tweet, the frequency table, and optimal theta parameters
Extract the features of the tweet via def extract_features()
Predict with new hypothesis (sigmoid) function with optimal thetas
Output: y_pred: the probability of a tweet being positive or negative, ranging from 0 to 1.
Eg:

I am happy -> 0.518580

I am bad -> 0.494339

this movie should have been great. -> 0.515331

Step 4. Check performance using the test set (def test_logistic_regression (test_x, test_y, freqs, theta)) → use our test set to check performance on unseen data

Input: test_x ( list of test tweets), test_y ( m*1 vector with corresponding labels for the list of test tweets, our frequency dictionary, and optimal thetas
Create empty y_hat array for storing our predictions
Go over a test tweet, get the label prediction for the tweet y_pred (probability from 0 to 1)
If the probability is equal to or higher than 0.5, append ‘1’ to our y_hat array for a positive prediction
If the probability is lower than 0.5, append ‘0’ to our y_hat array for a negative prediction
Calculate the model’s accuracy by comparing our predictions (y_hat) to our test_y (real y labels), counting the instances where they align, and dividing by all test samples
Output: accuracy of our model

Step 5. Perform error analysis

Check misclassified tweets and identify why they are misclassified.

That’s it! Thanks to the folks at deeplearning.ai for this awesome lesson. See you next week with NLP & Naive Bayes!