The business learns NLP (part 1): Twitter sentiment analysis with logistic regression
Today we will discuss predicting the sentiment of a tweet with logistic regression. The main lesson is that rather than using all the words in your bag of words as separate features to predict a tweet’s sentiment, you can downsize the number of features to only 3 features per tweet, and make highly accurate predictions!
For a detailed explanation of logistic regression, check my earlier blogpost.
This is a recap from Deeplearning.ai’s NLP specialization Course 1.
This article is part of a series. Continue with The Business Learns NLP — Part 2: sentiment analysis with Naive Bayes.
Nitty gritty notes for those who want to get to the bottom of it:
Note that logistic regression is part of supervised learning, hence we will use the sentiment labels of our training set (real ys) to train our model.
Step 1. Prep your data as input for your model
- Step A First things first, we will create a train set and a test set, so that we can evaluate how well our models generalizes to new data
- Step B Process your tweets (def process_tweet(tweet)) → all about cleaning up your tweets and make them nice and tidy to work with!
- Input: a tweet
- Clean up the dirt: Look at each tweet, remove hashtags, hyperlinks, retweets, handles
- Once dirt is removed, cut each tweet up in separate words, this is called tokenization, and create a variable (tweet_tokens)
- Create an empty array in which to store all the cleaned up words (tweets_clean)
- For every word in tweet_tokens, if the word isn’t a stopword or punctuation, then bring the word back to its stem, and append the word to tweets_clean
- Output: tweets_clean, list of words containing processed tweet
- Step C Build frequency table (def build_freqs(tweets, ys)) → get your post & neg frequencies of each word!
- We now want to create an overview of all the words in our corpus (training set), and count how often each word is used in a positive or negative setting. Note: rather than looking at a word and estimating whether it is considered positive or negative, we use our training set to tell us how often each word is mentioned in a positive or negative tweet. Hence, the word ‘happy’ will probably be mentioned a lot in our positive tweets, but it might also be used in a negative tweet (e.g. “I am not happy”).
- Input: tweets and real ys (real sentiment labels 1 for positive tweet, 0 for negative tweet)
- Go over each tweet and each y (sentiment label)
- and push each tweet through your process_tweet function to get a cleaned up, tokenized version of your tweet.
- With each cleaned up word of your tweet, create a pair of (word, sentiment (y))
- And add a count of 1 to your frequency dictionary each time that the word with this specific label is discovered in a new tweet
- If it’s a new word that hasn’t been registered, then register the word and its sentiment and assign it a frequency of ‘1’, as it has been registered for the first time
- Output: a frequency dictionary: an overview of our words, their sentiment, and a frequency
- Step D Extract the features (def extract_features( tweet, freqs)) → Now we will finally build our encoded version of the tweets to insert into our logreg model!
- We end up with a 3-column row for each tweet: a bias unit, a sum of all positive counts of the words in that tweet, and a sum of all negative counts of the words in that tweet
- Input: a tweet, and the frequency table
- Create a 1 x 3 vector for each tweet and initialize with zeros so that our function has a template in which to store the features
- For each tokenized tweet consisting of a list of words, go over each word,
- If we find that particular word in the positive sentiment with its associated frequency (number of times that that word occurred in all of our tweets together), then for that tweet, add its frequency to the ‘positive sentiment’ column. If we cannot find that particular word (because it might be a stopword), then don’t do anything
- Output: data matrix X: an overview of all training examples, with for each tweet only 3 column features: a bias unit, a sum of positive frequencies, and a sum of negative frequencies
Step 2. Train your model (def gradient descent (X, y, thetas, alpha, num. iter)) → we are now ready to train our model, come up with our optimal theta values, and define our hypothesis function to determine the sentiment of new tweets
- Input: data matrix X, Y, initial theta parameters, learning rate alpha, number of iterations)
- Initialize your data matrix X with zeros
- Fill your data matrix X by using our extract_features() function
- Set Y to train_y (real sentiments of the tweets!)
- Apply gradient descent by starting with 3 zero thetas, calculating the cost over the whole training set (comparing how far each predicted sentiment is to its real sentiment, performing gradient descent, updating the thetas, and continuing this exercise for a fixed number of iterations (let’s say that after 1500 times looping over the data we will have arrived at our optimal thetas).
- Output: our optimal theta parameters for which the cost is minimized.
Step 3. Predict the sentiment of a new tweet (def predict tweet( tweet, freqs, theta)) → Once we have trained our model, we are eager to know whether we can correctly predict the sentiment of a new incoming tweet!
- Input: a new tweet, the frequency table, and optimal theta parameters
- Extract the features of the tweet via def extract_features()
- Predict with new hypothesis (sigmoid) function with optimal thetas
- Output: y_pred: the probability of a tweet being positive or negative, ranging from 0 to 1.
I am happy -> 0.518580
I am bad -> 0.494339
this movie should have been great. -> 0.515331
Step 4. Check performance using the test set (def test_logistic_regression (test_x, test_y, freqs, theta)) → use our test set to check performance on unseen data
- Input: test_x ( list of test tweets), test_y ( m*1 vector with corresponding labels for the list of test tweets, our frequency dictionary, and optimal thetas
- Create empty y_hat array for storing our predictions
- Go over a test tweet, get the label prediction for the tweet y_pred (probability from 0 to 1)
- If the probability is equal to or higher than 0.5, append ‘1’ to our y_hat array for a positive prediction
- If the probability is lower than 0.5, append ‘0’ to our y_hat array for a negative prediction
- Calculate the model’s accuracy by comparing our predictions (y_hat) to our test_y (real y labels), counting the instances where they align, and dividing by all test samples
- Output: accuracy of our model
Step 5. Perform error analysis
- Check misclassified tweets and identify why they are misclassified.
That’s it! Thanks to the folks at deeplearning.ai for this awesome lesson. See you next week with NLP & Naive Bayes!