Francisca Dias
In this dataset I want to build a predictor that will tell me what news category it is just by texting some sentence.
For instance, let's say I want to know what category belongs the following sentence : "stocks are on the rise". If the model is built correctly, the answer should be business.
The dataset has 422,937 news titles from various sources: Reuters, Businessweek, Huffington Post and so on.
Each news title has a category. It can be:
"b" from Business;
"e" from Entertainment
"t" from Science and Technology
"m" from Health
I have applied two models that we can use to predict the category. Both Naive Bayes and Logistic Regression are applied when we want to predict categorical variables, which in this case is one of the four news category.
They both return pretty good scores:
Naive Bayes scores 92% accuracy
Logistic Regression 95% accuracy
I will extract a few news titles from four websites, according to the category we define.
I will build a dataframe with news titles and the category they belong to. I will then apply a function based on the naive bayes that predicts the category based on the title.
%matplotlib inline
import pandas as pd
df = pd.read_csv('uci-news-aggregator.csv')
df.head(1)
df.CATEGORY.value_counts()
import re
def normalize_text(s):
s = s.lower()
# remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
s = re.sub('\s\W',' ',s)
s = re.sub('\W\s',' ',s)
# make sure we didn't introduce any double spaces
s = re.sub('\s+',' ',s)
return s
df['TEXT'] = [normalize_text(s) for s in df['TITLE']]
df.columns
df.CATEGORY.value_counts()
from sklearn.feature_extraction.text import CountVectorizer
# pull the data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(df['TEXT'])
# LabelEncoder allows us to assign ordinal levels to categorical data
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
y = encoder.fit_transform(df['CATEGORY'])
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
# take a look at the shape of each of these
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
So the x training vector contains 337935 observations of 54637 occurrences -- this latter number is the number of
unique words in the entire collection of headlines. The x training vector contains the 337935 labels associated with
each observation in the x training vector.
So we're ready to go. Let's make the classifier!
# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(x_train, y_train)
nb.score(x_test, y_test)
Nice! Over 92% accuracy, just by using words as independent features
If you feel like exploring what words are characteristic of each category, you can pull out the coefficients of the Naive Bayes classifier:
Function to predict category from a direct tittle:
def predict_cat(title):
cat_names = {'b' : 'Business', 't' : 'Science and Technology', 'e' : 'Entertainment', 'm' : 'Health'}
cod = nb.predict(vectorizer.transform([title]))
return cat_names[encoder.inverse_transform(cod)[0]]
print(predict_cat("star was seen tv"))
print(predict_cat("stocks are on the rise"))
print(predict_cat("eggs and cholesterol"))
print(predict_cat("US equities just legged lower"))
print(predict_cat("potentially smaller corporate tax cut"))
print(predict_cat("over 23 million consumers who hold subprime"))
print(predict_cat("Roku is down 6% this morning"))
print(predict_cat("We had about nine open investigations of classified leaks"))
print(predict_cat("The last 48 hours has been quite a chaotic"))
print(predict_cat("shortly after the US open failed to spark "))
print(predict_cat("Well that escalated quickly as the USDJPY"))
print(predict_cat("the ailing foundation of the economy provides"))
It looks good!
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
# Instantiate the classifier: clf
clf = OneVsRestClassifier(LogisticRegression())
clf.fit(x_train, y_train)
# Print the accuracy
clf.score(x_test, y_test)
I will take some tittles from the following news websites:
Business: http://www.zerohedge.com/
Health : https://www.medicalnewstoday.com/
Science and Technology : https://www.bloomberg.com/technology
Entertainment : http://radaronline.com/
Let us create a pandas dataframe where I will have two columns: News Titles and Category, taken from the websites above.
raw_data = {
'news_titles': [
'US equities just legged lower',
'Gold Bounces Off Key Technical Support On Massive Volume',
'Ruble, Real Tumble As Oil Slumps On Weaker IEA Outlook',
'Do you have high blood pressure? You might, based on new guidelines',
'Five life hacks for healthy skin',
'Nuts may protect against heart disease',
'Apple Targets Rear-Facing 3-D Sensor for 2019 iPhone',
'Here’s What to Do If You Find an Embarrassing Video of Yourself on Instagram',
'Twitter Bets on New Data Business Product to Revive Revenue',
'Chrissy Teigen’s ‘First Born Baby’ Rushed To The Hospital',
'Khloe Kardashian Hides Growing Baby Bump Under Oversized Coat',
'Shannon Beador House Of Horrors Sold Amid Nasty Divorce From David'],
'website': ['Business', 'Business', 'Business',
'Health', 'Health', 'Health',
'Science and Technology', 'Science and Technology', 'Science and Technology',
'Entertainment', 'Entertainment', 'Entertainment']}
df_a = pd.DataFrame(raw_data, columns = ['news_titles', 'website'])
df_a
df_a["Values"] = df_a["news_titles"].apply(predict_cat)
df_a
We did very good!
Out of 12 titles, we only missed two.