Categorizing News from Articles

Francisca Dias

Table of Contents

In this dataset I want to build a predictor that will tell me what news category it is just by texting some sentence.

For instance, let's say I want to know what category belongs the following sentence : "stocks are on the rise". If the model is built correctly, the answer should be business.

The dataset has 422,937 news titles from various sources: Reuters, Businessweek, Huffington Post and so on.

Each news title has a category. It can be:

  • "b" from Business;

  • "e" from Entertainment

  • "t" from Science and Technology

  • "m" from Health

I have applied two models that we can use to predict the category. Both Naive Bayes and Logistic Regression are applied when we want to predict categorical variables, which in this case is one of the four news category.

They both return pretty good scores:

  • Naive Bayes scores 92% accuracy

  • Logistic Regression 95% accuracy

I will extract a few news titles from four websites, according to the category we define.

I will build a dataframe with news titles and the category they belong to. I will then apply a function based on the naive bayes that predicts the category based on the title.

Note:

This dataset was taken from the UCI Machine Repository.

You can find this dataset here or on my github account.

In [1]:
%matplotlib inline
import pandas as pd 
In [2]:
df = pd.read_csv('uci-news-aggregator.csv')
df.head(1)
Out[2]:
ID TITLE URL PUBLISHER CATEGORY STORY HOSTNAME TIMESTAMP
0 1 Fed official says weak data caused by weather,... http://www.latimes.com/business/money/la-fi-mo... Los Angeles Times b ddUyU0VZz0BRneMioxUPQVP6sIxvM www.latimes.com 1394470370698
In [3]:
df.CATEGORY.value_counts()
Out[3]:
e    152469
b    115967
t    108344
m     45639
Name: CATEGORY, dtype: int64
In [4]:
import re
def normalize_text(s):
    s = s.lower()
    
    # remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
    s = re.sub('\s\W',' ',s)
    s = re.sub('\W\s',' ',s)
    
    # make sure we didn't introduce any double spaces
    s = re.sub('\s+',' ',s)
    
    return s
In [5]:
df['TEXT'] = [normalize_text(s) for s in df['TITLE']]
In [6]:
df.columns
Out[6]:
Index(['ID', 'TITLE', 'URL', 'PUBLISHER', 'CATEGORY', 'STORY', 'HOSTNAME',
       'TIMESTAMP', 'TEXT'],
      dtype='object')
In [7]:
df.CATEGORY.value_counts()
Out[7]:
e    152469
b    115967
t    108344
m     45639
Name: CATEGORY, dtype: int64
In [8]:
from sklearn.feature_extraction.text import CountVectorizer

# pull the data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(df['TEXT'])
In [9]:
# LabelEncoder allows us to assign ordinal levels to categorical data
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
y = encoder.fit_transform(df['CATEGORY'])
In [10]:
# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
In [11]:
# take a look at the shape of each of these
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
(337935, 54637)
(337935,)
(84484, 54637)
(84484,)

So the x training vector contains 337935 observations of 54637 occurrences -- this latter number is the number of

unique words in the entire collection of headlines. The x training vector contains the 337935 labels associated with

each observation in the x training vector.

So we're ready to go. Let's make the classifier!

In [12]:
# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()
nb.fit(x_train, y_train)
Out[12]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
In [13]:
nb.score(x_test, y_test)
Out[13]:
0.92746555560816246

Nice! Over 92% accuracy, just by using words as independent features

If you feel like exploring what words are characteristic of each category, you can pull out the coefficients of the Naive Bayes classifier:

Function to predict category from a direct tittle:

In [14]:
def predict_cat(title):
    cat_names = {'b' : 'Business', 't' : 'Science and Technology', 'e' : 'Entertainment', 'm' : 'Health'}
    cod = nb.predict(vectorizer.transform([title]))
    return cat_names[encoder.inverse_transform(cod)[0]]
In [15]:
print(predict_cat("star was seen tv"))
print(predict_cat("stocks are on the rise"))
print(predict_cat("eggs and cholesterol"))
print(predict_cat("US equities just legged lower"))
print(predict_cat("potentially smaller corporate tax cut"))
print(predict_cat("over 23 million consumers who hold subprime"))
print(predict_cat("Roku is down 6% this morning"))
print(predict_cat("We had about nine open investigations of classified leaks"))
print(predict_cat("The last 48 hours has been quite a chaotic"))
print(predict_cat("shortly after the US open failed to spark "))
print(predict_cat("Well that escalated quickly as the USDJPY"))
print(predict_cat("the ailing foundation of the economy provides"))
Entertainment
Business
Health
Business
Business
Business
Science and Technology
Science and Technology
Entertainment
Business
Health
Business

It looks good!

In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
In [17]:
# Instantiate the classifier: clf

clf = OneVsRestClassifier(LogisticRegression())
clf.fit(x_train, y_train)
Out[17]:
OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
          n_jobs=1)
In [18]:
# Print the accuracy
clf.score(x_test, y_test)
Out[18]:
0.94976563609677567

I will take some tittles from the following news websites:

Let us create a pandas dataframe where I will have two columns: News Titles and Category, taken from the websites above.

In [19]:
raw_data = {
        'news_titles': [
            'US equities just legged lower', 
            'Gold Bounces Off Key Technical Support On Massive Volume', 
            'Ruble, Real Tumble As Oil Slumps On Weaker IEA Outlook', 
            'Do you have high blood pressure? You might, based on new guidelines', 
            'Five life hacks for healthy skin',
            'Nuts may protect against heart disease',
        'Apple Targets Rear-Facing 3-D Sensor for 2019 iPhone',
        'Here’s What to Do If You Find an Embarrassing Video of Yourself on Instagram',
        'Twitter Bets on New Data Business Product to Revive Revenue',
        'Chrissy Teigen’s ‘First Born Baby’ Rushed To The Hospital',
        'Khloe Kardashian Hides Growing Baby Bump Under Oversized Coat',
        'Shannon Beador House Of Horrors Sold Amid Nasty Divorce From David'],
        
    
    
    'website': ['Business', 'Business', 'Business', 
                'Health', 'Health', 'Health', 
                'Science and Technology', 'Science and Technology', 'Science and Technology',
               'Entertainment', 'Entertainment', 'Entertainment']}

df_a = pd.DataFrame(raw_data, columns = ['news_titles', 'website'])
df_a
Out[19]:
news_titles website
0 US equities just legged lower Business
1 Gold Bounces Off Key Technical Support On Mass... Business
2 Ruble, Real Tumble As Oil Slumps On Weaker IEA... Business
3 Do you have high blood pressure? You might, ba... Health
4 Five life hacks for healthy skin Health
5 Nuts may protect against heart disease Health
6 Apple Targets Rear-Facing 3-D Sensor for 2019 ... Science and Technology
7 Here’s What to Do If You Find an Embarrassing ... Science and Technology
8 Twitter Bets on New Data Business Product to R... Science and Technology
9 Chrissy Teigen’s ‘First Born Baby’ Rushed To T... Entertainment
10 Khloe Kardashian Hides Growing Baby Bump Under... Entertainment
11 Shannon Beador House Of Horrors Sold Amid Nast... Entertainment
In [20]:
df_a["Values"] = df_a["news_titles"].apply(predict_cat)
df_a
Out[20]:
news_titles website Values
0 US equities just legged lower Business Business
1 Gold Bounces Off Key Technical Support On Mass... Business Business
2 Ruble, Real Tumble As Oil Slumps On Weaker IEA... Business Business
3 Do you have high blood pressure? You might, ba... Health Health
4 Five life hacks for healthy skin Health Health
5 Nuts may protect against heart disease Health Health
6 Apple Targets Rear-Facing 3-D Sensor for 2019 ... Science and Technology Science and Technology
7 Here’s What to Do If You Find an Embarrassing ... Science and Technology Entertainment
8 Twitter Bets on New Data Business Product to R... Science and Technology Business
9 Chrissy Teigen’s ‘First Born Baby’ Rushed To T... Entertainment Entertainment
10 Khloe Kardashian Hides Growing Baby Bump Under... Entertainment Entertainment
11 Shannon Beador House Of Horrors Sold Amid Nast... Entertainment Entertainment

We did very good!

Out of 12 titles, we only missed two.