Categorizing News from Articles¶

Francisca Dias

Table of Contents¶

Introduction

Libraries

Function to normalize text

Feature Extraction

Train/Test Split

Naive Bays

Logistic Regression

"The great experiment"

Introduction ¶

In this dataset I want to build a predictor that will tell me what news category it is just by texting some sentence.

For instance, let's say I want to know what category belongs the following sentence : "stocks are on the rise". If the model is built correctly, the answer should be business.

The dataset has 422,937 news titles from various sources: Reuters, Businessweek, Huffington Post and so on.

Each news title has a category. It can be:

"b" from Business;
"e" from Entertainment
"t" from Science and Technology
"m" from Health

I have applied two models that we can use to predict the category. Both Naive Bayes and Logistic Regression are applied when we want to predict categorical variables, which in this case is one of the four news category.

They both return pretty good scores:

Naive Bayes scores 92% accuracy
Logistic Regression 95% accuracy

I will extract a few news titles from four websites, according to the category we define.

I will build a dataframe with news titles and the category they belong to. I will then apply a function based on the naive bayes that predicts the category based on the title.

Note:¶

This dataset was taken from the UCI Machine Repository.

You can find this dataset here or on my github account.

Libraries ¶

%matplotlib inline
import pandas as pd

df = pd.read_csv('uci-news-aggregator.csv')
df.head(1)

df.CATEGORY.value_counts()

e    152469
b    115967
t    108344
m     45639
Name: CATEGORY, dtype: int64

Function to normalize text ¶

import re
def normalize_text(s):
    s = s.lower()
    
    # remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
    s = re.sub('\s\W',' ',s)
    s = re.sub('\W\s',' ',s)
    
    # make sure we didn't introduce any double spaces
    s = re.sub('\s+',' ',s)
    
    return s

df['TEXT'] = [normalize_text(s) for s in df['TITLE']]

df.columns

Index(['ID', 'TITLE', 'URL', 'PUBLISHER', 'CATEGORY', 'STORY', 'HOSTNAME',
       'TIMESTAMP', 'TEXT'],
      dtype='object')

df.CATEGORY.value_counts()

e    152469
b    115967
t    108344
m     45639
Name: CATEGORY, dtype: int64

Feature Extraction

from sklearn.feature_extraction.text import CountVectorizer

# pull the data into vectors
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(df['TEXT'])

# LabelEncoder allows us to assign ordinal levels to categorical data
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
y = encoder.fit_transform(df['CATEGORY'])

Train/Test Split ¶

# function to split the data for cross-validation
from sklearn.model_selection import train_test_split
# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# take a look at the shape of each of these
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(337935, 54637)
(337935,)
(84484, 54637)
(84484,)

So the x training vector contains 337935 observations of 54637 occurrences -- this latter number is the number of

unique words in the entire collection of headlines. The x training vector contains the 337935 labels associated with

each observation in the x training vector.

So we're ready to go. Let's make the classifier!

Naive Bayes ¶

# the Naive Bayes model
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()
nb.fit(x_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

nb.score(x_test, y_test)

0.92746555560816246

Nice! Over 92% accuracy, just by using words as independent features

If you feel like exploring what words are characteristic of each category, you can pull out the coefficients of the Naive Bayes classifier:

Function to predict category from a direct tittle:

def predict_cat(title):
    cat_names = {'b' : 'Business', 't' : 'Science and Technology', 'e' : 'Entertainment', 'm' : 'Health'}
    cod = nb.predict(vectorizer.transform([title]))
    return cat_names[encoder.inverse_transform(cod)[0]]

print(predict_cat("star was seen tv"))
print(predict_cat("stocks are on the rise"))
print(predict_cat("eggs and cholesterol"))
print(predict_cat("US equities just legged lower"))
print(predict_cat("potentially smaller corporate tax cut"))
print(predict_cat("over 23 million consumers who hold subprime"))
print(predict_cat("Roku is down 6% this morning"))
print(predict_cat("We had about nine open investigations of classified leaks"))
print(predict_cat("The last 48 hours has been quite a chaotic"))
print(predict_cat("shortly after the US open failed to spark "))
print(predict_cat("Well that escalated quickly as the USDJPY"))
print(predict_cat("the ailing foundation of the economy provides"))

Entertainment
Business
Health
Business
Business
Business
Science and Technology
Science and Technology
Entertainment
Business
Health
Business

It looks good!

Logistic Regression ¶

from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

# Instantiate the classifier: clf

clf = OneVsRestClassifier(LogisticRegression())
clf.fit(x_train, y_train)

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
          n_jobs=1)

# Print the accuracy
clf.score(x_test, y_test)

0.94976563609677567

"The great experiment"¶

I will take some tittles from the following news websites:

Business: http://www.zerohedge.com/
Health : https://www.medicalnewstoday.com/
Science and Technology : https://www.bloomberg.com/technology
Entertainment : http://radaronline.com/

Let us create a pandas dataframe where I will have two columns: News Titles and Category, taken from the websites above.

raw_data = {
        'news_titles': [
            'US equities just legged lower', 
            'Gold Bounces Off Key Technical Support On Massive Volume', 
            'Ruble, Real Tumble As Oil Slumps On Weaker IEA Outlook', 
            'Do you have high blood pressure? You might, based on new guidelines', 
            'Five life hacks for healthy skin',
            'Nuts may protect against heart disease',
        'Apple Targets Rear-Facing 3-D Sensor for 2019 iPhone',
        'Here’s What to Do If You Find an Embarrassing Video of Yourself on Instagram',
        'Twitter Bets on New Data Business Product to Revive Revenue',
        'Chrissy Teigen’s ‘First Born Baby’ Rushed To The Hospital',
        'Khloe Kardashian Hides Growing Baby Bump Under Oversized Coat',
        'Shannon Beador House Of Horrors Sold Amid Nasty Divorce From David'],
        
    
    
    'website': ['Business', 'Business', 'Business', 
                'Health', 'Health', 'Health', 
                'Science and Technology', 'Science and Technology', 'Science and Technology',
               'Entertainment', 'Entertainment', 'Entertainment']}

df_a = pd.DataFrame(raw_data, columns = ['news_titles', 'website'])
df_a

df_a["Values"] = df_a["news_titles"].apply(predict_cat)
df_a

We did very good!

Out of 12 titles, we only missed two.

	news_titles	website
0	US equities just legged lower	Business
1	Gold Bounces Off Key Technical Support On Mass...	Business
2	Ruble, Real Tumble As Oil Slumps On Weaker IEA...	Business
3	Do you have high blood pressure? You might, ba...	Health
4	Five life hacks for healthy skin	Health
5	Nuts may protect against heart disease	Health
6	Apple Targets Rear-Facing 3-D Sensor for 2019 ...	Science and Technology
7	Here’s What to Do If You Find an Embarrassing ...	Science and Technology
8	Twitter Bets on New Data Business Product to R...	Science and Technology
9	Chrissy Teigen’s ‘First Born Baby’ Rushed To T...	Entertainment
10	Khloe Kardashian Hides Growing Baby Bump Under...	Entertainment
11	Shannon Beador House Of Horrors Sold Amid Nast...	Entertainment