Francisca Dias
This dataset contains data of news headlines published over a period of 14 years from the reputable Australian news source ABC (Australian Broadcasting Corp.).
The news were taken from 2003 till 2017.
The number of articles varies each year, and in 2013 we have over 80,000 articles.
The most frequent words appear to be related to some sort of tragedy, where words such as police, death and crime are pretty common.
I end this article by doing topic modeling.
This dataset can be found here.
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv("abcnews-date-text.csv")
df.head()
df.dtypes
df["year"] = df["publish_date"].astype(str).str[:4].astype(np.int64)
df.head()
df["month"] = df["publish_date"].astype(str).str[4:6].astype(np.int64)
df.head()
df.year.unique()
df.month.unique()
df["word_count"] = df["headline_text"].str.len()
df.head()
sns.set_style("white")
fig, ax = plt.subplots(figsize=(13,6))
ax= sns.countplot(y="year",data=df, palette="PuBuGn_d")
ax.set(xlabel='Number of Articles', ylabel='Year')
plt.title("Number of Articles per Year");
vectorizer = CountVectorizer(analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = "english",
max_features = 30)
news_array = vectorizer.fit_transform(df["headline_text"])
# Numpy arrays are easy to work with, so convert the result to an array
news_array = news_array.toarray()
# Lets take a look at the words in the vocabulary and print the counts of each word in the vocabulary:
vocab = vectorizer.get_feature_names()
# Sum up the counts of each vocabulary word
dist = np.sum(news_array, axis=0)
# For each, print the vocabulary word and the number of times it appears in the training set
for tag, count in zip(vocab, dist):
print (count, tag)
"Police" seems to be the most frequent word on the dataset.
It is interesting to see that the top words are pretty much related to tragedy: crash, hospital, murder.
Maybe ABC news consumers are drawn to this category of news.
all_words = df.headline_text
vect = CountVectorizer(stop_words = "english")
vect.fit(all_words)
print(len(vect.vocabulary_))
bag_of_words = vect.transform(all_words)
lda = LatentDirichletAllocation(n_topics=10, learning_method="batch",
max_iter=25, random_state=0)
# be build the model and transform the data in one step
# computing transform takes some time,
# and we can save time by doing both at once.
document_topics = lda.fit_transform(bag_of_words)
print("lda.components_.shape: {}".format(lda.components_.shape))
# for each topic (a row in the components_), sort the features (ascending).
# Invert rows with [:, ::-1] to make sorting descending
sorting = np.argsort(lda.components_, axis=1)[:, ::-1]
# get the feature names from the vectorizer:
feature_names = np.array(vect.get_feature_names())
# Print out the 10 topics:
import mglearn
mglearn.tools.print_topics(topics=range(10), feature_names=feature_names,
sorting=sorting, topics_per_chunk=5, n_words=10)