Francisca Dias
This dataset comes from unhcr.org website, but taken from Kaggle.com.
It includes all speeches made by the High Commissioner up until June 2014.
There are 10 High commissioners represented in this dataset.
Can you guess how had the lengthiest speech?
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.corpus import stopwords
stop = set(stopwords.words("english"))
df = pd.read_csv("/speeches.csv",encoding="UTF-8")
df["content"].head()
df.author.unique()
df.groupby(['author']).size()
sns.set_style("white")
fig, ax = plt.subplots(figsize=(13,6))
ax= sns.countplot(y="author",data=df, palette="PuBuGn_d")
ax.set(xlabel='Number of Speaches', ylabel='High Commissioner´s Name')
plt.title("Number of Speeches given by each High Commissioner");
df["date"] = df['by'].str[-4:]
number_speechs_year = df.groupby(['date']).size()
sns.set_style("white")
fig, ax = plt.subplots(figsize=(13,6))
ax= sns.countplot(y="date",data=df, palette="PuBuGn_d")
ax.set(xlabel='Number of Speaches', ylabel='High Commissioner´s Name')
ax.figure.set_size_inches(20,12)
plt.title("Number of Speeches given each year");
import re
def cleaning(s):
s = str(s)
s = s.rstrip('\n')
s = s.lower()
s = re.sub('\s\W',' ',s)
s = re.sub('\W,\s',' ',s)
s = re.sub(r'[^\w]', ' ', s)
s = re.sub("\d+", "", s)
s = re.sub('\s+',' ',s)
s = re.sub('[!@#$_]', '', s)
s = s.replace("co","")
s = s.replace("https","")
s = s.replace(",","")
s = s.replace("[\w*"," ")
return s
df['content'] = [cleaning(s) for s in df['content']]
Tokenization and removal of stop words from the content
# Tokenization segments text into basic units (or tokens ) such as words and punctuation.
df['content'] = df.apply(lambda row: nltk.word_tokenize(row['content']),axis=1)
# Removes stop words from column "content" and iterates over each row and item.
df['content'] = df['content'].apply(lambda x : [item for item in x if item not in stop])
Guess who was the High commissioner that discourse the lengthiest speech?
df["Total_Words"] = df["content"].apply(lambda x : len(x))
df.loc[df['Total_Words'].idxmax()]