Francisca Dias
This dataset consists of 555,957 consumer complaints on financial products and services from US Financial institutions.
There are 3,605 Financial institutions represented in this dataset.
The purpose of this data analysis is to give a visual essay, and a general overview on the information provided, such as the institutions, the type of consumer complaints, the responses given to consumers and a map overview bases on the zip code, provided by the great library, plot ly.
This dataset can be found here.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
df = pd.read_csv('consumer_complaints.csv', low_memory=False)
df.head(2)
df.columns
Since this is a time series, I will have to first convert date column into datetime and then set the date as our index. I will also need to sort the dates, since they are not sorted.
df['date_received'] = pd.to_datetime(df['date_received'])
df['year'], df['month'] = df['date_received'].dt.year, df['date_received'].dt.month
df.set_index('date_received', inplace=True)
df = df.sort_index()
Our time series starts in 2011 and goes till 2016.
df.head(3)
df.tail(3)
df_group_month = df.groupby(['date_received']).product.count().reset_index()
df_group_month.set_index('date_received', inplace=True)
df_group_month = df_group_month['product'].resample('MS').sum()
df_group_month.plot(figsize=(15, 8), linewidth=3)
plt.title("Number of Complaints over time")
plt.xlabel("Date")
plt.ylabel("Number of Complaints")
plt.show()
The number of complaints is in an increasing trend since 2012.
By the end of 2012 the number of complaints was about 10,000 and one year later that number increased 30%, to 13,000.
By mid 2015, it hit the highest number, of about 16,000 complaints.
There are 11 complaint product types represented in this dataset.
By looking at the pie chart, two complaint products pop up by their size : Mortgage and Debt Collection.
They both represent half of all complaints in this dataset.
cmap = sns.diverging_palette(220, 15, as_cmap=True)
product_size = df.groupby(['product']).size()
ax = product_size.plot.pie(y='product', figsize=(10, 10), colormap=cmap, autopct='%1.0f%%',pctdistance=0.5, labeldistance=1.2)
handles, labels = ax.get_legend_handles_labels()
lgd = ax.legend(handles, labels, bbox_to_anchor=(1.3, 0.8), loc=2, borderaxespad=0., fontsize=12)
plt.ylabel(' ')
plt.title('Complaints by Product Type', fontsize=20)
plt.show();
fig = plt.figure(figsize=(15,8))
company_size = df.company.value_counts(ascending=False)
company_size[:10].plot(kind='barh', colormap=cmap)
plt.title('Top 10 Financial Institutions with more complaints', fontsize=20)
plt.xlabel('Number of complaints')
plt.ylabel('Financial Institutions')
plt.show();
fig = plt.figure(figsize=(15,8))
df_company_response = df.company_response_to_consumer.value_counts()
df_company_response.plot(kind='barh', colormap=cmap)
plt.title('Proportion of Financial Institution responses to consumers', fontsize=20)
plt.xlabel('Number of Responses according to Type')
plt.ylabel('Type of Response')
plt.show();
There are 6 different methods that consumers can submit their complaints.
The most preferred method is via web. Surprisingly email is the least favourite.
df.submitted_via.value_counts()
fig = plt.figure(figsize=(15,8))
sns.countplot(x="submitted_via", data=df, color="c")
plt.title('What is the most preferred method for submitting a complaint?')
plt.xlabel('Submission Methods')
plt.ylabel('Number of submissions for each method')
plt.show();
According to the dataset, 97% of the financial institution response was given in time.
timely_response_df = df.timely_response.value_counts()
timely_response_df
fig = plt.figure(figsize=(15,8))
sns.countplot(x="timely_response", data=df, color="c")
plt.title('Response was on time?')
plt.xlabel('Yes or No')
plt.ylabel(' Number of Complaints in time and out of time')
plt.show();
Below are the responses given by the financial institutions to their consumers.
52,478 financial institutions choose not to provide a public response, out of 85,124. That represents 62% of all responses.
company_public_response_df =df.company_public_response.value_counts()
The number of responses is:
company_public_response_df.sum()
Below is each of the 10 categories that is available on this dataset, and their count for each response:
company_public_response_df
from plotly.offline import download_plotlyjs, plot, iplot
import plotly.offline as py
py.init_notebook_mode(connected=True)
new_df = df.groupby(["state"]).size().reset_index(name="Number_Complaints")
for col in new_df.columns:
new_df[col] = new_df[col].astype(str)
scl = [[0.0, 'rgb(242,240,247)'],[0.2, 'rgb(218,218,235)'],[0.4, 'rgb(188,189,220)'],\
[0.6, 'rgb(158,154,200)'],[0.8, 'rgb(117,107,177)'],[1.0, 'rgb(84,39,143)']]
new_df['text'] = new_df['state'] + '<br>' + 'Complaints '+new_df['Number_Complaints']
data = [ dict(
type='choropleth',
colorscale = scl,
autocolorscale = False,
locations = new_df['state'],
locationmode = 'USA-states',
z = new_df['Number_Complaints'].astype(float),
text = new_df['text'],
marker = dict(
line = dict (
color = 'rgb(255,255,255)',
width = 2
) ),
colorbar = dict(
title = "Number of Complaints")
) ]
layout = dict(
title = 'Number of Complaints by State<br>',
geo = dict(
scope='usa',
projection=dict( type='albers usa' ),
showlakes = False,
lakecolor = 'rgb(255, 255, 255)'),
)
fig = dict( data=data, layout=layout )
iplot( fig, filename='d3-cloropleth-map' )