Francisca Dias
Note:This dataset was taken from www.kaggle.com
In this dataset we have data on sales over time and by store for a large number of products sold by Corporacion Favorita in Ecuador.
There are also a number of other data files given on transactions, store information, and holidays. Their is an enormous amount of data to process and explore here.
The training data is almost 5.0 Gb in size on disk.
For this reason I downloaded only the information relative to the year 2016.
import pandas as pd
import numpy as np
holidays_events_df = pd.read_csv('/holidays_events.csv')
items_df = pd.read_csv('/items.csv')
oil_df = pd.read_csv('//oil.csv')
stores_df = pd.read_csv('/stores.csv')
transactions_df = pd.read_csv('/transactions.csv')
I will start by changing some columns on this data set. Add columns Month and day of the week.
There might be some sort of seasonality when it comes to the month, therefore I want to see if there“s a month that has more transactions than other months.
Also, my feeling is that people tend to shop more on Weekends than any other day of the week, so for this purposes, I will add a column with the day of the week to confirm my hypothesis.
import calendar
transactions_df["year"] = transactions_df["date"].astype(str).str[:4].astype(np.int64)
transactions_df["month"] = transactions_df["date"].astype(str).str[5:7].astype(np.int64)
transactions_df['date'] = pd.to_datetime(transactions_df['date'], errors ='coerce')
transactions_df['day_of_week'] = transactions_df['date'].dt.weekday_name
transactions_df["year"] = transactions_df["year"].astype(str)
transactions_df.head()
Transactions by Month
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
x = transactions_df.groupby(['month', 'year'], as_index=False).agg({'transactions':'sum'})
y = x.pivot("month", "year", "transactions")
fig, ax = plt.subplots(figsize=(10,7))
sns.heatmap(y);
December is the month with more transactions. We can also see that as we go forward in time, the number os transactions increase, as the map is getting darker. Please note that, since we dont have data for the last semester of 2017, the squares are left blank.
Transactions by day of the week
x = transactions_df.groupby(['day_of_week', 'year'], as_index=False).agg({'transactions':'sum'})
y = x.pivot("day_of_week", "year", "transactions")
fig, ax = plt.subplots(figsize=(10,7))
sns.heatmap(y);
Apparently my suspicions were correct: Sunday and Saturday are the most prefered days to shop at the Favorita.
regions_data = {
'region': ['Sierra','Sierra','Sierra','Sierra', 'Sierra', 'Sierra', 'Sierra', 'Sierra',
'Costa', 'Costa', 'Costa', 'Costa', 'Costa', 'Costa' , 'Costa' , 'Oriente'],
'state': ['Imbabura','Tungurahua', 'Pichincha', 'Azuay', 'Bolivar', 'Chimborazo',
'Loja', 'Cotopaxi', 'Esmeraldas', 'Manabi', 'Santo Domingo de los Tsachilas',
'Santa Elena', 'Guayas', 'El Oro', 'Los Rios', 'Pastaza']}
df_regions = pd.DataFrame(regions_data, columns = ['region', 'state'])
df_regions_cities = pd.merge(df_regions, stores_df, on='state')
transactions_regions = pd.merge(transactions_df, df_regions_cities, on='store_nbr')
transactions_regions.head()
Transactions by State
x = transactions_regions.groupby(['state', 'year'], as_index=False).agg({'transactions':'sum'})
y = x.pivot("state", "year", "transactions")
fig, ax = plt.subplots(figsize=(12,9))
sns.heatmap(y);
Pichincha and Guayas are among the states where Favorita sells more products.
Transactions by Store Number
x = transactions_regions.groupby(['store_nbr', 'year'], as_index=False).agg({'transactions':'sum'})
y = x.pivot("store_nbr", "year", "transactions")
fig, ax = plt.subplots(figsize=(12,9))
sns.heatmap(y);
Stores 44, 45, 46, 47 48 and 50 are winners in transactions.
It is also interesting to see that a few stores only opened after 2013, such as the stores number 20, 21 and 22 that became operational beginning of 2015.
items_df.head()
There is a variety of family products being sold at Favorita.
I would like to know what types of products are transactioned more often.
For that purpose, I calculated the percentage of each family product over the total transaction.
I plot the results below.
sns.set_style("white")
items_df_family = items_df.groupby(['family']).size().to_frame(name = 'count').reset_index()
items_df_family['percentage']= items_df_family['count']/items_df_family['count'].sum() * 100
fig, ax =plt.subplots(figsize=(14,10))
ax = sns.barplot(x="percentage", y="family", data=items_df_family, palette="BuGn_r")
Grocery I takes on more than 30% of all transactions.
Interesting to see that beverages comes right after with approximately 12%.
Since there is more than 125 million rows on the Train data, I will filter this data to show only what happened in store number 25 during 2016.
dtypes = {'store_nbr': np.dtype('int64'),
'item_nbr': np.dtype('int64'),
'unit_sales': np.dtype('float64'),
'onpromotion': np.dtype('O')}
train = pd.read_csv('/train.csv', index_col='id', parse_dates=['date'], dtype=dtypes)
date_mask = (train['date'] >= '2016-01-01') & (train['date'] <= '2016-12-31') & (train['store_nbr'] == 25)
train = train[date_mask]
train.head()
# Merge and sort date
df_train_item = pd.merge(train, items_df, on='item_nbr').sort_values(by='date')
df_train_item["year"] = df_train_item["date"].astype(str).str[:4].astype(np.int64)
df_train_item["month"] = df_train_item["date"].astype(str).str[5:7].astype(np.int64)
df_train_item.head()
Transactions by family type on store 25
sns.set_style("white")
ax = plt.subplots(figsize=(13, 9))
sns.countplot(x="family", hue="month", data=df_train_item, palette="Greens_d",
order=df_train_item.family.value_counts().iloc[:7].index);
For store 25, the top 7 family products are : Grocery I, Beverages, Cleaning, Produce, Dairy, Bread/ Bakery and Personal Care. There is a pattern for 2016 in terms of increased transactions for these 7 family products in November and December.
Interesting fact: There is no data for September, and in October it was open for only two days. This is the reason why there“s such a big drop in transactions.
What are the top items most bought in store 25 ?
Too bad we only have the id number, and not the descritpion of the product :)
df_train_item['item_nbr'].value_counts().nlargest(30)