Which algorithms perform better on this problem?

Francisca Dias

Table of Contents

For this project I will investigate the dataset from an online product seller, in which the goal is to predict what prices to suggest.

Each record in the dataset describes a listing.

There are almost 1.5 million observations, that is, online listings.

In [1]:
import pandas as pd
In [2]:
data = pd.read_csv('train.tsv', sep='\t')
In [3]:
data.head()
Out[3]:
train_id name item_condition_id category_name brand_name price shipping item_description
0 0 MLB Cincinnati Reds T Shirt Size XL 3 Men/Tops/T-shirts NaN 10.0 1 No description yet
1 1 Razer BlackWidow Chroma Keyboard 3 Electronics/Computers & Tablets/Components & P... Razer 52.0 0 This keyboard is in great condition and works ...
2 2 AVA-VIV Blouse 1 Women/Tops & Blouses/Blouse Target 10.0 1 Adorable top with a hint of lace and a key hol...
3 3 Leather Horse Statues 1 Home/Home Décor/Home Décor Accents NaN 35.0 1 New with tags. Leather horses. Retail for [rm]...
4 4 24K GOLD plated rose 1 Women/Jewelry/Necklaces NaN 44.0 0 Complete with certificate of authenticity
In [4]:
pd.set_option('display.float_format', lambda x: '%.3f' % x) # Surpress Scientific notation
data.describe()
Out[4]:
train_id item_condition_id price shipping
count 1482535.000 1482535.000 1482535.000 1482535.000
mean 741267.000 1.907 26.738 0.447
std 427971.135 0.903 38.586 0.497
min 0.000 1.000 0.000 0.000
25% 370633.500 1.000 10.000 0.000
50% 741267.000 2.000 17.000 0.000
75% 1111900.500 3.000 29.000 1.000
max 1482534.000 5.000 2009.000 1.000
In [6]:
import numpy as np
data['category_name'] = data['category_name'].replace(np.NaN, 'Unknown')
In [7]:
f = lambda x: (x["category_name"].split("/"))[0]
data["category"] = data.apply(f, axis=1)
In [8]:
data.category.value_counts()
Out[8]:
Women                     664385
Beauty                    207828
Kids                      171689
Electronics               122690
Men                        93680
Home                       67871
Vintage & Collectibles     46530
Other                      45351
Handmade                   30842
Sports & Outdoors          25342
Unknown                     6327
Name: category, dtype: int64
In [9]:
data['has_brand'] = data["brand_name"].apply(lambda x: 0 if type(x) == float else 1)
In [10]:
data.head()
Out[10]:
train_id name item_condition_id category_name brand_name price shipping item_description category has_brand
0 0 MLB Cincinnati Reds T Shirt Size XL 3 Men/Tops/T-shirts NaN 10.000 1 No description yet Men 0
1 1 Razer BlackWidow Chroma Keyboard 3 Electronics/Computers & Tablets/Components & P... Razer 52.000 0 This keyboard is in great condition and works ... Electronics 1
2 2 AVA-VIV Blouse 1 Women/Tops & Blouses/Blouse Target 10.000 1 Adorable top with a hint of lace and a key hol... Women 1
3 3 Leather Horse Statues 1 Home/Home Décor/Home Décor Accents NaN 35.000 1 New with tags. Leather horses. Retail for [rm]... Home 0
4 4 24K GOLD plated rose 1 Women/Jewelry/Necklaces NaN 44.000 0 Complete with certificate of authenticity Women 0
In [11]:
# Mapping categories
categories_mapping = {"Beauty": 2, 
                      "Kids": 3, 
                      "Handmade": 1, 
                      "Other": 4, 
                      "Sports & Outdoors": 7,
                      "Unknown": 6,
                      "Home": 10,
                      "Women": 9, 
                      "Vintage & Collectibles": 8,
                      "Electronics": 11, 
                      "Men": 4,}
    
data['category'] = data['category'].map(categories_mapping)
In [12]:
data['item_description'] = data['item_description'].replace(np.NaN, 'no')
In [13]:
# Gives the length of the name and description
data['name_length'] = data['name'].apply(len)
data['descrip_lenght'] = data['item_description'].apply(len)
In [14]:
data.head()
Out[14]:
train_id name item_condition_id category_name brand_name price shipping item_description category has_brand name_length descrip_lenght
0 0 MLB Cincinnati Reds T Shirt Size XL 3 Men/Tops/T-shirts NaN 10.000 1 No description yet 4 0 35 18
1 1 Razer BlackWidow Chroma Keyboard 3 Electronics/Computers & Tablets/Components & P... Razer 52.000 0 This keyboard is in great condition and works ... 11 1 32 188
2 2 AVA-VIV Blouse 1 Women/Tops & Blouses/Blouse Target 10.000 1 Adorable top with a hint of lace and a key hol... 9 1 14 124
3 3 Leather Horse Statues 1 Home/Home Décor/Home Décor Accents NaN 35.000 1 New with tags. Leather horses. Retail for [rm]... 10 0 21 173
4 4 24K GOLD plated rose 1 Women/Jewelry/Necklaces NaN 44.000 0 Complete with certificate of authenticity 9 0 20 41
In [21]:
columns_to_keep = ['item_condition_id', 'shipping', 'category', 'has_brand', 'name_length', 'name_length', 'price' ]
In [22]:
final_df = data[columns_to_keep]
In [24]:
final_df.head()
Out[24]:
item_condition_id shipping category has_brand name_length name_length price
0 3 1 4 0 35 35 10.000
1 3 0 11 1 32 32 52.000
2 1 1 9 1 14 14 10.000
3 1 1 10 0 21 21 35.000
4 1 0 9 0 20 20 44.000

This is a regression problem where all attributes are numeric.

I will use the Cross validation approach to split the data into train/test. I will split it into k = 10 parts.

To measure the algorithm performance I will use the mean squared error.

In [25]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet


array = final_df.values
X = array[:,0:6]
Y = array[:,6]


kfold = KFold(n_splits=10, random_state=7)
scoring = 'neg_mean_squared_error'



model_lr = LinearRegression()
model_ridge = Ridge()
model_lasso = Lasso()
model_elastic = ElasticNet()

results_lr = cross_val_score(model_lr, X, Y, cv=kfold, scoring=scoring)
results_ridge = cross_val_score(model_ridge, X, Y, cv=kfold, scoring=scoring)
results_lasso = cross_val_score(model_lasso, X, Y, cv=kfold, scoring=scoring)
results_elastic = cross_val_score(model_elastic, X, Y, cv=kfold, scoring=scoring)


print('Linear Regression Results:', results_lr.mean())
print('Ridge Results:', results_ridge.mean())
print('Lasso Results:', results_lasso.mean())
print('Lasso Results:', results_elastic.mean())
Linear Regression Results: -1441.67569915
Ridge Results: -1441.67446543
Lasso Results: -1451.17059846
Lasso Results: -1459.59991106

Best performance: Linear Regression and Ridge.

What will be done next?

  • I will run three nonlinear algorithms: k-Nearest Neighbors, Decision Tree Regressor and Support Vector Machines;
  • Perform more feature selection: Extract the most important features in this dataset;
  • Automate the workflows with Pipelines;
  • Improve performance with Ensemble and Algorithm Tuning.