Which algorithms perform better on this problem?¶

Francisca Dias

Table of Contents¶

Introduction

Data Transforms

Feature Selection

Evaluate Algorithms

Linear Algorithms

To Continue

Introduction ¶

For this project I will investigate the dataset from an online product seller, in which the goal is to predict what prices to suggest.

Each record in the dataset describes a listing.

There are almost 1.5 million observations, that is, online listings.

import pandas as pd

data = pd.read_csv('train.tsv', sep='\t')

data.head()

pd.set_option('display.float_format', lambda x: '%.3f' % x) # Surpress Scientific notation
data.describe()

Data Transforms ¶

import numpy as np
data['category_name'] = data['category_name'].replace(np.NaN, 'Unknown')

f = lambda x: (x["category_name"].split("/"))[0]
data["category"] = data.apply(f, axis=1)

data.category.value_counts()

Women                     664385
Beauty                    207828
Kids                      171689
Electronics               122690
Men                        93680
Home                       67871
Vintage & Collectibles     46530
Other                      45351
Handmade                   30842
Sports & Outdoors          25342
Unknown                     6327
Name: category, dtype: int64

data['has_brand'] = data["brand_name"].apply(lambda x: 0 if type(x) == float else 1)

data.head()

# Mapping categories
categories_mapping = {"Beauty": 2, 
                      "Kids": 3, 
                      "Handmade": 1, 
                      "Other": 4, 
                      "Sports & Outdoors": 7,
                      "Unknown": 6,
                      "Home": 10,
                      "Women": 9, 
                      "Vintage & Collectibles": 8,
                      "Electronics": 11, 
                      "Men": 4,}
    
data['category'] = data['category'].map(categories_mapping)

data['item_description'] = data['item_description'].replace(np.NaN, 'no')

# Gives the length of the name and description
data['name_length'] = data['name'].apply(len)
data['descrip_lenght'] = data['item_description'].apply(len)

data.head()

Feature Selection ¶

columns_to_keep = ['item_condition_id', 'shipping', 'category', 'has_brand', 'name_length', 'name_length', 'price' ]

final_df = data[columns_to_keep]

final_df.head()

Evaluate Algorithms ¶

This is a regression problem where all attributes are numeric.

I will use the Cross validation approach to split the data into train/test. I will split it into k = 10 parts.

To measure the algorithm performance I will use the mean squared error.

Linear Algorithms ¶

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet


array = final_df.values
X = array[:,0:6]
Y = array[:,6]


kfold = KFold(n_splits=10, random_state=7)
scoring = 'neg_mean_squared_error'



model_lr = LinearRegression()
model_ridge = Ridge()
model_lasso = Lasso()
model_elastic = ElasticNet()

results_lr = cross_val_score(model_lr, X, Y, cv=kfold, scoring=scoring)
results_ridge = cross_val_score(model_ridge, X, Y, cv=kfold, scoring=scoring)
results_lasso = cross_val_score(model_lasso, X, Y, cv=kfold, scoring=scoring)
results_elastic = cross_val_score(model_elastic, X, Y, cv=kfold, scoring=scoring)


print('Linear Regression Results:', results_lr.mean())
print('Ridge Results:', results_ridge.mean())
print('Lasso Results:', results_lasso.mean())
print('Lasso Results:', results_elastic.mean())

Linear Regression Results: -1441.67569915
Ridge Results: -1441.67446543
Lasso Results: -1451.17059846
Lasso Results: -1459.59991106

Best performance: Linear Regression and Ridge.

To Continue ¶

What will be done next?

I will run three nonlinear algorithms: k-Nearest Neighbors, Decision Tree Regressor and Support Vector Machines;

Perform more feature selection: Extract the most important features in this dataset;

Automate the workflows with Pipelines;

Improve performance with Ensemble and Algorithm Tuning.

	train_id	name	item_condition_id	category_name	brand_name	price	shipping	item_description
0	0	MLB Cincinnati Reds T Shirt Size XL	3	Men/Tops/T-shirts	NaN	10.0	1	No description yet
1	1	Razer BlackWidow Chroma Keyboard	3	Electronics/Computers & Tablets/Components & P...	Razer	52.0	0	This keyboard is in great condition and works ...
2	2	AVA-VIV Blouse	1	Women/Tops & Blouses/Blouse	Target	10.0	1	Adorable top with a hint of lace and a key hol...
3	3	Leather Horse Statues	1	Home/Home Décor/Home Décor Accents	NaN	35.0	1	New with tags. Leather horses. Retail for [rm]...
4	4	24K GOLD plated rose	1	Women/Jewelry/Necklaces	NaN	44.0	0	Complete with certificate of authenticity

	train_id	item_condition_id	price	shipping
count	1482535.000	1482535.000	1482535.000	1482535.000
mean	741267.000	1.907	26.738	0.447
std	427971.135	0.903	38.586	0.497
min	0.000	1.000	0.000	0.000
25%	370633.500	1.000	10.000	0.000
50%	741267.000	2.000	17.000	0.000
75%	1111900.500	3.000	29.000	1.000
max	1482534.000	5.000	2009.000	1.000

	train_id	name	item_condition_id	category_name	brand_name	price	shipping	item_description	category	has_brand
0	0	MLB Cincinnati Reds T Shirt Size XL	3	Men/Tops/T-shirts	NaN	10.000	1	No description yet	Men	0
1	1	Razer BlackWidow Chroma Keyboard	3	Electronics/Computers & Tablets/Components & P...	Razer	52.000	0	This keyboard is in great condition and works ...	Electronics	1
2	2	AVA-VIV Blouse	1	Women/Tops & Blouses/Blouse	Target	10.000	1	Adorable top with a hint of lace and a key hol...	Women	1
3	3	Leather Horse Statues	1	Home/Home Décor/Home Décor Accents	NaN	35.000	1	New with tags. Leather horses. Retail for [rm]...	Home	0
4	4	24K GOLD plated rose	1	Women/Jewelry/Necklaces	NaN	44.000	0	Complete with certificate of authenticity	Women	0

	train_id	name	item_condition_id	category_name	brand_name	price	shipping	item_description	category	has_brand	name_length	descrip_lenght
0	0	MLB Cincinnati Reds T Shirt Size XL	3	Men/Tops/T-shirts	NaN	10.000	1	No description yet	4	0	35	18
1	1	Razer BlackWidow Chroma Keyboard	3	Electronics/Computers & Tablets/Components & P...	Razer	52.000	0	This keyboard is in great condition and works ...	11	1	32	188
2	2	AVA-VIV Blouse	1	Women/Tops & Blouses/Blouse	Target	10.000	1	Adorable top with a hint of lace and a key hol...	9	1	14	124
3	3	Leather Horse Statues	1	Home/Home Décor/Home Décor Accents	NaN	35.000	1	New with tags. Leather horses. Retail for [rm]...	10	0	21	173
4	4	24K GOLD plated rose	1	Women/Jewelry/Necklaces	NaN	44.000	0	Complete with certificate of authenticity	9	0	20	41

Which algorithms perform better on this problem?¶

Table of Contents¶

Introduction¶

Data Transforms¶

Feature Selection¶

Evaluate Algorithms¶