Francisca Dias
For this project I will investigate the dataset from an online product seller, in which the goal is to predict what prices to suggest.
Each record in the dataset describes a listing.
There are almost 1.5 million observations, that is, online listings.
import pandas as pd
data = pd.read_csv('train.tsv', sep='\t')
data.head()
pd.set_option('display.float_format', lambda x: '%.3f' % x) # Surpress Scientific notation
data.describe()
import numpy as np
data['category_name'] = data['category_name'].replace(np.NaN, 'Unknown')
f = lambda x: (x["category_name"].split("/"))[0]
data["category"] = data.apply(f, axis=1)
data.category.value_counts()
data['has_brand'] = data["brand_name"].apply(lambda x: 0 if type(x) == float else 1)
data.head()
# Mapping categories
categories_mapping = {"Beauty": 2,
"Kids": 3,
"Handmade": 1,
"Other": 4,
"Sports & Outdoors": 7,
"Unknown": 6,
"Home": 10,
"Women": 9,
"Vintage & Collectibles": 8,
"Electronics": 11,
"Men": 4,}
data['category'] = data['category'].map(categories_mapping)
data['item_description'] = data['item_description'].replace(np.NaN, 'no')
# Gives the length of the name and description
data['name_length'] = data['name'].apply(len)
data['descrip_lenght'] = data['item_description'].apply(len)
data.head()
columns_to_keep = ['item_condition_id', 'shipping', 'category', 'has_brand', 'name_length', 'name_length', 'price' ]
final_df = data[columns_to_keep]
final_df.head()
This is a regression problem where all attributes are numeric.
I will use the Cross validation approach to split the data into train/test. I will split it into k = 10 parts.
To measure the algorithm performance I will use the mean squared error.
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
array = final_df.values
X = array[:,0:6]
Y = array[:,6]
kfold = KFold(n_splits=10, random_state=7)
scoring = 'neg_mean_squared_error'
model_lr = LinearRegression()
model_ridge = Ridge()
model_lasso = Lasso()
model_elastic = ElasticNet()
results_lr = cross_val_score(model_lr, X, Y, cv=kfold, scoring=scoring)
results_ridge = cross_val_score(model_ridge, X, Y, cv=kfold, scoring=scoring)
results_lasso = cross_val_score(model_lasso, X, Y, cv=kfold, scoring=scoring)
results_elastic = cross_val_score(model_elastic, X, Y, cv=kfold, scoring=scoring)
print('Linear Regression Results:', results_lr.mean())
print('Ridge Results:', results_ridge.mean())
print('Lasso Results:', results_lasso.mean())
print('Lasso Results:', results_elastic.mean())
Best performance: Linear Regression and Ridge.
What will be done next?