Francisca Dias
It can be hard to know how much something’s really worth. Small details can mean big differences in pricing.
For example, one of these sweaters cost 335 dollars and the other cost 9.99 dollars. Can you guess which one’s which?
from IPython.display import Image
Image("mercari_comparison.png")
Well, we cannot predict (yet) the price of each pullover, but we can fit each of them in a category of luxury level.
Spoiler Alert: Vince pullover is way costly than St.John's.
Therefore we want our model to say that Vince's pullover belongs to luxury category, whereas St.John's doesn't.
In this paper I will then show an easy way to guess if the product belongs to a luxury brand just by typing the item description.
This dataset is from an online seller retail, containing more than 1.5 million products.
The dataset can be found here.
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
Load the dataset
data = pd.read_csv('train.tsv', sep='\t')
The first 10 rows
data.head(10)
Missing values
total = data.isnull().sum().sort_values(ascending=False)
missing_data = pd.concat([total], axis=1, keys=['Total'])
missing_data.head()
"brand_name" is missing 632682 brands. This probably means that these items don't have any specific brand.
"category_name" is missing 6327 names.
"item_description" is missing 4 descriptions.
Since we want to guess if the product is luxury or not, we will only need the "name" feature.
Therefore we can drop any additional columns.
drop_elements = ['train_id', 'item_condition_id', 'shipping', 'item_description']
data = data.drop(drop_elements, axis = 1)
Now our dataset looks like this:
data.head()
Function to convert the "name" to lower case and
import re
def lower_case(s):
s = s.lower()
return s
data['name'] = [lower_case(s) for s in data['name']]
data.head()
Group all brands by their mean price
mean_price = data.groupby('brand_name', as_index=False)['price'].mean() # without as_index=False, it returns a Series instead
mean_price.head(3)
Basic Statistics from the average price per brand
mean_price.describe()
Here's how I will segment each category (there will be 4):
mean_price['Luxury'] = np.where(mean_price.price<13,1,
np.where(mean_price.price<18,2,
np.where(mean_price.price<28,3,4)))
mean_price.head(3)
Merge the two datasets: the main dataset and the groupby dataset
final_data = data.merge(mean_price, on='brand_name', how='left')
Rename the columns on the final dataset
final_data.rename(columns={'name': 'Name',
'category_name': 'Category',
'brand_name': 'Brand',
'price_x': 'Price',
'price_y': 'Average Price' }, inplace=True)
final_data.head(2)
final_data = final_data.drop(['Category', 'Brand', 'Price', 'Average Price'], axis=1)
We need to delete all rows that contain NaN. If we don't, the code will not run
final_data = final_data.dropna()
final_data.head(3)
Change the Luxury column datatype from float to integer
final_data['Luxury'] = final_data['Luxury'].astype('int64')
final_data.head(3)
We will use CountVectorizer to convert text into a matrix of token counts
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(final_data['Name'])
# LabelEncoder allows us to assign ordinal levels to categorical data
encoder = LabelEncoder()
y = encoder.fit_transform(final_data['Luxury'])
Train Test Split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
The multinomial Naive Bayes classifier is suitable for classification with discrete features
nb = MultinomialNB()
nb.fit(x_train, y_train)
nb.score(x_test, y_test)
Create a function that predicts the ccaegory of each item name
def predict_cat(Name):
cat_names = {1 : 'Cheap',
2 : 'Good Price',
3 : 'Not Luxury',
4 : 'Luxury'}
cod = nb.predict(vectorizer.transform([Name]))
return cat_names[encoder.inverse_transform(cod)[0]]
This means that:
Let's see if we can predict the category from the example given earlier.
Remember we had:
If our model predicts right, the "Vince" sweater will be more expensive, therefore will output a category that is higher than the st.john bay sweater.
print('vince sweater belongs to category:',predict_cat("vince turtleneck sweater"))
print('st.john bay sweater belongs to category:',predict_cat("st.john bay turtleneck sweater"))