Natural Language on online products

Francisca Dias

Table of Contents

It can be hard to know how much something’s really worth. Small details can mean big differences in pricing.

For example, one of these sweaters cost 335 dollars and the other cost 9.99 dollars. Can you guess which one’s which?

In [1]:
from IPython.display import Image
Image("mercari_comparison.png")
Out[1]:

Well, we cannot predict (yet) the price of each pullover, but we can fit each of them in a category of luxury level.

Spoiler Alert: Vince pullover is way costly than St.John's.

Therefore we want our model to say that Vince's pullover belongs to luxury category, whereas St.John's doesn't.

In this paper I will then show an easy way to guess if the product belongs to a luxury brand just by typing the item description.

This dataset is from an online seller retail, containing more than 1.5 million products.

The dataset can be found here.

In [2]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

Load the dataset

In [3]:
data = pd.read_csv('train.tsv', sep='\t')

The first 10 rows

In [4]:
data.head(10)
Out[4]:
train_id name item_condition_id category_name brand_name price shipping item_description
0 0 MLB Cincinnati Reds T Shirt Size XL 3 Men/Tops/T-shirts NaN 10.0 1 No description yet
1 1 Razer BlackWidow Chroma Keyboard 3 Electronics/Computers & Tablets/Components & P... Razer 52.0 0 This keyboard is in great condition and works ...
2 2 AVA-VIV Blouse 1 Women/Tops & Blouses/Blouse Target 10.0 1 Adorable top with a hint of lace and a key hol...
3 3 Leather Horse Statues 1 Home/Home Décor/Home Décor Accents NaN 35.0 1 New with tags. Leather horses. Retail for [rm]...
4 4 24K GOLD plated rose 1 Women/Jewelry/Necklaces NaN 44.0 0 Complete with certificate of authenticity
5 5 Bundled items requested for Ruie 3 Women/Other/Other NaN 59.0 0 Banana republic bottoms, Candies skirt with ma...
6 6 Acacia pacific tides santorini top 3 Women/Swimwear/Two-Piece Acacia Swimwear 64.0 0 Size small but straps slightly shortened to fi...
7 7 Girls cheer and tumbling bundle of 7 3 Sports & Outdoors/Apparel/Girls Soffe 6.0 1 You get three pairs of Sophie cheer shorts siz...
8 8 Girls Nike Pro shorts 3 Sports & Outdoors/Apparel/Girls Nike 19.0 0 Girls Size small Plus green. Three shorts total.
9 9 Porcelain clown doll checker pants VTG 3 Vintage & Collectibles/Collectibles/Doll NaN 8.0 0 I realized his pants are on backwards after th...

Missing values

In [5]:
total = data.isnull().sum().sort_values(ascending=False)
missing_data = pd.concat([total], axis=1, keys=['Total'])
missing_data.head()
Out[5]:
Total
brand_name 632682
category_name 6327
item_description 4
shipping 0
price 0

"brand_name" is missing 632682 brands. This probably means that these items don't have any specific brand.

"category_name" is missing 6327 names.

"item_description" is missing 4 descriptions.

Since we want to guess if the product is luxury or not, we will only need the "name" feature.

Therefore we can drop any additional columns.

In [6]:
drop_elements = ['train_id', 'item_condition_id', 'shipping', 'item_description']
data = data.drop(drop_elements, axis = 1)

Now our dataset looks like this:

In [7]:
data.head()
Out[7]:
name category_name brand_name price
0 MLB Cincinnati Reds T Shirt Size XL Men/Tops/T-shirts NaN 10.0
1 Razer BlackWidow Chroma Keyboard Electronics/Computers & Tablets/Components & P... Razer 52.0
2 AVA-VIV Blouse Women/Tops & Blouses/Blouse Target 10.0
3 Leather Horse Statues Home/Home Décor/Home Décor Accents NaN 35.0
4 24K GOLD plated rose Women/Jewelry/Necklaces NaN 44.0

Function to convert the "name" to lower case and

In [8]:
import re
def lower_case(s):
    s = s.lower()
    return s
In [9]:
data['name'] = [lower_case(s) for s in data['name']]
In [10]:
data.head()
Out[10]:
name category_name brand_name price
0 mlb cincinnati reds t shirt size xl Men/Tops/T-shirts NaN 10.0
1 razer blackwidow chroma keyboard Electronics/Computers & Tablets/Components & P... Razer 52.0
2 ava-viv blouse Women/Tops & Blouses/Blouse Target 10.0
3 leather horse statues Home/Home Décor/Home Décor Accents NaN 35.0
4 24k gold plated rose Women/Jewelry/Necklaces NaN 44.0

Group all brands by their mean price

In [11]:
mean_price = data.groupby('brand_name', as_index=False)['price'].mean() # without as_index=False, it returns a Series instead
mean_price.head(3)
Out[11]:
brand_name price
0 !iT Jeans 16.000000
1 % Pure 16.344262
2 10.Deep 17.333333

Basic Statistics from the average price per brand

In [12]:
mean_price.describe()
Out[12]:
price
count 4809.000000
mean 26.606004
std 28.142298
min 0.000000
25% 13.000000
50% 18.333333
75% 28.481481
max 429.000000

Here's how I will segment each category (there will be 4):

  • For prices below the first quartile (25%) I will apply category 1, that is, the cheapeast average price category.
  • For brands in which the average price ranges from 13 to 18 I will apply category 2.
  • For prices in between mean (50%) and the third quartile (75%) I will name category 3.
  • And for average prices above 28 I will apply category 4, the most priced category.
In [13]:
mean_price['Luxury'] = np.where(mean_price.price<13,1,
                       np.where(mean_price.price<18,2,
                       np.where(mean_price.price<28,3,4)))
In [14]:
mean_price.head(3)
Out[14]:
brand_name price Luxury
0 !iT Jeans 16.000000 2
1 % Pure 16.344262 2
2 10.Deep 17.333333 2

Merge the two datasets: the main dataset and the groupby dataset

In [15]:
final_data = data.merge(mean_price, on='brand_name', how='left')

Rename the columns on the final dataset

In [16]:
final_data.rename(columns={'name': 'Name',
                     'category_name': 'Category',
                     'brand_name': 'Brand',
                     'price_x': 'Price',
                     'price_y': 'Average Price' }, inplace=True)
final_data.head(2)
Out[16]:
Name Category Brand Price Average Price Luxury
0 mlb cincinnati reds t shirt size xl Men/Tops/T-shirts NaN 10.0 NaN NaN
1 razer blackwidow chroma keyboard Electronics/Computers & Tablets/Components & P... Razer 52.0 45.021277 4.0
In [17]:
final_data = final_data.drop(['Category', 'Brand', 'Price', 'Average Price'], axis=1)

We need to delete all rows that contain NaN. If we don't, the code will not run

In [18]:
final_data = final_data.dropna()
final_data.head(3)
Out[18]:
Name Luxury
1 razer blackwidow chroma keyboard 4.0
2 ava-viv blouse 2.0
6 acacia pacific tides santorini top 4.0

Change the Luxury column datatype from float to integer

In [19]:
final_data['Luxury'] = final_data['Luxury'].astype('int64')
In [20]:
final_data.head(3)
Out[20]:
Name Luxury
1 razer blackwidow chroma keyboard 4
2 ava-viv blouse 2
6 acacia pacific tides santorini top 4

We will use CountVectorizer to convert text into a matrix of token counts

In [21]:
vectorizer = CountVectorizer()
x = vectorizer.fit_transform(final_data['Name'])
In [22]:
# LabelEncoder allows us to assign ordinal levels to categorical data
encoder = LabelEncoder()
y = encoder.fit_transform(final_data['Luxury'])

Train Test Split

In [23]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

The multinomial Naive Bayes classifier is suitable for classification with discrete features

In [24]:
nb = MultinomialNB()
nb.fit(x_train, y_train)
Out[24]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
In [25]:
nb.score(x_test, y_test)
Out[25]:
0.8774437992363403

Create a function that predicts the ccaegory of each item name

In [26]:
def predict_cat(Name):
    cat_names = {1 : 'Cheap', 
                 2 : 'Good Price', 
                 3 : 'Not Luxury', 
                 4 : 'Luxury'}
    cod = nb.predict(vectorizer.transform([Name]))
    return cat_names[encoder.inverse_transform(cod)[0]]

This means that:

  • The chepeast products, the ones that belong to category 1, are labeled as "Cheap";
  • It follows by "Good Price", meaning that is still cheap, but pricier than category 1;
  • Then we have "Not Luxury", the third category;
  • And finnaly the most expensive category, the "Luxury" item.

Let's see if we can predict the category from the example given earlier.

Remember we had:

  • Vince long sleeve turtleneck sweater
  • St.john bay long sleeve turtleneck sweater

If our model predicts right, the "Vince" sweater will be more expensive, therefore will output a category that is higher than the st.john bay sweater.

In [27]:
print('vince sweater belongs to category:',predict_cat("vince turtleneck sweater"))
print('st.john bay sweater belongs to category:',predict_cat("st.john bay turtleneck sweater"))
vince sweater belongs to category: Luxury
st.john bay sweater belongs to category: Good Price