An analysis on the Top 4000 most backed projects ever on Kickstarter¶

Francisca Dias

Table of Contents¶

Introduction

Overview

Libraries

Data Cleaning

Exploring Kickstarter Statistics

Natural Language Insights on description and title

Introduction ¶

In this analysis I will explore the Top 4000 most backed projects ever on Kickstarter and search for interesting insights and trends.

Are these entrepreneurs starting more than one succesfull project?

What type of industry is the most successful on kickstarter? And what is the least successful?

What was the project with more pledged money yet?

Are there words that repeat more often in titles and description?

I will answer these and a lot other questions in this analysis.

Overview ¶

Kickstarter is a platform for millions of creators to present their innovative ideas to the public.

This is a win-win situation where creators could accumulate initial fund while the public get access to cutting-edge prototypical products that are not available in the market yet.

This dataset can be found here.

It consists on the following features:

amt.pledged: amount pledged
by: author
category: project category (string)
currency: currency type of amount pledged
goal: original pledge goal
location
num.backers: total number of backers
num.backers.tier: number of backers
pledge.tier: pledge tiers in USD
title
url

Libraries ¶

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
plt.style.use('fivethirtyeight')

from wordcloud import WordCloud, STOPWORDS

import warnings
warnings.filterwarnings("ignore")

import re

df = pd.read_csv('most_backed.csv')
df.head(2)

df.columns

Index(['Unnamed: 0', 'amt.pledged', 'blurb', 'by', 'category', 'currency',
       'goal', 'location', 'num.backers', 'num.backers.tier', 'pledge.tier',
       'title', 'url'],
      dtype='object')

Data Cleaning ¶

Renaming Columns,

define data types,

see all currencies that are represented in this dataset,

build a data frame with all exchange rates between USD and other currencies,

convert the pledge amount and goal to one single currency, USD.,

normalize the columns that contain text, that is, blurb and title,

define the columns that i will be using from now on and discard the ones that I will not need for my analysis.

df.rename(columns={'amt.pledged': 'amt_pledged',
                    'end.time': 'end_time',
                    'num.backers': 'num_backers',
                   'num.backers.tier': 'num_backers_tier',
                    'pledge.tier' : 'pledge_tier'}, inplace=True)

df.dtypes

Unnamed: 0            int64
amt_pledged         float64
blurb                object
by                   object
category             object
currency             object
goal                float64
location             object
num_backers           int64
num_backers_tier     object
pledge_tier          object
title                object
url                  object
dtype: object

What is the most common currency on Kickstarter?

df['currency'].value_counts()

usd    3438
gbp     252
cad     128
eur      96
aud      52
sek      14
nzd      10
dkk       7
chf       3
Name: currency, dtype: int64

The exchange rates as of November 22, 2017

# conversion rates for each currency, converted amount in USD

currencies = ['usd', 'sek', 'nzd', 'gbp', 'eur', 'dkk', 'chf', 'cad', 'aud']
conversion = [1.00, 0.11, 0.68, 1.32, 1.17, 0.15, 1.01, 0.78, 0.75 ]

Convert the exchange rate between USD and other currencies to a dataframe

exchange_rate = [ ('currency', currencies),
                     ('conversion_rate', conversion) ]
exchange_rate_table = pd.DataFrame.from_items(exchange_rate)
exchange_rate_table

Merge the currency dataframe with the dataset

final_df = df.merge(exchange_rate_table, on='currency')

final_df.head(2)

Converting all amounts to one common currency

# Convert amt_pledged and goal

final_df['amount_pledged_usd'] = final_df.amt_pledged * final_df.conversion_rate

final_df['goal_usd'] = final_df.goal * final_df.conversion_rate

final_df.columns

Index(['Unnamed: 0', 'amt_pledged', 'blurb', 'by', 'category', 'currency',
       'goal', 'location', 'num_backers', 'num_backers_tier', 'pledge_tier',
       'title', 'url', 'conversion_rate', 'amount_pledged_usd', 'goal_usd'],
      dtype='object')

Function to process the text in our dataset

def normalize_text(s):
    s = s.lower()

    # remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
    s = re.sub('\s\W',' ',s)
    s = re.sub('\W\s',' ',s)

    # make sure we didn't introduce any double spaces
    s = re.sub('\s+',' ',s)

    return s

Clean Blurb

final_df['clean_blurb'] = [normalize_text(s) for s in final_df['blurb']]

Clean Title

final_df['clean_title'] = [normalize_text(s) for s in final_df['title']]

Define the columns that I need and discard the ones I will not need from now

cols_to_use = ['amount_pledged_usd', 'goal_usd', 'clean_blurb', 
            'clean_title', 'by', 'category', 'location', 'num_backers']

clean_df = final_df[cols_to_use]

Exploring Kickstarter Statistics ¶

This is a rich dataset, and many insights can be taken here.

I will narrow down to just a few questions that I would like to have answered by the end of the analysis.

Are there entrepreneurs that start more than one succesfull projects or it is not common?

What type of products are sold in this platform and what type of industry is the most successful on kickstarter?

are entrepreneurs successful after these crowdfunding campaigns?

interesting findings during the process of analysis!

In this dataset we have projects from all around the world.

Fortunately we also have the cities where these projects come from, so below are the top ten cities that have more representation on the most backed projects ever on Kickstarter.

plt.subplots(figsize=(11,6))
sns.countplot('location',data=clean_df,palette='inferno',edgecolor=sns.color_palette('dark',7),order=clean_df['location'].value_counts()[:10].index)
plt.xticks(rotation=90)
plt.title('Top Locations with more projects', fontsize=20)
plt.ylabel('Number of Kickstarter Projects')
plt.xlabel('Location')
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.show()

San Francisco leads the ranking, followed by LA and NY. Seattle and London come next.

The top 4 cities with more projects belongs to the US.

Actually, if London wasn't here, the top 9 (at least) would belong to the United States.

What about the industry that has more projects?

plt.subplots(figsize=(11,6))
sns.countplot('category',data=clean_df,palette='inferno',edgecolor=sns.color_palette('dark',7),order=clean_df['category'].value_counts()[:10].index)
plt.xticks(rotation=90)
plt.title('Top Industry with more backed projects', fontsize=20)
plt.ylabel('Number of Kickstarter Projects')
plt.xlabel('Industry')
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.show()

We have 115 types of industry here represented.

Among those 115, Product Design is number one, followed by Tabletop Games and Video Games. Hardware and Technology come next.

What about the industry that has less projects?

Below are the top 12 industries that are less represented.

Please note the y axis. All these industries have just one project assigned to them.

plt.subplots(figsize=(11,6))
sns.countplot('category',data=clean_df,order=clean_df['category'].value_counts()[-12:].index, palette='RdBu')
plt.xticks(rotation=90)
plt.title('Top Industry with less backed projects', fontsize=20)
plt.ylabel('Number of Kickstarter Projects')
plt.xlabel('Industry')
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.show()

More than one project ratio

There are 461 authors that have more than one project.

It represents 11.5% of all entrepreneurs.

author_more_one_project = len(clean_df['by'].value_counts()[clean_df['by'].value_counts()>1])

author_more_one_project

461

percent = (author_more_one_project/len(clean_df))*100
percent

11.525

Top authors with more projects

pd.set_option('display.float_format', lambda x: '%.3f' % x)

plt.subplots(figsize=(11,6))
sns.countplot('by',data=clean_df,order=clean_df['by'].value_counts()[:10].index, palette='RdBu')
plt.xticks(rotation=90)
plt.title('Top Authors with more projects', fontsize=20)
plt.ylabel('Top Authors with more projects')
plt.xlabel('Number of Kickstarter Projects')
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.show()

As we can see from the graph above, the top 10 entrepreneurs* with more projects have more than 8 projects.

CoolMiniOrNot leads the ranking with an impressive number of projects: 24.

According to the web, this company is the Internet's largest gallery of painted miniatures, with a large repository of how-to articles on miniature painting.

So far we did an analysis based on absolute values, that is, we didn't take into account the amount pledge averaged by category or by the number of backers.

Let's do this!

Average pledge by each category

average_pledge_each_category = pd.pivot_table(clean_df, index= 'category', values= "amount_pledged_usd")

average_pledge_each_category.sort_values('amount_pledged_usd', ascending=False)

Who would say that television is the most pledged category?

Taking a closer looks, well, it makes sense. There is just one project in this category and the amount pledge was 5,764.229.

final_df[final_df['category'] == 'Television']

Therefore the winner is:

Joel Gordon Hodgson

According to Wikipedia, Joel Gordon Hodgson is an American writer, comedian and television actor.

He is best known for creating Mystery Science Theater 3000 and starring in it as the character Joel Robinson.

final_df[final_df['category'] == 'Music Videos']

The less "fortunate" is:

Donnalou Stevens

According to Google, she is a singer with a couple of hits already.

Average backers by each category

What do backers like the most?

average_pledge_each_category = pd.pivot_table(clean_df, index= 'category', values= "num_backers")

average_pledge_each_category

The 3D printing industry is trendy, attracting on average in this category, 3394 backers.

Then comes Academic and Accessories.

In the top bottom is World Music, Young Adult and Zines.

final_df[final_df['category'] == 'Young Adult']

Natural Language Insights on description and title ¶

Since we previously cleaned the title and blurb using the function def normalize_text, we are now ready to extract some insights from both features.

I will create a new dataframe containg only the text on the dataset.

I will count the words for both title and blurb.

As expected, blurb (aka, description) has more words than the title itself.

On average, description has 20 words. The most common are: Game, new, world, first, card.

On average, title has no more that 5 words. The most common words here are: Game, card, smart, first, playing.

final_df.clean_blurb.head(2)

0     this is a card game for people who are into k...
1     an unusually addicting high-quality desk toy ...
Name: clean_blurb, dtype: object

final_df.clean_title.head(2)

0               exploding kittens
1    fidget cube a vinyl desk toy
Name: clean_title, dtype: object

Create a new dataframe : title_and_blurb_analysis dataframe

cols_to_analyse = ['category', 'clean_blurb', 'clean_title']

title_and_blurb_analysis = final_df[cols_to_analyse]

title_and_blurb_analysis.head()

Blurb analysis

#Counts the number of characters 
title_and_blurb_analysis['char_count_blurb'] = title_and_blurb_analysis['clean_blurb'].str.len()

#Splits all characters into words and does a list of these words 
title_and_blurb_analysis['words_blurb'] = title_and_blurb_analysis['clean_blurb'].str.split(' ')

# Counts the number of words 
title_and_blurb_analysis['word_count_blurb'] = title_and_blurb_analysis['words_blurb'].str.len()

Average number of words on description

title_and_blurb_analysis.word_count_blurb.mean()

20.7705

Function to convert the description to a list

class1 = title_and_blurb_analysis['clean_blurb'].tolist()
string1 = ''
for i in range(len(class1)):
    string1 += class1[i]

WordCloud on Title: the most common words on Titles

wordcloud1 = WordCloud(stopwords=STOPWORDS,
                       background_color='#007599',
                        max_words=30
                        ).generate(string1)
plt.figure(figsize=(10,8))
plt.imshow(wordcloud1)
plt.axis('off')
plt.show()

Title analysis

#Counts the number of characters 
title_and_blurb_analysis['char_count_title'] = title_and_blurb_analysis['clean_title'].str.len()

#Splits all characters into words and does a list of these words 
title_and_blurb_analysis['words_title'] = title_and_blurb_analysis['clean_title'].str.split(' ')

# Counts the number of words 
title_and_blurb_analysis['word_count_title'] = title_and_blurb_analysis['words_title'].str.len()

Average number of words on title

title_and_blurb_analysis.word_count_title.mean()

5.94625

Function to convert the description to a list

class2 = title_and_blurb_analysis['clean_title'].tolist()
string2 = ''
for i in range(len(class2)):
    string2 += class2[i]

WordCloud on Title: the most common words on Titles

wordcloud2 = WordCloud(stopwords=STOPWORDS,
                       background_color='#b30086',
                        max_words=30

                      ).generate(string2)
plt.figure(figsize=(10,8))
plt.imshow(wordcloud2)
plt.axis('off')
plt.show()

	Unnamed: 0	amt.pledged	blurb	by	category	currency	goal	location	num.backers	num.backers.tier	pledge.tier	title	url
0	0	8782571.0	\nThis is a card game for people who are into ...	Elan Lee	Tabletop Games	usd	10000.0	Los Angeles, CA	219382	[15505, 202934, 200, 5]	[20.0, 35.0, 100.0, 500.0]	Exploding Kittens	/projects/elanlee/exploding-kittens
1	1	6465690.0	\nAn unusually addicting, high-quality desk to...	Matthew and Mark McLachlan	Product Design	usd	15000.0	Denver, CO	154926	[788, 250, 43073, 21796, 41727, 21627, 12215, ...	[1.0, 14.0, 19.0, 19.0, 35.0, 35.0, 79.0, 79.0...	Fidget Cube: A Vinyl Desk Toy	/projects/antsylabs/fidget-cube-a-vinyl-desk-toy

	amount_pledged_usd
category
Television	5764229.000
Gaming Hardware	2215906.667
World Music	785111.000
Sound	782088.588
3D Printing	748282.441
Typography	747961.000
Flight	698532.333
Narrative Film	632290.675
Action	630019.000
Sculpture	550389.410
Faith	538103.000
Architecture	517252.000
Space Exploration	496919.354
Web	491656.745
Wearables	446895.453
Mixed Media	445630.720
Drama	442313.000
Camera Equipment	423681.653
Technology	410683.455
Hardware	379238.890
Drinks	360616.327
Product Design	358966.076
Robots	348233.757
Food	308440.077
Science Fiction	303154.552
Photobooks	301918.350
Apparel	300264.534
Video Games	275573.504
Gadgets	273833.193
Tabletop Games	269771.701
...	...
Electronic Music	104204.304
Audio	103539.640
Fabrication Tools	102864.000
Software	101226.913
Shorts	100161.895
Nonfiction	99016.917
Anthologies	98088.618
Playing Cards	96679.680
Academic	91474.000
Comic Books	90432.828
Poetry	87370.060
Food Trucks	85470.000
Video	85176.000
Periodicals	84678.816
Apps	82937.071
Cookbooks	80451.360
Photo	70301.000
Journalism	66065.999
Conceptual Art	65955.280
Digital Art	65652.120
Jazz	60526.000
Zines	60431.000
Installations	58916.000
Calendars	58664.000
Fiction	58263.846
Literary Journals	53128.500
Vegan	52907.000
Stationery	47165.000
Young Adult	38257.830
Music Videos	33837.000

	num_backers
category
3D Printing	3394.130
Academic	1461.000
Accessories	3097.550
Action	17713.000
Animation	3650.837
Anthologies	2001.217
Apparel	2673.781
Apps	2589.353
Architecture	3575.000
Art	2786.600
Art Books	2110.971
Audio	1572.500
Calendars	1142.500
Camera Equipment	2300.115
Children's Books	3088.842
Childrenswear	2317.500
Chiptune	4187.333
Civic Design	2081.400
Classical Music	2324.500
Comedy	3628.533
Comic Books	1618.600
Comics	2674.810
Conceptual Art	1391.000
Cookbooks	2292.429
Country & Folk	1675.667
Crafts	3214.000
DIY Electronics	3326.452
Dance	4133.000
Design	2827.626
Digital Art	1383.000
...	...
Ready-to-wear	2052.333
Restaurants	1880.250
Robots	2363.722
Rock	2654.917
Romance	1334.000
Science Fiction	3717.400
Sculpture	1784.500
Shorts	1767.250
Small Batch	2817.750
Software	2539.037
Sound	4068.538
Space Exploration	6947.800
Spaces	1512.500
Stationery	1326.000
Tabletop Games	3316.513
Technology	3742.000
Television	48270.000
Theater	2093.600
Thrillers	2147.000
Typography	8609.000
Vegan	1302.000
Video	1183.000
Video Games	5725.193
Wearables	3175.172
Web	9830.077
Webcomics	2009.576
Webseries	2584.562
World Music	6518.500
Young Adult	1199.000
Zines	1153.000

	currency	conversion_rate
0	usd	1.00
1	sek	0.11
2	nzd	0.68
3	gbp	1.32
4	eur	1.17
5	dkk	0.15
6	chf	1.01
7	cad	0.78
8	aud	0.75

	category	clean_blurb	clean_title
0	Tabletop Games	this is a card game for people who are into k...	exploding kittens
1	Product Design	an unusually addicting high-quality desk toy ...	fidget cube a vinyl desk toy
2	Web	bring reading rainbow’s library of interactiv...	bring reading rainbow back for every child eve...
3	Narrative Film	updated this is it we're making a veronica ma...	the veronica mars movie project
4	Video Games	an adventure game from tim schafer double fin...	double fine adventure

An analysis on the Top 4000 most backed projects ever on Kickstarter¶

Table of Contents¶

Introduction¶

Overview¶

Libraries¶

Data Cleaning¶

Exploring Kickstarter Statistics¶