Francisca Dias
In this analysis I will explore the Top 4000 most backed projects ever on Kickstarter and search for interesting insights and trends.
Are these entrepreneurs starting more than one succesfull project?
What type of industry is the most successful on kickstarter? And what is the least successful?
What was the project with more pledged money yet?
Are there words that repeat more often in titles and description?
I will answer these and a lot other questions in this analysis.
Kickstarter is a platform for millions of creators to present their innovative ideas to the public.
This is a win-win situation where creators could accumulate initial fund while the public get access to cutting-edge prototypical products that are not available in the market yet.
This dataset can be found here.
It consists on the following features:
amt.pledged: amount pledged
by: author
category: project category (string)
currency: currency type of amount pledged
goal: original pledge goal
location
num.backers: total number of backers
num.backers.tier: number of backers
pledge.tier: pledge tiers in USD
title
url
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
plt.style.use('fivethirtyeight')
from wordcloud import WordCloud, STOPWORDS
import warnings
warnings.filterwarnings("ignore")
import re
df = pd.read_csv('most_backed.csv')
df.head(2)
df.columns
df.rename(columns={'amt.pledged': 'amt_pledged',
'end.time': 'end_time',
'num.backers': 'num_backers',
'num.backers.tier': 'num_backers_tier',
'pledge.tier' : 'pledge_tier'}, inplace=True)
df.dtypes
What is the most common currency on Kickstarter?
df['currency'].value_counts()
The exchange rates as of November 22, 2017
# conversion rates for each currency, converted amount in USD
currencies = ['usd', 'sek', 'nzd', 'gbp', 'eur', 'dkk', 'chf', 'cad', 'aud']
conversion = [1.00, 0.11, 0.68, 1.32, 1.17, 0.15, 1.01, 0.78, 0.75 ]
Convert the exchange rate between USD and other currencies to a dataframe
exchange_rate = [ ('currency', currencies),
('conversion_rate', conversion) ]
exchange_rate_table = pd.DataFrame.from_items(exchange_rate)
exchange_rate_table
Merge the currency dataframe with the dataset
final_df = df.merge(exchange_rate_table, on='currency')
final_df.head(2)
Converting all amounts to one common currency
# Convert amt_pledged and goal
final_df['amount_pledged_usd'] = final_df.amt_pledged * final_df.conversion_rate
final_df['goal_usd'] = final_df.goal * final_df.conversion_rate
final_df.columns
Function to process the text in our dataset
def normalize_text(s):
s = s.lower()
# remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
s = re.sub('\s\W',' ',s)
s = re.sub('\W\s',' ',s)
# make sure we didn't introduce any double spaces
s = re.sub('\s+',' ',s)
return s
Clean Blurb
final_df['clean_blurb'] = [normalize_text(s) for s in final_df['blurb']]
Clean Title
final_df['clean_title'] = [normalize_text(s) for s in final_df['title']]
Define the columns that I need and discard the ones I will not need from now
cols_to_use = ['amount_pledged_usd', 'goal_usd', 'clean_blurb',
'clean_title', 'by', 'category', 'location', 'num_backers']
clean_df = final_df[cols_to_use]
This is a rich dataset, and many insights can be taken here.
I will narrow down to just a few questions that I would like to have answered by the end of the analysis.
In this dataset we have projects from all around the world.
Fortunately we also have the cities where these projects come from, so below are the top ten cities that have more representation on the most backed projects ever on Kickstarter.
plt.subplots(figsize=(11,6))
sns.countplot('location',data=clean_df,palette='inferno',edgecolor=sns.color_palette('dark',7),order=clean_df['location'].value_counts()[:10].index)
plt.xticks(rotation=90)
plt.title('Top Locations with more projects', fontsize=20)
plt.ylabel('Number of Kickstarter Projects')
plt.xlabel('Location')
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.show()
San Francisco leads the ranking, followed by LA and NY. Seattle and London come next.
The top 4 cities with more projects belongs to the US.
Actually, if London wasn't here, the top 9 (at least) would belong to the United States.
What about the industry that has more projects?
plt.subplots(figsize=(11,6))
sns.countplot('category',data=clean_df,palette='inferno',edgecolor=sns.color_palette('dark',7),order=clean_df['category'].value_counts()[:10].index)
plt.xticks(rotation=90)
plt.title('Top Industry with more backed projects', fontsize=20)
plt.ylabel('Number of Kickstarter Projects')
plt.xlabel('Industry')
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.show()
We have 115 types of industry here represented.
Among those 115, Product Design is number one, followed by Tabletop Games and Video Games. Hardware and Technology come next.
What about the industry that has less projects?
Below are the top 12 industries that are less represented.
Please note the y axis. All these industries have just one project assigned to them.
plt.subplots(figsize=(11,6))
sns.countplot('category',data=clean_df,order=clean_df['category'].value_counts()[-12:].index, palette='RdBu')
plt.xticks(rotation=90)
plt.title('Top Industry with less backed projects', fontsize=20)
plt.ylabel('Number of Kickstarter Projects')
plt.xlabel('Industry')
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.show()
More than one project ratio
There are 461 authors that have more than one project.
It represents 11.5% of all entrepreneurs.
author_more_one_project = len(clean_df['by'].value_counts()[clean_df['by'].value_counts()>1])
author_more_one_project
percent = (author_more_one_project/len(clean_df))*100
percent
Top authors with more projects
pd.set_option('display.float_format', lambda x: '%.3f' % x)
plt.subplots(figsize=(11,6))
sns.countplot('by',data=clean_df,order=clean_df['by'].value_counts()[:10].index, palette='RdBu')
plt.xticks(rotation=90)
plt.title('Top Authors with more projects', fontsize=20)
plt.ylabel('Top Authors with more projects')
plt.xlabel('Number of Kickstarter Projects')
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.show()
As we can see from the graph above, the top 10 entrepreneurs* with more projects have more than 8 projects.
CoolMiniOrNot leads the ranking with an impressive number of projects: 24.
According to the web, this company is the Internet's largest gallery of painted miniatures, with a large repository of how-to articles on miniature painting.
So far we did an analysis based on absolute values, that is, we didn't take into account the amount pledge averaged by category or by the number of backers.
Let's do this!
Average pledge by each category
average_pledge_each_category = pd.pivot_table(clean_df, index= 'category', values= "amount_pledged_usd")
average_pledge_each_category.sort_values('amount_pledged_usd', ascending=False)
Who would say that television is the most pledged category?
Taking a closer looks, well, it makes sense. There is just one project in this category and the amount pledge was 5,764.229.
final_df[final_df['category'] == 'Television']
Therefore the winner is:
Joel Gordon Hodgson
According to Wikipedia, Joel Gordon Hodgson is an American writer, comedian and television actor.
He is best known for creating Mystery Science Theater 3000 and starring in it as the character Joel Robinson.
final_df[final_df['category'] == 'Music Videos']
The less "fortunate" is:
Donnalou Stevens
According to Google, she is a singer with a couple of hits already.
Average backers by each category
What do backers like the most?
average_pledge_each_category = pd.pivot_table(clean_df, index= 'category', values= "num_backers")
average_pledge_each_category
The 3D printing industry is trendy, attracting on average in this category, 3394 backers.
Then comes Academic and Accessories.
In the top bottom is World Music, Young Adult and Zines.
final_df[final_df['category'] == 'Young Adult']
Since we previously cleaned the title and blurb using the function def normalize_text, we are now ready to extract some insights from both features.
I will create a new dataframe containg only the text on the dataset.
I will count the words for both title and blurb.
As expected, blurb (aka, description) has more words than the title itself.
On average, description has 20 words. The most common are: Game, new, world, first, card.
On average, title has no more that 5 words. The most common words here are: Game, card, smart, first, playing.
final_df.clean_blurb.head(2)
final_df.clean_title.head(2)
Create a new dataframe : title_and_blurb_analysis dataframe
cols_to_analyse = ['category', 'clean_blurb', 'clean_title']
title_and_blurb_analysis = final_df[cols_to_analyse]
title_and_blurb_analysis.head()
Blurb analysis
#Counts the number of characters
title_and_blurb_analysis['char_count_blurb'] = title_and_blurb_analysis['clean_blurb'].str.len()
#Splits all characters into words and does a list of these words
title_and_blurb_analysis['words_blurb'] = title_and_blurb_analysis['clean_blurb'].str.split(' ')
# Counts the number of words
title_and_blurb_analysis['word_count_blurb'] = title_and_blurb_analysis['words_blurb'].str.len()
Average number of words on description
title_and_blurb_analysis.word_count_blurb.mean()
Function to convert the description to a list
class1 = title_and_blurb_analysis['clean_blurb'].tolist()
string1 = ''
for i in range(len(class1)):
string1 += class1[i]
WordCloud on Title: the most common words on Titles
wordcloud1 = WordCloud(stopwords=STOPWORDS,
background_color='#007599',
max_words=30
).generate(string1)
plt.figure(figsize=(10,8))
plt.imshow(wordcloud1)
plt.axis('off')
plt.show()
Title analysis
#Counts the number of characters
title_and_blurb_analysis['char_count_title'] = title_and_blurb_analysis['clean_title'].str.len()
#Splits all characters into words and does a list of these words
title_and_blurb_analysis['words_title'] = title_and_blurb_analysis['clean_title'].str.split(' ')
# Counts the number of words
title_and_blurb_analysis['word_count_title'] = title_and_blurb_analysis['words_title'].str.len()
Average number of words on title
title_and_blurb_analysis.word_count_title.mean()
Function to convert the description to a list
class2 = title_and_blurb_analysis['clean_title'].tolist()
string2 = ''
for i in range(len(class2)):
string2 += class2[i]
WordCloud on Title: the most common words on Titles
wordcloud2 = WordCloud(stopwords=STOPWORDS,
background_color='#b30086',
max_words=30
).generate(string2)
plt.figure(figsize=(10,8))
plt.imshow(wordcloud2)
plt.axis('off')
plt.show()