Francisca Dias
In this report I will explore the Projects on Indiegogo's crowdfunding platform that were listed as of October 2017.
I will introduce some dynamics in visualization where you can hover the mouse over the graphics and get instant information about that particular item.
I would like to answer the following questions:
These and other questions will be answered in this report.
Indiegogo is an international crowdfunding platform that offers funding. It allows people to solicit funds for an idea, charity, or start-up business.
Its business model is based on charging a 5% fee on contributions.
This dataset was taken from https://webrobots.io/indiegogo-dataset/.
The data collected is from 2017-10-15.
Below are the features that come in this dataset:
title: title of project
tagline: project's description
cached_collected_pledges_count: number of backers
balance: the investment made by backers so far
currency_code: the currency where the project is listed
amt_time_left: the time left for the project to be closed to funding
category_name: project's category
collected_percentage: the percentage collected regarding the project's goal
partner_name: the name of a corporation that has backed the project. It assumes null when there's none
in_forever_funding: some projects are continuously being funded
import pandas as pd
indie1 = pd.read_csv('Indiegogo001.csv')
indie2 = pd.read_csv('Indiegogo002.csv')
indie3 = pd.read_csv('Indiegogo003.csv')
indie4 = pd.read_csv('Indiegogo004.csv')
indie5 = pd.read_csv('Indiegogo005.csv')
indie6 = pd.read_csv('Indiegogo006.csv')
indie7 = pd.read_csv('Indiegogo007.csv')
frames = [indie1, indie2, indie3, indie4, indie5, indie6, indie7]
result = pd.concat(frames)
There are 216,283 rows in this dataset. We could assume that each represents a project, but as you will see in a while, it is not the case.
result.head()
Below are the data types.
Please notice that balance is considered as an object. This is because the currency symbol comes attached to the amount value. Therefore I will have to clean this feature and convert to numeric.
result.dtypes
Much of the time spent analysing and working with a dataset comes from making sure the data is ready for analysis.
By ready I mean that there are no nulls, all column types are correct, no duplicates, selecting the columns I will be working with, and so on.
Below is a description of all steps I have taken to clean this dataset:
Concat all files
Strip Column Balance so to remove all non-numeric characters
Convert Balance to numeric type (before was object)
Get the exchange rates for the 5 currencies here represented
convert Balance amount to a single currency: USD
Select columns I will be using
Rename those Columns
Columns I will be working with
Extract symbols from column balance
result['balance'] = result['balance'].map(lambda x: x.lstrip('/€$£ACEhNU,.'))
result['balance'] = result['balance'].str.replace(',', '')
result['balance'] = result['balance'].str.replace('.', '')
import numpy as np
result['balance'].astype(np.int64)
Steps I have taken here:
Make a new dataframe with the exchange rates as of November 26, 2017
Concat this new dataframe with the original one, matching the currency
Change the Balance data type, from object to numeric
Convert Balance to US Dollars so we normalize the amount for all projects
Select the colums I will be using
And rename those columns for better understanding
# exchange rates for each currency, converted amount in USD, as of November 26, 2017
currencies = ['USD', 'GBP', 'EUR', 'CAD', 'AUD']
conversion = [1.00, 1.33, 1.19, 0.78, 0.76 ]
exchange_rate = [ ('currency_code', currencies),
('conversion_rate', conversion) ]
exchange_rate_table = pd.DataFrame.from_items(exchange_rate)
exchange_rate_table
new_result = result.merge(exchange_rate_table, on='currency_code')
new_result['balance'] = new_result['balance'].apply(pd.to_numeric, errors='coerce')
# Convert Balance to US Dollars
new_result['Amount Pledged USD'] = new_result.balance * new_result.conversion_rate
cols_to_use = ['title', 'tagline', 'cached_collected_pledges_count',
'currency_code', 'amt_time_left', 'category_name',
'collected_percentage', 'partner_name', 'in_forever_funding', 'Amount Pledged USD']
new_result = new_result[cols_to_use]
new_result.rename(columns={'title': 'Title',
'tagline': 'Description',
'cached_collected_pledges_count': 'Numb of Backers',
'currency_code': 'Currency Code',
'amt_time_left' : 'Any time left',
'category_name': 'Category Name',
'collected_percentage' : 'Collected Percentage',
'partner_name' : 'Partner Name',
'in_forever_funding' : 'Forever Funding'}, inplace=True)
new_result.head()
I decided to make a special case for this problem since most datasets don't come clean.
And since they don't come clean, any duplicates in this dataset can and will ruin the analysis.
Duplicates can jeopardize our estimates. We have to be very careful, otherwise We will do an analysis that is biased and wrong.
We should think that each ID is unique, that is, for each ID corresponds ONLY ONE project.
If we look at the Title though, we will see that there are titles that repeat, but the ID is different. But it´s the same project.
We could have assumed that we were dealing with more than 200.000 projects, but when we remove the duplicates, we are left with only 70.000 projects.
Let us see these duplicates.
If we sort the initial dataframe by the Number of Backers, we can see that the project "Super Troopers 2" repeats 4 times, and the information is the same for all these rows. Therefore we should consider this a duplicate.
check_duplicates = new_result.sort_values(['Numb of Backers'], ascending=[False])
check_duplicates.head()
Remove duplicates
new_result.drop_duplicates('Title', inplace=True)
# new_result.to_csv('new_result.csv')
agg_funcs = {'Amount Pledged USD':np.sum,
'Numb of Backers':np.sum }
grouped = new_result.groupby('Category Name', as_index=False).agg(agg_funcs)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
Create a new column with Amount Pledged per Million USD
grouped['Amount Pledged in Million USD'] = grouped['Amount Pledged USD'] / 1000000
grouped['Amount Pledged in Million USD'] = grouped['Amount Pledged in Million USD'].round()
Create a new column with Number of Backers per 100.000
grouped['Numb of Backers per 100.000'] = grouped['Numb of Backers'] / 100000
grouped['Numb of Backers per 100.000'] = grouped['Numb of Backers per 100.000'].round()
grouped.head()
# grouped.to_csv('grouped.csv')
currency_distribution = new_result.groupby(['Currency Code'])[['Currency Code']].count()
# currency_distribution.to_csv('currency.csv')
Here I will answer the questions that I asked in the beginning.
I will also introduce some interactive visualizations, that will make easier for someone to interpret the results.
Below are the first 5 rows in our dataset:
new_result.head()
Please notice that I will use here the dataset that has been modified to accomodate both Number of backers and amount pledge, respectively per 100.000 and per 1 US Million Dollars.
I also rounded the numbers, so we are left with integers.
from plotly.graph_objs import *
import plotly.offline as py
py.init_notebook_mode(connected=True)
trace1 = {
"x": ["5.0", "16.0", "65.0", "27.0", "3.0", "2.0", "1.0", "1.0", "12.0", "44.0", "25.0", "9.0", "129.0", "133.0", "19.0", "99.0", "100.0", "2.0", "53.0", "35.0", "122.0", "3.0", "2.0", "24.0", "0.0", "7.0", "29.0", "48.0", "94.0", "20.0", "5.0", "1.0", "12.0"],
"y": ["1.0", "2.0", "2.0", "1.0", "0.0", "0.0", "0.0", "0.0", "2.0", "4.0", "1.0", "1.0", "4.0", "17.0", "1.0", "5.0", "4.0", "0.0", "5.0", "5.0", "9.0", "0.0", "0.0", "1.0", "0.0", "0.0", "1.0", "1.0", "3.0", "2.0", "1.0", "0.0", "2.0"],
"marker": {
"autocolorscale": False,
"cauto": True,
"cmax": 32,
"cmin": 0,
"color": ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", "32"],
"size": ["5.0", "16.0", "65.0", "27.0", "3.0", "2.0", "1.0", "1.0", "12.0", "44.0", "25.0", "9.0", "129.0", "133.0", "19.0", "99.0", "100.0", "2.0", "53.0", "35.0", "122.0", "3.0", "2.0", "24.0", "0.0", "7.0", "29.0", "48.0", "94.0", "20.0", "5.0", "1.0", "12.0"],
"sizemode": "area",
"sizeref": 0.11875,
},
"mode": "markers",
"name": "Numb of Backers per 100.000",
"text": ["Animal Rights", "Art", "Audio", "Camera Gear", "Comics", "Community Projects", "Creative Works", "Culture", "Dance & Theater", "Education", "Energy & Green Tech", "Environment", "Fashion & Wearables", "Film", "Food & Beverages", "Health & Fitness", "Home", "Human Rights", "Local Businesses", "Music", "Phones & Accessories", "Photography", "Podcasts, Blogs & Vlogs", "Productivity", "Spirituality", "Tabletop Games", "Tech & Innovation", "Transportation", "Travel & Outdoors", "Video Games", "Web Series & TV Shows", "Wellness", "Writing & Publishing"],
"textsrc": "FranciscaDias:28:774b15",
"type": "scatter",
"uid": "8f8dce"
}
data = Data([trace1])
layout = {
"autosize": True,
"hovermode": "closest",
"title": "Distribution of Amount Pledged and # Backers, per Category",
"xaxis": {
"autorange": True,
"range": [-32.5795788669, 150],
"title": "Amount Pledged in Million USD",
"type": "linear"
},
"yaxis": {
"autorange": False,
"range": [-8.5, 30],
"title": "Numb of Backers per 100.000",
"type": "linear"
}
}
fig = Figure(data=data, layout=layout)
py.iplot(fig, filename='bubble')
If you hover over the bubles, you can see not only the category for which they belong, but also the number of Backers per 100.000 and the project balance amount in US Million dollars.
Now let us take a look at the currency representation.
It is a good assumption that the projects listed in Canadian Dollars are probably from Canandian entrepreneurs.
Therefore I want to see what countries are reprented in this dataset and their distribution in terms of currency representation.
trace1 = {
"labels": ["AUD", "CAD", "EUR", "GBP", "USD"],
"labelssrc": "FranciscaDias:25:f35bf0",
"marker": {"colors": ["rgb(255, 255, 204)", "rgb(161, 218, 180)", "rgb(65, 182, 196)", "rgb(44, 127, 184)", "rgb(8, 104, 172)", "rgb(37, 52, 148)"]},
"name": "count",
"textfont": {"size": 14},
"type": "pie",
"uid": "ac71bc",
"values": ["912", "3673", "3958", "5403", "56545"],
"valuessrc": "FranciscaDias:25:ca40ef"
}
data = Data([trace1])
layout = {
"title": "Currency Distribution per Projects",
"autosize": True,
"hovermode": "closest"
}
fig = Figure(data=data, layout=layout)
py.iplot(fig, filename='pie')
print('There are',new_result['Numb of Backers'].sum(),'Backers in this dataset')
print('Each project has on average',round(new_result['Numb of Backers'].sum()/len(new_result)),'Backers')
print('Each Backers gives on average',round(new_result['Amount Pledged USD'].sum()/new_result['Numb of Backers'].sum()),'US Dollars for each Project')
projects_more_backers = new_result.sort_values(['Numb of Backers'], ascending=[False])
projects_more_backers.head()
Super Troopers 2!
This project received more support in terms of backers than any other project.
54554 Backers rushed into this investment. According to Wikipedia, Super Troopers 2 is an upcoming American crime comedy mystery film.
What is also popular is this Protective Case for Iphone X: 53133 backers want this product!
import plotly.graph_objs as go
data = [go.Bar(
x = ["Super Troopers 2", "Protective iPhone X", "Solar Roadways", "Con Man", "Nimuno Loops"],
y = ["54554", "53133", "50162", "46992", "42368"]
)]
layout = go.Layout(
title='Projects with more Backers',
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='basic-bar')
According to Indiegogo's webiste, “Partner” is a corporation or other legal entity that includes approved Campaigns on its dedicated page (“Partner Page”) using the Service.
Let's see how many of these projects are backed by a corporation or partner, and which partner has the most projects.
3 % of all projects have a partner:
print('The percentage of projects that has a partner is:'
,round((len(new_result[new_result['Partner Name']!='null'])/len(new_result))*100), '%')
Who are they?
new_result['Partner Name'].value_counts().nlargest(10)
Top 5 Partners Profile
Indiegogo has this feature/program that is called Forever Funding. It means that backers can continuously fund a project without a deadline.
The Forever Funding model is for companies who’ve already met their funding goal and want to keep raising money.
What is their representation?
new_result['Forever Funding'].value_counts()
10% of all projects have this feature!
I just did an analysis on all projects that were on the platform Indiegogo in October this year.
I collected some interesting fidings: