EDA using R

Bharat Kulkarni
7 min readNov 19, 2018

Before any of you go ahead and ask why R and why not python the answer is simple I’m doing Data Analyst Nanodegree on Udaicty and this particular project requires me to use R that’s the reason.

Though Udacity has provided me with some datasets I also have the option to use my own So, I decided to find one on kaggle and ended up selecting Kickstarter dataset. Why because for the past few years I’ve been involved in Startups and now as an Analyst I want to explore it.

PS — I won’t be explaining the code.

Here’s the Github link.

Let’s start with loading some packages that are required to plot and make the process much easier.

library(ggplot2)
library(dplyr)
library(ggthemes)
library(rworldmap)
library(knitr)
library(tidyr)
library(gridExtra)
library(grid)

Load the dataset, because you have to load to work with it, that’s how the world works.

ks <- read.csv('ks-projects-201801.csv')#types of variables
glimpse(ks)
Variables / Columns

Print first few rows, just to see the data, nothing fancy.

head(ks)
First few rows of data

Let’s see if there are any Null or NA values in the dataset.

# check for NA values
sapply(ks, function(x) sum(is.na(x)))
I want to put some joke here but I just can’t.

Let’s delete the columns and also rename two of those.

# remove us.pledged columnks <- ks[, -c(5,7,9,13)]# rename usd_pledged_real to usd_pledged and did the same for usd_goal_realcolnames(ks)[10] <- "usd_pledged"
colnames(ks)[11] <- "usd_goal"

You don’t get any output for doing this, I hope it said “Done!” with a dong!

That’s the data cleaning that I did, let’s move and plot some Univariate graphs.

Most popular Main Category

# filter data
cat.freq <- ks %>%
group_by(main_category) %>%
summarise(count = n()) %>%
arrange(desc(count))
# order data
cat.freq$main_category <- factor(cat.freq$main_category, levels=cat.freq$main_category)
# plot
ggplot(aes(main_category, count), data = cat.freq)+
geom_bar(stat = 'identity')+
ggtitle("Projects by category")+
xlab("Project Category")+
ylab("Frequency")+
theme_minimal()+
theme(plot.title=element_text(hjust=0.5),
axis.text.x=element_text(size=10, angle=90))+
geom_text(aes(label = count, vjust = -0.2))
Journalism is at the bottom mainly because of Disruptor like PEW News.

Second plot — Number of projects by year

# filter data
ks.year <- ks %>%
mutate(year = format(as.Date(launched, "%Y-%m-%d %H:%M:%S"), "%Y"))%>%
filter(!year %in% c('1970')) %>%
group_by(year) %>%
summarise(count = n())
#plot
ggplot(aes(year, count, fill = count), data = ks.year)+
geom_bar(stat = "identity")+
ggtitle("Project by Year")+
xlab("Year")+
ylab("Frequency")+
theme_minimal()+
geom_text(aes(label = count, vjust = -0.3))+
theme(legend.position = "null")+
scale_fill_gradient_tableau(palette = "Orange Light" )
Ignore 2018 like I did for all the things that I wanted to do in 2018. (Having data till 2018 Jan)

Third Plot — Number of projects by country

ks.country <- ks %>%
filter(country!='N,0"')%>%
group_by(country)%>%
summarise(counts = n())
#plot number of projects by countries.
ggplot(aes(country, counts/sum(counts)*100), data = ks.country)+
geom_bar(stat = 'identity')+
scale_y_sqrt(breaks = seq(0,80,5))+
ylab("% of Projects")+
xlab("Country")+
ggtitle("Percentage of projects by country")
India where are you?

Let’s plot the same in a world map, yeah you can do that.

#match countries with the abbreviation 
countries.match <- joinCountryData2Map(ks.country, joinCode = "ISO2", nameJoinColumn = "country")
#plot map
mapCountryData(countries.match, nameColumnToPlot = "counts", mapTitle = "Number of project by countries", catMethod = "logFixedWidth", colourPalette = "topo")
That color combo tho!

Enough of country, let’s see the state of all the projects.

#plot data
ggplot(aes(state), data = ks)+
geom_bar()
I’m lazy.

Success Rate

prop.table(table(ks$state))*100
Damn Undefined.

Now let’s plot with 2 variables.

Total amount pledged by category

# filter data
pledged.cat <- ks %>%
group_by(main_category) %>%
summarise(total = sum(usd_pledged)) %>%
arrange(desc(total))
# arrange
pledged.cat$main_category <- factor(pledged.cat$main_category, levels = pledged.cat$main_category)
#plot itggplot(aes(main_category, total/1000000, fill=total), data = pledged.cat)+
geom_bar(stat = 'identity')+
ggtitle("Total amount pledged by category")+
xlab("Main Category")+
ylab("USD Pledged")+
geom_text(aes(label = paste0('$', round(total/1000000, 1))),vjust = -0.3, size = 3)+
theme_minimal()+
theme(plot.title=element_text(hjust=0.5, size = 10),
axis.text.x = element_text(size = 10, angle = 90), legend.position = "null")+
scale_fill_gradient(low = "#feefb6", high = "#fecc0c")
Damn PEW News hit hard.

Number of Successful project by main category

#filter data
state.cat <- ks %>%
filter(state %in% c("successful", "failed"))%>%
group_by(main_category, state)%>%
summarise(counts = n()) %>%
mutate(per = counts/sum(counts)) %>%
arrange(desc(state), per)
#factor data
state.cat$main_category <- factor(state.cat$main_category,
levels=state.cat$main_category[1:(nrow(state.cat)/2)])
#plot
ggplot(aes(main_category, per, fill = state), data = state.cat)+
geom_bar(stat ="identity")+
ggtitle("Success and Failure rate of Projects by category")+
xlab("Project Category")+
ylab("Percentage")+
scale_y_continuous(labels = scales::percent)+
scale_fill_discrete(name="Project Status", breaks=c("successful", "failed"),
labels=c("Success", "Failure")) +
geom_text(aes(label=paste0(round(per*100,1),"%")), position=position_stack(vjust=0.5),
colour="white", size=3)+
theme_classic()+
theme(plot.title=element_text(hjust=0.5), legend.position="bottom")+
coord_flip()
Tech has a high failure rate, I need to switch my path.

Success rate of projects by year

# filter data
success.year <- ks %>%
mutate(year = format(as.Date(launched, "%Y-%m-%d %H:%M:%S"), "%Y")) %>%
filter(!year %in% c('1970')) %>%
group_by(year, state) %>%
summarise(counts = n()) %>%
mutate(per = counts/sum(counts)) %>%
filter(state %in% c("successful", "failed"))
# plot
ggplot(aes(year, per, fill = state), data = success.year)+
geom_bar(stat ="identity")+
ggtitle("Success and Failure rate of Projects by Year")+
xlab("Year")+
ylab("Percentage")+
scale_y_continuous(labels = scales::percent)+
scale_fill_discrete(name="Project Status", breaks=c("successful", "failed"),
labels=c("Success", "Failure")) +
geom_text(aes(label=paste0(round(per*100,1),"%")), position=position_stack(vjust=0.5),
colour="white", size=3)+
theme_minimal()+
theme(plot.title=element_text(hjust=0.5), legend.position="bottom")+
coord_flip()
The same color again?

Enough of graphs let’s print some Tables

Top projects with highest set goals

kable(head(ks[order(-ks$usd_goal), c(2,3,7,11)], 20))
Flat is Earth and Potato Salad Fuck.

Have to give for that confidence to set those goals.

Aside from the one project that was suspended and one that was cancelled, all other projects failed here. Their goals must have been set too high and seen as being too unreasonable for the idea they were selling.
Let’s look at the top 15 most ambitious projects that were successfully funded instead.

Top Successful projects

#filter data
proj.success <- ks[ks$state=="successful",]
#print table
kable(head(proj.success[order(-proj.success$usd_goal), c(2,3,7,11)], 20))
Pebble was crowd funded I didn’t know that, RIP Pebble, you were colorful unlike my graphs.

Comparing the goals of these projects with the most ambitions one we can clearly see that there is a huge gap, and Games, Design, technology are the top categories to get funded.

That’s it I’m not going to plot anymore, if you’re interested you can check my GitHub for full report.

Conclusions and Findings

Many factors affect success of projects on Kickstarter. Dance and Theater based projects have high success rate of 60% because of low risk involved. Games has 43% success rates and it has been able to get maximum amount of money on Kickstarter, around USD 600+ million. Design and Technology projects have low success rate because of high risk however, these projects are able to get huge funding only next to games. Backers on Kickstarter give more money to Games, Designs and Technology projects because these are innovative and disruptive projects. However, money generated by top 1000 projects under Games and under Designs generated similar amount of funding.

Even though success of the projects from Non US Countries are comparable to success of projects to the US, US outperforms all other countries in terms of Success percentage, projects from US have higher success percentage compared to projects outside USA.

Kickstarter started in 2009, we can see a huge rise in growth of projects during 2014 and 2015. Since, 2016 there is a decrease in the number of projects and the reason for this is because lot of projects between 2014 and 2015 failed but, quite surprisingly the success rate of projects since 2016 have been increased this might be because there is a decrease in crowd funding trend and only serious people tend to get into this after looking at the failure rate.

Comparing top successful ambitious projects and top projects with highest set goals we can clearly see that the projects with higher goal either were failed or suspended or cancelled. Their goals must have been set too high and seen as being too unreasonable for the idea they were selling. And by comparing the amount difference between these two categories we can see there is a huge gap and projects with unrealistic goals always fail.

The one thing that surprised me the most was the huge success rate in categories like Dance and Theater, and even though Technology, Games and design might not have a huge success rate people still fund them.

For future work, exploring projects that fell short of their goal, projects that were well funded and got suspended and building a model that depicts the number of backers and also how far or short will the project get funded.

The following conclusions are not certain, there may be issues with this methodology and we would need to address them to get a more definite result. The currency rate keeps on changing from time to time and we don’t know when these were converted to USD. We would need more data, more metrics and better statistical tests to confirm the following findings; but the data is rich enough and the metric good enough to warrant further study.

Let me know what are your thoughts on this and any tips are always welcomed.

Until then see you next time, Peace.

--

--