Reproducible Research Using R Markdown
While doing research and analysis, an important aspect which normally goes unnoticed is documentation. Suffice to say however, if your research can't be reproduced - people will find it hard to give them any serious thought.
Though documentation is a boring (yet crucial) task in anyone's research pipeline, the ability to document as you code is indeed helpful. In R, one can make use of R Markdown and the knitr package to do this.
Below is a sample of my assignment (which deals with reproducible research) as part of John Hopkins Data Science Specialization course which I'm still currently going through. Had to pull an all-nighter last night since I've been procrastinating really bad as of late (which is rather evident in the kind of analysis that I did below) .
Weather Events And It’s Affect To Population
Hafidz Zulkifli
Synopsis
This report attempts to draw correlation between various weather events to consequences against the human population in the United States. Weather data used is based from NOAA Storm Database reading from 1950 to November 2011. The first goal of the report focuses upon are the affects of weather events against population health. The second goal of the study is to determine the effects of weather against the United States’ economy.
This report attempts to draw correlation between various weather events to consequences against the human population in the United States. Weather data used is based from NOAA Storm Database reading from 1950 to November 2011. The first goal of the report focuses upon are the affects of weather events against population health. The second goal of the study is to determine the effects of weather against the United States’ economy.
Data Processing
The data set for this study is taken from NOAA Storm Database. We will first load this into R.
data <- span=""> read.csv("D:/Self_Development/Coursera - JHU/5. Reproducible Research/2/repdata-data-StormData.csv.bz2" ) ->
Since we are looking at two aspects of the events, it is helpful to separate the data set into 2 halves, each to answer it’s respective question.
To prepare the data set to answer the first question, we will be using the event types, amount of reported injuries and also fatalities. This would give us an idea of which weather events are most harmful - or deadly - to the population.
Our basic strategy in processing the data would be to sum up all the fatalities and injuries per weather event, and visualize the total amount in the Result section.
However due to the many types of events that has been reported, we shall concentrate our work on only the top 10 for each category.
weather_data <- span=""> data.frame(data$EVTYPE, data$FATALITIES, data$INJURIES)
colnames(weather_data) <- span=""> c('event', 'fatality','injury')
library(plyr)
by_event1 <- span=""> ddply(weather_data,~event,summarise,sum=sum(fatality))
fatality <- span=""> by_event1[by_event1$sum %in% sort(by_event1$sum, decreasing=TRUE)[1:10],]
by_event2 <- span=""> ddply(weather_data,~event,summarise,sum=sum(injury))
injury <- span=""> by_event2[by_event2$sum %in% sort(by_event2$sum, decreasing=TRUE)[1:10],] ->->->->->->
For the second dataset, we will be using the event types, the amount of property damage and the crop damage (which are available in dollars).
Similar to our previous work, the basic strategy in processing the data would be to sum up all the damages on property and crop. We will then visualize the result in the Result section.
However due to the many types of events that has been reported, we shall concentrate our work on only the top 10 for each category.
economy_data <- span=""> data.frame(data$EVTYPE, data$PROPDMG, data$PROPDMGEXP, data$CROPDMG, data$CROPDMGEXP)
colnames(economy_data) <- span=""> c('event', 'propertydmg','propdmgexp','cropdmg','cropdmgexp')
summary(economy_data)->->
## event propertydmg propdmgexp
## HAIL :288661 Min. : 0.00 :465934
## TSTM WIND :219940 1st Qu.: 0.00 K :424665
## THUNDERSTORM WIND: 82563 Median : 0.00 M : 11330
## TORNADO : 60652 Mean : 12.06 0 : 216
## FLASH FLOOD : 54277 3rd Qu.: 0.50 B : 40
## FLOOD : 25326 Max. :5000.00 5 : 28
## (Other) :170878 (Other): 84
## cropdmg cropdmgexp
## Min. : 0.000 :618413
## 1st Qu.: 0.000 K :281832
## Median : 0.000 M : 1994
## Mean : 1.527 k : 21
## 3rd Qu.: 0.000 0 : 19
## Max. :990.000 B : 9
## (Other): 9
Based on the summary, it can be seen that not all the damage estimates have been given the magnitudes. As such, the resultant value without any assigned magnitude may not be relevant enough to be included in the calculation (value is too small to be significant).
Thus, any estimates which does not have any magnitude assigned other than those that have been documented by NWS (ie ‘K’, ‘M’ and ‘B’) will be omitted from this study.
#to simplify typing
ed <- span=""> economy_data
# get the property damages
edp <- span=""> data.frame(ed$event, ed$propertydmg, ed$propdmgexp)
ed_M <- span=""> edp[edp$ed.propdmgexp == 'M',]
ed_K <- span=""> edp[edp$ed.propdmgexp == 'K',]
ed_B <- span=""> edp[edp$ed.propdmgexp == 'B',]
#calculate real value
ed_M1 <- span=""> data.frame(ed_M$ed.event, ed_M$ed.propertydmg*1000000)
colnames(ed_M1) <- span=""> c('event','prop_damage')
ed_K1 <- span=""> data.frame(ed_K$ed.event, ed_K$ed.propertydmg*1000)
colnames(ed_K1) <- span=""> c('event','prop_damage')
ed_B1 <- span=""> data.frame(ed_B$ed.event, ed_B$ed.propertydmg*1000000000)
colnames(ed_B1) <- span=""> c('event','prop_damage')
#recombine
edp <- span=""> rbind(ed_K1, ed_M1, ed_B1)
library(plyr)
by_event3 <- span=""> ddply(edp,~event,summarise,sum=sum(prop_damage))
property_damage <- span=""> by_event3[by_event3$sum %in% sort(by_event3$sum, decreasing=TRUE)[1:10],]
# now let's get the crop damages
edc <- span=""> data.frame(ed$event, ed$cropdmg, ed$cropdmgexp)
edc_M <- span=""> edc[edc$ed.cropdmgexp == 'M',]
edc_K <- span=""> edc[edc$ed.cropdmgexp == 'K',]
edc_B <- span=""> edc[edc$ed.cropdmgexp == 'B',]
#calculate real value
edc_M1 <- span=""> data.frame(edc_M$ed.event, edc_M$ed.cropdmg*1000000)
colnames(edc_M1) <- span=""> c('event','crop_damage')
edc_K1 <- span=""> data.frame(edc_K$ed.event, edc_K$ed.cropdmg*1000)
colnames(edc_K1) <- span=""> c('event','crop_damage')
edc_B1 <- span=""> data.frame(edc_B$ed.event, edc_B$ed.cropdmg*1000000000)
colnames(edc_B1) <- span=""> c('event','crop_damage')
#recombine
edc <- span=""> rbind(edc_K1, edc_M1, edc_B1)
by_event4 <- span=""> ddply(edc,~event,summarise,sum=sum(crop_damage))
crop_damage <- span=""> by_event4[by_event4$sum %in% sort(by_event4$sum, decreasing=TRUE)[1:10],]->->->->->->->->->->->->->->->->->->->->->->->->->->->
Finally, we compute the cost incurred due to both property and crop damages.
#combine property and crop
prop_crop <- span=""> rbind(by_event3, by_event4)
prop_crop_damage <- span=""> prop_crop[prop_crop$sum %in% sort(prop_crop$sum, decreasing=TRUE)[1:10],]->->
The data set for this study is taken from NOAA Storm Database. We will first load this into R.
data <- span=""> read.csv("D:/Self_Development/Coursera - JHU/5. Reproducible Research/2/repdata-data-StormData.csv.bz2" ) ->
Since we are looking at two aspects of the events, it is helpful to separate the data set into 2 halves, each to answer it’s respective question.
To prepare the data set to answer the first question, we will be using the event types, amount of reported injuries and also fatalities. This would give us an idea of which weather events are most harmful - or deadly - to the population.
Our basic strategy in processing the data would be to sum up all the fatalities and injuries per weather event, and visualize the total amount in the Result section.
However due to the many types of events that has been reported, we shall concentrate our work on only the top 10 for each category.
weather_data <- span=""> data.frame(data$EVTYPE, data$FATALITIES, data$INJURIES)
colnames(weather_data) <- span=""> c('event', 'fatality','injury')
library(plyr)
by_event1 <- span=""> ddply(weather_data,~event,summarise,sum=sum(fatality))
fatality <- span=""> by_event1[by_event1$sum %in% sort(by_event1$sum, decreasing=TRUE)[1:10],]
by_event2 <- span=""> ddply(weather_data,~event,summarise,sum=sum(injury))
injury <- span=""> by_event2[by_event2$sum %in% sort(by_event2$sum, decreasing=TRUE)[1:10],] ->->->->->->
For the second dataset, we will be using the event types, the amount of property damage and the crop damage (which are available in dollars).
Similar to our previous work, the basic strategy in processing the data would be to sum up all the damages on property and crop. We will then visualize the result in the Result section.
However due to the many types of events that has been reported, we shall concentrate our work on only the top 10 for each category.
economy_data <- span=""> data.frame(data$EVTYPE, data$PROPDMG, data$PROPDMGEXP, data$CROPDMG, data$CROPDMGEXP)
colnames(economy_data) <- span=""> c('event', 'propertydmg','propdmgexp','cropdmg','cropdmgexp')
summary(economy_data)->->
## event propertydmg propdmgexp
## HAIL :288661 Min. : 0.00 :465934
## TSTM WIND :219940 1st Qu.: 0.00 K :424665
## THUNDERSTORM WIND: 82563 Median : 0.00 M : 11330
## TORNADO : 60652 Mean : 12.06 0 : 216
## FLASH FLOOD : 54277 3rd Qu.: 0.50 B : 40
## FLOOD : 25326 Max. :5000.00 5 : 28
## (Other) :170878 (Other): 84
## cropdmg cropdmgexp
## Min. : 0.000 :618413
## 1st Qu.: 0.000 K :281832
## Median : 0.000 M : 1994
## Mean : 1.527 k : 21
## 3rd Qu.: 0.000 0 : 19
## Max. :990.000 B : 9
## (Other): 9
Based on the summary, it can be seen that not all the damage estimates have been given the magnitudes. As such, the resultant value without any assigned magnitude may not be relevant enough to be included in the calculation (value is too small to be significant).
Thus, any estimates which does not have any magnitude assigned other than those that have been documented by NWS (ie ‘K’, ‘M’ and ‘B’) will be omitted from this study.
#to simplify typing
ed <- span=""> economy_data
# get the property damages
edp <- span=""> data.frame(ed$event, ed$propertydmg, ed$propdmgexp)
ed_M <- span=""> edp[edp$ed.propdmgexp == 'M',]
ed_K <- span=""> edp[edp$ed.propdmgexp == 'K',]
ed_B <- span=""> edp[edp$ed.propdmgexp == 'B',]
#calculate real value
ed_M1 <- span=""> data.frame(ed_M$ed.event, ed_M$ed.propertydmg*1000000)
colnames(ed_M1) <- span=""> c('event','prop_damage')
ed_K1 <- span=""> data.frame(ed_K$ed.event, ed_K$ed.propertydmg*1000)
colnames(ed_K1) <- span=""> c('event','prop_damage')
ed_B1 <- span=""> data.frame(ed_B$ed.event, ed_B$ed.propertydmg*1000000000)
colnames(ed_B1) <- span=""> c('event','prop_damage')
#recombine
edp <- span=""> rbind(ed_K1, ed_M1, ed_B1)
library(plyr)
by_event3 <- span=""> ddply(edp,~event,summarise,sum=sum(prop_damage))
property_damage <- span=""> by_event3[by_event3$sum %in% sort(by_event3$sum, decreasing=TRUE)[1:10],]
# now let's get the crop damages
edc <- span=""> data.frame(ed$event, ed$cropdmg, ed$cropdmgexp)
edc_M <- span=""> edc[edc$ed.cropdmgexp == 'M',]
edc_K <- span=""> edc[edc$ed.cropdmgexp == 'K',]
edc_B <- span=""> edc[edc$ed.cropdmgexp == 'B',]
#calculate real value
edc_M1 <- span=""> data.frame(edc_M$ed.event, edc_M$ed.cropdmg*1000000)
colnames(edc_M1) <- span=""> c('event','crop_damage')
edc_K1 <- span=""> data.frame(edc_K$ed.event, edc_K$ed.cropdmg*1000)
colnames(edc_K1) <- span=""> c('event','crop_damage')
edc_B1 <- span=""> data.frame(edc_B$ed.event, edc_B$ed.cropdmg*1000000000)
colnames(edc_B1) <- span=""> c('event','crop_damage')
#recombine
edc <- span=""> rbind(edc_K1, edc_M1, edc_B1)
by_event4 <- span=""> ddply(edc,~event,summarise,sum=sum(crop_damage))
crop_damage <- span=""> by_event4[by_event4$sum %in% sort(by_event4$sum, decreasing=TRUE)[1:10],]->->->->->->->->->->->->->->->->->->->->->->->->->->->
Finally, we compute the cost incurred due to both property and crop damages.
#combine property and crop
prop_crop <- span=""> rbind(by_event3, by_event4)
prop_crop_damage <- span=""> prop_crop[prop_crop$sum %in% sort(prop_crop$sum, decreasing=TRUE)[1:10],]->->
Result
Weather Events and Health
#plotting two panel graph
par(mai=c(3,1,1,1))
par(mfrow=c(1,2))
barplot(fatality$sum,names.arg=fatality$event,xlab=mtext("Event Types", side=3), ylab="Total Fatalities", las=2)
barplot(injury$sum,names.arg=injury$event,xlab=mtext("Event Types", side=3),ylab="Total Injuries", las=2 )
title(outer=TRUE,main="Top 10 Most Harmful Weather Events", line= -2)
Based on the figure above, we can conclude visually that tornado seems to be the most harmful weather event (both in terms of death and injuries caused) that affects the wide population.
However, do also note that there are occurences where there are typos and multiple classification for the same type of weather event. What this means is that more analysis is needed to combine similar types of events together to get the correct picture of things.
#plotting two panel graph
par(mai=c(3,1,1,1))
par(mfrow=c(1,2))
barplot(fatality$sum,names.arg=fatality$event,xlab=mtext("Event Types", side=3), ylab="Total Fatalities", las=2)
barplot(injury$sum,names.arg=injury$event,xlab=mtext("Event Types", side=3),ylab="Total Injuries", las=2 )
title(outer=TRUE,main="Top 10 Most Harmful Weather Events", line= -2)
Based on the figure above, we can conclude visually that tornado seems to be the most harmful weather event (both in terms of death and injuries caused) that affects the wide population.
However, do also note that there are occurences where there are typos and multiple classification for the same type of weather event. What this means is that more analysis is needed to combine similar types of events together to get the correct picture of things.
Weather Events and Economy
par(mai=c(3,1,1,1))
par(mfrow=c(1,3))
barplot(property_damage$sum,names.arg=property_damage$event,xlab=mtext("Property Damage vs Event Types", side=3), ylab="", las=2)
barplot(crop_damage$sum,names.arg=crop_damage$event,xlab=mtext("Crop Damage vs Event Types", side=3), ylab="", las=2)
barplot(prop_crop_damage$sum,names.arg=prop_crop_damage$event,xlab=mtext("Total Damage vs Event Types", side=3), ylab="", las=2)
title(outer=TRUE,main="Top 10 Most Costly Weather Events", line= -2)
Meanwhile on the economy, we see that flood is the major contributor of economic loss. Drilling further down into a subsegment of this, we see that drought is major contributor to economic loss when it comes to crops, but doesn’t affect the property segment at all.
par(mai=c(3,1,1,1))
par(mfrow=c(1,3))
barplot(property_damage$sum,names.arg=property_damage$event,xlab=mtext("Property Damage vs Event Types", side=3), ylab="", las=2)
barplot(crop_damage$sum,names.arg=crop_damage$event,xlab=mtext("Crop Damage vs Event Types", side=3), ylab="", las=2)
barplot(prop_crop_damage$sum,names.arg=prop_crop_damage$event,xlab=mtext("Total Damage vs Event Types", side=3), ylab="", las=2)
title(outer=TRUE,main="Top 10 Most Costly Weather Events", line= -2)
Meanwhile on the economy, we see that flood is the major contributor of economic loss. Drilling further down into a subsegment of this, we see that drought is major contributor to economic loss when it comes to crops, but doesn’t affect the property segment at all.
Comments