Chapter 2 Split data into training and test datasets

jdt2 = data.frame(cbo, bugs) inTrain <- createDataPartition(y=jdt2$bugs,p=.8,list=FALSE) jdtTrain <- jdt2[inTrain,] jdtTest <- jdt2[-inTrain,]


BLR models fault-proneness are as follows $fp(X) = \frac{e^{logit()}}{1 + e^{logit(X)}}$

where the simplest form for logit is $logit(X) = c_{0} + c_{1}X$


```r
# logit regression
# glmLogit <- train (bugs ~ ., data=jdt.train, method="glm", family=binomial(link = logit))       

glmLogit <- glm (bugs ~ ., data=jdtTrain, family=binomial(link = logit))
summary(glmLogit)
## 
## Call:
## glm(formula = bugs ~ ., family = binomial(link = logit), data = jdtTrain)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.654  -0.591  -0.515  -0.471   2.150  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.20649    0.13900  -15.87   <2e-16 ***
## cbo          0.06298    0.00765    8.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 807.98  on 797  degrees of freedom
## Residual deviance: 691.80  on 796  degrees of freedom
## AIC: 695.8
## 
## Number of Fisher Scoring iterations: 5

Predict a single point:

newData = data.frame(cbo = 3)
predict(glmLogit, newData, type = "response")
##     1 
## 0.117

Draw the results, modified from: http://www.shizukalab.com/toolkits/plotting-logistic-regression-in-r

results <- predict(glmLogit, jdtTest, type = "response")

range(jdtTrain$cbo)
## [1]   0 156
range(results)
## [1] 0.0992 0.9993
plot(jdt2$cbo,jdt2$bugs)
curve(predict(glmLogit, data.frame(cbo=x), type = "response"),add=TRUE)

# points(jdtTrain$cbo,fitted(glmLogit))

Another type of graph:

library(popbio)
## 
## Attaching package: 'popbio'
## The following object is masked from 'package:caret':
## 
##     sensitivity
logi.hist.plot(jdt2$cbo,jdt2$bugs,boxp=FALSE,type="hist",col="gray")

2.1 The caret package

There are hundreds of packages to perform classification task in R, but many of those can be used throught the ‘caret’ package which helps with many of the data mining process task as described next.

The caret packagehttp://topepo.github.io/caret/ provides a unified interface for modeling and prediction with around 150 different models with tools for:

  • data splitting

  • pre-processing

  • feature selection

  • model tuning using resampling

  • variable importance estimation, etc.

Website: http://caret.r-forge.r-project.org

JSS Paper: www.jstatsoft.org/v28/i05/paper

Book: Applied Predictive Modeling