Chapter 2 Split data into training and test datasets
jdt2 = data.frame(cbo, bugs) inTrain <- createDataPartition(y=jdt2$bugs,p=.8,list=FALSE) jdtTrain <- jdt2[inTrain,] jdtTest <- jdt2[-inTrain,]
BLR models fault-proneness are as follows $fp(X) = \frac{e^{logit()}}{1 + e^{logit(X)}}$
where the simplest form for logit is $logit(X) = c_{0} + c_{1}X$
```r
# logit regression
# glmLogit <- train (bugs ~ ., data=jdt.train, method="glm", family=binomial(link = logit))
glmLogit <- glm (bugs ~ ., data=jdtTrain, family=binomial(link = logit))
summary(glmLogit)
##
## Call:
## glm(formula = bugs ~ ., family = binomial(link = logit), data = jdtTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.654 -0.591 -0.515 -0.471 2.150
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.20649 0.13900 -15.87 <2e-16 ***
## cbo 0.06298 0.00765 8.23 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 807.98 on 797 degrees of freedom
## Residual deviance: 691.80 on 796 degrees of freedom
## AIC: 695.8
##
## Number of Fisher Scoring iterations: 5
Predict a single point:
= data.frame(cbo = 3)
newData predict(glmLogit, newData, type = "response")
## 1
## 0.117
Draw the results, modified from: http://www.shizukalab.com/toolkits/plotting-logistic-regression-in-r
<- predict(glmLogit, jdtTest, type = "response")
results
range(jdtTrain$cbo)
## [1] 0 156
range(results)
## [1] 0.0992 0.9993
plot(jdt2$cbo,jdt2$bugs)
curve(predict(glmLogit, data.frame(cbo=x), type = "response"),add=TRUE)
# points(jdtTrain$cbo,fitted(glmLogit))
Another type of graph:
library(popbio)
##
## Attaching package: 'popbio'
## The following object is masked from 'package:caret':
##
## sensitivity
logi.hist.plot(jdt2$cbo,jdt2$bugs,boxp=FALSE,type="hist",col="gray")
2.1 The caret package
There are hundreds of packages to perform classification task in R, but many of those can be used throught the ‘caret’ package which helps with many of the data mining process task as described next.
The caret packagehttp://topepo.github.io/caret/ provides a unified interface for modeling and prediction with around 150 different models with tools for:
data splitting
pre-processing
feature selection
model tuning using resampling
variable importance estimation, etc.
Website: http://caret.r-forge.r-project.org
JSS Paper: www.jstatsoft.org/v28/i05/paper