Chapter 8 Feature Selection Example
Feature Selection in R and Caret
library(caret)
library(doParallel) # parallel processing
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
library(dplyr) # Used by caret
library(pROC) # plot the ROC curve
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(foreign)
### Use the segmentationData from caret
# Load the data and construct indices to divided it into training and test data sets.
#set.seed(10)
<- read.arff("./datasets/defectPred/D1/KC1.arff") kc1
<- createDataPartition(y = kc1$Defective,
inTrain ## the outcome data are needed
p = .75,
## The percentage of data in the
## training set
list = FALSE)
The function createDataPartition
does a stratified partitions.
<- kc1[inTrain,]
training nrow(training)
## [1] 1573
<- kc1[-inTrain, ]
testing nrow(testing)
## [1] 523
The train function can be used to + evaluate, using resampling, the effect of model tuning parameters on performance + choose the “optimal” model across these parameters + estimate model performance from a training set
<- trainControl(## 10-fold CV
fitControl method = "repeatedcv",
number = 10,
## repeated ten times
repeats = 10)
gbmFit1 <- train(Defective ~ ., data = training, method = “gbm”, trControl = fitControl, ## This last option is actually one ## for gbm() that passes through verbose = FALSE) gbmFit1
<- train(Defective ~ .,
plsFit data = training,
method = "pls",
## Center and scale the predictors for the training
## set and all future samples.
preProc = c("center", "scale")
)
# To fix
# testPred <- predict(plsFit, testing)
# postResample(testPred, testing$Defective)
# sensitivity(testPred, testing$Defective)
# confusionMatrix(testPred, testing$Defective)
When there are three or more classes, confusion matrix will show the confusion matrix and a set of “one-versus-all” results.