library(tidymodels)
has_vip <- requireNamespace("vip", quietly = TRUE)
set.seed(10)
kc1 <- read.csv("./datasets/defectPred/unified/Unified-file.csv", stringsAsFactors = FALSE)
kc1 <- kc1[, c("McCC", "CLOC", "PDA", "PUA", "LLOC", "LOC", "bug")]
kc1$Defective <- factor(ifelse(kc1$bug > 0, "Y", "N"))
kc1$bug <- NULL
split <- initial_split(kc1, prop = 0.75, strata = Defective)
training <- training(split)
testing <- testing(split)13 Feature Selection Example
This section presents a modern feature-selection workflow using tidymodels. The goal is to keep the process explicit and reproducible:
- split data into training and test subsets
- define feature filters in a
recipe - fit a model through a
workflow - inspect variable importance
13.1 Feature filtering with recipes
recipes supports common feature-selection and cleaning steps:
step_zv()removes zero-variance predictorsstep_nzv()removes near-zero-variance predictorsstep_corr()removes highly correlated numeric predictors
rec <- recipe(Defective ~ ., data = training) |>
step_zv(all_predictors()) |>
step_nzv(all_predictors()) |>
step_corr(all_numeric_predictors(), threshold = 0.90)
rec_prep <- prep(rec)
juice(rec_prep) |> dplyr::glimpse()13.2 Leakage pitfalls in feature selection
Feature selection must be fit on training data only. Common mistakes:
- selecting features on the full dataset before splitting
- ranking features using labels from future releases
- including attributes that are proxies of the target (post-release fields)
In this chapter, feature filters are defined in the recipe and learned from training only, then applied to testing through the fitted workflow.
13.3 Train a model after feature filtering
rf_spec <- rand_forest(trees = 500, min_n = 5) |>
set_mode("classification") |>
set_engine("ranger", importance = "impurity")
wf <- workflow() |>
add_recipe(rec) |>
add_model(rf_spec)
rf_fit <- fit(wf, data = training)
rf_fit13.4 Evaluate on the test set
pred_cls <- predict(rf_fit, testing, type = "class")
pred_prb <- predict(rf_fit, testing, type = "prob")
eval_tbl <- bind_cols(testing, pred_cls, pred_prb)
metrics(eval_tbl, truth = Defective, estimate = .pred_class)
conf_mat(eval_tbl, truth = Defective, estimate = .pred_class)13.5 Variable importance
if (has_vip) {
rf_fit |>
extract_fit_parsnip() |>
vip::vip(num_features = 15)
} else {
message("Package 'vip' is not installed; skipping variable-importance plot.")
}13.6 Feature Selection Packages and Further Reading
For this chapter, the most useful package families are:
| Package | Typical Use in Feature Selection |
|---|---|
| recipes | Filtering and preprocessing (step_zv, step_nzv, step_corr) |
| Boruta | All-relevant feature selection using random forests |
| FSelectorRcpp | Information-gain and entropy-based ranking |
| vip | Variable-importance visualization for fitted models |
| ranger | Fast tree-based models with built-in importance scores |
To keep this chapter concise, we focus on the recipes + ranger workflow. See ?sec-popular-packages for a broader package map and ?sec-rpackages for package-management guidance.