11  Preprocessing

Following the data mining process, we describe what is meant by preprocessing, classical supervised models, unsupervised models and evaluation in the context of software engineering with examples

This task is probably the hardest and where most of effort is spend in the data mining process. It is quite typical to transform the data, for example, finding inconsistencies, normalising, imputing missing values, transforming input data, merging variables, etc.

Typically, pre-processing consist of the following tasks (subprocesses):

11.1 Data

Consistent data are semantically correct based on real-world knowledge of the problem, i.e., no constraints are violated and data can be used for inducing models and analysis. For example, LoC or effort is constrained to non-negative values. We can also consider multiple attributes to be consistent among them, and even across datasets (e.g., same metrics but collected by different tools).

11.2 Missing values

Missing values will have a negative effect when analysing the data or learning models. The results can be biased when compared with the models induced from the complete data, the results can be harder to analyse, it may be needed to discard records with missing values depending on the algorithm and this can be an important problems with small datasets such as the effort estimation ones.

Missing data is typically classified into: * MCAR (Missing Completely at Random) or MAR (Missing At Random) where there is no reason for those missing values and we can assume that the distribution could follow the attribute’s distribution. * MNAR (Missing Not At Random) where there is a pattern for those missing values and it may be advisable to check the data-gathering process to try to understand why such information is missing.

Imputation consists of replacing missing values with estimates. Many algorithms cannot handle missing values directly; therefore, imputation methods are needed. We can use simple approaches such as replacing missing values with the mean or mode of the attribute. More elaborate approaches include:

  • EM (Expectation-Maximisation)
  • Distance-based
    • \(k\)-NN (\(k\)-Nearest Neighbours)
    • Clustering

In R, a missing value is represented with NA and the analyst must decide what to do with missing data. The simplest approach is to leave out instances (ignore missing -IM-) with with missing data. This functionality is supported by many base functions through the na.rm option.

The mice R package. MICE (Multivariate Imputation via Chained Equations) assumes that data are missing at random. Other packages include Amelia, missForest, Hmisc and mi.

11.3 Noise

Imperfections in real-world data can negatively influence induced machine learning models. Approaches to deal with noisy data include: * Robust learners capable of handling noisy data (e.g., C4.5 through pruning strategies) * Data polishing methods which aim to correct noisy instances prior training * Noise filters which are used to identify and eliminate noisy instances from the training data.

Types of noise data: * Class Noise (aka label noise). + There can be contradictory cases (all attributes have the same value except the class) + Misclassifications. The class attribute is not labeled with the true label (golden truth) * Attribute Noise. Values of attributes that are noise, missing or unknown.

11.4 Outliers

There is a large amount of literature related to outlier detection, and furthermore several definitions of outlier exist.

library(dbscan)

unified <- read.csv("./datasets/defectPred/unified/Unified-file.csv", stringsAsFactors = FALSE)
kc1 <- unified[, c("McCC", "CLOC", "PDA", "PUA", "LLOC", "LOC", "bug")]
kc1$Defective <- factor(ifelse(kc1$bug > 0, "Y", "N"))
kc1$bug <- NULL

The LOF algorithm (lofactor), given a data set it produces a vector of local outlier factors for each case.

kc1num <- kc1[, sapply(kc1, is.numeric)]
outlier.scores <- dbscan::lof(kc1num, minPts=5)
plot(density(na.omit(outlier.scores)))
outliers <- order(outlier.scores, decreasing=T)[1:5]
print(outliers)

Another simple method of Hiridoglou and Berthelot for positive observations.

11.5 Feature selection

Feature Selection (FS) aims at identifying the most relevant attributes from a dataset. It is important in different ways:

  • A reduced volume of data allows different data mining or searching techniques to be applied.

  • Irrelevant and redundant attributes can generate less accurate and more complex models. Furthermore, data mining algorithms can be executed faster.

  • It avoids the collection of data for those irrelevant and redundant attributes in the future.

The problem of FS received a thorough treatment in pattern recognition and machine learning. Most of the FS algorithms tackle the task as a search problem, where each state in the search specifies a distinct subset of the possible attributes (Blum and Langley 1997). The search procedure is combined with a criterion to evaluate the merit of each candidate subset of attributes. There are a multiple possible combinations between each procedure search and each attribute measure (Liu and Yu 2005).

There are two major approaches in FS from the method’s output point of view:

  • Feature subset selection (FSS)

  • Feature ranking in which attributes are ranked as a list of features which are ordered according to evaluation measures (a subset of features is often selected from the top of the ranking list).

FFS algorithms designed with different evaluation criteria broadly fall into two categories:

  • The filter model relies on general characteristics of the data to evaluate and select feature subsets without involving any data mining algorithm.

  • The wrapper model requires one predetermined mining algorithm and uses its performance as the evaluation criterion. It searches for features better suited to the mining algorithm aiming to improve mining performance, but it also tends to be more computationally expensive than filter model Langley (1994).

Feature subset algorithms search through candidate feature subsets guide by a certain evaluation measure (Liu and Motoda 1998) which captures the goodness of each subset. An optimal (or near optimal) subset is selected when the search stops.

Some existing evaluation measures that have been shown effective in removing both irrelevant and redundant features include the consistency measure (Dash et al. 2000), the correlation measure (Hall 1999) and the estimated accuracy of a learning algorithm (Kohavi and John 1997).

  • Consistency measure attempts to find a minimum number of features that separate classes as consistently as the full set of features can. An inconsistency is defined as to instances having the same feature values but different class labels.

  • Correlation measure evaluates the goodness of feature subsets based on the hypothesis that good feature subsets contain features highly correlated to the class, yet uncorrelated to each other.

  • Wrapper-based attribute selection uses the target learning algorithm to estimate the worth of attribute subsets. The feature subset selection algorithm conducts a search for a good subset using the induction algorithm itself as part of the evaluation function.

Langley (1994) notes that feature selection algorithms that search through the space of feature subsets must address four main issues: (i) the starting point of the search, (ii) the organization of the search, (iii) the evaluation of feature subsets and (iv) the criterion used to terminate the search. Different algorithms address these issues differently.

It is impractical to look at all possible feature subsets, even with a small number of attributes. Feature selection algorithms usually proceed greedily and can be classified into those that add features to an initially empty set (forward selection) and those that remove features from an initially complete set (backward elimination). Hybrid approaches both add and remove features as the algorithm progresses. Forward selection is much faster than backward elimination and therefore scales better to large datasets. A wide range of search strategies can be used: best-first, branch-and-bound, simulated annealing, genetic algorithms (see Kohavi and John (1997) for a review).

11.5.1 FSelector package in R

The FSelector package in R implements several classic feature-selection algorithms inspired by Weka. For new projects, prefer actively maintained alternatives such as FSelectorRcpp, Boruta, or recipe-based selectors in tidymodels.

library(FSelector)

unified <- read.csv("./datasets/defectPred/unified/Unified-file.csv", stringsAsFactors = FALSE)
cm1 <- unified[, c("McCC", "CLOC", "PDA", "PUA", "LLOC", "LOC", "bug")]
cm1$Defective <- factor(ifelse(cm1$bug > 0, "Y", "N"))
cm1$bug <- NULL

cm1RFWeigths <- random.forest.importance(Defective ~ ., cm1)
cutoff.biggest.diff(cm1RFWeigths)

Using the Information Gain measure as ranking:

cm1GRWeights <- gain.ratio(Defective ~ ., cm1)
cm1GRWeights
cutoff.biggest.diff(cm1GRWeights)

# After assigning weights, we can select the statistically significant ones
cm1X2Weights <- chi.squared(Defective ~ ., cm1)
cutoff.biggest.diff(cm1X2Weights)

Using CFS attribute selection

library(FSelector)

unified <- read.csv("./datasets/defectPred/unified/Unified-file.csv", stringsAsFactors = FALSE)
cm1 <- unified[, c("McCC", "CLOC", "PDA", "PUA", "LLOC", "LOC", "bug")]
cm1$Defective <- factor(ifelse(cm1$bug > 0, "Y", "N"))
cm1$bug <- NULL

result <- cfs(Defective ~ ., cm1)
f <- as.simple.formula(result, "Defective")
f

Other packages for feature selection in R include FSelectorRcpp, which re-implements FSelector methods without Weka dependencies.

Another popular package is Boruta, which is based on selection based on Random Forest.

11.6 Instance selection

Removal of samples (complementary to the removal of attributes) in order to scale down the dataset prior to learning a model so that there is (almost) no performance loss.

There are two types of processes:

  • Prototype Selection (PS) (Garcia et al. 2012) when the subset is used with a distance based method (kNN)

  • Training Set Selection (TSS) (Cano et al. 2007) in which an actual model is learned.

It is also a search problem as with feature selection. Garcia et al. (2012) provide a comprehensive overview of the topic.

11.7 Discretization

This process transforms continuous attributes into discrete ones, by associating categorical values to intervals and thus transforming quantitative data into qualitative data.

11.8 Correlation Coefficient and Covariance for Numeric Data

Two random variables \(x\) and \(y\) are called independent if the probability distribution of one variable is not affected by the presence of another.

\(\tilde{\chi}^2=\frac{1}{d}\sum_{k=1}^{n} \frac{(O_k - E_k)^2}{E_k}\)

cor(kc1$CLOC, kc1$LOC, use = "complete.obs")
cor(kc1$McCC, kc1$LLOC, use = "complete.obs")

11.9 Normalization

11.9.1 Min-Max Normalization

\(z_i=\frac{x_i-\min(x)}{\max(x)-\min(x)}\)

library(recipes)

rec_norm <- recipe(Defective ~ ., data = kc1) |>
  step_normalize(all_numeric_predictors())

prep(rec_norm) |> juice()

11.9.2 Z-score normalization

Z-score normalization rescales each value by subtracting the mean and dividing by the standard deviation. It is useful when models assume centered predictors with comparable scales.

kc1_num <- dplyr::select(kc1, where(is.numeric))
kc1_scaled <- scale(kc1_num)
summary(kc1_scaled[, 1:min(5, ncol(kc1_scaled))])

11.10 Transformations

11.10.1 Linear and Quadratic Transformations

Transformations can reduce skewness, stabilize variance, or expose nonlinear patterns for models that rely on linear assumptions.

if ("LOC_TOTAL" %in% names(kc1)) {
  loc_linear <- kc1$LOC_TOTAL
  loc_quadratic <- kc1$LOC_TOTAL^2
  summary(data.frame(loc_linear, loc_quadratic))
}

11.10.2 Box-cox transformation

Box-Cox is often used for strictly positive variables. It can improve normality-like behavior and model fit for some regression techniques.

if ("LOC_TOTAL" %in% names(kc1)) {
  loc_pos <- kc1$LOC_TOTAL + 1
  lambda_grid <- seq(-2, 2, by = 0.25)
  ll <- sapply(lambda_grid, function(l) MASS::boxcox(loc_pos ~ 1, lambda = l, plotit = FALSE)$y)
  lambda_grid[which.max(ll)]
}

11.10.3 Nominal to Binary Transformations

Categorical predictors are commonly converted into binary indicators before training some models.

if ("Defective" %in% names(kc1)) {
  dummy_tbl <- model.matrix(~ Defective - 1, data = kc1)
  head(dummy_tbl)
}

11.11 Preprocessing in R

11.11.1 The dplyr package

The dplyr package created by Hadley Wickham. Some functions are similar to SQL syntax. key functions in dplyr include:

  • select: select columns from a dataframe
  • filter: select rows from a dataframe
  • summarize: allows us to do summary stats based upon the grouped variable
  • group_by: group by a factor variable
  • arrange: order the dataset
  • joins: as in sql left join

Tutorial: https://github.com/justmarkham/dplyr-tutorial

Examples

library(dplyr)

Describe the dataframe:

str(kc1)

as_tibble() creates a local tibble for better printing and modern dplyr workflows.

kc1_tbl <- tibble::as_tibble(kc1)

Filter:

# Filter rows: use comma or & to represent AND condition
filter(kc1_tbl, Defective == "Y" & CLOC != 0)

Another operator is %in%.

Select:

select(kc1_tbl, contains("LOC"), Defective)

Now, kc1_tbl contains(“LOC”), Defective

Filter and Select together:

# nesting method
filter(select(kc1_tbl, contains("LOC"), Defective), Defective == "Y")

It is easier using the chaining method:

# chaining method
kc1_tbl %>%
    select(contains("LOC"), Defective) %>%
  filter(Defective == "Y")

Arrange ascending

# 
kc1_tbl %>%
  select(LOC, Defective) %>%
  arrange(LOC)

Arrange descending:

kc1_tbl %>%
  select(LOC, Defective) %>%
  arrange(desc(LOC))

Mutate:

kc1_tbl %>%
    filter(Defective == "Y") %>%
  select(CLOC, LLOC, LOC, Defective) %>%
  mutate(loc_density = LOC / pmax(CLOC, 1))

summarise: Reduce variables to values

# Create a table grouped by Defective, and then summarise each group by taking the mean of loc
kc1_tbl %>%
    group_by(Defective) %>%
  summarise(avg_loc = mean(LOC, na.rm=TRUE))
# Create a table grouped by Defective, and then summarise each group by taking the mean of loc
kc1_tbl %>%
    group_by(Defective) %>%
    summarise(
      mccc_mean = mean(McCC, na.rm = TRUE),
      mccc_min = min(McCC, na.rm = TRUE),
      mccc_max = max(McCC, na.rm = TRUE),
      loc_mean = mean(LOC, na.rm = TRUE),
      loc_min = min(LOC, na.rm = TRUE),
      loc_max = max(LOC, na.rm = TRUE)
    )

It seems than the number of Defective modules is larger than the Non-Defective ones. We can count them with:

# n() or tally
kc1_tbl %>%
    group_by(Defective) %>%
    tally()

It seems that it’s an imbalanced dataset…

# randomly sample a fixed number of rows, without replacement
kc1_tbl %>% sample_n(2)

# randomly sample a fraction of rows, with replacement
kc1_tbl %>% sample_frac(0.05, replace=TRUE)

# Better formatting adapted to the screen width
glimpse(kc1_tbl)

11.12 Missing but critical in preprocessing: leakage-safe workflows

A common mistake is to normalize, impute, or select features on the full dataset before splitting into train/test. This leaks information from test to train and usually inflates performance.

Recommended order:

  1. Split data first
  2. Fit preprocessing only on training data
  3. Apply the learned preprocessing to validation/test data
  4. Train and evaluate model
library(tidymodels)

set.seed(10)
split <- initial_split(kc1_tbl, prop = 0.75, strata = Defective)
train_tbl <- training(split)
test_tbl <- testing(split)

rec_safe <- recipe(Defective ~ ., data = train_tbl) |>
  step_zv(all_predictors()) |>
  step_impute_median(all_numeric_predictors()) |>
  step_normalize(all_numeric_predictors())

prep_rec <- prep(rec_safe)
train_processed <- bake(prep_rec, new_data = train_tbl)
test_processed <- bake(prep_rec, new_data = test_tbl)

dim(train_processed)
dim(test_processed)

11.12.1 Group-aware and time-aware splitting

Random splits can still leak information when related rows belong to the same project, release, or time period. Prefer:

  • group-aware splits: keep all rows of a project/release in one side
  • time-aware splits: train on older observations, test on newer observations
# group-aware split idea (project_id must exist)
# rsample::group_initial_split(data_tbl, group = project_id)

# time-aware split idea (timestamp must exist and be sorted)
# data_tbl <- data_tbl |> dplyr::arrange(timestamp)
# cutoff <- floor(0.8 * nrow(data_tbl))
# train_tbl <- data_tbl[1:cutoff, ]
# test_tbl  <- data_tbl[(cutoff + 1):nrow(data_tbl), ]

11.12.2 Concrete leakage examples to avoid

Typical leakage patterns in SE datasets:

  1. Using post-release defect counts as predictors of the same release.
  2. Computing normalization/imputation on full data before split.
  3. Duplicated modules across train and test due to merges or snapshots.
  4. Features derived from future information (for example, final churn).

11.13 Class imbalance handling

Defect datasets are often imbalanced, so accuracy alone can be misleading.

Useful practices:

  • report precision, recall, F1, and PR-AUC (not only accuracy)
  • use stratified sampling in train/test splits
  • try rebalancing methods on training data only
if (requireNamespace("themis", quietly = TRUE)) {
  rec_bal <- recipe(Defective ~ ., data = train_tbl) |>
    step_zv(all_predictors()) |>
    themis::step_smote(Defective)

  rec_bal |> prep() |> juice() |>
    dplyr::count(Defective)
} else {
  message("Package 'themis' not installed; skipping SMOTE example.")
}

11.14 Preprocessing checklist

Before modeling, verify:

  1. No leakage from test folds into preprocessing
  2. Missing-value strategy documented and reproducible
  3. Outlier policy documented (keep/cap/remove)
  4. Categorical encoding strategy defined
  5. Class imbalance strategy defined (if classification)
  6. Temporal split considered when timestamps are available
  7. Pipeline saved so the same transformations can be reused in deployment

11.15 End-to-End Workflow (Preprocess to Deploy)

The following flow summarizes a leakage-safe and reproducible workflow for software engineering prediction tasks.

flowchart LR
  A[Raw SE Data<br/>commits issues metrics tests] --> B[Data Quality Checks<br/>missing duplicates impossible values]
  B --> C[Train/Test Split<br/>stratified or temporal]
  C --> D[Fit Preprocessing on Train Only<br/>impute encode normalize select]
  D --> E[Transform Train]
  D --> F[Transform Test]
  E --> G[Train Model]
  G --> H[Cross-Validation on Train]
  H --> I[Final Evaluation on Test<br/>precision recall F1 PR-AUC]
  I --> J[Persist Pipeline + Model]
  J --> K[Deployment]
  K --> L[Monitoring<br/>drift performance alerts]

Key principle: any operation that learns parameters from data (imputation, normalization, feature selection, balancing) must be fit using training data only.

11.16 Other libraries and tricks

The lubridate package contains a number of functions facilitating the conversion of text to POSIX dates. As an example, consider the following code. We may use this, for example, with time series.

For example https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf

library(lubridate)
dates <- c("15/02/2013", "15 Feb 13", "It happened on 15 02 '13")
dmy(dates)