21  Evaluation of Unsupervised Models

21.1 Learning Objectives and Evaluation Lens

  • Objective: evaluate whether unsupervised structure is both statistically coherent and practically useful.
  • Primary metrics: silhouette, CH index, purity, ARI.
  • Interpretation rule: internal quality and external usefulness should both be considered.
  • Common pitfalls: selecting \(k\) once without sensitivity analysis and over-interpreting unstable clusters.

In unsupervised learning we do not train against target labels, so evaluation must focus on cluster structure quality and business usefulness.

In software engineering, a common use case is grouping modules/components by their static metrics (size, complexity, coupling) to identify risk profiles for testing and quality assurance.

21.2 Example: Clustering Software Modules (UnifiedBugDataset)

We use Unified-file.csv from UnifiedBugDataset 1.2 (2019), which merges multiple defect datasets (PROMISE, Eclipse, Bug Prediction Dataset, Bugcatchers, GitHub bug datasets). Compared to using only NASA subsets (e.g., KC1/JM1), this is a broader and more up-to-date benchmark.

Even though clustering is unsupervised, the dataset includes a bug field that we use after clustering as an external validation signal.

unified <- read.csv("./datasets/defectPred/unified/Unified-file.csv", stringsAsFactors = FALSE)

# External validation label (binary): defective if bug count > 0.
truth <- factor(ifelse(unified$bug > 0, "buggy", "clean"), levels = c("clean", "buggy"))

# Use numeric software metrics only, excluding identifiers and bug label.
drop_cols <- c("ID", "Name", "LongName", "Parent", "bug")
X <- unified[, setdiff(names(unified), drop_cols)]
X <- X[, sapply(X, is.numeric), drop = FALSE]

# Remove incomplete rows and keep labels aligned.
ok <- complete.cases(X)
X <- X[ok, , drop = FALSE]
truth <- truth[ok]

# Remove degenerate columns that break scaling/kmeans.
var_ok <- vapply(X, function(col) {
    v <- var(col, na.rm = TRUE)
    is.finite(v) && v > 0
}, logical(1))
X <- X[, var_ok, drop = FALSE]

# Silhouette requires pairwise distances (O(n^2)); cap sample size for speed.
max_n <- 2500
if (nrow(X) > max_n) {
    idx <- sample(seq_len(nrow(X)), size = max_n)
    X <- X[idx, , drop = FALSE]
    truth <- truth[idx]
}

# Scale metrics to avoid size-dominated clusters.
X_scaled <- scale(X)
X_scaled <- X_scaled[, colSums(is.finite(X_scaled)) == nrow(X_scaled), drop = FALSE]

dim(X_scaled)

21.3 Internal Evaluation Metrics

Internal metrics evaluate cluster compactness and separation without labels.

  • Total within-cluster sum of squares (tot.withinss): lower is better
  • Average silhouette width: higher is better
  • Calinski-Harabasz index: higher is better
calinski_harabasz <- function(x, clusters) {
    x <- as.matrix(x)
    clusters <- as.factor(clusters)
    n <- nrow(x)
    k <- nlevels(clusters)

    overall_center <- colMeans(x)

    # Within-cluster sum of squares
    wss <- 0
    # Between-cluster sum of squares
    bss <- 0

    for (cl in levels(clusters)) {
        idx <- which(clusters == cl)
        xk <- x[idx, , drop = FALSE]
        nk <- nrow(xk)
        ck <- colMeans(xk)

        wss <- wss + sum((xk - matrix(ck, nrow = nk, ncol = ncol(xk), byrow = TRUE))^2)
        bss <- bss + nk * sum((ck - overall_center)^2)
    }

    (bss / (k - 1)) / (wss / (n - k))
}

avg_silhouette <- function(x, clusters) {
    if (!has_cluster) return(NA_real_)
    sil <- cluster::silhouette(as.integer(as.factor(clusters)), dist(x))
    mean(sil[, "sil_width"])
}
ks <- 2:6

internal_tbl <- lapply(ks, function(k) {
    km <- kmeans(X_scaled, centers = k, nstart = 25)
    data.frame(
        k = k,
        total_withinss = km$tot.withinss,
        avg_silhouette = avg_silhouette(X_scaled, km$cluster),
        calinski_harabasz = calinski_harabasz(X_scaled, km$cluster)
    )
})

internal_tbl <- do.call(rbind, internal_tbl)
internal_tbl

21.4 External Evaluation (Using Defect Labels Only for Assessment)

External metrics compare clusters with known classes. This does not make the training supervised; it only checks whether clusters align with practical categories of interest.

  • Purity: proportion of modules assigned to the majority class in each cluster
  • Adjusted Rand Index (ARI): agreement corrected for chance
purity_score <- function(truth, pred) {
    tab <- table(truth, pred)
    sum(apply(tab, 2, max)) / sum(tab)
}

adjusted_rand_index <- function(truth, pred) {
    tab <- table(truth, pred)
    n <- sum(tab)

    a <- rowSums(tab)
    b <- colSums(tab)

    sum_choose <- sum(choose(tab, 2))
    expected <- (sum(choose(a, 2)) * sum(choose(b, 2))) / choose(n, 2)
    max_index <- 0.5 * (sum(choose(a, 2)) + sum(choose(b, 2)))

    denom <- max_index - expected
    if (denom == 0) return(0)

    (sum_choose - expected) / denom
}
# Choose k by best silhouette when available, otherwise k=2 as baseline.
if (all(is.na(internal_tbl$avg_silhouette))) {
    k_best <- 2
} else {
    k_best <- internal_tbl$k[which.max(internal_tbl$avg_silhouette)]
}

km_best <- kmeans(X_scaled, centers = k_best, nstart = 25)

external_tbl <- data.frame(
    k = k_best,
    purity = purity_score(truth, km_best$cluster),
    adjusted_rand_index = adjusted_rand_index(truth, km_best$cluster)
)

external_tbl

21.5 SE Interpretation of the Result

In this context, useful unsupervised evaluation means:

  • Clusters are internally coherent (higher silhouette / CH, lower WSS)
  • Clusters have interpretable engineering meaning
    • for example, one cluster might contain high complexity and size metrics, indicating modules requiring stronger testing effort
  • If external validation is available, non-trivial purity/ARI suggests that structural metric groups are related to defect proneness

This evaluation approach supports practical decisions such as risk-based test prioritization, review allocation, and focused refactoring.

21.6 Important topics often missing

  • cluster stability under resampling (same data, different seeds)
  • sensitivity to feature scaling choices
  • sensitivity to parameter choices (\(k\), eps, MinPts)
  • actionable interpretation quality (can teams act on cluster descriptions?)