8  Data Sources in Software Engineering

We classify this trail in the following categories:

Metadata (source: Israel Herraiz)

Types of information stored in the repositories:

9 Repositories

The repository ecosystem has changed substantially in recent years. In practice, it is useful to distinguish between large-scale open infrastructures, curated benchmark collections, and restricted datasets.

9.1 Large-scale Open Infrastructures (Current)

9.2 Curated and Reproducible Research Collections

9.3 Restricted / Controlled-Access Datasets

  • ISBSG (industry benchmark data, mainly effort/cost) http://www.isbsg.org/
  • World of Code can also fall into this category depending on requested access level and usage mode.

9.4 Legacy Resources (Still Cited, Use with Caution)

Several repositories frequently cited in older literature are partially inactive, changed, or difficult to reproduce exactly in their original form (for example, FLOSSMole, FLOSSMetrics, some SourceForge archives, and project-specific mirrors). They remain useful for historical comparison, but new studies should prefer modern, actively maintained infrastructures.

For PROMISE/NASA-style defect datasets, note the well-documented quality issues and the limited availability of original source context.

10 Open Tools/Dashboards to extract data

Process to extract data:

Process

Within the open source community, several toolkits allow us to extract data that can be used to explore projects:

Metrics Grimoire http://metricsgrimoire.github.io/

Grimoire

SonarQube http://www.sonarqube.org/

SonarQube

CKJM (OO Metrics tool) http://gromit.iiar.pwr.wroc.pl/p_inf/ckjm/

Collects a large number of object-oriented metrics from code.

10.1 Issues

There are problems such as different tools report different values for the same metric (Lincke et al. 2008)

It is well known that the NASA datasets have some problems:

  • (Gray et al. 2011) The misuse of the NASA metrics data program data sets for automated software defect prediction

  • (Shepperd et al. 2013) Data Quality: Some Comments on the NASA Software Defect Datasets

10.2 Critical Data Quality Threats in SE Mining

When mining software repositories, several systematic threats can corrupt datasets before any analysis begins. These are distinct from the general data quality issues above and are specific to the sociotechnical nature of software repositories.

Bot activity. A substantial fraction of commits, issues, pull requests, and comments in modern repositories originate from automated agents — dependency updaters (Dependabot, Renovate), CI bots (GitHub Actions), or project management bots. Including bot activity inflates author counts, commit frequencies, and issue-close rates in ways that distort any analysis based on those signals. Detecting bots requires heuristics or classifiers trained on names, email patterns, and behavioral regularity (Recupito et al. 2021).

Identity and alias merging. The same developer may appear under multiple email addresses, usernames, or display names across systems. Without alias resolution, contributor statistics, author-level effort models, and social network analyses are unreliable. Tools such as git-fame, SortingHat, or custom string-similarity heuristics are commonly used for this step (Kalliamvakou et al. 2014).

Tangled commits. A single commit may fix a bug, refactor existing code, and add a new feature simultaneously. SZZ-based defect datasets assume that bug-fixing commits cleanly identify the bug-introducing change — a fragile assumption when commits are tangled. Untangling methods exist but are rarely applied in published datasets.

The SZZ algorithm and its known flaws. Most defect prediction datasets label bug-introducing changes by tracing bug-fixing commits back to the lines they modified, a process codified in the SZZ algorithm. SZZ is known to be noisy: it misattributes whitespace changes, comment edits, and moved code as bug introductions. Multiple variants (B-SZZ, AG-SZZ, MA-SZZ, RA-SZZ) have been proposed to reduce these errors; Borg et al. provide a systematic comparison (Borg et al. 2019).

Near-duplicate and forked repositories. GitHub hosts millions of repository forks. Mining without fork-filtering inflates dataset size and introduces near-identical observations, inflating apparent sample size and correlation. World of Code and GHTorrent both document the extent of this problem.

Survivorship bias. Analyses on “active” or “popular” repositories systematically exclude projects that died, stalled, or were kept private. Kalliamvakou et al. identified this as one of the central perils of mining GitHub; findings from surviving projects may not generalise to the full population of software development (Kalliamvakou et al. 2014).

10.3 What Is Often Missing in SE Data Sources

When selecting a dataset, it is common to focus on size and number of attributes, but several critical aspects are often under-reported:

  • Provenance: exact extraction query/script, extraction date, and tool version.
  • Versioning: whether the data corresponds to one release, multiple releases, or moving snapshots.
  • Unit of analysis: file-level, class-level, module-level, commit-level, or issue-level.
  • Ground truth definition: how labels were assigned (for example, how a file is marked as defective).
  • Missing data policy: whether missing values were removed, imputed, or left as-is.
  • Data quality checks: duplicate records, inconsistent identifiers, impossible values.
  • Licensing and legal constraints: redistribution rights and terms of use.
  • Privacy/security: anonymization of developer identities and removal of sensitive fields.

These elements are essential to make replication and fair comparison possible.

10.4 Minimum Dataset Card for Reproducibility

For each dataset used in this course/project, document at least:

  1. Source repository and URL
  2. Extraction period and timezone
  3. Granularity and keys (e.g., file, commit, issue)
  4. Label definition and class distribution
  5. Number of records before/after cleaning
  6. Features removed and rationale
  7. Known threats to validity

This small checklist dramatically improves transparency and repeatability.