- Data mining/ML for Soft Eng
- Soft Eng/Project Management
- Software Analytics
- Information Sources
- Software Engineering Repositories
- Tools
- And vice-versa? (Soft Eng for DM/ML)
- DevOps vs MLOps
- CRISP-DM
- Conferences and journals
July 29th, 2022 @ COMPDES 2022, UNAH, Honduras
quantifiable
approach to the development, operation, and maintenance of software; that is, the application of engineering to software.In God we trust, all others must bring data (W. Edwards Deming)
You can’t control, what you can’t measure (Lord Kelvin)
Software analytics
is to enable software practitioners to perform data exploration and analysis in order to obtain insightful and actionable information for data driven tasks around software and services (Zhang et al, Malets’11)
(Source: MS Research)
As with KDD
(Knowledge Discovery in Databases) the idea is also to find useful information that was previously unknown!
Version Control Systems
or Soft Conf Mang, e.g., GitHub, Bitbucket
Issue/Bug tracking systems
: Bugzilla, JiraIDE
s: Software Testing, AI assisted developmentMailing lists
(even question & answer websites)Google Play
and AppStore
e.g., reviews, starts (quality), updates…Simulation
can be used to generate data! e.g. System Dynamics -What if?
questions(Source: I. Herraiz)
(Source:Bitergia)
(Source:Husson Univ)
Requirements
:
Analysis and design
:
Coding
: All types of metrics can be collected from code to predict artifact defects, allocate testing effort, code clone detection, etc.
AI Assisted Development
helping with APIs, suggesting code, etc.Genetic Improvement
how to plug modules or third party components automatically (e.g., video codecs)?Testing
Maintenance
Phase out
:
Project Management
(Source:Wikipedia)
project managers
with:
Supervised learning Classification | Defect prediction |
(Semi-supervised) | Apps reviews, text mining of docs |
Regression | Effort and cost estimation |
Non-supervised learning Clustering | Clustering of users, projects, etc. |
Recommender Systems | How to use an API, function calls that tend to be together |
Time series | Evolution of Projects (Clustering/classification of time series) |
Text/Web mining | Reviews, bugs reports, requirements from textual descriptions, code comments, documentation, Function and variable names |
Social Net Analysis | Mailing lists, GitHub |
Process Mining | Logs collected automatically in Web servers, how actual processes are run, are they followed correctly? can processes be optimised? |
Sequence patterns | Run-time traces |
Graph Mining | Dynamic call graphs |
Also, all datasets suffer from many of the ML issues.
Pre-processing is a major step having to deal with:
Open Source provides tons of data to be analysed. The most popular (Big Data
!) source is GitHub
Diomidis Spinellis
has maintains a list of resources for mining software engineering research https://github.com/dspinellis/awesome-msr
For project management, there is the ISBSG (International Software Benchmarking Standards Group) database but it is not open: http://isbsg.org/
Small data
but difficult to collect though!
OPEN repositories such as ZENODO: https://zenodo.org/
Ultimate Debian Database: https://udd.debian.org/
(Linus Torvalds, source: http://codequoter.myshopify.com/)
Create an empty database
$mysql -u root -p -e 'create database wekaDB;'
Run CVSAnaly2
to populate DB tables
$cvsanaly2 --db-user=root --db-password=***** --db- database=wekaDB Parsing log for /tmp/weka (svn) Executing extensions
Data are ready to be used!
$ mysql -u root -p -e 'select * from wekaDB.actions limit 5;' Enter password: +----+------+---------+-----------+-----------+ | id | type | file_id | commit_id | branch_id | +----+------+---------+-----------+-----------+ | 1 | A | 2 | 1 | 1 | | 2 | V | 3 | 2 | 1 | | 3 | V | 4 | 3 | 1 | | 4 | V | 5 | 4 | 1 | | 5 | M | 5 | 5 | 1 | +----+------+---------+-----------+-----------+ $
And visualised:
vg-github.py --user root --passwd ***** --dir /tmp/temp --removedb --ghuser USER --ghpasswd PW --vgdir /home/drg/git/vizGrimoire --isuser MetricsGrimoire
DevOps set of practices that combines SW development (Dev) and IT operations (Ops).
… to MLOps
Machine Learning Operations
(MLOps) defines language-, framework-, platform-, and infrastructure-agnostic practices to design, develop, and maintain machine learning applications. https://ml-ops.org/
MLOps streamlines the process of taking machine learning models to production, and then maintaining and monitoring them. MLOps is a collaborative function, often comprising data scientists, devops engineers, and IT. (DataBricks)
Machine Learning is part of many software systems and it needs to be managed
MLflow
is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry
MLflow
components:
Predictive Model Markup Language (PMML) is a language used to represent predictive analytic models. It allows for predictive solutions to be easily shared between PMML compliant applications (Source: Wikipedia).
ASUM-DM Analytics Solutions Unified Method for Data Mining/Predictive Analytics is a refined and extended CRISP-DM (Source: IBM)
There is a large number of related conferences (in fact, nowadays most of them seem ML conferences!:
Including most important SE journals: