2  What is Data Mining

We will deal with extracting information from data, either for estimation, defect prediction, planning, etc.

We will provide an overview of data analsyis using different techniques.

A definition we will like to highlight is the one about Knowledge Discovery in Databases (KDD), which is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data (Fayyad et al. 1996)

KDD Process

The Cross Industry Process for Data Mining (CRISP-DM) also provides a common and well-developed framework for delivering data mining projects identifying six steps (Shearer 2000):

  1. Problem Understanding
  2. Data Understanding
  3. Data Preparation
  4. Modeling
  5. Evaluation
  6. Deployment

CRISP-DM (Wikipedia)

2.1 The Aim of Data Analysis and Statistical Learning

  • The aim of any data analysis is to understand the data
  • and to build models for making predictions and estimating future events based on past data
  • and to make statistical inferences from our data.
  • We may want to test different hypothesis on the data
  • We want to generate conclusions about the population where our sample data comes from
  • Most probably we are interested in building a model for quality, time, defects or effort prediction

  • We want to find a function \(f()\), that given \(X1, X2, ...\) computes \(Y=f(X1, X2, ..., Xn)\)

2.2 Data Science

Data science (DS) is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structured and unstructured data. Data science is related to data mining, machine learning and big data.

We may say that the term DS embraces all terms related to data analysis that previously were under different disciplines.

Wikipedia Data Science

2.3 Further Information

Generic books about statistics:

2.4 Data Mining and Data Science with R

2.5 Data Mining with Weka

Weka is another popular framework written in Java that can be used and extended with other languages and frameworks. The authors of Weka also have a popular book:

  • Ian Witten, Eibe Frank, Mark Hall, Christopher J. Pal, Data Mining: Practical Machine Learning Tools and Techniques (4th Edt), Morgan Kaufmann, 2016, ISBN: 978-0128042915