Context-Aware Standard Process for Data Mining

CASP-DM is an extension of the Cross Industry Standard Process for Data Mining (CRISP-DM) that addresses specific challenges of Machine Learning and Data Mining, especially focusing on context and model reuse handling.

A major assumption in many machine learning and data mining algorithms is that the training and deployment data must be in the same contexts, namely, having the same feature space, distribution or misclassification cost. However, in many real-world applications, this assumption may not hold. Apart from having several different training contexts, there might also be many potential deployment contexts which differ from the training context(s) in one or more ways.

Anticipating potential changes in context is a critically important part of data mining projects. Context changes can lead to substantial additional costs or require running a new project from scratch.

Many recent machine learning approaches have addressed the need to cope with context changes and reuse of learnt knowledge. Areas such as data shift, transfer learning, etc. Context anticipation requires dedicated activities in all phases of the data mining process. However, these activities are not included in any of the existing standard process methodologies (KDD, CRISP-DM, SEMMA).


CASP-DM is an extension of CRISP-DM that maps the original CRISP-DM reference model, proposing tasks and outputs as well as enhancements to the original reference model allowing thus the practitioners to be aware of (and anticipate) the main types of context and context changes (including changes in costs, data distribution and others) and, with this in mind, choose the appropriate modelling techniques and visualization tools for the construction, selection, adaptation and understanding of versatile and context-aware models.

CRISP-DM is the most complete data mining methodology in terms of meeting the needs of industrial projects and has become the most widely used process for DM projects. Although CRISP-DM does not seem to be maintained or adapted to the new challenges in data mining, the proposed six phases and their subphases are still a good guide for the knowledge discovery process. In fact, the interest in CRISP-DM continues to be high compared to other models.

CASP-DM has been evolving as a new standard with the goal of integrating context-awareness and context changes in the knowledge discovery process.