Examples

In this section we illustrate the utility of the CASPDM methodology by means of different examples where contexts play an important role.

The task in this problem consists in discarding automatically the emails that represent spam, i.e. unsolicited messages, especially advertising. This task is a clear example of a cost-sensitive problem since the cost of predicting an authentic email as spam is significantly more expensive than accepting a spam email. The context can be referred as the skew or cost ratio derived from the cost matrix.

For this task, we can summarise the steps of the CASP-DM methodology in the following way:

  • Business Understanding: We want to infer precise and end-user adaptable spam rules using DM approaches.
  • Data Understanding: In this step data useful to build models and define contexts are collected. For this problem, an important amount of labeled emails should be collected.
  • Data Preparation: Emails contain as set of features that must be processed for the modelling phase. Many of these are text and they are usually transformed to fix-size tables by some methods such as bag of words.
  • Modelling: In this phase, machine learning models able to deal with high amount of features are usually employed to build binary classifiers. In this example, we need to consider versatile models that minimize expected cost for a wide range of the context (cost ratio/skew). For assessing the quality of the models in the CASP-DM methodology we need to use metrics that can measure the performance of the models taking operating contexts into account. In this case, we can use AUC (Area under the ROC curve) or Area under the Cost Curve.
  • Evaluation: The versatile classifier obtained is cost-sensitive and thus able to adapt to the requirements of different users (different operating contexts).
  • Deployment: As a result of the previous phases, we obtain a versatile classifier able to be adaptable to different operating contexts. We may need to monitor the performance of the model in order to modify it if the performance does not reach a sufficient efficiency.
An estate agent has a database of possible customers, collects information about each customer and learns a regression model to predict the maximum mortgage that the customer can get from a bank. On an everyday basis, several new properties enter the estate agents portfolio. Each of them has a different price. Obviously, the estate agent only offers a property to those customers that can afford it. Each property represents a genuine cut-off of customers, those who can afford the property and those who cannot.

In the following we summarize the different steps:

  • Business Understanding: We need to determine the customers that are above or below the price of the property, just to offer each property to the appropriate subset of customers. We do not expect to work differently for each new property.
  • Data Understanding: We need to collect customer’s data helpful for estimating the maximum mortgage loan that the customer can obtain from a bank. The context is given by the price of each new property that arrives to the estate agent.
  • Data Preparation: Some syntactic modifications are made to the data that do not change its meaning, but might be required by the modeling tool. Context is represented by a number (the property price).
  • Modelling: Regression techniques are used to build a model for predicting the maximum mortgage loan for each costumer. Then, operating context (price of state properties) works as a threshold that is used to discard the costumers whose estimated value given by the model is below that price.
  • Evaluation: In this phase we assure that the distribution of contexts is similar to the distribution assumed in the previous phase.
  • Deployment: As a result of the previous phases, we obtain only one model able to be adapted to different operating contexts by selecting the customers according to the threshold.
In this example we work with a dataset constructed from sales information. It contains sales data for coffee and tea products sold in stores across the United States and the task is thus to predict sales (volume) by location and product. The data has a multidimensional nature where each fact describes the sales of products in dollars according to two dimensions: product (levels: one specific product or all products) and store (levels: store, city, state, district and region).

In the following we summarise the different steps:

  • Business Understanding: The task is to predict sales (volume) by location and product and, although a different model could be trained per each possible combination, we want just one versatile model and use it in different multidimensional contexts.
  • Data Understanding: Multidimensional data is a rich and complex scenario where the same task can change significantly depending on the level of aggregation over some of the dimensions. Context is given and is very clear (multidimensional operating context or resolution of data).
  • Data Preparation: A multidimensional operating context or resolution is represented as a tuple of levels for each dimension and determines the level for every dimension of the dataset context.
  • Modelling: For this example we explore several approaches for multidimensional data when predictions have to be made at different levels (or contexts) of aggregation: One method relies on the same resolution, another approach aggregates predictions bottom-up, a third approach disaggregates predictions top-down.
  • Evaluation: The model is a good compromise between performance and effective cost (one model vs. thousands of them).
  • Deployment: We train a unique predictive model for the lowest level in the datamart. Once a new multidimensional context appears, we apply the model to the deployment data and aggregate the predictions according to the given context. With this approach, only one model is used for every possible context.