Phase 4: Modelling

In this phase, various modelling techniques are selected and applied, and their parameters are calibrated to optimal values.

Adapting this phase to address context changes and model reuse handling involves:

  • A new optimal branch of reframe-based sub tasks and deliverables has been added for selecting the modeling technique. We differentiate between classical modelling techniques and other techniques that reuse or adapt the model.
  • Enhanced procedures for testing the versatile model’s quality and validity have been added.
  • In case reframing ins planned or done over existing models, specific reframing activities are needed to build the versatile model.
  • A new general task “Revise Model” is included to handling model revision in incremental or lifelong learning data mining tasks.
  • A new general task “Reframe Setting” has been added in this phase in order to decide which type of reframing should be used depending on what aspects of the model are reusable in other contexts.

Tasks:

  • Task: As the first step in modelling, select the actual modeling technique that is to be used. Although it has been selected a tool during the “Business Understanding” phase, this task refers to the specific modeling technique, e.g., decision-tree building with 5.0, or neural network generation with back propagation. Determining the most appropriate model will typically be based on the data types, data mining goals (scores, patterns, clusters, versatile model. etc.)
  • Outputs:
    • Modeling technique: Document the actual modeling technique that is to be used. In case context matters, Select the model and reframing couple, e.g.. scorer and score-driven or linear regression and continuous output reframing.
    • Modeling assumptions: Many modeling techniques make specific assumptions on the data, e.g., all attributes have uniform distributions, no missing values allowed, class attribute must be symbolic etc. Record any such assumptions made.
  • Task: Before we actually build a model, we should consider how the models results will be tested. Therefore we need to generate a procedure or mechanism to test the models quality and validity (describing the criteria for goodness of a model (i.e., error rate) and defining the data on which these criteria will be tested.
  • Outputs:
    • Test design: Describe the intended plan (i.e., how to divide the available dataset) for training, testing and evaluating the models.
    • Context plot and performance metrics: Decide how the context changes can be evaluated (e.g, by using artificial data). Identify proper metrics to evaluate reframing efficiency.
  • Task: Run the modelling tool on the prepared dataset to create one or more models.
  • Outputs:
    • Parameter settings: Most modeling techniques have a a large number of parameters that can be adjusted. List the parameters and their chosen value, along with the rationale for the choice of parameter settings.
    • Models: These are the actual models produced by the modeling tool, not a report.
    • Model description: Describe the resultant model. Report on the results of a model and any meaningful conclusion, document any difficulties or inconsistencies encountered with their meanings.
  • Task: Once we have built a model and as a result of an incremental learning or lifelong learning, the model needs to be revised(patched or extended) because of some novelty or inconsistency of the new data is detected with respect to the existing model. This can be extended to context changes, provided we can determine when the context has changed significantly to deserve a revision process.
  • Task: For each model under consideration, we have to interpret them and make a methodical assessment according to the data mining success criteria, and the desired test design. Judge and discuss the the success of the application of modelling and discovery techniques technically. Rank the models used.
  • Outputs:
    • Model assessment: Summarize results of this task by using evaluation chars, analysis nodes, cross-validation charts, etc.; list qualities of generated models (e.g., in terms of accuracy) and rank their quality in relation to each other. In context-aware tasks, compare with different scenarios, in particular retraining.
    • Revised parameter settings: According to the model assessment, revise parameter settings and tune them for the next run in the Build Model task. Iterate model building and assessment until you strongly believe that you found the best model(s). Document all such revisions and assessments.
  • Task: Which type of reframing technique should be used depending on what aspects of the model are reusable in other contexts? Taking into account the particular deployment context (if known), we distinguish three different kinds of reframing (which can be combined): output, input and structural reframing. Thus, where a conventional, non-versatile model captures only such information as is necessary to deal with test instances from the same context, a versatile model captures additional information that, in combination with reframing, allows it to deal with test instances from a larger range of contexts.
  • Outputs:
    • Kind of reframing: Describe the kind of reframing (output, input or structural) to be applied over the versatile model.

Previous Phase <– Phase 3: Data Preparation   —   Next Phase –> Phase 5: Evaluation


 

Legend of the different representation of original and new/enhanced tasks and outputs: