Phase 3: Data preparation

This phase covers all activities needed to construct the final dataset from the initial raw data. Data Preparation tasks are likely to be performed multiple times and not in any prescribed order.

Adapting this phase to address context changes and model reuse handling involves:

  • “Select Data” should be enhanced with feature extraction, resolution change and dimensionality reduction techniques to define possible attribute sets for modelling activities.
  • A selection of contexts and context changes relevant to the data mining goals should be done by selecting data which cover the selected contexts and changes.
  • Enhanced constructive data preparation operations have  been added to derive context-specific and context-independent attributes.
  • Integration of data from multiple tables or records to create new records or values should also be updated with data from different contexts.
  • Data formatting for specific data mining techniques need to include the context representation.

Tasks:

  • Task: Based upon the initial data collection conducted in the previous CRISP-DM phase, you are ready to decide on the data to be used for analysis. Note that data selection covers selection of records (rows) as well as attributes (columns) in a table.
  • Outputs:
    • Rationale for inclusion/exclusion: List the data and context to be included/excluded and the reasons for these decisions.
    • Selected contexts and changes: Select contexts and context changes relevant to the data mining goals, ignore the others. Select data to cover the selected contexts and changes.
  • Task: Clean and solve problems in the data chosen to include for the analysis. This tasks aims at raising the data quality to the level required by the selected analysis techniques.
  • Outputs:
    • Data Cleaning Report: Report data-cleaning efforts (missing data, data errors, coding inconsistencies, missing data and bad metadata) for tracking alterations to the data and in order for future data mining projects to be benefited.
  • Task: This task includes constructive data preparation operations such as the production of derived attributes or entire new records, or transformed values for existing attributes.
  • Outputs:
    • Derived attributes: Derived attributes are new attributes that are constructed from one or more existing attributes in the same record. Derive context-specific and context-independent attributes.
    • Derived attributes: Describe the creation of completely new records. Generate new data to force context-invariance (e.g., rotated images in deep learning).
  • Task: These are methods whereby information is combined from multiple sources. There are two basic methods of integrating data: merging two data sets with similar records but different attributes or appending two or more data sets with similar attributes but different records.
  • Outputs:
    • Merged data: This includes: merging tables together into a new table; aggregation of data (summarising information) from multiple records and/or tables and integrating data from relevant contexts
  • Task: This task involves checking whether certain techniques require a particular format or order to the data. Therefore syntactic modifications have to be made to the data (without changing its meaning).
  • Outputs:
    • Reformatted data: Syntactic changes made to satisfy the requirements of the specific modeling tool. Examples: change the order of the attributes and/or records, add identifier, remove commas from within text fields, trimming values, etc.
    • Context representation: Select context representation. (How are the contexts going to be represented in the data (parametrisation; as-feature vs as-dataset)?)

Previous Phase <– Phase 2: Data Understanding   —   Next Phase –> Phase 4: Modelling


 

Legend of the different representation of original and new/enhanced tasks and outputs: