Phase 2: Data Understanding

The second phase involves an initial data collection and proceeds with activities that enable you to become familiar with the data.

Adapting this phase to address context changes and model reuse handling involves:

  • Enhance the initial data collection task in order to be able to represent different relevant contexts.
  • We should be able to contribute to or refine the data description, quality reports and information about context representation, and feed into the transformation and other data preparation steps.

Tasks

  • Task: Acquire the data (or access to the data) listed in the project resources. This initial collection includes data integration if acquired from multiple data sources. Describe attributes (promising, irrelevant, . . . ), quantity and quality of data. Collect sufficiently rich raw data to represent possibly different relevant contexts. Collect sufficiently rich raw data to represent possibly different relevant contexts.
  • Outputs:
    • Initial Data Collection Report: Describe data collected: describe attributes (promising, irrelevant, . . . ), quantity and quality of data and identify relevant contexts.
  • Task: Describe the properties of the acquired data and report on the results. This includes the amount of data (consider sampling), value types, records, fields, coding schemes, etc.
  • Outputs:
    • Initial Data Collection Report: Write description report in order to share the findings about the data.
  • Task: This task addresses data mining and context-aware goals through querying, visualization, and reporting techniques over the data and how they may contribute/refine the initial (business or DM) goals, data transformation/preparation.... Among others, this analysis include distribution of key attributes, looking for errors in the data, relationships between pairs or small numbers of attributes, results of simple aggregations, properties of significant sub-populations, and simple statistical analyses.
  • Outputs:
    • Data exploration report: Describe results of this task including (possibly using graphs and plots) first findings, initial hypothesis, explorations about contexts, particular subsets of relevant data and attributes and their impact on the remainder of the project.
  • Task: Examine the quality of the data: coding or data errors, missing values, bad metadata, measurement errors and other types of inconsistencies that make analysis difficult.
  • Outputs:
    • Initial Data Collection Report: List and describe the results of the data quality verification (is correct?, contain errors?, missing values?, how common are they?) and list possible solutions.

Previous Phase <– Phase 1: Business Understanding   —   Next Phase –> Phase 3: Data Preparation


 

Legend of the different representation of original and new/enhanced tasks and outputs: