Mining
— Data Preprocessing —
1
Data Preprocessing
• Why preprocess the data?
• Descriptive data summarization
• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
• Summary
2
Why Data Preprocessing?
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
• e.g., occupation=“ ”
– noisy: containing errors or outliers
• e.g., Salary=“-10”
– inconsistent: containing discrepancies in codes or names • e.g., Age=“42” Birthday=“03/07/1997”
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records
3
Why Is Data Dirty?
• Incomplete data may come from
– “Not applicable” data value when collected
– Different considerations between the time when the data was collected and when it is analyzed.
– Human/hardware/software problems
• Noisy data (incorrect values) may come from
– Faulty data collection instruments
– Human or computer error at data entry
– Errors in data transmission
• Inconsistent data may come from
– Different data sources
– Functional dependency violation (e.g., modify some linked data)
• Duplicate records also need data cleaning
4
Why Is Data Preprocessing Important?
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause incorrect or even misleading statistics.
– Data warehouse needs consistent integration of quality data
• Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse
5
Multi-Dimensional Measure of Data Quality
• A well-accepted multidimensional view:
– Accuracy
– Completeness
– Consistency
– Timeliness
– Believability
– Value added
– Interpretability
– Accessibility
• Broad categories:
–
References: 42:73-78, 1999 • T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003 • D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999 • T. Redman. Data Quality: Management and Technology. Bantam Books, 1992 • Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations. Communications of ACM, 39:86-95, 1996 Data Engineering, 7:623-640, 1995 70