Preview

Data Preprocessing

Better Essays
Open Document
Open Document
3740 Words
Grammar
Grammar
Plagiarism
Plagiarism
Writing
Writing
Score
Score
Data Preprocessing
IT433 Data Warehousing and Data
Mining
— Data Preprocessing —

1

Data Preprocessing
• Why preprocess the data?
• Descriptive data summarization
• Data cleaning

• Data integration and transformation
• Data reduction

• Discretization and concept hierarchy generation
• Summary
2

Why Data Preprocessing?
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
• e.g., occupation=“ ”

– noisy: containing errors or outliers
• e.g., Salary=“-10”

– inconsistent: containing discrepancies in codes or names • e.g., Age=“42” Birthday=“03/07/1997”
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records
3

Why Is Data Dirty?
• Incomplete data may come from
– “Not applicable” data value when collected
– Different considerations between the time when the data was collected and when it is analyzed.
– Human/hardware/software problems

• Noisy data (incorrect values) may come from
– Faulty data collection instruments
– Human or computer error at data entry
– Errors in data transmission

• Inconsistent data may come from
– Different data sources
– Functional dependency violation (e.g., modify some linked data)

• Duplicate records also need data cleaning
4

Why Is Data Preprocessing Important?
• No quality data, no quality mining results!

– Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause incorrect or even misleading statistics.

– Data warehouse needs consistent integration of quality data

• Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse

5

Multi-Dimensional Measure of Data Quality
• A well-accepted multidimensional view:
– Accuracy
– Completeness
– Consistency
– Timeliness
– Believability
– Value added
– Interpretability
– Accessibility
• Broad categories:



References: 42:73-78, 1999 • T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003 • D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999 • T. Redman. Data Quality: Management and Technology. Bantam Books, 1992 • Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations. Communications of ACM, 39:86-95, 1996 Data Engineering, 7:623-640, 1995 70

You May Also Find These Documents Helpful

  • Good Essays

    Audit and organize the data. Understanding your data before cleaning improves the efficiency of your project and reduces the time and cost of data cleaning. Understand the purpose, location, flow, and workflows of your data before you start.…

    • 522 Words
    • 3 Pages
    Good Essays
  • Satisfactory Essays

    DRAFT EXAMINATION TIMETABLE TRIMESTER 3, 2010 MORNING EXAMS AT BURWOOD - COMMENCE AT 8.45 AM…

    • 545 Words
    • 3 Pages
    Satisfactory Essays
  • Good Essays

    “If a Bag is purchased, a Blush is also purchased at that same transaction.” (“If Bag, then Blush.”) While Bag is antecedent, Blush represents consequent.…

    • 824 Words
    • 4 Pages
    Good Essays
  • Powerful Essays

    BIMS management team has been facing a major dilemma of high turnover and extremely low employee morale. BIMS management team has asked Team B to help identify the main cause of the high turnover and low morale and propose an acceptable solution that will result in a decrease of both.…

    • 1185 Words
    • 5 Pages
    Powerful Essays
  • Powerful Essays

    10. data cleansing is a critical aspect of data warehousing that includes reconciling conflicting data definitions and formats organization-wide.…

    • 2021 Words
    • 9 Pages
    Powerful Essays
  • Powerful Essays

    Cis 500 Data Mining Report

    • 2046 Words
    • 9 Pages

    This report is an analysis of the benefits of data mining to business practices. It also assesses the reliability of data mining algorithms and with examples. “Data Mining is a process that uses statistical, mathematical, artificial intelligence, and machine learning techniques…

    • 2046 Words
    • 9 Pages
    Powerful Essays
  • Powerful Essays

    The term data quality dimension has been widely used for a number of years to describe the measure of the…

    • 3726 Words
    • 23 Pages
    Powerful Essays
  • Powerful Essays

    Canadian Tires

    • 1557 Words
    • 7 Pages

    Data quality and integration proved to be the biggest challenge at CTC. As the company’s IW grew dramatically after 1994, it was evolving on old infrastructure and a poorly defined data model. The data model did not reflect the data requirements of the business and because of the lack of standard data definitions, several versions of the truth could be extracted from IW.…

    • 1557 Words
    • 7 Pages
    Powerful Essays
  • Satisfactory Essays

    Kkak

    • 462 Words
    • 2 Pages

    BT Group started taking data quality seriously in 1997. Nigel Thrner, project lead manager for BT data quality programs, identified a data quality "champion" in each of BT's major lines of business to lead an infor¬mation management forum. Each information man¬agement group targeted specific projects with demon¬strable returns on investment, such as improving private-inventory recordkeeping to increase the num¬ber of disconnected circuits returned to stock for reuse or correcting names and addresses in marketing data to reduce the number ofletters sent to the wrong peo¬ple. As the project expanded, Thrner's group central¬ized data management and developed a data quality methodology that incorporated best practices from inside and outside the company.…

    • 462 Words
    • 2 Pages
    Satisfactory Essays
  • Satisfactory Essays

    Reliability of Data: Data can be acquired in many different ways, what’s important however is if the data you’re receiving is reliable or not. Primary and secondary sources are extremely important to organisations; it can be the vital difference between gaining reliable data and receiving poor data. Research on data costs money, and the better the data, the more likely the organisation will have to fork over a lot of money, which then greatly effects finance limitations and constraints.…

    • 308 Words
    • 2 Pages
    Satisfactory Essays
  • Satisfactory Essays

    Cash Flows

    • 380 Words
    • 2 Pages

    The Rogers Corporation has a gross profit of $880,000 and $360,000 in depreciation expenses. The Evans Corporation also has $880,000 in gross profit, with $60,000 in depreciation expense. Selling and administration expense is $120,000 for each company.…

    • 380 Words
    • 2 Pages
    Satisfactory Essays
  • Good Essays

    What is Data? What is information? Data is facts; numbers; statistics; readings from a device or machine. It depends on what the context is. Data is what is used to make up information. Information could be considered to be the same characteristics I just described as data. In the context of transforming data into information, you could assume data is needed to produce information. So information there for is the meaningful translation of a set of or clusters of data that’s produces an output of meaningful information. So data is a bunch of meaningless pieces of information that needs to be composed; analyzed; formed; and so forth to form a meaningful piece of information.…

    • 880 Words
    • 4 Pages
    Good Essays
  • Good Essays

    Syncretism In Religion

    • 772 Words
    • 4 Pages

    “Is Christianity, Islam, and Judaism guilty of the religious tradition called syncretism?” Christianity, Islam, and Judaism are guilty of the religious tradition called syncretism. It is my opinion that all religions are created from the same foundation then somehow changed to fit the needs of a particular culture or philosophy. I feel that religion has lost its integrity and validity. Judaism, Christianity, and Islam all stem from the same origin.…

    • 772 Words
    • 4 Pages
    Good Essays
  • Satisfactory Essays

    We will start by defining what Data is and what Information is and investigate what the differences are between the two. Data is defined as individual facts, statistics, or items of information. Information is defined as knowledge gained through study, communication, research, instruction, etc.; factual. As we can see Data is units of information, and information is a collection of facts. In order to logically process this information into presentable facts we must mathematically assess the data in to a reliable representation. We would use the linear formula y=mx+b, and use the data to represent the trend of the slope so it can give a reader a visual depiction of the information presented.…

    • 591 Words
    • 3 Pages
    Satisfactory Essays
  • Powerful Essays

    Litgb Assignment

    • 2746 Words
    • 11 Pages

    Data mining, is also known as "knowledge discovery," refers to computer-assisted tools and techniques for sifting through and analyzing these vast data stores in order to find trends, patterns, and correlations that can guide decision making and increase understanding.…

    • 2746 Words
    • 11 Pages
    Powerful Essays