Visualization
Chapter 4 – Summary
Statistics
Data Mining for Business
Intelligence
Shmueli, Patel & Bruce
© Galit Shmueli and Peter Bruce 2010
Data Visualization
• “A picture is worth a thousand words”
• Data visualization and summary statistics help condense data
• Effective presentation
• Supports data cleaning (identify missing values, outliers, incorrect values, duplicates) and exploring (combine some groups)
• Helps identify suitable variables
• Mandatory initial step for most data mining applications Graphs for Data
Exploration
Basic Plots
Line Graphs
Bar Charts
Scatterplots
Distribution Plots
Boxplots
Histograms
Two Examples
Amtrak Ridership:
Boston Housing
Amtrak routinely
Data:
collects data on ridership Goal: To predict future ridership using the series of monthly ridership data between Jan
1991 – March 2004
Census tracts in
Boston
Several variables (14)
– crime rate, location, etc. Goal 1: Predict median value of a home in the tract Goal 2: Cluster census tracts Line Graph for Time Series
Shows how ridership patterns of Amtrak trains change over time
Bar Chart for Categorical
Variable
Determine differences between subgroups
Example: 95% of tracts do not border
Charles River
Scatterplot
Displays relationship between two numerical variables
– median values decreases as percentage of low status population increases
Graphs
Three most effective plots:
bar charts – usually for categorical variables
line graphs – time series data
Scatterplots – relationship between 2
variables
Used widely in the business world
Domain knowledge and nature of the task are
used to select appropriate chart for data at hand Distribution Plots
Display entire distribution of a numerical
variable
Display “how many” of each value occur in a data set or, for continuous data or data with many possible values, “how many” values are in each of a series of ranges or “bins”
Generally useful for prediction tasks
(supervised