Objectives:
1. Recognize which summaries are used for numeric data or for qualitative data.
2. Construct a frequency table, bar graph and pie chart for qualitative data.
3. Convert raw data into a data array.
4. Construct frequency table, relative and cumulative frequency tables, histogram, and ogive for quantitative data.
5. Construct a stem-and-leaf display to represent quantitative data.
A. Summarizing Qualitative Data (2.1)
1. Introduction: Data are usually collected, entered, and saved into some form of database. In this form, trends and characteristics are not easily detectable as there can sometimes be millions of pieces of data. We want to summarize/reduce the data to a form which is more easily interpreted and which will aid in decision-making. Many summaries are found in newspapers, magazines, internet, annual reports, and research studies; therefore, it is important for you to understand how these summaries are constructed.
2. Frequency Table - a tabular summary of a data showing the frequency (or percent) of items in each of the distinct categories.
Example: Summary of academic majors:
MAJOR
ACCT
ISDS
PBADM
ACCT
ISDS
PBADM
ISDS
PBADM
ISDS
PBADM
PBADM
.
.
.
MKT
Becomes:
MAJOR
FREQ
RELATIVE FREQ (%freq) ISDS
24
0.253 (25.3) FIN
9
0.095 (9.5) MKT
15
0.158 (15.8) ACCT
7
0.074 (7.4) PBADM
40
0.421 (42.1) TOTAL
95
1.001* (100.1)
3. Visualizing Qualitative Data
a. Bar Graph – graphical display of data where each category is depicted by a bar representing the frequency or proportion of observations in that category. (Note: bars do not touch)
(Example from Course Survey) n=1618 b. Pie Chart – a graphical display of data where slices of the pie, in degrees, are associated with the frequency or proportion of observations in that category.
4. Table 2.2 Seattle Weather, February 2010
(Page 18)
a. Data
Sunday
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
1 Rainy
2 Rainy
3 Rainy
4 Rainy
5 Rainy
6 Rainy
7 Rainy
8 Rainy
9 Cloudy
10 Rainy
11 Rainy
12 Rainy
13 Rainy
14 Rainy
15 Rainy
16 Rainy
17 Sunny
18 Sunny
19 Sunny
20 Sunny
21 Sunny
22 Sunny
23 Rainy
24 Rainy
25 Rainy
26 Rainy
27 Rainy
28 Sunny
b. Frequency Table
c. How to get Output in JMP (1) Save data in a Excel File
… …
(2) Open the Excel File in JMP, using File → Open. (3) Select Graph → Chart ●Drag Variable (Column 1) to Categories, X, Levels ● Click Down Arrow to select ‘Pie Chart’ ● Click OK.
B. Summarizing Numerical Data (Sect 2.2)
1. Mission Viejo Home Prices Data: (Page 17) a. Data
b. Frequency Distribution & Histogram
c. Summary: (1) Range is $300K up to $800K (2) Most homes sold in the $500-600K range. (3) Only 4 houses sold in the lowest range, 2 houses sold in the highest range.
2. Note that qualitative data are automatically categorized. With numeric data, YOU need to determine the numerically-ordered categories/classes.
3. Ordered Array (not in Text)
An Ordered Array is a sequence of raw data in rank order from the smallest to the largest observation.
330 350 370 399 412 … 670 702 735
(Note: Range = 735 – 330 = $405K)
4. Guidelines for Constructing a Frequency Distribution:
a. Select Number of Classes – usually 5 to 20 classes. (Larger data sets require more classes, smaller data sets require less classes; this is a very subjective decision – should try to avoid the pancake (wide/flat) and skyscraper (tall/thin) effect)
(In this example, let’s use 5 classes for summarizing)
b. Determine the Width of Class
c. Determine the Class Limits – the boundaries for each class; These are very subjective, must be defined so that all observations are included. (Note: we must include the smallest value and largest value)
So what do you think about using:
(Note: each category has the same width)
Little clarity for interpretation!
So what do you do?????
d. Modify class limits to gain clarity One suggestion – set width to $100K and set minimum to $300K to get:
e. Using the ordered array, COUNT and record the number (Frequency) of observations that fall in each class. (Note the Cumulative Frequency)
5. Relative Frequency and Cumulative Relative Frequency (reports frequencies as proportions)
(Proportions are useful when comparing data sets of different sizes)
6. Class Midpoint–halfway point between the class boundaries. (Not in Text)
7. Note: The original observations are lost in the grouping process, but you gain the power of interpretation that you don’t have with a list of numbers.
8. Visualizing Quantitative Data
a. Histogram – a visual representation of quantitative data where the Horizontal Axis represents the values of the variable of interest (in this case, the price of houses) and the Vertical Axis represents the frequencies or relative frequencies. The heights of the bars represent the frequencies in each of the classes.
(Note: this histogram illustrates skewed data)
9. Frequency Polygon: Formed by connecting the midpoints of each class.
10. Ogive – a graphical representation of cumulative frequencies or cumulative relative frequencies where the X-coordinate is the upper class limit (UCL) and the Y-coordinate is the cumulative value.
11. Using JMP to analyze Numeric Data
a. Open the Excel File in JMP, using File → Open.
b. Select Analyze → Distribution
c. Drag Variable (Column 1) to Y, Columns
d. Click OK. e. Click RED triangle by variable, Column 1, select Histogram Options → Show Counts. f. To change Class Limits, double-click on X-axis g. Change Min to 300, Max to 800, and Increment to 100, then click OK.
C. Stem-and-Leaf Diagram (2.3)
1. A stem-and-leaf diagram separates data into stems (leading digits) and leaves (or trailing digits).
2. Right-most digits are leaves, remaining numbers are stems.
3. Example: AGE of The 25 Wealthiest People (www.forbes.com/lists/2010)
3 6
4
5 2234459
6 01225668
7 0449
8 1237
9 0
Interpretation: Youngest is 36, oldest is 90, most are in their 60s (followed by 50s),
16 years separates the youngest from the next oldest, more than half are 65 years or older, etc…
4. Characteristics of Stem-and-Leaf a. most effective for relatively small data sets
b. can use to determine minimum, maximum, range, mode
c. gives an idea of how the individual values are distributed across the range of the data
d. Retains all data - each observation remains distinctly identifiable