Regression
Data Mining for Business
Intelligence
Shmueli, Patel & Bruce
© Galit Shmueli and Peter Bruce 2010
Topics
Explanatory vs. predictive modeling with
regression
Example: prices of Toyota Corollas
Fitting a predictive model
Assessing predictive accuracy
Selecting a subset of predictors (variable selection) Explanatory Modeling
Goal: Explain relationship between predictors
(explanatory variables) and target
Familiar use of regression in data analysis
Multiple linear regression – linear relationship between
a dependent variable Y (response) and a set of predictors
X1,…,Xp
Model Goal: Fit the data well and understand the
contribution of explanatory variables to the model – model performance assessed by residual analysis
Model fitted to the entire dataset
Predictive Modeling
Goal: Predict target values in new data where we have predictor values, but not target values
Classic data mining context
Model Goal: Optimize predictive accuracy – how
accurately can the fitted model predict new cases
Model trained on training data and performance is assessed on validation or test data
Explaining role of predictors is not the primary
purpose (although useful)
Regression Method
Predict the value of the dependent variable Y
based on predictors X1,…,Xp
Regression coefficients β1, β2,…, βp in the equation:
Y = β1X1 + β2X2 + …..+ βpXp
Coefficients estimated via ordinary least squares
(OLS) method
Estimated using training sample
Predictive capacity assessed by prediction results on
validation set – average squared error
Assumptions – normality, independence, linearity
Example: Prices of Toyota
Corolla
ToyotaCorolla.xls
Goal: Predict sale prices of used Toyota
Corollas based on their specification
Data: Prices of 1442 used Toyota
Corollas, with their specification information – age, mileage, fuel type, engine size
Data Sample
(showing only the variables to be used in analysis) Variables Used
Price in