15.062 Data Mining
Problem 1 (25 points)
For the following questions please give a True or False answer with one or two sentences in justification. 1.1 A linear regression model will be developed using a training data set. Adding variables to the model will always reduce the sum of squared residuals measured on the validation set.
1.2 Although forward selection and backward elimination are fast methods for subset selection in linear regression, only step-wise selection is guaranteed to find the best subset.
1.3 An analyst computes classification functions using discriminant analysis for a data set with three classes C1, C2 and C3. She assumes that all three classes are equally likely to arise in the application. She later learns that the probability of C1 is twice that of C2 and C3. The probabilities for C2 and C3 are equal. If she re-computes the classification functions using this information, the value of the classification function for C1 will increase for every data point.
1.4 A classification model's misclassification rate on the validation set is a better measure of the model's predictive ability on new data than its misclassification rate on the training set.
1.5 A neural net classifier for two classes constructs a separating boundary between the classes that is linear in weighted sums of the input values.
Problem 2 (10 points)
A dataset of 1000 cases was partitioned into a training set of 600 cases and a validation set of 400 cases. A k-Nearest Neighbors model with k=1 had a misclassification error rate of 8% on the validation data. It was subsequently found that the partitioning had been done incorrectly and that
100 cases from the training data set had been accidentally duplicated and had overwritten 100 cases in the validation dataset. What is the misclassification error rate for the 300 cases that were truly part of the validation data?
Problem 3 (10 points)
A Naïve Bayes classifier has been constructed with