Tzu Han Hung (Vivian) CASE 2 1. Estimated profit by random selection
Expected spending per catalog mailed = 0.053 * $103 = $5.46
Expected Gross Profit by random select= (5.46-2)*180,000 = $622,800 2. a) We applied partition to “All_data” sheet, and partition output is shown in “Data_Partition1”
b) Logistic regression output can be seen in “LR_Output1”. Target variable is “purchase”. We select every variable except sequence_number(meaningless variable), source_w(removed from one of “source” variables because it is redundant), and spending (no meaning for target variable, purchase probability).
We choose the subset with 7 coefficients, since it has Cp value of 7.4 (closer to 7) as well as the probability greater than 10%. We applied the regression model to testing and validation dataset (output is in “LR_Output2”, “LR_Testscore2”, and “LR_ValidLiftChart2”). In testcore sheet, we can see the probability output we generated for each row from test data. Below shows the regression model and scoring summary.
3. a) the data of purchaser only is in “Purchasers_only” sheet b) Partition is shown in “Data_Partition2” sheet
c) Multiple Linear regression output can be seen in “MLR_Output1”. Target variable is “spending”. We select every variable except sequence_number(meaningless variable), source_w(removed from one of “source” variables because it is redundant), and purchase(all are 1 here).
d) To select best subset, the first criteria we consider is adjusted R square, finding the point where R square value stop improving, which is around 8 coefficients. Next we check Cp value, since Cp is not approaching the number of coefficient at all until more than 20 coefficient and Cp is our second criteria, we decided to choose 8 coefficients as our regression model, so that we can keep our simple model and avoid over-fitting problem. We applied the regression model to testing and validation dataset (output is