Start with genes-leukemia.csv dataset used in assignment 2. As a predictor use field TREATMENT_RESPONSE, which has values Success, Failure or "?" (missing)
A. Examine the records where TREATMENT_RESPONSE is non-missing. Q1: How many such records are there? Answer: 15 (7 success, 8 failure)
Q2: Can you describe these records using other sample fields (e.g. Year from XXXX to YYYY , or Gender = X, etc) Answer: CLASS = AML and Source=CALGB
Q3: Why is it not correct to build predictive models for TREATMENT_RESPONSE using records where it is missing?
Answer: because the records with missing response may correspond to either success or failure - missing in this dataset is not a separate value. B. Select only the records with non-missing TREATMENT_RESPONSE. Keep SNUM (sample number) but remove sample fields that are all the same or missing. Call the reduced dataset genes-reduced.csv
Q4: Which sample fields you should keep? Answer:
FAB_if_AML pct_Blasts Treatment_Response PS the remaining fields have either all missing values or have the same or almost the same values for the 15 selected cases. C. Build a CART Model using leave-one-out cross validation.
Q5: what tree do you get? and what is the expected error rate? U82759 < 813.5 Error rate: 27%
Q6: what are the important variables and their relative importance, according to CART? Answer:
U82759 100.00 U12471_CDS1 31.59 M91432 29.67 AF012024_S 29.67 L13278 29.67 M81933 27.75
Q7: Remove the top predictor -- U82759 and re-run the CART -- what do you get? Answer: A tree with no predictor and an error of 1, meaning that there is no valid correlation in the remaining fields to the outcome. D: Extra credit (10%): Use Google to search the web