Poulomi Pal 2012PGP078
Sukshit Kapur 2012PGP097
Mohit Dhami 2012PGP070
Ujjwal Shankar 2012PGP103
Vineet Jain 2012PGP061
Assumptions for the Assignment:
We have clubbed the fatalities and non-injuries in the MAX_SEV_IR into a single category i.e. 0 because we are interested in the class of injury.
We have included every predictor for running the different models except in case of tree where we ran random forest first and then ran tree. In doing so we zeroed upon the predictors of more interest.
Our class of interest is injury hence we have assumed the false negative rate to be substantially more expensive than false positive rate.
Output for Forest:
Here we use the mean decrease in the accuracy to define the predictors used for the classification of the data. More is the mean decrease in the accuracy, more important is the predictor for classification. Based on the output we select the important predictors to be used for running the tree. We have chosen the first 15 predictors with higher mean decrease in the accuracy.
Now we see the variable importance table to decide upon the important predictors.
Using the data from the above table we find that the below predictors are of our interest:
INJURY_CRASH FATALITIES NO_INJ_I PRPTYDMG_CRASH PED_ACC_R SPD_LIM VEH_INVL
REL_RWY_R STRATUM_R MANCOL_I_R RELJCT_I_R TRAF_CON_R WEATHER_R SUR_COUND
NON_INVL
Please see below, attached is the xml format of the output of forest
Using the above reduced data set we will run the tree.
When we run the tree on the reduced data set we get from the forest we get the below output:
Also when we check the rules of the tree we get below output:
Drawing the tree we see below:
Looking at the tree we can infer that the two most important predictor that can be used to classify the data are: INJURY_CRASH and NO_INJ_I
Now running the neural net and seeing its output we get the below:
The below gives the error matrix of the