We here briefly describe the classification algorithms applied, the BBC forum dataset [112] used and the performance evaluation measures used to analyze the results.
5.5.1 Classification Algorithms
For classification task, in this module, we used the four classification algorithms of Support Vector Machine, Decision Tree, Naïve Bayes and Logistic Regression provided in ODM[107]. As discussed earlier that it is used for data mining tasks in a number of existing research works[108-110]. Maximum Description Length (MDL) algorithm has been applied for attribute importance and all the proposed features show positive results.
5.5.2 Dataset
The choice of proper dataset is significant as it should cover diverse topics from …show more content…
In addition, performance evaluation measures used to evaluate classification such as Receiver Operating Characteristic (ROC), Area Under the Curve (AUC), Lift and Cost have also been used for evaluation. The measure as briefly described as follows:
5.5.4.1 ROC
Receiver Operating Characteristic (ROC) is a metric to compare actual and predicted values in a classification model. It is applied for the analysis of binary classification to obtain in-depth insight into the decision-making ability of the classification model. ROC is plotted as a curve on an X-Y axis. The false positive rate is placed on the X axis while the true positive rate is placed on the Y axis. The top left corner is the optimal location on an ROC graph, indicating a high true positive rate and a low false positive rate[113, 114].
ROC graph is defined by a parametric definition x=FPrate(t), y=TPrate(t). (21)
Where t represents the probability threshold value, which by default is …show more content…
Lift is the ratio between the percentages of correct positive classifications to that of actual positive classification in the test data. Lift is computed using the parametric definition [113]: x=Yrate(t)= (TP(t)+FP(t))/(P+N),y=TP(t). (23)
5.5.7 Cost
Cost is an additional measure introduced by Oracle Data Miner. It is an indication of the damage done by an incorrect prediction and is useful for comparison of classification models. Lower cost means a high probability of confidence in the prediction ability of the classification model.
5.6 Results and Discussion
The post and thread classification results using four classification algorithms are compared using evaluation measures of Accuracy, Precision, Recall and F-measure. In addition, performance measures of ROC, AUC, Lift and Cost are used for in-depth analysis.
5.6.1 Post