A. Become familiar with the use of the WEKA workbench to invoke several different machine learning schemes.
Use latest stable version. Use both the graphical interface (Explorer) and command line interface (CLI).
See Weka home page for Weka documentation.
B. Use the following learning schemes, with the default settings to analyze the weather data (in weather.arff). For test options, first choose "Use training set", then choose "Percentage Split" using default 66% percentage split. Report model percent error rate.
ZeroR (majority class)
OneR
Naive Bayes Simple
J4.8
C. Which of these classifiers are you more likely to trust when determining whether to play? Why?
D. What can you say about accuracy when using training set data and when using a separate percentage to train?
Assignment 2: Preparing the data and mining it
A. Take the file genes-leukemia.csv (here is the description of the data) and convert it to Weka file genes-a.arff.
You can convert the file either using a text editor like emacs (brute force way) or find a Weka command that converts .csv file to .arff (a smart way).
B. Target field is CLASS. Use J48 on genes-leukemia with "Use training set" option.
C. Use genes-leukemia.arff to create two subsets: genes-leukemia-train.arff, with the first 38 samples (s1 ... s38) of the data genes-leukemia-test.arff, with the remaining 34 samples (s39 ... s72).
D. Train J48 on genes-leukemia-train.arff and specify "Use training set" as the test option.
What decision tree do you get? What is its accuracy?
E. Now specify genes-leukemia-test.arff as the test set.
What decision tree do you get and how does its accuracy compare to one in the previous question?
F. Now remove the field "Source" from the classifier (unclick checkmark next to Source, and click on Apply Filter in the top menu) and repeat steps D and E.
What do you observe? Does the accuracy on test set improve and if so, why do you think it does?