1. Dataset For this tutorial, we will work on some unlabeled data from the US Census Bureau. The following introduction to this dataset is for you to learn about its attributes and interpret results: Attributes of the raw data is discretized to have less attribute values, which is the data we are seeing now. Attributes description of the raw data attributes is at: http://archive.ics.uci.edu/ml/databases/census1990/USCensus1990raw.attributes.txt Some attributes are kept the same from raw dataset to the current dataset, with an “i” attached to the front of current attribute name indicating it’s unchanged; the discretized attributes of raw data set are named with a “d” added in front of their original names. For example, in current data set, attribute “dAge” is discretized from raw data set, and its description should be “AAGE” in the raw data description (Age); “iAvail” means the attribute values is not changed from its raw values, and its corresponding attribute is “AVAIL” in raw data description (Available for work). For more information, the mapping functions from raw attributes to current attributes can be found here: http://archive.ics.uci.edu/ml/databases/census1990/USCensus1990.mapping.sql The file used in this tutorial is an abbreviated version of the data set, obtaining the first 10,000 instances out of 2,458,285. [Note: If your computer does not have big memory, you will notice the following clustering process is executed very slowly. Then you may use the file UScensus_3000.xlsx to do this Lab. This file has only 3000 instances, although it may not get as interesting results as the larger file, it should take much less memory than the larger set with 10000 instances.] Start RapidMiner and ReadExcel UScensus_10000.xlsx, and set role of the “case ID” to be id, then store the dataset to your repository (please recall tutorial 2 on importing and storing data). Please note the dataset is a little bigger than those we have worked on,
1. Dataset For this tutorial, we will work on some unlabeled data from the US Census Bureau. The following introduction to this dataset is for you to learn about its attributes and interpret results: Attributes of the raw data is discretized to have less attribute values, which is the data we are seeing now. Attributes description of the raw data attributes is at: http://archive.ics.uci.edu/ml/databases/census1990/USCensus1990raw.attributes.txt Some attributes are kept the same from raw dataset to the current dataset, with an “i” attached to the front of current attribute name indicating it’s unchanged; the discretized attributes of raw data set are named with a “d” added in front of their original names. For example, in current data set, attribute “dAge” is discretized from raw data set, and its description should be “AAGE” in the raw data description (Age); “iAvail” means the attribute values is not changed from its raw values, and its corresponding attribute is “AVAIL” in raw data description (Available for work). For more information, the mapping functions from raw attributes to current attributes can be found here: http://archive.ics.uci.edu/ml/databases/census1990/USCensus1990.mapping.sql The file used in this tutorial is an abbreviated version of the data set, obtaining the first 10,000 instances out of 2,458,285. [Note: If your computer does not have big memory, you will notice the following clustering process is executed very slowly. Then you may use the file UScensus_3000.xlsx to do this Lab. This file has only 3000 instances, although it may not get as interesting results as the larger file, it should take much less memory than the larger set with 10000 instances.] Start RapidMiner and ReadExcel UScensus_10000.xlsx, and set role of the “case ID” to be id, then store the dataset to your repository (please recall tutorial 2 on importing and storing data). Please note the dataset is a little bigger than those we have worked on,