R and Data Mining: Examples and Case Studies
1
Yanchang Zhao yanchang@rdatamining.com http://www.RDataMining.com
April 26, 2013
1
➞2012-2013 Yanchang Zhao. Published by Elsevier in December 2012. All rights reserved.
Messages from the Author
Case studies: The case studies are not included in this oneline version. They are reserved exclusively for a book version.
Latest version: The latest online version is available at http://www.rdatamining.com. See the website also for an R Reference Card for Data Mining.
R code, data and FAQs: R code, data and FAQs are provided at http://www.rdatamining. com/books/rdm. Chapters/sections to add: topic modelling and stream graph; spatial data analysis. Please let me know if some topics are interesting to you but not covered yet by this document/book.
Questions and feedback: If you have any questions or comments, or come across any problems with this document or its book version, please feel free to post them to the RDataMining group below or email them to me. Thanks.
Discussion forum: Please join our discussions on R and data mining at the RDataMining group
<http://group.rdatamining.com>.
Twitter: Follow @RDataMining on Twitter.
A sister book: See our upcoming book titled Data Mining Application with R at http://www. rdatamining.com/books/dmar. Contents
List of Figures
v
List of Abbreviations
vii
1 Introduction
1.1 Data Mining . . . . . . . . .
1.2 R . . . . . . . . . . . . . . . .
1.3 Datasets . . . . . . . . . . . .
1.3.1 The Iris Dataset . . .
1.3.2 The Bodyfat Dataset .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
1
2
2
3
2 Data Import and Export
2.1 Save and Load R Data . . . . . . . . . . .
2.2 Import from and Export to .CSV
1
Yanchang Zhao yanchang@rdatamining.com http://www.RDataMining.com
April 26, 2013
1
➞2012-2013 Yanchang Zhao. Published by Elsevier in December 2012. All rights reserved.
Messages from the Author
Case studies: The case studies are not included in this oneline version. They are reserved exclusively for a book version.
Latest version: The latest online version is available at http://www.rdatamining.com. See the website also for an R Reference Card for Data Mining.
R code, data and FAQs: R code, data and FAQs are provided at http://www.rdatamining. com/books/rdm. Chapters/sections to add: topic modelling and stream graph; spatial data analysis. Please let me know if some topics are interesting to you but not covered yet by this document/book.
Questions and feedback: If you have any questions or comments, or come across any problems with this document or its book version, please feel free to post them to the RDataMining group below or email them to me. Thanks.
Discussion forum: Please join our discussions on R and data mining at the RDataMining group
<http://group.rdatamining.com>.
Twitter: Follow @RDataMining on Twitter.
A sister book: See our upcoming book titled Data Mining Application with R at http://www. rdatamining.com/books/dmar. Contents
List of Figures
v
List of Abbreviations
vii
1 Introduction
1.1 Data Mining . . . . . . . . .
1.2 R . . . . . . . . . . . . . . . .
1.3 Datasets . . . . . . . . . . . .
1.3.1 The Iris Dataset . . .
1.3.2 The Bodyfat Dataset .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
1
2
2
3
2 Data Import and Export
2.1 Save and Load R Data . . . . . . . . . . .
2.2 Import from and Export to .CSV
Bibliography: [Adler and Murdoch, 2012] Adler, D. and Murdoch, D. (2012). rgl: 3D visualization device system (OpenGL) [Agrawal et al., 1993] Agrawal, R., Faloutsos, C., and Swami, A. N. (1993). Efficient similarity search in sequence databases [Agrawal and Srikant, 1994] Agrawal, R. and Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In Proc. of the 20th International Conference on Very Large Data Bases, pages 487–499, Santiago, Chile. [Aldrich, 2010] Aldrich, E. (2010). A package of funtions for computmultiresolution analyses. http://cran.r- [Breunig et al., 2000] Breunig, M [Buchta et al., 2012] Buchta, C., Hahsler, M., and with contributions from Daniel Diaz (2012). [Burrus et al., 1998] Burrus, C. S., Gopinath, R. A., and Guo, H. (1998). Introduction to Wavelets and Wavelet Transforms: A Primer [Butts, 2010] Butts, C. T. (2010). sna: Tools for Social Network Analysis. R package version 2.2-0. [Butts et al., 2012] Butts, C. T., Handcock, M. S., and Hunter, D. R. (March 1, 2012). network: Classes for Relational Data [Chan et al., 2003] Chan, F. K., Fu, A. W., and Yu, C. (2003). Harr wavelets for efficient similarity search of time-series: with and without time warping [Chan and Fu, 1999] Chan, K.-p. and Fu, A. W.-c. (1999). Efficient time series matching by wavelets [Chang, 2011] Chang, J. (2011). lda: Collapsed Gibbs sampling methods for topic models. R package version 1.3.1. [Cleveland et al., 1990] Cleveland, R. B., Cleveland, W. S., McRae, J. E., and Terpenning, I. (1990). Stl: a seasonal-trend decomposition procedure based on loess. Journal of Official Statistics, 6(1):3–73. BIBLIOGRAPHY [Csardi and Nepusz, 2006] Csardi, G [Ester et al., 1996] Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise [Feinerer, 2010] Feinerer, I. (2010). tm.plugin.mail: Text Mining E-Mail Plug-In. R package version 0.0-4. [Feinerer, 2012] Feinerer, I. (2012). tm: Text Mining Package. R package version 0.5-7.1. [Feinerer et al., 2008] Feinerer, I., Hornik, K., and Meyer, D. (2008). Text mining infrastructure in r [Fellows, 2012] Fellows, I. (2012). wordcloud: Word Clouds. R package version 2.0. [Filzmoser and Gschwandtner, 2012] Filzmoser, P. and Gschwandtner, M. (2012). mvoutlier: Multivariate outlier detection based on robust methods. R package version 1.9.7. [Frank and Asuncion, 2010] Frank, A. and Asuncion, A. (2010). [Gentry, 2012] Gentry, J. (2012). twitteR: R based Twitter client. R package version 0.99.19. [Giorgino, 2009] Giorgino, T. (2009). Computing and visualizing dynamic timewarping alignments in R: The dtw package [Gr¨ un and Hornik, 2011] Gr¨ un, B. and Hornik, K. (2011). topicmodels: An R package for fitting topic models [Hahsler, 2012] Hahsler, M. (2012). arulesNBMiner: Mining NB-Frequent Itemsets and NBPrecise Rules. R package version 0.1-2. [Hahsler and Chelluboina, 2012] Hahsler, M. and Chelluboina, S. (2012). arulesViz: Visualizing Association Rules and Frequent Itemsets [Hahsler et al., 2005] Hahsler, M., Gruen, B., and Hornik, K. (2005). arules – a computational environment for mining association rules and frequent item sets [Hahsler et al., 2011] Hahsler, M., Gruen, B., and Hornik, K. (2011). arules: Mining Association Rules and Frequent Itemsets [Han and Kamber, 2000] Han, J. and Kamber, M. (2000). Data Mining: Concepts and Techniques. [Hand et al., 2001] Hand, D. J., Mannila, H., and Smyth, P. (2001). Principles of Data Mining (Adaptive Computation and Machine Learning) [Handcock et al., 2003] Handcock, M. S., Hunter, D. R., Butts, C. T., Goodreau, S. M., and Morris, M [Hennig, 2010] Hennig, C. (2010). fpc: Flexible procedures for clustering. R package version 2.0-3. [Hornik et al., 2012] Hornik, K., Rauch, J., Buchta, C., and Feinerer, I. (2012). textcat: N-Gram Based Text Categorization [Hothorn et al., 2012] Hothorn, T., Buehlmann, P., Kneib, T., Schmid, M., and Hofner, B. (2012). [Hothorn et al., 2010] Hothorn, T., Hornik, K., Strobl, C., and Zeileis, A. (2010). Party: A laboratory for recursive partytioning [Hu et al., 2011] Hu, Y., Murray, W., and Shan, Y. (2011). Rlof: R parallel implementation of Local Outlier Factor(LOF) [Jain et al., 1999] Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). Data clustering: a review. [Keogh et al., 2000] Keogh, E., Chakrabarti, K., Pazzani, M., and Mehrotra, S. (2000). Dimensionality reduction for fast similarity search in large time series databases. Knowledge and Information Systems, 3(3):263–286. [Keogh and Pazzani, 1998] Keogh, E. J. and Pazzani, M. J. (1998). An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback.