Ashok Chirla Computer Science Engineering, V.R.Siddhartha Engineering College, Kanuru, Vijayawada, A.P., India ashok.chirla@gmail.com. Abstract— Document clustering is considered as an important tool in the fast developing information explosion era. It is the process of grouping text documents into category groups and has found applications in various domains like information retrieval, web information systems. Ontology based computing is emerging as a natural evolution of existing technologies to design with the information onslaught. In current dissertation work, background knowledge derived from WordNet as ontology is applied during preprocessing of documents for document clustering. Document vectors constructed from WordNet synsets is used as input for clustering. Comparative analysis is done between clustering using k-means and clustering using bi- secting k-means. A document Categorization tool is developed which summarizes the hierarchy of concepts obtained from WordNet during clustering phase. GUI tool contains the association between WordNet concepts and documents belonging to the concept. Keywords: Document clustering, Ontology, BOW, POS Tagging, Stemming, Labeling, bisecting k-means algorithm.
I. INTRODUCTION
With the abundance of text documents available through the Web and corporate document management systems, the partitioning of document sets into previously unseen categories ranks high on the priority list for many applications like business intelligence systems. Nowadays the problem is often not to access text information but to select the relevant documents [2].
The steady development of computer hardware technology in the last few years has led to large supplies of powerful and affordable computers, data collection equipments, and storage media. These technologies provide good support to the database and information industry and make a huge number of databases and information repositories
References: [1] A.Hotho and S.Staab A.Maedche (2001), “Ontology- based Text Clustering”, In proceedings of the IJCAI-2001 workshop Text Learning Beyond Supervision. [3] Michael Steinbach, George Karypis and Vipin Kumar (2001), “A Comparison of Document Clustering Techniques”, Department of Computer Science and Engineering, University of Minnesota, Technical Report 00-034. [4] Fellbaum, Christiane (2005), “WordNet and wordnets”, In Brown, Keith et al. (eds.), Encyclopedia of Language and Linguistics, Second Edition, Oxford: Elsevier, 665-670. [9] S C Punitha, K Mugunthadevi and M Punithavalli (2011), “Impact of Ontology based Approach on Document Clustering” International Journal of Computer Applications 22(2):22–26, May 2011. Published by Foundation of Computer Science. [10] Sam Scott, Stan Matwin(1997), “Text Classification Using WordNet Hypernyms”, Computer Science Dept., University of Ottawa, Ottawa, Canada.