Samer Hassan and Rada Mihalcea and Carmen Banea Department of Computer Science University of North Texas samer@unt.edu, rada@cs.unt.edu, carmenb@unt.edu
Abstract
This paper describes a new approach for estimating term weights in a document, and shows how the new weighting scheme can be used to improve the accuracy of a text classifier. The method uses term co-occurrence as a measure of dependency between word features. A random-walk model is applied on a graph encoding words and co-occurrence dependencies, resulting in scores that represent a quantification of how a particular word feature contributes to a given context. Experiments performed on three standard classification datasets show that the new random-walk based approach outperforms the traditional term frequency approach of feature weighting.
1
Introduction
Term frequency has long been used as a major factor for estimating the probabilistic distribution of features in a document, and it has been employed in a broad spectrum of tasks including language modeling [18], feature selection [29, 24], and term weighting [13, 20]. The main drawback associated with the term frequency method is the fact that it relies on a bag-of-words approach. It implies feature independence, and disregards any dependencies that may exist between words in the text. In other words, it defines a ”random choice,” where the weight of the term is proportional to the probability of choosing the term randomly from the set of terms that constitute the text. Such an approach might be effective for capturing the relevance of a term in a local context, but it fails to account for the global effect that the term’s existence exerts on the entire text segment. We argue that the bag-of-words model may not be the best technique to capture term importance. Instead, given that relations in the text could be preserved by maintaining the structural representation of the text, a method
References: [1] I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras, and C. D. Spyropoulos. An evaluation of naive bayesian anti-spam filtering. In Proceedings of the workshop on Machine Learning in the New Information Age, 2000. [2] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1–7), 1998. [3] C. Buckley, G. Salton, J. Allan, and A. Singhal. Automatic query expansion using smart: Trec 3. In Proceedings of the Text Retrieval Conference, 1994. [4] P. D. Ciya Liao, Shamim Alpha. Feature preparation in text categorization. In Oracle Corporation, 2002. [5] R. Collobert and S. Bengio. SVMTorch: Support vector machines for large-scale regression problems. Journal of Machine Learning Research, 1:143–160, 2001. [6] P. Dai, U. Iurgel, and G. Rigoll. A novel feature combination approach for spoken document classification with support vector machines, 2003. [7] F. Debole and F. Sebastiani. Supervised term weighting for automated text categorization. In SAC ’03: Proceedings of the 2003 ACM symposium on Applied computing, pages 784–788, New York, NY, USA, 2003. ACM Press. [8] B. Dom, I. Eiron, A. Cozzi, and Y. Shang. Graph-based ranking algorithms for e-mail expertise analysis. In Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, San Diego, California, 2003. [9] G. Erkan and D. Radev. Lexrank: Graph-based centrality as salience in text summarization. Journal of Artificial Intelligence Research, December 2004. [10] G. Grimmett and D. Stirzaker. Probability and Random Processes. Oxford University Press, 1989. [11] T. Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Proceedings of the 14th International Conference on Machine Learning, Nashville, US, 1997. [12] A. Klautau. Speech recognition based on discriminative classifiers. In Proceedings of the Simposio Brasileiro de Telecomunicacion-SBT, Rio de Janeiro, Brazil, 2003. [13] M. Lan, C. Tan, H. Low, and S. Sungy. A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In Proceedings of the 14th international conference on World Wide Web, pages 1032–1033, 2005. [14] E. Leopold and J. Kindermann. Text categorization with support vector machines. how to represent texts in input space? In Machine Learning, volume 46, pages 423–444, Hingham, MA, USA, 2002. Kluwer Academic Publishers. [15] R. Mihalcea and P. Tarau. TextRank – bringing order into texts. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), Barcelona, Spain, 2004. [16] A. Moschitti. A study on optimal paramter tuning for Rocchio text classifier. In Proceedings of the European Conference on Information Retrieval, Pisa, Italy, 2003. [17] K. Papineni. Why inverse document frequency? In NAACL ’01: Second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies 2001, pages 1–8, Morristown, NJ, USA, 2001. Association for Computational Linguistics. [18] J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Research and Development in Information Retrieval, pages 275–281, 1998. [19] M. Radovanovic and M. Ivanovic. Document representations for classification of short web-page descriptions. In DaWaK, pages 544–553, 2006. [20] R. Robertson and K. Sparck-Jones. Simple, proven approaches to text retrieval. Technical report, 1997. [21] S. Robertson. Understanding inverse document frequency: on theoretical arguments for idf. Journal of Documentation, 5:503–520, 2004. [22] M. Sahami. Learning limited dependence bayesian classifiers. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, pages 335–338, 1996. [23] K. Schneider. A new feature selection score for multinomial naive bayes text classification based on kl-divergence. In The Companion Volume to the Proceedings of 42st Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, July 2004. [24] H. Schutze, D. A. Hull, and J. O. Pedersen. A comparison of classifiers and document representations for the routing problem. In Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, Seattle, Washington, 1995. [25] K. Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1):11–21, 1972. [26] S. Tan, X. Cheng, M. M. Ghanem, B. Wang, and H. Xu. A novel refinement approach for text categorization. In CIKM ’05: Proceedings of the 14th ACM international conference on Information and knowledge management, pages 469–476, Bremen, Germany, 2005. [27] V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995. [28] Y. Yang and X. Liu. A reexamination of text categorization methods. In Proceedings of the 22nd ACM SIGIR Conference on Research and Development in Information Retrieval, 1999. [29] Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning, Nashville, US, 1997. 8