Preview

Random-walk Term Weighting for Improved Text Classi

Better Essays
Open Document
Open Document
6409 Words
Grammar
Grammar
Plagiarism
Plagiarism
Writing
Writing
Score
Score
Random-walk Term Weighting for Improved Text Classi
Random-Walk Term Weighting for Improved Text Classification
Samer Hassan and Rada Mihalcea and Carmen Banea Department of Computer Science University of North Texas samer@unt.edu, rada@cs.unt.edu, carmenb@unt.edu

Abstract
This paper describes a new approach for estimating term weights in a document, and shows how the new weighting scheme can be used to improve the accuracy of a text classifier. The method uses term co-occurrence as a measure of dependency between word features. A random-walk model is applied on a graph encoding words and co-occurrence dependencies, resulting in scores that represent a quantification of how a particular word feature contributes to a given context. Experiments performed on three standard classification datasets show that the new random-walk based approach outperforms the traditional term frequency approach of feature weighting.

1

Introduction

Term frequency has long been used as a major factor for estimating the probabilistic distribution of features in a document, and it has been employed in a broad spectrum of tasks including language modeling [18], feature selection [29, 24], and term weighting [13, 20]. The main drawback associated with the term frequency method is the fact that it relies on a bag-of-words approach. It implies feature independence, and disregards any dependencies that may exist between words in the text. In other words, it defines a ”random choice,” where the weight of the term is proportional to the probability of choosing the term randomly from the set of terms that constitute the text. Such an approach might be effective for capturing the relevance of a term in a local context, but it fails to account for the global effect that the term’s existence exerts on the entire text segment. We argue that the bag-of-words model may not be the best technique to capture term importance. Instead, given that relations in the text could be preserved by maintaining the structural representation of the text, a method



References: [1] I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras, and C. D. Spyropoulos. An evaluation of naive bayesian anti-spam filtering. In Proceedings of the workshop on Machine Learning in the New Information Age, 2000. [2] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1–7), 1998. [3] C. Buckley, G. Salton, J. Allan, and A. Singhal. Automatic query expansion using smart: Trec 3. In Proceedings of the Text Retrieval Conference, 1994. [4] P. D. Ciya Liao, Shamim Alpha. Feature preparation in text categorization. In Oracle Corporation, 2002. [5] R. Collobert and S. Bengio. SVMTorch: Support vector machines for large-scale regression problems. Journal of Machine Learning Research, 1:143–160, 2001. [6] P. Dai, U. Iurgel, and G. Rigoll. A novel feature combination approach for spoken document classification with support vector machines, 2003. [7] F. Debole and F. Sebastiani. Supervised term weighting for automated text categorization. In SAC ’03: Proceedings of the 2003 ACM symposium on Applied computing, pages 784–788, New York, NY, USA, 2003. ACM Press. [8] B. Dom, I. Eiron, A. Cozzi, and Y. Shang. Graph-based ranking algorithms for e-mail expertise analysis. In Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, San Diego, California, 2003. [9] G. Erkan and D. Radev. Lexrank: Graph-based centrality as salience in text summarization. Journal of Artificial Intelligence Research, December 2004. [10] G. Grimmett and D. Stirzaker. Probability and Random Processes. Oxford University Press, 1989. [11] T. Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Proceedings of the 14th International Conference on Machine Learning, Nashville, US, 1997. [12] A. Klautau. Speech recognition based on discriminative classifiers. In Proceedings of the Simposio Brasileiro de Telecomunicacion-SBT, Rio de Janeiro, Brazil, 2003. [13] M. Lan, C. Tan, H. Low, and S. Sungy. A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In Proceedings of the 14th international conference on World Wide Web, pages 1032–1033, 2005. [14] E. Leopold and J. Kindermann. Text categorization with support vector machines. how to represent texts in input space? In Machine Learning, volume 46, pages 423–444, Hingham, MA, USA, 2002. Kluwer Academic Publishers. [15] R. Mihalcea and P. Tarau. TextRank – bringing order into texts. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), Barcelona, Spain, 2004. [16] A. Moschitti. A study on optimal paramter tuning for Rocchio text classifier. In Proceedings of the European Conference on Information Retrieval, Pisa, Italy, 2003. [17] K. Papineni. Why inverse document frequency? In NAACL ’01: Second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies 2001, pages 1–8, Morristown, NJ, USA, 2001. Association for Computational Linguistics. [18] J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Research and Development in Information Retrieval, pages 275–281, 1998. [19] M. Radovanovic and M. Ivanovic. Document representations for classification of short web-page descriptions. In DaWaK, pages 544–553, 2006. [20] R. Robertson and K. Sparck-Jones. Simple, proven approaches to text retrieval. Technical report, 1997. [21] S. Robertson. Understanding inverse document frequency: on theoretical arguments for idf. Journal of Documentation, 5:503–520, 2004. [22] M. Sahami. Learning limited dependence bayesian classifiers. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, pages 335–338, 1996. [23] K. Schneider. A new feature selection score for multinomial naive bayes text classification based on kl-divergence. In The Companion Volume to the Proceedings of 42st Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, July 2004. [24] H. Schutze, D. A. Hull, and J. O. Pedersen. A comparison of classifiers and document representations for the routing problem. In Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, Seattle, Washington, 1995. [25] K. Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1):11–21, 1972. [26] S. Tan, X. Cheng, M. M. Ghanem, B. Wang, and H. Xu. A novel refinement approach for text categorization. In CIKM ’05: Proceedings of the 14th ACM international conference on Information and knowledge management, pages 469–476, Bremen, Germany, 2005. [27] V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995. [28] Y. Yang and X. Liu. A reexamination of text categorization methods. In Proceedings of the 22nd ACM SIGIR Conference on Research and Development in Information Retrieval, 1999. [29] Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning, Nashville, US, 1997. 8

You May Also Find These Documents Helpful

  • Good Essays

    Nt1310 Unit 3 Study Essay

    • 3921 Words
    • 16 Pages

    |Term-Document Matrix |A frequency matrix created from digitized and organized documents (the corpus) where the columns…

    • 3921 Words
    • 16 Pages
    Good Essays
  • Satisfactory Essays

    Pt1420 Unit 1 Assignment

    • 303 Words
    • 2 Pages

    IBM Multimedia Analysis and Retrieval System [8]. The service enabled users to train new classifiers in December 2015.…

    • 303 Words
    • 2 Pages
    Satisfactory Essays
  • Good Essays

    Isds Ch 5

    • 3328 Words
    • 14 Pages

    11) By applying a learning algorithm to parsed text, researchers from Stanford University's NLP lab have…

    • 3328 Words
    • 14 Pages
    Good Essays
  • Good Essays

    * RI.3.5. Use text features and search tools (e.g., key words, sidebars, hyperlinks) to locate information relevant to a given topic efficiently.…

    • 4807 Words
    • 20 Pages
    Good Essays
  • Powerful Essays

    Automatic Sentence Generator

    • 3412 Words
    • 14 Pages

    automatically. To do so, models are first created based on contexts of interest. These models incorporate word histories that are detected in a context dependent training set of sentences. Not only will we be able to automatically generate sentences associated with the theme being modeled, but we will also be able to help recognize phrases and sentences. In other words, this is a module which could be part of an automatic speech recognition system, so that proposed recognized word sequences can be validated according to acceptable contexts. The system is adaptive and incremental, since models can be modified with additional training sentences, which would expand a previously established capacity.…

    • 3412 Words
    • 14 Pages
    Powerful Essays
  • Satisfactory Essays

    Text analysis is difficult because you have to quantify text as well as deal with other challenges such as dimensionality and data dispersal. The real challenge for HP was combining structured data with unstructured data. HP was able to utilize SAS Text Miner’s technique called singular value decomposition.…

    • 410 Words
    • 2 Pages
    Satisfactory Essays
  • Satisfactory Essays

    But, in this research work, we have built a sentiment analysis and trained it using natural language processing that has resulted in a very high trustee rate. This sentiment rate which is obtained is slightly to be higher than other algorithms that were previously proposed.…

    • 596 Words
    • 3 Pages
    Satisfactory Essays
  • Satisfactory Essays

    Belonging

    • 281 Words
    • 2 Pages

    * Take into account context, purpose and register, text structures, stylistic features, grammatical features and vocabulary.…

    • 281 Words
    • 2 Pages
    Satisfactory Essays
  • Satisfactory Essays

    It is a story about a young boy living with his aunt and uncle, who is infatuated with his friend's sister and often follows her to school, never having the courage to talk to her. At last, she speaks to him, asking him if he is planning to attend a visiting bazaar named "Araby". After her indication that she cannot attend, the young man offers to bring her something from that bazaar, hoping to impress her. The young boy borrows some money from his uncle and makes his way to the bazaar a little late because that night his uncle was late home. At the bazaar, most stalls were closed and most of the people have left. The event is shuting down for the night, and the boy does not have enough money to buy something…

    • 186 Words
    • 1 Page
    Satisfactory Essays
  • Best Essays

    It Essay - Data Mining

    • 1998 Words
    • 8 Pages

    He, J. (2009). Advances in Data Mining: History and Future. Third International Symposium on Intelligent . Retrieved November 1, 2012, from http://ieeexplore.ieee.org.ezproxy.lib.ryerson.ca/stamp/stamp.jsp?tp=&arnumber=5370232&tag=1…

    • 1998 Words
    • 8 Pages
    Best Essays
  • Powerful Essays

    Paper Sharock 0

    • 5577 Words
    • 26 Pages

    Chang, K., Chen, I., & Sung, Y. (2002). The effect of concept mapping to enhance text…

    • 5577 Words
    • 26 Pages
    Powerful Essays
  • Best Essays

    Access to Health Care

    • 2651 Words
    • 11 Pages

    Uzma R., Mitchell T., Day, T., and Hardin, M. (2008). Text mining in healthcare applications…

    • 2651 Words
    • 11 Pages
    Best Essays
  • Satisfactory Essays

    Semantic Studies

    • 669 Words
    • 3 Pages

    According to “The Introduction Of Social Studies Vocabulary By Semantics Features Analysis: Using a Microcomputer Database Program” by Michael P. French and Nancy Cook (University of Wisconsin), they conducted the studies on the results of using microcomputer program adapted with semantics features theory. This program was created to study if semantics features help the students learn various words, basing on the theory by Johnson and Pearson (1984), Semantic feature analysis is a strategy that draws upon a student's prior knowledge about words and places the emphasis on the relationship of concepts within categories. In this method, the student explores the ways in which the meanings of words differ. These relationships (sameness or difference) is shown by placing (+) and (-) signs in a table referred to as a semantic feature grid. The students could effectively learn new vocabularies and categorized them correctly. Therefore, it could be concluded that semantics features was effective strategy for learning various kinds of words.…

    • 669 Words
    • 3 Pages
    Satisfactory Essays
  • Good Essays

    7). The documents’ authors were asked to provide questions, topics, and relevance judgements based on their papers, which were used along with thesauri to create a database with which to compare retrieval rates. With Cranfield 2, Cleverdon introduced the concepts of recall and precision, both dependent on relevance. By manipulating the many decisions that go into the indexing process, Cleverdon was able to study the effects of specificity and exhaustivity. Having the same inverse relationship as recall and precision, all were important to striking the delicate balance the led to optimum retrieval…

    • 748 Words
    • 3 Pages
    Good Essays
  • Satisfactory Essays

    Recount Assessment Task

    • 611 Words
    • 5 Pages

    Write either a personal or an imaginary recount. You can choose your own topic or use one of the following:…

    • 611 Words
    • 5 Pages
    Satisfactory Essays