Dennis Ramdass & Shreyes Seshasai
6.863 Final Project
Spring 2009
May 18, 2009
1
Introduction
In many real-world scenarios, the ability to automatically classify documents into a fixed set of categories is highly desirable. Common scenarios include classifying a large amount of unclassified archival documents such as newspaper articles, legal records and academic papers. For example, newspaper articles can be classified as ’features’, ’sports’ or ’news’. Other scenarios involve classifying of documents as they are created. Examples include classifying movie review articles into ’positive’ or ’negative’ reviews or classifying only blog entries using a fixed set of labels.
Natural language processing o↵ers powerful techniques for automatically classifying documents. These techniques are predicated on the hypothesis that documents in di↵erent categories distinguish themselves by features of the natural language contained in each document. Salient features for document classification may include word structure, word frequency, and natural language structure in each document.
Our project looks specifically at the task of automatically classifying newspaper articles from the MIT newspaper The Tech. The Tech has archives of a large number of articles which require classification into specific sections (News, Opinion, Sports, etc). Our project is aimed at investigating and implementing techniques which can be used to perform automatic article classification for this purpose.
At our disposal is a large archive of already classified documents so we are able to make use of supervised classification techniques. We randomly split this archive of classified documents into training and testing groups for our classification systems (hereafter referred to simply as classifiers). This project experiments with di↵erent natural language feature sets as well as di↵erent statistical techniques using these feature sets
and
References: [1] Yiming Yang. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1999. Machine Learning, 1999. and Development in Information Retrieval, 1996. [6] David D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In Machine Learning: ECML-98, Tenth European Conference on Machine Learning, 1998. In AAAI-98 Workshop on Learning for Text Categorization, 1998. Tech. rep. WS-98-05, AAAI Press. [9] K. Nigam, J. La↵erty, and A. McCallum. Using maximum entropy for text classification. IJCAI-99 Workshop on Machine Learning for Information Filtering, pages 61-67, 1999. [10] Bikel, D. M. 2000. A statistical model for parsing and word-sense disambiguation. In Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (Hong Kong, October 07- 08, 2000). http://docs.huihoo.com/nltk/0.9.5/api/nltk.tokenize.punkt.PunktSentenceTokenizer-class.html [13] Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection [14] Adwait Ratnaparkhi. A Maximum Entropy Part-Of-Speech Tagger. In Proceedings of the Empirical Methods in Natural Language Processing Conference, May 17-18, 1996.