Preview

Newspaper Article Classifier

Powerful Essays
Open Document
Open Document
6617 Words
Grammar
Grammar
Plagiarism
Plagiarism
Writing
Writing
Score
Score
Newspaper Article Classifier
Document Classification for Newspaper Articles
Dennis Ramdass & Shreyes Seshasai
6.863 Final Project
Spring 2009
May 18, 2009

1

Introduction

In many real-world scenarios, the ability to automatically classify documents into a fixed set of categories is highly desirable. Common scenarios include classifying a large amount of unclassified archival documents such as newspaper articles, legal records and academic papers. For example, newspaper articles can be classified as ’features’, ’sports’ or ’news’. Other scenarios involve classifying of documents as they are created. Examples include classifying movie review articles into ’positive’ or ’negative’ reviews or classifying only blog entries using a fixed set of labels.
Natural language processing o↵ers powerful techniques for automatically classifying documents. These techniques are predicated on the hypothesis that documents in di↵erent categories distinguish themselves by features of the natural language contained in each document. Salient features for document classification may include word structure, word frequency, and natural language structure in each document.
Our project looks specifically at the task of automatically classifying newspaper articles from the MIT newspaper The Tech. The Tech has archives of a large number of articles which require classification into specific sections (News, Opinion, Sports, etc). Our project is aimed at investigating and implementing techniques which can be used to perform automatic article classification for this purpose.
At our disposal is a large archive of already classified documents so we are able to make use of supervised classification techniques. We randomly split this archive of classified documents into training and testing groups for our classification systems (hereafter referred to simply as classifiers). This project experiments with di↵erent natural language feature sets as well as di↵erent statistical techniques using these feature sets
and



References: [1] Yiming Yang. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1999. Machine Learning, 1999. and Development in Information Retrieval, 1996. [6] David D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In Machine Learning: ECML-98, Tenth European Conference on Machine Learning, 1998. In AAAI-98 Workshop on Learning for Text Categorization, 1998. Tech. rep. WS-98-05, AAAI Press. [9] K. Nigam, J. La↵erty, and A. McCallum. Using maximum entropy for text classification. IJCAI-99 Workshop on Machine Learning for Information Filtering, pages 61-67, 1999. [10] Bikel, D. M. 2000. A statistical model for parsing and word-sense disambiguation. In Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (Hong Kong, October 07- 08, 2000). http://docs.huihoo.com/nltk/0.9.5/api/nltk.tokenize.punkt.PunktSentenceTokenizer-class.html [13] Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection [14] Adwait Ratnaparkhi. A Maximum Entropy Part-Of-Speech Tagger. In Proceedings of the Empirical Methods in Natural Language Processing Conference, May 17-18, 1996.

You May Also Find These Documents Helpful

  • Good Essays

    Nt1310 Unit 3 Study Essay

    • 3921 Words
    • 16 Pages

    |Term-Document Matrix |A frequency matrix created from digitized and organized documents (the corpus) where the columns…

    • 3921 Words
    • 16 Pages
    Good Essays
  • Satisfactory Essays

    Analyzes the documents by grouping them in as many appropriate ways as possible. Does not simply summarize the documents individually.…

    • 1030 Words
    • 4 Pages
    Satisfactory Essays
  • Powerful Essays

    |Classification |The purpose of classification is|Classification essays are organized by its |Choose a topic you know a lot |…

    • 1051 Words
    • 5 Pages
    Powerful Essays
  • Good Essays

    analyzes the documents by grouping them in as many ways as possible and does not simply summarize the documents individually…

    • 1939 Words
    • 8 Pages
    Good Essays
  • Powerful Essays

    Classification Solution in order for you to get the high ratings in ones very own research paper.…

    • 1414 Words
    • 9 Pages
    Powerful Essays
  • Good Essays

    DBQ essay

    • 1491 Words
    • 6 Pages

    •Analyzes the documents by grouping them in as many appropriate ways as possible. Does not simply summarize the documents individually.…

    • 1491 Words
    • 6 Pages
    Good Essays
  • Satisfactory Essays

    Periodical Database: A research aid that catalogues articles from a large number of journals or magazines…

    • 793 Words
    • 6 Pages
    Satisfactory Essays
  • Better Essays

    Analyzes the documents by grouping them in as many appropriate ways as possible. Does not simply summarize the documents individually.…

    • 3231 Words
    • 10 Pages
    Better Essays
  • Satisfactory Essays

    Analyzes the documents by grouping them in as many ways as possible. Does not simply summarize the documents individually.…

    • 278 Words
    • 2 Pages
    Satisfactory Essays
  • Good Essays

    1. __________ consists of powerful software tools to discover and extract knowledge from text documents.…

    • 2215 Words
    • 17 Pages
    Good Essays
  • Good Essays

    There are a number of different models for software development life cycles. Life cycle models describe the interrelationships between software development phases. It specifies the relationships between project phases, including transaction criteria feedback, mechanisms, milestones, baselines, reviews, and deliverables.…

    • 677 Words
    • 3 Pages
    Good Essays
  • Powerful Essays

    Automatic Sentence Generator

    • 3412 Words
    • 14 Pages

    1.- Introduction. The growing, unstoppable development of very high speed information processing computers with tremendous main memory capacity which we see today leads us to think that it will be possible to design and construct automatic speech recognition systems which can detect and code all the grammatical components of a training corpus. As part of our effort to make a contribution to the fascinating world of Automatic Speech Recognition, we have developed a system composed of a set of computer programs. We have observed that on the basis of a model of a small corpus made up of sentences in a particular context, we can automatically generate a great quantity of grammatically correct sentences with this context. Also, our system can effect a linguistic discrimination to the point of rejecting, as…

    • 3412 Words
    • 14 Pages
    Powerful Essays
  • Powerful Essays

    References: [1] L. Lesmo, The turin university parser at evalita 2009, in: Proceedings of EVALITA 9, 2009 [2] M. De Marneffe, B. MacCartney, C. Manning, Generating typed dependency parses from phrase structure parses, LREC 2006, Citeseer, 2006. [3] M. de Marneffe, C. Manning, Stanford typed dependencies manual, , 2008. [4] T. Jain, D. Nemade, Recognizing contextual polarity in phrase-level sentiment analysis, International Journal of Computer Applications IJCA 7 (5) (2010) 5–11 [5] http://www.noslang.com [6]Identifying the semantic orientation of terms using SHAL for sentiment analysis(November 2012) [7]A framework for building web mining applications in the world of blogs: A case study in product sentiment analysis (2011) [8]Sentiment Analysis: An Overview Comprehensive Exam Paper (November 16, 2009) [9] Bing Liu. Sentiment Analysis and Opinion Mining, Morgan & Claypool Publishers, May 2012. [10] Introduction to sentiment analysis (Erasmus Mundus European Master’s Program in Language and Communication Technologies) [11] Opinion mining and sentiment analysis Bo Pang and Lillian Lee (Sep. 2011) [12] Sentiment Analysis: An Overview Comprehensive Exam Paper, November 16, 2009 [13] Thumbs up? Sentiment Classi¯cation using Machine LearningTechniques [14] Sentiment Identification by Incorporating Syntax, Semantics and Context Information [15] Sentiment analysis via dependency parsing(2012)…

    • 5176 Words
    • 21 Pages
    Powerful Essays
  • Powerful Essays

    Mcb Dsdfdf

    • 20543 Words
    • 83 Pages

    • Identifying the main points of a text • Deducing the meaning of unfamiliar lexical items in a given context.…

    • 20543 Words
    • 83 Pages
    Powerful Essays
  • Powerful Essays

    2 Blaedow; Karen R. (Madison, WI). ‘Method and apparatus for determining the meaning of natural language’ accessed…

    • 1640 Words
    • 7 Pages
    Powerful Essays