Natalie Glance nglance@intelliseek.com Matthew Hurst mhurst@intelliseek.com Kamal Nigam knigam@intelliseek.com Matthew Siegler msiegler@intelliseek.com Robert Stockton rstockton@intelliseek.com Intelliseek Applied Research Center Pittsburgh, PA 15217
Takashi Tomokiyo ttomokiyo@intelliseek.com ABSTRACT
Weblogs and message boards provide online forums for discussion that record the voice of the public. Woven into this mass of discussion is a wide range of opinion and commentary about consumer products. This presents an opportunity for companies to understand and respond to the consumer by analyzing this unsolicited feedback. Given the volume, format and content of the data, the appropriate approach to understand this data is to use large-scale web and text data mining technologies. This paper argues that applications for mining large volumes of textual data for marketing intelligence should provide two key elements: a suite of powerful mining and visualization technologies and an interactive analysis environment which allows for rapid generation and testing of hypotheses. This paper presents such a system that gathers and annotates online discussion relating to consumer products using a wide variety of state-of-the-art techniques, including crawling, wrapping, search, text classification and computational linguistics. Marketing intelligence is derived through an interactive analysis framework uniquely configured to leverage the connectivity and content of annotated online discussion. Categories and Subject Descriptors: H.3.3: Information Search and Retrieval General Terms: Algorithms, Experimentation Keywords: text mining, content systems, computational linguistics, machine learning, information retrieval
from online public communications. For example, there are message boards devoted to a specific gaming platform, newsgroups centered around a particular make and model of motorcycle, and
References: [1] S. Abney. Partial parsing via finite-state cascades. In Workshop on Robust Parsing, 8th European Summer School in Logic, Language and Information, 1996. [2] R. Agrawal, S. Rajagopalan, R. Srikant, and Y. Xu. Mining newsgroups using networks arising from social behavior. In Proceedings of the Twelfth International World Wide Web Conference (WWW2003), 2003. [3] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In J. B. Bocca, M. Jarke, and C. Zaniolo, editors, Proc. 20th Int. Conf. Very Large Data Bases, VLDB, pages 487–499. Morgan Kaufmann, 12–15 1994. [4] R. Baumgartner, S. Flesca, and G. Gottlob. Declarative information extraction, Web crawling, and recursive wrapping with Lixto. Lecture Notes in Computer Science, 2173, 2001. [5] K. D. Bollacker, S. Lawrence, and C. L. Giles. CiteSeer: An autonomous web agent for automatic retrieval and identification of interesting publications. In Agents ’98, pages 116–123, 1998. [6] H. Chen, J. Hu, and R. W. Sproat. Integrating geometric and linguistic analysis for e-mail signature block parsing. ACM Transactions on Information Systems, 17(4):343–366, 1999. [7] W. W. Cohen. Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems, 18(3):288—321, 2000. [8] W. W. Cohen, L. S. Jensen, and M. Hurst. A flexible learning system for wrapping tables and lists in HTML documents. In Proceedings of The Eleventh International World Wide Web Conference (WWW-2002), Honolulu, Hawaii, 2002. [9] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence, 118(1–2):69–113, 2000. [10] N. Glance and W. Cohen. BoardViewer: Meta-search and community mapping over message boards. Intelliseek Technical Report, 2003. [11] N. Glance, M. Hurst, and T. Tomokiyo. BlogPulse: Automated trend discovery for weblogs. In WWW 2004 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2004. [12] M. Hurst and K. Nigam. Retrieving topical sentiments from online document collections. In Document Recognition and Retrieval XI, pages 27–34, 2004. [13] L. S. Jensen and W. Cohen. Grouping extracted fields. In Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, 2001. [14] T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Machine Learning: ECML-98, Tenth European Conference on Machine Learning, 1998. [15] D. D. Lewis and J. Catlett. Heterogeneous uncertainty sampling for supervised learning. In Machine Learning: Proceedings of the Eleventh International Conference, 1994. [16] D. D. Lewis and W. A. Gale. A sequential algorithm for training text classifiers. In SIGIR ’94, pages 3–12, 1994. [17] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2:285–318, 1988. [18] A. McCallum and K. Nigam. Employing EM in pool-based active learning for text classification. In Machine Learning: Proceedings of the Fifteenth International Conference, pages 350–358, 1998. [19] J. Myllymaki. Effective web data extraction with standard XML technologies. In Proc. WWWW10, pages 689–696, May 2001. [20] T. Nasukawa, M. Morohashi, and T. Nagano. Customer claim mining: Discovering knowledge in vast amounts of textual data. Technical report, IBM Research, Japan, 1999. [21] T. Nasukawa and J. Yi. Sentiment analysis: Capturing favorability using natural language processing. In Proceedings of K-CAP ’03, 2003. [22] K. Nigam and M. Hurst. Towards a robust metric of opinion. In AAAI Spring Symposium on Exploring Attitude and Affect in Text, 2004. [23] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up? sentiment classification using machine learning techniques. In Proceedings of EMNLP 2002, 2002. [24] J. G. Shanahan, Y. Qu, and J. Weibe, editors. Computing Attitude and Affect in Text. Springer, Dordrecht, Netherlands, 2005. [25] T. Tomokiyo and M. Hurst. A language model approach to keyphrase extraction. In Proceedings of the ACL Workshop on Multiword Expressions, 2003. [26] Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1/2):67–88, 1999.