Data Science and Prediction Vasant Dhar Professor, Stern School of Business Director, Center for Digital Economy Research March 29, 2012
Abstract The use of the term “Data Science” is becoming increasingly common along with “Big Data.” What does Data Science mean? Is there something unique about it? What skills should a “data scientist” possess to be productive in the emerging digital age characterized by a deluge of data? What are the implications for business and for scientific inquiry? In this brief monograph I address these questions from a predictive modeling perspective.
Electronic copy available at: http://ssrn.com/abstract=2086734
1. Introduction The use of the term “Data Science” is becoming increasingly common along with “Big Data.” What does Data Science mean? Is there something unique about it? What skills should a “data scientist” possess to be productive in the emerging digital age characterized by a deluge of data? What are the implications for scientific inquiry? The term “Science” implies knowledge gained by systematic study. According to one definition, it is a systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions about the universe.1 Data Science might therefore imply a focus around data and by extension, Statistics, which is a systematic study about the organization, properties, and analysis of data and their role in inference, including our confidence in such inference. Why then do we need a new term, when Statistics has been around for centuries? The fact that we now have huge amounts of data should not in and of itself justify the need for a new term. The short answer is that it is different in several ways. First, the raw material, the “data” part of Data Science, is increasingly heterogeneous and unstructured – text, images, and video, often emanating from networks with complex relationships