06 Dec 2015

50 years of Data Science

David Donoho

learning from data, or ‘data analysis’.

On Tuesday September 8, 2015, as I was preparing these remarks, the University of Michigan announced a $100 Million “Data Science Initiative” (DSI), ultimately hiring 35 new faculty.

“This coupling of scientific discovery and practice involves the collection, management, processing, analysis, visualization, and interpretation of vast amounts of heterogeneous data associated with a diverse array of scientific, translational, and inter-disciplinary applications.”

This announcement is not taking place in a vacuum.

2 Data Science ‘versus’ Statistics

applied statisticians

Statisticians see administrators touting, as new, activities that statisticians have already been pursuing daily, for their entire careers; and which were considered standard already when those statisticians were back in graduate school.

The identified leaders of this initiative are faculty from the Electrical Engineering and Computer Science Department (Al Hero) and the School of Medicine (Brian Athey).

‘‘Data Scientist" means a professional who uses scientific methods to liberate
and create meaning from raw data.

To a statistician, this sounds an awful lot like what applied statisticians do: use methodology to make inferences from data. Continuing:

    ‘‘Statistics" means the practice or science of collecting and analyzing
 numerical data in large quantities.

A grand debate: is data science just a ‘rebranding’ of statistics?

2.1 The ‘Big Data’ Meme

History
The very term ‘statistics’ was coined at the beginning of modern efforts to compile census data, i.e. comprehensive data about all inhabitants of a country, for example France or the United States. Census data are roughly the scale of today’s big data; but they have been around more than 200 years!

2.2 The ‘Skills’ Meme

What are those skills? Many would cite mastery of Hadoop, a variant of Map/Reduce for use with datasets distributed across a cluster of computers. Consult the standard reference Hadoop: The Definitive Guide. Storage and Analysis at Internet Scale, 4th Edition by Tom White.

5 Breiman’s ‘Two Cultures’, 2001

Statistics starts with data. Think of the data as being generated by a black box in which a vector of input variables x (independent variables) go in one side, and on the other side the response variables y come out. Inside the black box, nature functions to associate the predictor variables with the response variables … There are two goals in analyzing the data:

The ‘Generative Modeling’ culture seeks to develop stochastic models which fit the data, and then make inferences about the data-generating mechanism based on the structure of those models.

The ‘Predictive Modeling’ culture prioritizes prediction and is estimated by Breiman to encompass 2% of academic statisticians — including Breiman — but also many computer scientists and, as the discussion of his article shows, important industrial statisticians.

6 The Predictive Culture’s Secret Sauce

6.1 The Common Task Framework

To my mind, the crucial but unappreciated methodology driving predictive modeling’s success is what computational linguist Marc Liberman (Liberman, 2009) has called the Common Task Framework (CTF). An instance of the CTF has these ingredients:

All the competitors share the common task of training a prediction rule which will receive a good score; hence the phase common task framework.

6.3 The Secret Sauce

It is no exaggeration to say that the combination of a Predictive Modeling culture together with CTF is the ‘secret sauce’ of machine learning.