# The Big Data Blog, Part II: Daniela Witten

Leading up to the AAAS Center for Science, Technology, and Security Policy (CSTSP) and Federal Bureau of Investigation’s public event on April 1, “Big Data, Life Sciences, and National Security,” CSTSP is bringing you a series of interviews with leading experts on Big Data. In our previous post, we talked with Subha Madhavan, Director of the Innovation Center for Biomedical Informatics at the Georgetown University Medical Center, about Big Data in the biomedical field.

Today’s expert is Daniela Witten, Assistant Professor of Biostatistics at the University of Washington. Daniela, who was featured three times in Forbes’ 30 Under 30, works on statistical machine learning to find ways of using Big Data to solve problems in genomics and biomedical sciences.

**CSTSP:** How is Big Data defined?

**Daniela:** Big Data is a catchphrase for the *massive amounts of data* being collected in a variety of fields, like astrophysics, genomics, sociology, and marketing. Big Data has gotten a lot of attention recently as people have become increasingly aware of this data explosion, and of the potential to perform statistical analyses on these data in order to answer really interesting, and previously unanswerable questions.

**CSTSP: **What, in your opinion, are the most innovative applications of Big Data?

**Daniela: **I’m particularly interested in Big Data in biomedical research. For instance, how can we use Big Data to determine the best way to treat a cancer patient, or to identify brain regions that are affected in certain diseases? In that context, the data are not only big, but also “high-dimensional” — they are characterized by having many more features or measurements (e.g. brain regions) than observations (e.g. patients with a particular disease). This high dimensionality leads to a lot of *statistical challenges*.

**CSTSP: **You mentioned data that is “high-dimensional.” Can you expand a little more on what that means, what it does, and how is it applied?

**Daniela: **When we talk about "*high-dimensional*" data, we mean data in which there are *more measured variables*, or in the language of statisticians, "features," *than observations on which those variables are measured*, such as patients, rats, or whatever else we are taking measurements on.

For instance, suppose we are interested in predicting a kid's height on the basis of his/her shoe size, and his/her father's height. We need to collect some data in order to build this predictive model. We might collect data from 200 kids; for each kid we measure the kid's shoe size, father's height, kid's height. Then we have three *measured variables* and 200 *observations*. This is "*low-dimensional*" data, to which classical statistical methods, like linear regression, are applicable.

Now instead, suppose we want to predict a kid's height using his DNA sequence: that is, *3 billion variables*, because a person's DNA sequence is 3 billion basepairs long. If we collect data from 200 kids, and measure height and DNA sequence for each kid, then we now have *billions of measured variables* and 200 observations. This is "*high-dimensional*" data -- and it leads to a lot of statistical challenges! Classical statistical methods like linear regression cannot be applied, and there is a great danger of *overfitting the data*.

**CSTSP: **What does “overfitting the data” refer to?

**Daniela: **"Overfitting" refers to a model that performs well on the data for which it was developed but performs poorly on new data not used in model development. In the example of predicting kids' heights, if you develop a model that does GREAT on the 200 kids used to fit the model, but then does poorly at predicting the heights of the next 200 kids who walk into your office, then you have overfit the data!

**CSTSP: **In what ways do you think the Big Data field will change or grow in the near future?

**Daniela: **For the most part, the insights that are being drawn from Big Data right now are just the* low-hanging fruit*. This is because the statistical techniques needed to delve more deeply into this type of data are lagging a step behind.

There are two reasons that *we need new statistical techniques* to make sense of Big Data: (1) even simple scientific questions cannot be answered on the basis of Big Data using existing statistical techniques; and (2) many of the scientific questions that we wish to answer using Big Data are not simple.

Classical statistical techniques are intended for the setting in which we have a large number of observations, for instance, patients enrolled in a study, and a small number of features or measurements, for instance, height, weight, and blood pressure, per observation. But in the context of Big Data, we are often faced with “*high-dimensional*” situations where there are many more features than observations.

For instance, if we perform brain imaging on cancer patients, we might have hundreds of thousands, or even millions, of features (brain regions for which measurements are taken), but only hundreds of observations (cancer patients). So many classical statistical techniques, such as linear regression, cannot be applied, and there is a need for new statistical methods that are well-suited for this “high-dimensional” scenario. This has been an active area of statistical research in recent years, but* a lot more work is needed *in order for valid statistical conclusions to be reliably drawn based on this type of Big Data.

Furthermore, in gleaning insights from Big Data, we face an additional problem: many of the scientific questions that we wish to answer on the basis of Big Data have not yet been well-studied from a *statistical perspective*, because those questions simply didn’t exist for old-fashioned "Small Data”!

*Collaborative filtering* is one example of a statistical method that has been newly-developed in the context of Big Data, in order to answer a question that didn't arise with Small Data. Collaborative filtering systems are used by companies like Amazon to suggest to a customer items that he or she might want to purchase, based on his or her past purchase history as well as purchases made by other customers.

These systems have potential *applications in medicine*. For instance, based on a patient's electronic medical record (EMR) as well as the EMRs of many other patients, can we identify diseases or conditions for which that patient is at risk? It is important to further develop and refine collaborative filtering techniques before we apply them in a clinical setting where they will play a role in patient care.

**CSTSP: **In regards to the new statistical techniques and challenges, you mentioned “collaborative filtering.” What are some of the other innovative statistical techniques that are being developed to better utilize Big Data?

**Daniela: **I am very interested in developing methods for "*graphical modeling*" on the basis of high-dimensional data. A "graphical model" is a representation of the dependence relationships among a set of variables. Graphical models have applications in many fields, but I am particularly interested in applications in biology. For instance, this is a picture of a graphical model. In the context of biology, you can think of each black dot in that picture as a gene, and each line (red or blue) as representing two genes that have some sort of relationship between them (for instance, maybe one gene acts upon another).

Why is this interesting? There are more than 20,000 genes in humans, and it is believed that genes work together in complex ways in order to perform important biological functions. It is hoped that understanding the relationships among genes will lead to important insights about biology. People would like to discover, for instance, that Gene X affects Genes Y, Z, and W, and that Gene W affects Genes A, B, C, D, E.

Understanding these relationships among genes could also *lead to insights about human diseases*. For instance, maybe in normal tissue, Genes A, B, C, and D interact, but in some particular disease, Gene A doesn't properly interact with Genes B, C, and D. This could lead to a *better understanding of the disease*, as well as possible therapeutic targets.

But drawing these sorts of conclusions based on the available data is really challenging, and *better statistical methods are needed*. Part of my research involves developing such methods. The available data are very high-dimensional: they typically consist of around 20,000 gene measurements, on a few hundred patients.

**CSTSP: **What are some of the biggest challenges or risks facing Big Data in the near future? How can groups working with Big Data avoid overlooking these challenges and risks?

**Daniela: **One of the biggest challenges about Big Data is that it is *easy to confuse noise with signal *— or, as a statistician would describe it, to *overfit the data*. Unfortunately, it is much easier to accidentally overfit with Big Data than with Small Data — and in fact, overfitting is virtually guaranteed unless one vigilantly guards against it! For instance, if overfitting has occurred, then an algorithm that seems to perfectly predict patient response to chemotherapy may give terrible results on new patients not used in developing the algorithm.

Some approaches, like cross-validation and the use of a separate test set, can and should always be used to *reduce the risk of overfitting*. However, these approaches have their limitations -- for instance, they result in a substantial reduction in the sample size, which can be a big problem in the context of biomedical data for which observations (e.g. patients) are quite limited.

To overcome this problem, the statistical community has been working hard in recent years to develop betters ways to *measure uncertainty and perform inference* for models applied to Big Data.