With Better Sharing, Neuroscientists Could Wring More Insights From Raw Data

Many neuroscience data sets have far more information than any laboratory can fully analyze, and researchers must do more to ensure their data are accessible to others, experts said at AAAS.

As researchers strive to unlock the secrets of the brain, they must do more to ensure that the huge amounts of data they generate are not locked up and inaccessible to other scientists who might glean additional insights from the data, speakers said at a 21 March symposium at AAAS.

"We’re moving into an era of science where the challenges have expanded beyond and outside the scope of the individual investigator."

Justin Sanchez, DARPA program manager

Broader sharing of neuroscience data faces some technological hurdles, they said, but one of the major challenges is simply convincing researchers of the value of sharing the raw data with competitors in a timely fashion.

"The first barrier is sociological," said Yuan Liu, chief of the Office of International Activities at the National Institute of Neurological Disorders and Stroke. A researcher may spend many years obtaining data, she said, and will ask, "Why should I share it with others?" There also are concerns the data may be misused or misinterpreted, she said.

But many neuroscience data sets have far more information than any one researcher or laboratory can fully analyze, Liu said. Other researchers may explore questions that were not anticipated in the original research.

Yuan Liu | AAAS/Earl Lane

Liu spoke at a day-long symposium on "Neuroscience and Data Sharing" sponsored by the AAAS Science & Technology and Policy Fellowships Program and the Potomac Institute for Policy Studies. The session was the first of the Institute's Neuroscience Policy Series.

Data sharing has become increasingly urgent in neuroscience, with its diversity of research tools and the vast amounts of information they generate, both on the whole brain and on the connecting neural circuits between brain regions. In studying the functional behavior of the brain, from control of muscles to the formation of memories, scientists are using such tools such as electron microscopy, recordings of electrical signals from individual brain cells, and imaging of brain structures and processes using functional magnetic resonance imaging (fMRI), positron emission tomography (PET), and high-resolution optical imaging.

500 terabytes

The amount of raw data produced by one study on fruit fly behavior

10 terabytes

The estimated amount of data that could hold the printed collection of the Library of Congress

There have been a number of national and international initiatives to collect, store, and share large neuroscience data sets, but the challenges remain significant. "It always sounds easy," said Alan I. Leshner, AAAS chief executive officer and executive publisher of Science. "It seems very straight forward to me: 'It ought to be that any data, once published, ought to be made freely available" to scientific colleagues.  But besides finding better ways to generate and manage large data sets, scientists seeking to share neuroscience data more widely also must contend with "people's phenomenal desire to make sure that they own their data exclusively long enough to milk it" for publishable findings, Leshner said. That factor makes sharing of data "a very tough and complex issue," he said.

Still, there have been success stories, and symposium participants said there is a general recognition that broader sharing of neuroscience data makes sense, both to ensure the reproducibility of published results and to encourage new lines of inquiry.

Marcia McNutt | AAAS/Earl Lane

"Availability of data is a cornerstone for reproducibility" of findings, said Marcia McNutt, editor in chief of Science. The journal, published by AAAS, requires that all of the data necessary to understand, assess and extend the conclusions of a published manuscript be made available. That does not mean, she noted, that Science wants to establish a repository for raw data in neuroscience and other fields.

"We don't want to impoverish the public repositories" that already are available, she said. Moreover, McNutt said, Science and other journals do not have the resources to maintain and manage huge data bases of information. She cited an example from her own field of geophysics, where a recently published paper on forest loss looked at 40 years of data from the Landsat remote sensing satellites. It would be "ludicrous," McNutt said, for a journal to seek archival control of millions of Landsat images.

While Science is a willing partner in any effort to increase sharing of neuroscience data, McNutt said, it will be up to the neuroscience community to decide what goals are feasible and how best to pursue them.  

"The most obvious challenge, of course, is the sheer volume of data that could be created and the difficulties in managing, organizing and providing access to that data," said Jerry Sheehan, assistant director for policy development at the National Library of Medicine, an arm of the National Institutes of Health. He also noted the great variety of data and the challenge of piecing it together across disciplinary boundaries.

Kristin Branson, a computer scientist and group leader at the Howard Hughes Medical Institute's Janelia Farm Research Campus in Ashburn, Va., provided an example of how much data can be generated in a basic study of behaviors in fruit flies. She and her colleagues have been using computers to automatically track the behavior of male and female fruit flies in a bowl as different subsets of their neurons are activated. The raw data consists of 20,000 videos representing 500 terabytes of data. (By one estimate, just 10 terabytes could hold the printed collection of the Library of Congress.)

"We don't really want to share that because it's not something that anybody else could make any sense of," Branson said regarding the raw videos. Her team has been compressing the data and focusing on about 20 important statistics on variables such as the speed, position and behavior of the flies under study. Branson said she is happy to share her data with others after first publishing her own results, and her team has made its machine learning software (the Janelia Automatic Animal Behavior Annotator) freely available for download. Branson said there remain "limited benefits for one's career" in sharing lots of data, a point that other speakers made as well.

Jerry Sheehan | AAAS/Earl Lane

Still, Sheehan said neuroscience already is one of the leaders in data sharing and management, with such resources as the NIH-funded National Database for Autism Research; an NIH-Defense Department sponsored data base on traumatic brain injury; the NIH-funded Neuroimaging Informatics Tools and Resources Clearinghouse (NITRC), which helps researchers to develop, share and collaborate on software tools for doing functional and structural imaging studies of the brain; and the Neuroscience Information Framework, an NIH initiative that makes neuroscience resources - data, materials, and tools - accessible via any computer connected to the Internet. There also is the Stockholm-based International Neuroinformatics Coordinating Facility, a nonprofit organization with 17 member countries, that develops standards for neuroscience data sharing, analysis, modelling and simulation.

Michael Huerta, associate director of the National Library of Medicine, said it is important that data sets from diverse sources be made broadly available and their use facilitated by data-related standards. NIH has undertaken a Big Data to Knowledge (BD2K) initiative to make it easier for biomedical scientists to access, manage, analyze and integrate data sets from many sources, Huerta said. That includes making sure that information about research data - such as where the data reside and how to access them - is discoverable, citable and linked to the scientific literature.

NIH now requires a recipient of a grant in excess of $500,000 in any year to develop a data sharing plan that is reviewed by the study section that assesses the grant proposal, Liu said. The data plan is not considered, however, when the panel assigns a score to for the proposal. NIH program staffers monitor the data sharing practices of successful grant recipients, but Liu said there is a lack of consistency in such monitoring. While the current NIH policy on data sharing is relatively weak, she said, the agency is in the process of developing stronger policies. Liu suggested that funding agencies include a researcher's track record for data sharing as one of the criteria for merit awards.

From left, Jennifer Buss of the Potomac Institute for Policy Studies, Rita Colwell, Michael Huerta, Justin Sanchez, Kristin Branson | AAAS/Earl Lane

The Defense Advanced Research Projects Agency (DARPA), another agency that sponsors cutting-edge neuroscience research, also is taking steps to bolster data sharing. Justin Sanchez, a DARPA program manager who oversees neuroscience grants, said the agency's announcements "have specific criteria for policies on data sharing."

The collaborative spirit that data sharing can engender offers a huge opportunity to really take the reins and do something different," Sanchez said. He added, "We're moving into an era of science where the challenges have expanded beyond and outside the scope of the individual investigator."

Rita Colwell, a former director of the National Science Foundation who is now a Distinguished University Professor at both the University of Maryland and the Johns Hopkins University Bloomberg School of Public Health, took the long view on data sharing. "I, too, can remember as a very young investigator being terribly concerned" about being "scooped" by sharing data, Colwell said. "But I have learned that when you collaborate, you get a whole lot more out of the data and the publications that come out of it."