Working with big data sets is challenging but the potential benefits for increasing national security are significant, scientists and national security experts agreed at a joint event organized by AAAS and the FBI.
More than 300 people attended "Big Data, Life Sciences and National Security" on 1 April at the Renaissance D.C. Downtown Hotel. The daylong event was hosted by the AAAS Center for Science, Technology and Security Policy ( CSTSP ) and the Biological Countermeasures Unit of the FBI Weapons of Mass Destruction Directorate (WMDD ).
The event was designed to encourage discussion from different perspectives about the applications and risks of big data in the life sciences, said Kavita Berger, AAAS CSTSP associate director. AAAS and the FBI have organized similar events together since 2009.
"Although the term 'big data' in the life sciences is not well-defined, it refers to the process of gaining useful, actionable information from multiple large datasets that are not uniform, static and easily analyzed using conventional statistical and computational methods," she said.
"This event and overall project is very timely as scientists push the envelope on collection, analysis, and application of big data," Berger said. "Several U.S. government agencies have big data initiatives, and efforts to make data collected by such agencies openly available are already underway."
In some ways, the challenges related to big data are not new to scientists, said Alan Leshner, AAAS chief executive officer and executive publisher of the journal Science. "Virtually all of science has been accustomed to large data sets, but they seem to be getting larger," he said. "It's more urgent that we figure out the most effective ways not only to analyze them, but to share them and share them widely."
Scientists working with big data sometimes face government agencies and nongovernmental organizations that are unwilling to share their data, said Peter Speyer, chief data and technology officer at the Institute for Health Metrics and Evaluation at the University of Washington. Even when the data sources cooperate, their websites change frequently and the data is often posted in PDF files, making it difficult for scientists to use. "It takes a massive amount of effort just to find, access and make those data usable for research," he said.
Despite the challenges, the potential benefits of working with big data are significant. Scientists need to turn to technology for advanced data analysis because of the growing number of research papers published annually, said Scott Spangler, principal data scientist at the IBM Almaden Research Center. "There are literally millions of articles going unread today in basic science," he said. "This is going to become a problem for society in general as we collect more information in science."
For example, a graduate student is not capable of reading all of the 60,000 published research papers on P-53, a protein with major implications for cancer research, but a computer system like IBM's Watson can sort through that research and identify the most relevant information, according to Spangler.
P-53 isn't easily affected by drugs but other proteins called kinases are, Spangler said, inspiring researchers to identify kinases that interact with P-53. Previously, scientists identified about one such kinase every year, but over a few weeks Watson's analysis has identified five possible interactive kinases which could be potential drug targets. Researchers at Baylor College of Medicine are currently using experimental validation to confirm Watson's findings. These same tools can potentially be used by drug companies to identify new targets for treating many diseases and conditions.
Big data is also useful for identifying trends. While information about a potential outbreak typically goes from a physician, to a county level health office, to a state office and ultimately the Centers for Disease Control and Prevention (CDC), HealthMap.org  strives to discover outbreaks more quickly by drawing information from a variety of sources.
HealthMap researchers draw information from Twitter, Facebook and news reports, decreasing the amount of time that passes from the start of an outbreak to its detection, according to David McIver, Harvard Medical School and Boston Children's Hospital research fellow in informatics. Recently, by finding an article in a local French-language newspaper in New Guinea, HealthMap identified an outbreak of Ebola nine days before the World Health Organization (WHO).
"Unless people are actively looking at these in newspapers and different languages, these things will not be picked up," McIver said.
While HealthMap identifies early-stage outbreaks, the Open Source Indicators (OSI ) program at the Intelligence Advanced Research Projects Activity (IARPA ), a center of the Office of the Director of National Intelligence, uses publicly available data to predict significant social events, including outbreaks. OSI looks at social media, and what people are looking for on search engines and Wikipedia, as well as dinner reservation cancellations and crowding in hospital parking lots.
"The goal is to take a bunch of very noisy, weak signals and combine them in a way that gives a stronger signal," said Jason Matheny, OSI program manager. When OSI handily predicted outbreaks of the flu more quickly than Google Flu Trends, researchers developed a corrected model of Google Flu Trends, which OSI beat by two weeks.
Another IARPA program, Foresight and Understanding from Scientific Exposition (FUSE ) examines indicators in the scientific and patent literatures to predict technological breakthroughs. "This is an important problem for those who are trying to track the security implications of technology or the economic implications of technology," Matheny said. "Hundreds of thousands of documents [are] being added each month, so keeping up with this really does require a machine learning approach."
Forecasting Science and Technology (ForeST ) takes FUSE a step further and enables individuals to predict when specific technological milestones will occur, like when the cost of whole genome sequencing will drop to $100. The project includes an S&T forecasting tournament called SciCast , which is led by George Mason University and IARPA with assistance from AAAS CSTSP.
"We have hundreds of questions posted and thousands of participants," Matheny said. "For those in the audience who have a competitive streak or are interested in seeing what the science and technology crowd thinks about a range of questions, I encourage you to participate."
AAAS CSTSP and FBI WMDD have established a working group of individuals from government, academia and the private sector with expertise in the life sciences, computer sciences and health who will continue to research key topics identified at the event , Berger said.
"The partnership between the FBI and AAAS has been a very successful one in tackling biosecurity," said Edward You, Supervisory Special Agent, FBI WMDD. "This is an opportunity to basically widen that lens and try also to look at some of the really rapidly advancing and growing capabilities and capacities."
"I really appreciate the partnership that we have with AAAS because they have helped us reach the research community," said Gabe Sampoll-Ramirez, chief scientist of the FBI WMDD. "We care about this relationship because we see [the Association] as a partner in preventing bioterrorism."