As a follow-up to the Big Data public event on April 1, hosted by the American Association for the Advancement of Science (AAAS) Center for Science, Technology and Security Policy (CSTSP) and the Biological Countermeasures Unit of the WMD Directorate of the Federal Bureau of Investigation (FBI), CSTSP is conducting detailed interviews with speakers to elaborate on scientific and technological developments in the field of Big Data and the potential implications of Big Data and analytics to national and international biological security.
The first interview is with Dr. Ivo Dinov, Associate Professor at the University of Michigan School of Nursing, Director of the Statistics Online Computational Resource, and a core member of the University of Michigan Comprehensive Cancer Center.
CSTSP: How would you define “Big Data,” and how is that different from the other definitions of the term “Big Data” that you’ve encountered? What is your perception of the general public’s understanding of what are and their acceptance of “Big Data” studies?
Ivo: The research efforts of my group are centered around data-aggregation, compressive-analytics, scientific inference, interactive services for data interrogation, including high-dimensional data visualization, and integration of research and education. Specific applications include neuroimaging-genetics studies of Alzheimer’s disease, predictive modeling of cancer treatment outcomes, and high-throughput data analytics using graphical pipeline workflows.
In our research scope, “Big Data” is defined by its multi-dimensional characteristics – data size, incompleteness, incongruency, complex representation, multiscale nature, and heterogeneity of its sources. These “Big Data” characteristics not only make it distinct from more conventional data, but also different from the industry-driven “4 V’s” definition of Big Data based on “volume”, “velocity”, “variety” and “veracity”. The relation between “Big Data” and “conventional data” is analogous to the dichotomy between “organizational” and “government” level budgets, policies, law or analytics. In general, Big Data is effectively a messy collage of “conventional data” representing different and alternative views of the same complex natural process inspected through a multispectral prism.
Big Data R&D has a great potential to democratize the data sciences, in general. Advanced degrees (e.g., PhD) are no longer necessary, or sufficient, for young, motivated and dedicated citizen scientists to acquire, develop, study, investigate, produce and share data, tools, services, information and knowledge. One interesting property of Big Data is the inverse relation between the increase of its size, complexity and importance (exponential growth) which parallels the decrease of its value over time (exponential decay) from the point of it becoming static (see Figure 1).
Figure 1, courtesy of Ivo Dinov
CSTSP: What are the most common data inputs/sources that contribute to “Big Data”? What characteristics are associated with these inputs/sources that impact who uses the data and how the data is used? How do you foresee these inputs/sources evolving over the next decade?
Ivo: According to International Data Corporation (IDC), the volume of data added to the “digital universe,” defined as “all the information created, replicated, and consumed in a single year,” has grown from 130 Billion Gigabytes (GB) to 2.8 Trillion GB in 2012 and is projected to reach 40 Trillion GB by 2020. Annually, general consumers have accounted for around three-quarters of that total. Currently, most of the new data (>90%) created by consumers annually, represents digital images and motion videos. It’s also true that the “digital footprint” of humanity increases exponentially over time. As an example, for each 2-day period in 2014, we create as much information as the human race generated from the dawn of civilization up until 2003 (cf. Eric Schmidt, Google), about 5 Exabytes of data (5x1018 Bytes). Even if 0.01% of these data are interesting and useful, it would amount to 4 Petabytes (4x1015 Bytes) of data generated weekly.
Obviously, no individual or organization could potentially manage or interpret this data tsunami. Most scientists of researchers have a limited scope and interest. However, these interest-scope restrictions do not really make Big Data analytics trivial. A version of the Gödel’s incompleteness theorem provides a fundamental reason for why Big Data will always remain challenging - there is no way to collect enormous amounts of data that are both internally-consistent and at the same time accurately represent the native process. At one extreme, attempting to collect, manage and understand infinite amounts of information is bound to either generate biased inference (e.g., fitting a narrow spectrum phenomena) or provide unstable inference, which is difficult to replicate. In the computational sciences, people have encountered that principle as “data overfitting” even in smaller scale studies. This is not a serious cause to abandon Big Data explorations, however. We are very far from getting close to this extreme point of singularity.
CSTSP: What tools are available now to analyze and glean important information? What are the limitations of these tools, and what tools are still needed?
Ivo: Although there may be variations due to scope, most basic Big Data analytic protocols include the following steps: data aggregation, data scrubbing, data fusion (e.g., semantic mapping), exploratory and quantitative data modeling, data analytics, summarization, information administration, knowledge management, decision and action (see Figure 2)
Figure 2, courtesy of Ivo Dinov. In the context of clinical neuroscience, the schematic illustrates the Big Data analytics protocol describing the transition of raw Big Data to Information, Information to Knowledge, and Knowledge to (clinical) Action.
There are some tools, web-services and infrastructure to address few of these Big Data analytic steps; however, there are many scalability, resilience and reproducibility concerns. Open-citizenry engagement and significant public and private investment in open Big Data research, development implementation and training (RDIT) may do for Big Data what crowd-sourcing did for Wikipedia.
The theoretical foundations of Big Data Science remain a wide open field. My research group is investing significant time and resources on a new Big Data processing theory of high-throughput analytics and model-free Inference. We are exploring the core principles of distribution-free and model-agnostic methods for scientific inference based on Big Data. Compressive Big Data analytics (CBDA) represents
one idea that is gaining some traction. In CBDA, one iteratively generates random (sub)samples from the Big Data collection, uses classical techniques to develop model-based or non-parametric inference, repeats the sampling and inference steps many times, and finally uses bootstrapping techniques to quantify probabilities, estimate likelihoods, approximate, parameters, or assess accuracy of findings. Although untested as of yet, the CBDA approach may provide a scalable solution to avoid some of the Big Data management and analytics challenges.
CSTSP: What are some of the opportunities for the general public, including amateur scientists, to contribute to “Big Data”?
Ivo: By 2020 we will know for sure if Big Data is truly the “Forth Scientific Paradigm,” expanding the three pillars of experimental, theoretical and computational sciences (cf. Microsoft Research), or if it is merely a bubble that would either burst spectacularly or slowly fade away as a viral hoax. As of Spring 2014, there is evidence to suggest that Big Data driven infocracy, the rapid and robust management of substantial flow of information, is here to stay and it provides a paradigm shift in our scientific, cultural and social perceptions of the world within and around us. Whether we like it or not, big social, biomedical, environmental, and financial data will be observed, stored and integrated on multiple levels.
This rate of data acquisition far exceeds the rate of increase of our ability to handle the information overflow. This is a demonstration of Moore’s vs. Kryder’s law. Moore’s law relates to the expectation that our computational capabilities, specifically the number of transistors on integrated circuits, doubles approximately every 18-24 months. Kryder’s law, on the other hand, stipulates that the volume of data, in terms of disk storage capacity, is doubling every 14-18 months. Although both laws yield exponential growth, data volume is increasing at a faster pace (Figure 3). As a result, there are clear interests and needs for significant private, public and government engagement in managing, processing, interrogating and interpreting the information content of Big Data.
Figure 3, courtesy of Ivo Dinov. Kryder’s Law: Data volume Increases faster than computational power
As the internal (among us) and external (with the outside environment) interconnectivity increases, there will be amazing opportunities for amateur scientists and data-savvy citizens to explore, collaborate and discover patterns, relations and associations of a wide gamut of natural processes. Some of these discoveries may lead to new healthcare inventions, unexpected bio-social findings, transformative energy policies, or expeditious environmental decisions. The fundamental question that will determine the long-term impact of Big Data is directly related to the information value-complexity duality: What is the balance between the new information content relative to the pragmatic challenges associated with Big Data? I suspect that in Big Data space, the value-added and the complexity of managing it will both be enormous. Although we should expect headwinds and setbacks along the way, the value-complexity balance will probably tilt towards benefits significantly outweighing the challenges.
CSTSP: Is there a gap in the availability of and access to “Big Data” and tools between research groups (for- and non- profits) and the general public? If so, how can data scientists and other “Big Data” stakeholders address this gap?
Ivo: There are huge problems with Big Data access and availability. First, industry seems to be leading the Big Data wave (see Figure 4). As a result, there are a number of asynchronous grass-root developments and proprietary data procurement and management mechanisms with potential downstream limitations to Big Data fusion and joint-analytics. There may be “data-integrity” and vulnerability issues here, but the more acute challenge is the lack of standards and agreement on common data elements and approaches. At this early stage, I view this as “creative-chaos” that will be a short-term annoyance but ensure the long-term sustainability of Big Data analytics.
Figure 4, adapted from http://www.bigdata-startups.com/open-source-tools/, courtesy of Ivo Dinov
Second, in the US, there appears to be a noticeable public and government underinvestment in Big Data sciences. For example, NIH plans to invest $24 million annually to support 7-8 National Big Data to Knowledge Centers of Excellence. Although there will be complementary Big Data federal funding opportunities, collectively these investments fall short of the European Brain initiative ($1.3 billion) and the Chinese commitment of $300 billion for biotechnology research over the next five years – both initiatives are related to Big Data, but have much narrower scopes.
Third, Big Data access is significantly constrained by several complementary factors: bandwidth limitations (e.g. whole-genome ADNI data requires transferring hard drives in suitcases); HIPAA regulations (bio-health data, including unstructured information, is practically inaccessible outside the data owners -- often even within the same institution); and technological challenges (there is a substantial lack of theoretical understanding of the complex processes generating Big Data and the subsequent analytics, modeling algorithms, technological tools and web-services for mechanistically handling this information). Any considerable effort to deal with Big Data will quickly showcase the major barriers associated with data access, exploration, validation and publication of findings.
What is needed are policies that balance personal information protection, societal benefits and long-term vision for handling Big Data, regulations facilitating Big Data analytics, incentives to creative Big Data applications and promoting innovative technology developments.
CSTSP: What are the benefits and challenges for providing the general public with access to “Big Data” collection and analysis tools?
Ivo: There are a number of benefits and challenges associated with public access to Big Data. Examples of challenges include:
- Pragmatics of data servicing (management, transfer, processing) – there are many technological and infrastructure problems.
- Avoiding inappropriate use of the data. “Big Data Risks” include considerations for privacy, security, misuse, spoofing (feeding in false data), de-identification attempts (from aggregated data), and personal, organization, government or national threats (for example, ethnic targeting, corporate espionage, large-scale cultural, health, or environmental attacks, exploitation of vulnerabilities or intentional character attacks, software, service or algorithm hacking).
- Overreliance on unsupervised Big Data analytics may lead to wide societal or cultural swings or biased decision making.
- Heisenberg’s principle for Big Data: prediction-feed-back into Big Data analytic models may impact the actual future outcomes or prospective observations. There may be a natural limit to Big Data utilization caused by the very Big Data acquisition and processing protocols which are expected to have direct effect on the behavior of datum quanta, the basic building block of Big Data.
Examples of the benefits include:
- Development of powerful tools for explanatory, anticipatory, predictor or forecasting models for a wide range of natural phenomena from genetic mutation, to Global climate change, social network dynamics, healthcare improvements, etc.
- Big Data has significant promise in diagnostic decision support, quantifying cost-benefit analysis, development of opportunity-risk assessment models, quantifying normal or pathological intrinsic process variability and extrinsic technological noise.
- Engagement – we all recall what the introduction of personal motor vehicles, public education, and democratization of politics did for modern society. There is little doubt that providing public access, support and infrastructure for Big Data is likely to have a similar impact on human culture, well-being and social interactions.
Another subtle benefit of democratizing Big Data may be the transition from the expected rare, yet significant, revolutionary changes into persistent, yet minor, alterations representing more evolutionary adaptations to unavoidable genetic, social, environmental or technological fluctuations.
CSTSP: Can you expand on the risks and challenges to national security that you foresee in the collection, storage and analysis of data in the near future? In what ways are those risks and challenges unique to “Big Data” specifically as opposed to large databases and datasets in the past?
Ivo: Ill-intended entities could potentially gather sufficient personal, organizational or government data from disparate sources, aggregate the data, develop forecasting, predictive or retrospective models that may allow them to proactively anticipate individual, institutional of federal actions, engage in fraudulent activities assuming pseudo-identities, deduce prior (non-public) information, or secretly mislead people, advocates, representatives or decision makers. These deceptive pursuits could have immediate or delayed detrimental effects (e.g., health, food supply, finance, social norms, and environment).
The reasons why these potential risks are significantly outweighed by the benefits of Big Data is that open Big Data science harnesses the broad community ingenuity by engaging trans-disciplinary talent in the research, development, training and security studies associated with Big Data. If nuclear science is any indication, openly-shared, effectively supported and gently regulated, Big Data science could become a significant environmental, social, health and economic driver.
CSTSP: Inversely, how may “Big Data” be used to address national and international biological security issues?
Ivo: Beyond the collection of enormous and heterogeneous information, some of the Big Data analytics capabilities, web-services and technologies that will be designed and implemented to address specific scientific challenges may also be applicable for information provenance and predictive modeling. For instance, the same data, theory, software and infrastructure developed to address broad health, social or environmental challenges in the Big Data domain can also be employed to monitor, predict and anticipate disease breakouts (e.g., Google Flu Trends), provide targeted surveillance (cf. targeted drug safety monitoring), semi-supervised stochastic thread monitoring (cf. NLP and Latent Dirichlet Allocation using word, phrase or topic distributions in unstructured text data).
CSTSP: How can scientists, policymakers and other stakeholders properly weigh the risks and challenges against the opportunities and benefits of “Big Data”? What unique factors will have to be taken into consideration in the “Big Data” risk-benefit analysis?
Ivo: A harmonious approach is necessary to balance the societal demands for enabling and empowering creative Big Data open-science developments with the needs for protection of personal information and reducing public, social and cultural risks associated with potential misuse of such data, technology or know-how.
A good guide on how to approach this balance may be developed based on the principles previously laid out and validated by large initiatives like Net Neutrality (stipulating open communications with ISP providers and governments treating all Internet data/traffic equally without discrimination based on user, platform or spatial-temporal traits) and the Open-Source community development model (which promotes universal access, re-distribution and collaborative expansion of product designs, blueprints, software, tools or services). The special characteristics that make regulation, policy and guidance of Big Data, and its analytic processing, challenging include its dynamic nature, sporadic, if not stochastic appearance, rapid evolution, potency to enable significant knowledge gain or transfer, and capacity to empower action.
*The slides included in this interview were from Dr. Dinov’s presentation at the April 1 meeting. The full presentation is available here.