New Technique Can Encrypt Genetic Data, Protecting Privacy While Assisting Biomedical Research

With inexpensive flow cells like this one, researchers can sequence a human genome in a matter of days, said Yaniv Erlich. | AAAS/Janel Kiley

CHICAGO -- Hackers are devising ever-smarter methods for probing genetic data to obtain undesirable knowledge about individuals or their families. But, a new encryption technique may ensure that such data can be truly anonymized while remaining workable for medical researchers.

Sharing genetic data is critical to advancing biomedical discoveries; after all, the medical community still has much to learn about how our genomes correlate to health and well-being. But some people have shied away from sharing because, as several studies and real-world incidents have shown, their privacy cannot be guaranteed. This diminishes the number of results flowing into genetic databases, creating an obstacle to collecting the large amounts of genetic information needed to understand how genes predispose people to disease.

At a 16 February symposium at the AAAS Annual Meeting, a panel of experts explored the ongoing tension between the individual's need for genetic privacy and the medical community's desire to use genetic data to improve clinical care.

Yaniv Erlich, a fellow at the Whitehead Institute for Biomedical Research, underscored the privacy problem. He explained the various routes by which genetic data can be breached, noting that researchers have recently witnessed a rapid growth in these techniques.

Broadly speaking, there are three such techniques: identity tracing, in which a hacker doesn't know who the owner of a DNA sample is but figures it out using little DNA identifiers; attribute disclosure, in which the hacker does know the donor's identity but uncovers additional sensitive information about him or her using data submitted to public DNA databases; and finally, completion of sensitive DNA information, in which a hacker knows a donor's identity but works to uncover genomic areas that were masked.

The most famous example of DNA completion involved James Watson, the co-discoverer of the DNA double helix. In 2007, Watson had his whole DNA sequence made publicly available with the exception of a gene that predicts the development of Alzheimer's disease, APOE.

"Watson permitted his entire genome to be published online after masking his APOE gene," Erlich explained, "but ancient mutations in our genome are found in a certain non-random structure, and so, by looking at other nearby mutations that were not masked, several geneticists reported being able to reveal Watson's APOE status."

Erlich provided an analogy. "I can mask most of the following sentence, B_r_ _k Ob_ _a is t_ _ Pr_ _ _ _ _nt, but I bet you can still read it because letters are not totally random. If you know some of the letters, you can easily infer the others," he explained. "The same process works with genomic data."

Fortunately, the discussion at the meeting provided a glimpse into an advance that may help. Kristin Lauter, head of cryptography research at Microsoft, discussed a method for anonymizing genetic data. This sophisticated approach encrypts the genetic info, requiring a key for access, which only the donor holds. 

Critically, it also encodes the data in such a way that scientists don't lose the flexibility to perform medically useful genetic tests on it. Most such tests (which use algorithms to try to correlate the presence of a gene to a disease or a trait) can be approximated by simple addition and multiplication. With Lauter's tool, computers perform addition and multiplication on data that is never unmasked from its encrypted state; scientists extract insightful correlations around genetic predisposition from it but never know the source.

Lauter provided a demonstration of how easy it is to use this encryption technique, which many have thought too cumbersome for wide application. In a matter of seconds on her personal laptop, she ran an analysis that encrypted her genetic data, sent it to a secure cloud for analysis, returned it, and decrypted it so she could get important medical information (in this case, her risk of a heart attack).

This new security approach is coming at a very good time as stakeholders outside the world of science unite to weigh issues of genetic privacy and biomedical advancement.

John Wilbanks, Chief Commons Officer at Sage Bionetworks, a Seattle-based research institute that encourages making scientific research results widely accessible, talked about the strategies for setting up reasonable restrictions on genetic data while still making it sufficiently robust for medical use.

"It's very hard to put enough rich data about a person online without compromising at least some percentage of his or her anonymity," he said. "It all comes down to finding a middle point."

He explained who is getting involved in identifying that middle point. "I'd say fifteen groups, maybe labs, are working on it. There aren't too many yet because it's a very new space at the edge of bioethics, re-identification technology, and genetic sequencing, and these groups have never come together like this before. Each of us only has part of the answer. We're all trying to figure out what the right overall balance should be."

Wilbanks said the road ahead, while promising thanks to advances like Lauter's, is not without its challenges - the mysteries of the genome being one.

"For me, the biggest obstacle is the overall uncertainty around what our genomes mean today versus what they're going to mean tomorrow," he said. "Today, a section of genetic code may be considered junk; tomorrow, it may have meanings we don't want to know, or don't want the world to."