The identities of some volunteers who donate personal genome sequence data for research can be revealed using only publicly available information, researchers report in the 18 January issue of Science.
The detection method relies on Y chromosome information linked to specific surnames in public genetic genealogy databases, and can only be used to directly identify male DNA donors, although those donors’ female relatives may also be identified. Currently, there are at least eight such databases, including Ysearch and Sorenson Molecular Genealogy Foundation, that collectively contain the surnames and Y-chromosome sequences of hundreds of thousands of men.
“This is an important result that points out the potential for breaches of privacy in genomics studies,” said Yaniv Erlich of the Whitehead Institute for Biomedical Research, who led the research team.
Erlich’s team investigated whether these databases could be used to determine the identity of male volunteers who had donated personal genomic data anonymously to research efforts such as the 1000 Genomes Project.
They used Y-chromosome sequence data from a set of these volunteers and pulled up a number of associated surnames from the genealogy databases, some of which contain free, built-in search engines. They then pinned down the possible individuals for each surname by seeking matches with the volunteers’ ages and states of residence, which were also public information.
With this triangulation approach, they were able to determine the identities of about 50 of the volunteers. Using another set of experiments, the authors estimate that surnames of U.S. Caucasian male DNA donors could be recovered about 12% of the time.
Gymrek and colleagues do not reveal the identities of the volunteers, who were told during the informed consent process that there was a chance that their identities could be determined even though the sequences were de-identified.
The appropriate response to these challenges is not for the public to stop donating samples or for data sharing to stop, which would hamper scientific progress, the authors say. Instead, they suggest establishing clear policies for data sharing, educating participants about the risks and benefits of genetic studies, and developing legislation regarding the proper use of genetic information.
Barbara Jasny, deputy editor at Science, said that the journal “deliberated the potential benefits and risks of publishing this paper very carefully. Ultimately, we felt that publication would help draw attention to this important issue and promote discussion regarding how best to balance the need to protect research subjects with the need for large-scale data sharing, especially in research that could affect human health.”
In an accompanying Policy Forum article, experts including the directors of the National Human Genome Research Institute and the National Institute of General Medical Sciences recommend further examination of how to balance research participants’ privacy rights with the societal benefits to be realized from the sharing of biomedical research data.
In response to the findings of Erlich’s team, NIGMS and NHGRI have moved certain demographic information from the publicly-accessible portion of the NIGMS Human Genetics Cell Repository to help reduce the risk of future privacy breaches.
Read the abstract, “Identifying Personal Genomes by Surname Inference,” by Melissa Gymrek and colleagues.
Listen to a Science Podcast interview with author Melissa Gymrek.