Anonymous genetic data posted on online resources may not be as secure as previously thought. A team of researchers in the USA has shown it is possible to link whole genome sequence data to a specific person, using only publicly available information.
The technique, which is currently only able to identify males, uses genetic markers identified from whole genome sequences held anonymously in certain genetic research databases and matches it to information held in genealogy databases, which can be stored by surname.
This gives a list of possible surnames, which may then be narrowed down to a specific person using demographic information made available in the genetic database and then cross referenced with other public information, potentially identifying the original anonymous contributor to the genetic research. Those with common surnames are less likely to be successfully identified.
Using this technique, the group
from the Whitehead Institute for Biomedical Research in Massachusetts was able
to identify genomics entrepreneur Dr Craig Venter as well as several donors to genetic
research databases, including the 1000 Genomes Project.
Only males are able to be identified as the process works by analysing genetic markers known as Y-strs (Y-short tandem repeats) that are found on the male sex chromosome. As is common with surnames, DNA on the Y chromosome is passed from father to son. Genealogy databases can make use of this correlation and may openly store Y chromosome information by surname.
In some cases, family members were also able to be identified. A person who submits their genetic information for research may reveal family genetic traits, which by using genealogy databases can be traced to identify other members in that family.
While the accuracy rate is only around 12 percent, the discovery raises important questions on the security of genetic research. Dr Yaniv Erlich, who led the study, told the BBC: 'This is an important result that points out the potential for breaches of privacy in genomics studies'.
The authors have not published the names discovered nor the full details of the method used. The findings have been shared prior to publication with the US National Human Genome Research Institute, involved in the 1000 Genomes Project, which has since removed age information from its genome database.
Speaking on the potential impact of the findings Dr Erlich said: 'We hope that this study will eventually result in better security algorithms, better policy guidelines, and better legislation to help mitigate some of the risks described'.
The study was published in the journal Science.