Better archiving of genetic data

- EN - DE- FR- IT
Deborah Leigh is a geneticist in WSL’s Ecological Genetics research group
Deborah Leigh is a geneticist in WSL’s Ecological Genetics research group and a member of the Standardizing, Aggregating, Analyzing and Disseminating Global Wildlife Genetic and Genomic Data for Improved Management and Advancement of Community Best Practices Working Group supported by the John Wesley Powell Center for Analysis and Synthesis, funded by the U.S. Geological Survey (Photo: zvg)

Every year, researchers upload vast amounts of genetic information to publicly accessible databases. An international team of researchers led by the Swiss Federal Institute for Forest, Snow and Landscape Research WSL is calling in the scientific Journal "Nature Ecology & Evolution" for this to be done in a standardised form, in order to enable comprehensive reuse of the data.

Deborah Leigh, there are various large databases in which genetic information is publicly accessible - from complete, decoded genomes of various organisms to individual gene sequences. You and your colleagues want to change how this data is archived. Why?

Let’s take the International Nucleotide Sequence Database Collaboration (INSDC), that is an umbrella for the European, the American and the Japanese genetic databases, as an example. It is very well established and has been around since 1987, it has a huge volume of data and is an excellent resource, for example to identify new species or develop new methods. But up until last year it lacked mandatory minimum standards for metadata, descriptors like the date and location of sampling. Not having this information made it very difficult to fully utilise the corresponding genetic data. But to fulfill our obligation to the public to utilize our research funds as extensively as possible, we have to do that.

And that’s not possible right now?

It is, but it is very tough. Firstly, only a very small amount of data that is published in papers can actually be found in its raw form in public databases. That’s an issue because if you can’t find the data in its raw form you can’t fully utilize the archived data and maximize its impact. Secondly, within each database you have a constellation of different file types and different refinement or ’cleaning’ steps that have been applied to the data. The type of data that is uploaded is not standardized and that makes it hard to reuse. Thirdly, the lack of metadata standards means, for example, that you cannot simply search for all data from a specific area or derived from a single method. It gets even more complicated to search across different databases.

What do you propose to make genetic information in archives more accessible?

We suggest standardized formats for different types of genetic and genomic data. That might seem small, as these formats are already widely used, but to standardize them would allow to access genetic data more easily. It would, for example, make it possible for non-specialist researchers and practitioners to share data with a clear processing history with new partners and would also help remove technology barriers to reuse like the need for a computer cluster for genomic data processing which would help ensure greater equity globally.

And what about the metadata you mentioned?

We ask for the mandatory inclusion of as much metadata as possible and that is safe for the species to publish. For some protected species, for example, it might be safer not to specify locations. That type of data is important for different reasons. Many reanalysis using methods in population and landscape genetics can’t be done without location information or sampling year. It is also keeping the data available for future innovation. We may not think of reuses now that other researchers would come up with in the future, and we need to provide them with as much extra information as possible to enable that. In our paper we also explicitly ask researchers to retroactively archive older data or supplement it to adhere to these new standards and fix past mistakes. What we’re aiming for is that every data set that’s or has been produced in the past is accessible and can be utilized in every possible way to ensure maximum gains from research funding. Essentially so that the public get the ’most for their money’.

Why is it so important to process old data in particular?

Specifically data from the 1990s or the early 2000s is often not very accessible. But it is really valuable, as it provides a missing baseline from the genetic diversity record. This data is important to spot recent declines or changes in genetic diversity, that could help us stop loss before it becomes harmful. Also as climate change increases, these types of baselines will likely become important in assessing the impacts of climatic change extremes on genetic diversity and the ability of species to recover in our rapidly changing world.

Is data archiving in genetics a new discussion?

No, genetics has a very long history of open data that the field is proud of, we are contributing to the ongoing discussions by proposing standardized formats and minimum metadata to archive. The INSDC has already increased the metadata requirements to include the time and location of sampling in the last year. The WSL project GenDiB , supported by FoeN, is working to establish a national database of genetic diversity data of wild Swiss populations. Other databases are also taking part in the discussion, too.