EMBL-EBI is among the world’s leading providers of open biological data resources at scale. For researchers who want their data to remain Findable, Accessible, Interoperable and Reusable, and for funders who need trustworthy repositories to support Open Science and promote reproducible research practice by their awardees, the archives under EMBL-EBI’s stewardship create immense value, estimated as equivalent to a financial return of up to £11 billion every year by a 2021 independent report.
Delivery of our mission, to help scientists realise the potential of big data in biology, depends on the long-term resilience of our data resources, ensured by the policies and procedures described below.
Data resource life cycle management and retirement
EMBL-EBI data resources typically operate over the course of decades. Our resources life cycle management process from time to time results in a requirement to retire a resource. This tends to happen when the scientific needs of the community have shifted or there are more efficient means to manage the data resource. On resource retirement, datasets are typically migrated to a new or existing resource that supersedes the retiring resource, with original data identifiers always maintained. Alternatively, the resource may remain accessible to users without receiving further updates.
Resource, staffing and infrastructure continuity
A mixed core and grant funding model ensures that, while individual EMBL-EBI resources rely partly on external funding sources, institutional funds can offer continuity of key staff and the data contained in resources. Technical and administrative infrastructure are centralised and institutionally funded functions at EMBL-EBI. These central teams ensure the resilience of the data resources, with respect to user-facing service (monitoring service status, maintaining service uptime, ensuring up-to-date software) and hardware backend (redundant servers, backups). Together these measures mitigate risks from external funding discontinuation and ensure data preservation.
Data resource backup and recovery
EMBL-EBI infrastructure is distributed in three discrete data centres in different geographical locations to guarantee data protection. Many resources make use of geo-dispersed file storage which has automatic failover and recovery. Operating multiple instances also allows load balancing between the data centres, ensuring that services continue to be available to the public in the event of interruption of a single service instance. Geo-dispersed backup via public cloud is also an increasingly popular option.
Distributed delivery through international consortia
Often EMBL-EBI resources are delivered as part of international collaboration, in which there is typically a commitment to federation, data exchange and preservation. For example, all data in the European Nucleotide Archive is mirrored by the other partners in the International Nucleotide Sequence Database Collaboration, namely the US National Centre for Biotechnology Information and DNA Databank of Japan. EMBL-EBI is also a part of the Uniprot, wwPDB, IMEx and ProteomeXchange consortia. Ensuring data are available in more than one location strengthens data preservation, provides more robust global access and mitigates risks relating to loss of funding or changes in operating conditions in individual host countries and institutions.
Edit