The Naturalis Hybrid Cloud infrastructure

Data storage

Combining High Performance Computing and massively scalable storage in the datacenter with Public Cloud Hyperscalers at the speed of light using SURFnet fiber optics.

Just like in many other research domains, ICT is playing an increasing  important role in biodiversity research. Expertise from different science domains is used and combined to analyse large datasets and specialized tooling.

Needs and developments within research and ICT

Not only are the datasets produced getting bigger, but a meaningful analysis of these also requires increasing computer power. Naturalis researches are making use of data produced by the DNA lab with next gen sequencing software, are investigating the distribution of species using GIS software or are working with 3D scans and 3D prints. The Naturalis Cloud delivers them the right tools for the job. 

What does the solution chosen by Naturalis cover?

Traditional HPC systems are very specialized and almost exclusively aimed at the research community. In recent years, much of the knowledge built up within the HPC community about performing calculation tasks in clusters has increasingly been applied in a more standardized manner to generic cloud platforms. 

One of the fastest growing cloud projects is OpenStack. This open source project initiated in 2010 by Rackspace and NASA is a modular cloud operating system that a party can use to build its own cloud. Based on the Infrastructure as a Service (IaaS) offered with OpenStack, users can independently set up virtual servers, storage, and networks, and configure these according to their own needs.

Due to the long-term development towards automated and standardized cloud environments OpenStack was found to be a perfect solution for the various needs within Naturalis.

The private cloud of Naturalis based on OpenStack offers researchers a possibility to carry out calculation tasks on virtual Linux or Windows machines. With this researchers have complete freedom and control over the software, while the installation of a number of frequently used applications is automated.

Strategy

Since the start of the project in 2010, OpenStack has grown into a vast project supported by a huge community of large and small IT companies. The platform is increasingly being used in research as well. For example, CERN joined the project early on and is making large-scale use of the software.
In the near future researchers will benefit from the rapid developments within OpenStack. For example, based on the needs within the research community increasingly tailored services and applications will be provided so that researchers not only have complete control over what they do but can also use their time more efficiently. And for the provision of Data Processing clusters and the support of 3D applications many interesting possibilities are on the horizon.

By making a conscious choice for OpenStack at a relatively early stage, Naturalis has ensured that it has the right knowledge and experience to be able to rapidly respond to new developments, and to enable researchers to benefit from these. Furthermore, the platform offers the basis for developing a range of new services to make collection-related and other biodiversity data digitally available.
These functional advantages are also associated with a more efficient use of hardware resources and more effective management. Further more, Naturalis developed a strategy where Open Source software with generic hardware are perferred over closed source vendor solutions. 

Data storage

Scientific research never stands still and research into biodiversity is no exception to this. The most recent developments are having an enormous impact on the requirements placed on current and future data storage. For example, to improve the reproducibility and transparency of research increasingly higher requirements are being placed on the storage of research data. Yet at the same time the research field is changing and new techniques are being introduced. From next-generation sequencing to 3-D modeling, all of the techniques increasingly used within Naturalis require ever-larger amounts of reliable and rapidly available storage.  

Open Data

Besides the size of the data storage new requirements are also being placed on making the data available for fellow researchers, research funding bodies, and the wider public. Data is therefore being accessed from beyond the computer network of Naturalis and increasingly higher requirements are being placed on the availability of the data storage facilities. Policies regarding research data management have been developed and are being put into place. Short term fast availability and accesibility are on one side of the storage demand spectrum, cheap long term archiving of the original research data is on the other. 

Scalable tiered storage in a hybrid cloud

The Naturalis HPC cloud on OpenStack needed a massively scalable storage back end. As with OpenStack Naturalis engineers and ICT architect choose for an OpenSource platform on standard hardware and that was found in the form of CEPH. 
Ceph is a relatively new storage solution based on a distributed object store that allows large quantities of data to be stored very reliably. The open source software can be installed on standard, ‘commodity’ servers in which the software uses a smart algorithm to ensure the distribution of pieces of data (objects) across the available servers. Furthermore, Ceph is a solution that can meet the needs for three important types of storage: block, object and file system.
By using Ceph as a bulk storage solution, Naturalis meets the needs of several important use cases:

  • Block storage in the form of images and volumes that, in combination with rapid local storage, can be used for analysis of biodiversity data in the cloud environment;
  • Object storage disclosed by means of a S3 API for data preservation and public dissemination;
  • Storage for the backup of primary data.

Futureproof

As a result of the fundamental choice for an extremely scalable open source solution, Naturalis is now in a better position to respond to future demands. An increasing need can be met by scaling up the size of the storage without having to implement expensive changes in the architecture. Changing demands can also be responded to more easily by making use of open source software.
Furthermore, Ceph is a highly cost-effective solution. Instead of paying a lot of money for software licenses and then being bound to specific hardware, cheaper commodity hardware can be used and investments can be made in internal knowledge and external support.

Hybrid Cloud

Another principle of the Naturalis Cloud design is that it needs to be cost efficient and adaptive to market pricing. When the GEANT (SURFnet) European Tender for storage had been completed the pricing for archival data on the private cloud was no longer competetive. Excellent use of storage  automation made transferal of the static data to the Public Cloud possible at the flick of a switch.