Data-driven science requires not only fast storage systems but also strategies to manage this data efficiently within and across data centers. Big data tools can satisfy the need for searching data based on user specific metadata, however, there is a zoo of tools available and no single tool can realize all the requirements a HPC system in a data center requires. Data lakes, for example, are a reasonable approach but there are alternative concepts and tools that also need to be considered. A uniform and consistent view to the millions of scientific data files on HPC systems and their efficient processing is required to maximize exploitability and prevent segmented data silos between users or projects.

Project Goals

Aim of the project is to critically investigate state-of-the-practice of data management concepts at NHR centers and bring forward joint developments and training for the scope of data management. We expand previous activities of using data lakes for HPC systems with a broad data-centric view which ultimately should fuel the data exchange between centers. Over a period of one year, in the project we will a) investigate and develop methods for efficient data handling at NHR centers. In particular, suitability and performance of existing (general and domain-specific) research data management solutions for HPC systems are explored. b) develop a concept for data exchange between centers. This involves performance aspects of the data transfer with a focus on network tests between centers with testing of tools and optimizations, and organizational aspects, e.g., user identity management and permission of data for the transfers. c) investigate performance of storage systems and compare it across centers. The goal is to expand the previous conducted tests involving HPC file systems and object storage systems and to exchange experience and performance results within the NHR. d) form communities and create training material for typical use cases. For the aforementioned efforts we organize workshops and create training materials for the NHR centers.

GWDG’s Role in the project

GWDG is organizing this project and performs all tasks in close collaboration with the involved partners.

Project Partners

  • Zuse Institute Berlin
  • Technische Universität Dresden (TUD)
  • RWTH Aachen

Open Monthly Meetings

We have a monthly Jour Fixe every third Tuesday in a month at 3pm in BBB: https://meet.gwdg.de/b/hen-ogm-ktx-b7l Everyone is welcome! We are looking forward to meet you!

Acknowledgements

We gratefully acknowledge funding by “Nationales Hochleistungsrechnen” by the Project ”Large Scale Data Management”.

Deliverables 2024

Storage Report 2024

Data-Intensive Projects User Cheat-Sheet

Deliverables 2023

Data Management Systems Report 2023

Data Transfer Report Report 2023

Storage Report 2023

Devivrables 2022

Storage Report 2022

Contact

Hendrik Nolte

Duration

01.01.2023 - 30.6.2024

Funded by

NHR Zukunftsprojekte