Core Concepts
Automated, reliable, and efficient replication of 7.3 petabytes of climate simulation data from Lawrence Livermore National Laboratory to Argonne and Oak Ridge National Laboratories, leveraging high-speed networks and data transfer infrastructure.
Abstract
The content describes a large-scale data replication task undertaken to copy 7.3 petabytes of climate simulation data from Lawrence Livermore National Laboratory (LLNL) to Argonne National Laboratory (ANL) and Oak Ridge National Laboratory (ORNL).
The key highlights are:
- The replication was necessary to establish new Earth System Grid Federation (ESGF) nodes at ANL and ORNL, and to increase the reliability and accessibility of the climate data.
- The data consisted of 29 million files across 17 million directories, posing significant challenges in terms of time, reliability, and automation.
- The replication leveraged high-speed networks (ESnet), data transfer nodes, and the Globus platform to enable automated, reliable, and efficient data transfers.
- The replication process was largely automated using a custom script that managed the transfers, monitored progress, and handled failures.
- The replication was completed in 77 days, close to the theoretical minimum time based on the performance of the LLNL file system.
- The authors discuss lessons learned, including the importance of reliable fault recovery, handling of maintenance periods at the sites, and optimizing for asymmetric network performance.
- The successful replication demonstrates the benefits of a well-designed data replication infrastructure for large-scale climate data management.
Stats
The replication task involved 8,182,644,448,359,330 bytes (7.3 PB) of data in 17,347,671 directories and 28,907,532 files.
The average transfer rates were:
LLNL→ALCF: 0.648 GB/s
LLNL→OLCF: 0.662 GB/s
ALCF→OLCF: 1.706 GB/s
OLCF→ALCF: 2.352 GB/s
A total of 4,086 faults were encountered, with an average of 1.05 faults per transfer.
Quotes
"This success demonstrates the considerable benefits that can accrue from the adoption of performant data replication infrastructure."
"Especially given the much larger data volumes expected for CMIP7, it would seem advantageous to deploy such data replication also at other major ESGF sites."