toplogo
Sign In

Improving Numerical Weather Prediction Workflows at Scale Using the Distributed Asynchronous Object Store (DAOS)


Core Concepts
The European Centre for Medium-Range Weather Forecasts (ECMWF) has developed a new backend for their storage library (FDB) that leverages the Distributed Asynchronous Object Store (DAOS) to outperform traditional POSIX-compliant file systems, especially under high load and contention.
Abstract
The paper presents ECMWF's work on developing DAOS Catalogue and Store backends for their FDB storage library, which is used in Numerical Weather Prediction (NWP) workflows. The key highlights are: NWP workflows are highly data-intensive, with data volumes expected to continue increasing substantially in the coming years. This poses a significant storage capacity and performance challenge for current and future NWP data centers. POSIX-compliant file systems, while widely used, have limitations in highly parallel data-intensive workloads due to their API semantics and metadata management overheads. ECMWF developed a DAOS-based backend for their FDB library, which provides a domain-specific API for storing and indexing weather data. The DAOS backend is designed to leverage DAOS's features like server-side contention resolution, reduced metadata requirements, and fine-grained I/O to improve performance under high load and contention. Benchmarks were conducted on the NEXTGenIO prototype system, comparing the DAOS-based FDB backends against the traditional POSIX-based backends on Lustre. The results show that the DAOS backends can outperform the highly optimized POSIX backends, especially under high load and contention, which is typical in NWP workflows. The paper provides insights into the design and implementation of the DAOS Catalogue and Store backends, as well as the methodology used for performance assessment and optimization.
Stats
The ECMWF operational forecasting system runs 4 times a day in 1-hour time-critical windows, with an ensemble of 52 perturbed model instances across 2500 compute nodes. Approximately 70 TiB of data are produced and stored during an operational run, comprising 25 million fields.
Quotes
"POSIX prescribes lots of metadata to be maintained by the operating system for each file and directory; the consistency guarantees are sometimes excessive; it relies on the operating system's block device interface, which enforces retrieval of entire blocks even if only a few bytes in a file are requested; and it mandates file semantics that are over-constrained and non-optimal for high write and read contention workloads on distributed systems, such that distributed locking mechanisms need to be put in place by the distributed file system implementations, causing large lock communication overheads on the client nodes."

Deeper Inquiries

How can the DAOS-based FDB backends be further optimized to reduce the one-off overheads observed in the profiling results, and make them more suitable for long-running operational NWP workflows

To optimize the DAOS-based FDB backends and reduce the one-off overheads observed in the profiling results, several strategies can be implemented: Connection Pooling: Implement a connection pooling mechanism to reuse established connections to DAOS pools and containers. This can significantly reduce the overhead of establishing new connections for each operation. Batch Processing: Introduce batch processing for pool operations and other one-off FDB overheads. By grouping multiple operations together, the overhead of individual calls can be minimized, leading to improved efficiency. Asynchronous Operations: Utilize asynchronous operations where possible to overlap communication and computation. This can help in reducing idle time and maximizing resource utilization. Caching: Implement caching mechanisms for frequently accessed data or metadata. By caching certain information locally, the need for repeated retrieval from the DAOS backend can be minimized, reducing latency and overhead. Optimized Data Structures: Review and optimize the data structures used in the DAOS backend to ensure efficient storage and retrieval of information. This can help in streamlining operations and reducing processing time. Parallel Processing: Explore opportunities for parallel processing within the backend operations. By distributing tasks across multiple threads or processes, the overall processing time can be reduced, leading to lower overheads. Resource Management: Implement robust resource management techniques to ensure efficient utilization of system resources. This includes proper allocation and deallocation of resources to prevent unnecessary overhead. By incorporating these optimization strategies, the DAOS-based FDB backends can be fine-tuned to operate more efficiently and effectively, making them better suited for long-running operational NWP workflows.

What other data-intensive scientific workflows, beyond NWP, could potentially benefit from the use of DAOS and similar object storage technologies, and what are the key considerations in adapting them

Various data-intensive scientific workflows beyond NWP can benefit from the use of DAOS and similar object storage technologies. Some potential domains include: Genomics and Bioinformatics: Large-scale genomic data analysis, DNA sequencing, and bioinformatics workflows can leverage object storage for efficient data management and analysis. Astrophysics and Astronomy: Processing and analyzing vast amounts of astronomical data, including images, spectra, and simulations, can benefit from the scalability and performance of object storage systems. Particle Physics: High-energy physics experiments generate massive datasets that require robust storage solutions. Object storage can provide the necessary scalability and reliability for storing and analyzing particle physics data. Climate Science: Climate modeling, simulation, and analysis involve handling extensive datasets. Object storage technologies can enhance data accessibility and management in climate science workflows. Financial Analytics: Financial institutions dealing with large volumes of transactional data and market information can utilize object storage for secure and efficient data storage and retrieval. Key considerations in adapting these workflows to object storage technologies include data access patterns, scalability requirements, security and compliance needs, integration with existing systems, and performance optimization for specific data processing tasks.

Given the increasing adoption of machine learning in forecast processes, how can the FDB and its DAOS-based backends be extended to efficiently handle the storage and indexing requirements of large ML model artifacts and associated training/inference data

To handle the storage and indexing requirements of large ML model artifacts and associated data in forecast processes, the FDB and its DAOS-based backends can be extended in the following ways: Support for Large Files: Enhance the backend to efficiently handle large ML model artifacts by optimizing data transfer mechanisms and storage allocation for big files. Metadata Management: Improve metadata handling to store and retrieve information about ML models, training data, and inference results effectively. This includes indexing metadata for quick access and search capabilities. Versioning and Snapshotting: Implement versioning and snapshotting features to track changes in ML models over time and facilitate rollback to previous versions if needed. Data Pipelines: Integrate data pipelines for seamless movement of training data, model artifacts, and inference results within the storage system. This ensures data consistency and reliability throughout the ML workflow. Scalability and Performance: Optimize the backend for scalability to accommodate the growing volume of ML data and ensure high performance for training and inference tasks. Security and Compliance: Enhance security measures to protect sensitive ML data and ensure compliance with data privacy regulations. This includes encryption, access control, and audit trails for data access. By incorporating these extensions and enhancements, the FDB with DAOS backends can effectively support the storage and indexing requirements of large ML model artifacts and associated data in forecast processes, enabling efficient and reliable machine learning workflows.
0
visual_icon
generate_icon
translate_icon
scholar_search_icon
star