Open Access. Powered by Scholars. Published by Universities.®

Computer Engineering Commons

Open Access. Powered by Scholars. Published by Universities.®

Hardware Systems

Doctoral Dissertations

Fault Tolerance

Articles 1 - 1 of 1

Full-Text Articles in Computer Engineering

Toward Reliable And Efficient Message Passing Software For Hpc Systems: Fault Tolerance And Vector Extension, Dong Zhong Aug 2021

Toward Reliable And Efficient Message Passing Software For Hpc Systems: Fault Tolerance And Vector Extension, Dong Zhong

Doctoral Dissertations

As the scale of High-performance Computing (HPC) systems continues to grow, researchers are devoted themselves to achieve the best performance of running long computing jobs on these systems. My research focus on reliability and efficiency study for HPC software.

First, as systems become larger, mean-time-to-failure (MTTF) of these HPC systems is negatively impacted and tends to decrease. Handling system failures becomes a prime challenge. My research aims to present a general design and implementation of an efficient runtime-level failure detection and propagation strategy targeting large-scale, dynamic systems that is able to detect both node and process failures. Using multiple overlapping …