Open Access. Powered by Scholars. Published by Universities.®
Physical Sciences and Mathematics Commons™
Open Access. Powered by Scholars. Published by Universities.®
Articles 1 - 1 of 1
Full-Text Articles in Physical Sciences and Mathematics
Near-Optimal Scheduling And Decision-Making Models For Reactive And Proactive Fault Tolerance Mechanisms, Nichamon Naksinehaboon
Near-Optimal Scheduling And Decision-Making Models For Reactive And Proactive Fault Tolerance Mechanisms, Nichamon Naksinehaboon
Doctoral Dissertations
As High Performance Computing (HPC) systems increase in size to fulfill computational power demand, the chance of failure occurrences dramatically increases, resulting in potentially large amounts of lost computing time. Fault Tolerance (FT) mechanisms aim to mitigate the impact of failure occurrences to the running applications. However, the overhead of FT mechanisms increases proportionally to the HPC systems' size. Therefore, challenges arise in handling the expensive overhead of FT mechanisms while minimizing the large amount of lost computing time due to failure occurrences.
In this dissertation, a near-optimal scheduling model is built to determine when to invoke a hybrid checkpoint …