Open Access. Powered by Scholars. Published by Universities.®

Computer Engineering Commons

Open Access. Powered by Scholars. Published by Universities.®

University of Texas at El Paso

Computer Sciences

Clustering

Articles 1 - 1 of 1

Full-Text Articles in Computer Engineering

A Case Study Towards Verification Of The Utility Of Analytical Models In Selecting Checkpoint Intervals, Michael Joseph Harney Jan 2013

A Case Study Towards Verification Of The Utility Of Analytical Models In Selecting Checkpoint Intervals, Michael Joseph Harney

Open Access Theses & Dissertations

As high performance computing (HPC) systems grow larger, with increasing numbers of components, failures become more common. Codes that utilize large numbers of nodes and run for long periods of time must take such failures into account and adopt fault tolerance mechanisms to avoid loss of computation and, thus, system utilization. One of those mechanisms is checkpoint/restart. Although analytical models exist to guide users in the selection of an appropriate checkpoint interval, these models are based on assumptions that may not always be true. This thesis examines some of these assumptions, in particular, the consistency of parameters like Mean Time …