Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Computer Sciences

Louisiana Tech University

Doctoral Dissertations

Reliability-aware

Publication Year

Articles 1 - 2 of 2

Full-Text Articles in Physical Sciences and Mathematics

Failure Analysis And Reliability -Aware Resource Allocation Of Parallel Applications In High Performance Computing Systems, Narasimha Raju Gottumukkala Apr 2008

Failure Analysis And Reliability -Aware Resource Allocation Of Parallel Applications In High Performance Computing Systems, Narasimha Raju Gottumukkala

Doctoral Dissertations

The demand for more computational power to solve complex scientific problems has been driving the physical size of High Performance Computing (HPC) systems to hundreds and thousands of nodes. Uninterrupted execution of large scale parallel applications naturally becomes a major challenge because a single node failure interrupts the entire application, and the reliability of a job completion decreases with increasing the number of nodes. Accurate reliability knowledge of a HPC system enables runtime systems such as resource management and applications to minimize performance loss due to random failures while also providing better Quality Of Service (QOS) for computational users.

This …


Reliability -Aware Optimal Checkpoint /Restart Model In High Performance Computing, Yudan Liu Apr 2007

Reliability -Aware Optimal Checkpoint /Restart Model In High Performance Computing, Yudan Liu

Doctoral Dissertations

Computational power demand for large challenging problems has increasingly driven the physical size of High Performance Computing (HPC) systems. As the system gets larger, it requires more and more components (processor, memory, disk, switch, power supply and so on). Thus, challenges arise in handling reliability of such large-scale systems. In order to minimize the performance loss due to unexpected failures, fault tolerant mechanisms are vital to sustain computational power in such environment. Checkpoint/restart is a common fault tolerant technique which has been widely applied in the single computer system. However, checkpointing in a large-scale HPC environment is much more challenging …