Physical Sciences and Mathematics | Open Access Articles

Failure Prediction For High-Performance Computing Systems, Narate Taerat Apr 2012

Failure Prediction For High-Performance Computing Systems, Narate Taerat

Doctoral Dissertations

The failure rate in high-performance computing (HPC) systems continues to escalate as the number of components in these systems increases. This affects the scalability and the performance of parallel applications in large-scale HPC systems. Fault tolerance (FT) mechanisms help mitigating the impact of failures on parallel applications. However, utilizing such mechanisms requires additional overhead. Besides, the overuse of FT mechanisms results in unnecessarily large overhead in the parallel applications. Knowing when and where failures will occur can greatly reduce the excessive overhead. As such, failure prediction is critical in order to effectively utilize FT mechanisms. In addition, it also helps …

Go to article

Near-Optimal Scheduling And Decision-Making Models For Reactive And Proactive Fault Tolerance Mechanisms, Nichamon Naksinehaboon Apr 2012

Near-Optimal Scheduling And Decision-Making Models For Reactive And Proactive Fault Tolerance Mechanisms, Nichamon Naksinehaboon

Doctoral Dissertations

As High Performance Computing (HPC) systems increase in size to fulfill computational power demand, the chance of failure occurrences dramatically increases, resulting in potentially large amounts of lost computing time. Fault Tolerance (FT) mechanisms aim to mitigate the impact of failure occurrences to the running applications. However, the overhead of FT mechanisms increases proportionally to the HPC systems' size. Therefore, challenges arise in handling the expensive overhead of FT mechanisms while minimizing the large amount of lost computing time due to failure occurrences.

In this dissertation, a near-optimal scheduling model is built to determine when to invoke a hybrid checkpoint …

Go to article

Failure Analysis And Reliability -Aware Resource Allocation Of Parallel Applications In High Performance Computing Systems, Narasimha Raju Gottumukkala Apr 2008

Failure Analysis And Reliability -Aware Resource Allocation Of Parallel Applications In High Performance Computing Systems, Narasimha Raju Gottumukkala

Doctoral Dissertations

The demand for more computational power to solve complex scientific problems has been driving the physical size of High Performance Computing (HPC) systems to hundreds and thousands of nodes. Uninterrupted execution of large scale parallel applications naturally becomes a major challenge because a single node failure interrupts the entire application, and the reliability of a job completion decreases with increasing the number of nodes. Accurate reliability knowledge of a HPC system enables runtime systems such as resource management and applications to minimize performance loss due to random failures while also providing better Quality Of Service (QOS) for computational users.

This …

Go to article

Reliability -Aware Optimal Checkpoint /Restart Model In High Performance Computing, Yudan Liu Apr 2007

Reliability -Aware Optimal Checkpoint /Restart Model In High Performance Computing, Yudan Liu

Doctoral Dissertations

Computational power demand for large challenging problems has increasingly driven the physical size of High Performance Computing (HPC) systems. As the system gets larger, it requires more and more components (processor, memory, disk, switch, power supply and so on). Thus, challenges arise in handling reliability of such large-scale systems. In order to minimize the performance loss due to unexpected failures, fault tolerant mechanisms are vital to sustain computational power in such environment. Checkpoint/restart is a common fault tolerant technique which has been widely applied in the single computer system. However, checkpointing in a large-scale HPC environment is much more challenging …

Go to article

Physical Sciences and Mathematics Commons^™

Full-Text Articles in Physical Sciences and Mathematics

Failure Prediction For High-Performance Computing Systems, Narate Taerat

Doctoral Dissertations

Near-Optimal Scheduling And Decision-Making Models For Reactive And Proactive Fault Tolerance Mechanisms, Nichamon Naksinehaboon

Doctoral Dissertations

Failure Analysis And Reliability -Aware Resource Allocation Of Parallel Applications In High Performance Computing Systems, Narasimha Raju Gottumukkala

Doctoral Dissertations

Reliability -Aware Optimal Checkpoint /Restart Model In High Performance Computing, Yudan Liu

Doctoral Dissertations