Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 4 of 4

Full-Text Articles in Physical Sciences and Mathematics

Reliability Models For Hpc Applications And A Cloud Economic Model, Thanadech Thanakornworakij Jul 2012

Reliability Models For Hpc Applications And A Cloud Economic Model, Thanadech Thanakornworakij

Doctoral Dissertations

With the enormous number of computing resources in HPC and Cloud systems, failures become a major concern. Therefore, failure behaviors such as reliability, failure rate, and mean time to failure need to be understood to manage such a large system efficiently.

This dissertation makes three major contributions in HPC and Cloud studies. First, a reliability model with correlated failures in a k-node system for HPC applications is studied. This model is extended to improve accuracy by accounting for failure correlation. Marshall-Olkin Multivariate Weibull distribution is improved by excess life, conditional Weibull, to better estimate system reliability. Also, the univariate …


Failure Prediction For High-Performance Computing Systems, Narate Taerat Apr 2012

Failure Prediction For High-Performance Computing Systems, Narate Taerat

Doctoral Dissertations

The failure rate in high-performance computing (HPC) systems continues to escalate as the number of components in these systems increases. This affects the scalability and the performance of parallel applications in large-scale HPC systems. Fault tolerance (FT) mechanisms help mitigating the impact of failures on parallel applications. However, utilizing such mechanisms requires additional overhead. Besides, the overuse of FT mechanisms results in unnecessarily large overhead in the parallel applications. Knowing when and where failures will occur can greatly reduce the excessive overhead. As such, failure prediction is critical in order to effectively utilize FT mechanisms. In addition, it also helps …


A Failure Index For High Performance Computing Applications, Clayton F. Chandler Apr 2012

A Failure Index For High Performance Computing Applications, Clayton F. Chandler

Doctoral Dissertations

This dissertation introduces a new metric in the area of High Performance Computing (HPC) application reliability and performance modeling. Derived via the time-dependent implementation of an existing inequality measure, the Failure index (FI) generates a coefficient representing the level of volatility for the failures incurred by an application running on a given HPC system in a given time interval. This coefficient presents a normalized cross-system representation of the failure volatility of applications running on failure-rich HPC platforms. Further, the origin and ramifications of application failures are investigated, from which certain mathematical conclusions yield greater insight into the behavior of these …


Near-Optimal Scheduling And Decision-Making Models For Reactive And Proactive Fault Tolerance Mechanisms, Nichamon Naksinehaboon Apr 2012

Near-Optimal Scheduling And Decision-Making Models For Reactive And Proactive Fault Tolerance Mechanisms, Nichamon Naksinehaboon

Doctoral Dissertations

As High Performance Computing (HPC) systems increase in size to fulfill computational power demand, the chance of failure occurrences dramatically increases, resulting in potentially large amounts of lost computing time. Fault Tolerance (FT) mechanisms aim to mitigate the impact of failure occurrences to the running applications. However, the overhead of FT mechanisms increases proportionally to the HPC systems' size. Therefore, challenges arise in handling the expensive overhead of FT mechanisms while minimizing the large amount of lost computing time due to failure occurrences.

In this dissertation, a near-optimal scheduling model is built to determine when to invoke a hybrid checkpoint …