Open Access. Powered by Scholars. Published by Universities.®

Computer Engineering Commons

Open Access. Powered by Scholars. Published by Universities.®

Databases and Information Systems

Open Access Theses

Theses/Dissertations

2014

Articles 1 - 1 of 1

Full-Text Articles in Computer Engineering

Reliability Guided Resource Allocation For Large-Scale Supercomputing Systems, Shruti Umamaheshwaran Apr 2014

Reliability Guided Resource Allocation For Large-Scale Supercomputing Systems, Shruti Umamaheshwaran

Open Access Theses

In high performance computing systems, parallel applications request a large number of resources for long time periods. In this scenario, if a resource fails during the application runtime, it would cause all applications using this resource to fail. The probability of application failure is tied to the inherent reliability of resources used by the application. Our investigation of high performance computing systems operating in the field has revealed a significant difference in the measured operational reliability of individual computing nodes. By adding awareness of the individual system nodes' reliability to the scheduler along with the predicted reliability needs of parallel …