Open Access. Powered by Scholars. Published by Universities.®
Articles 1 - 1 of 1
Full-Text Articles in Computer Engineering
Reliability Guided Resource Allocation For Large-Scale Supercomputing Systems, Shruti Umamaheshwaran
Reliability Guided Resource Allocation For Large-Scale Supercomputing Systems, Shruti Umamaheshwaran
Open Access Theses
In high performance computing systems, parallel applications request a large number of resources for long time periods. In this scenario, if a resource fails during the application runtime, it would cause all applications using this resource to fail. The probability of application failure is tied to the inherent reliability of resources used by the application. Our investigation of high performance computing systems operating in the field has revealed a significant difference in the measured operational reliability of individual computing nodes. By adding awareness of the individual system nodes' reliability to the scheduler along with the predicted reliability needs of parallel …