Open Access. Powered by Scholars. Published by Universities.®
Articles 1 - 1 of 1
Full-Text Articles in Engineering
Algorithms For Fault Tolerance In Distributed Systems And Routing In Ad Hoc Networks, Qiangfeng Jiang
Algorithms For Fault Tolerance In Distributed Systems And Routing In Ad Hoc Networks, Qiangfeng Jiang
Theses and Dissertations--Computer Science
Checkpointing and rollback recovery are well-known techniques for coping with failures in distributed systems. Future generation Supercomputers will be message passing distributed systems consisting of millions of processors. As the number of processors grow, failure rate also grows. Thus, designing efficient checkpointing and recovery algorithms for coping with failures in such large systems is important for these systems to be fully utilized. We presented a novel communication-induced checkpointing algorithm which helps in reducing contention for accessing stable storage to store checkpoints. Under our algorithm, a process involved in a distributed computation can independently initiate consistent global checkpointing by saving its …