Open Access. Powered by Scholars. Published by Universities.®

Computer Engineering Commons

Open Access. Powered by Scholars. Published by Universities.®

Digital Communications and Networking

Routing

2013

University of Kentucky

Articles 1 - 1 of 1

Full-Text Articles in Computer Engineering

Algorithms For Fault Tolerance In Distributed Systems And Routing In Ad Hoc Networks, Qiangfeng Jiang Jan 2013

Algorithms For Fault Tolerance In Distributed Systems And Routing In Ad Hoc Networks, Qiangfeng Jiang

Theses and Dissertations--Computer Science

Checkpointing and rollback recovery are well-known techniques for coping with failures in distributed systems. Future generation Supercomputers will be message passing distributed systems consisting of millions of processors. As the number of processors grow, failure rate also grows. Thus, designing efficient checkpointing and recovery algorithms for coping with failures in such large systems is important for these systems to be fully utilized. We presented a novel communication-induced checkpointing algorithm which helps in reducing contention for accessing stable storage to store checkpoints. Under our algorithm, a process involved in a distributed computation can independently initiate consistent global checkpointing by saving its …