Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 2 of 2

Full-Text Articles in Physical Sciences and Mathematics

Adaptive Checkpointing For Master-Worker Style Parallelism (Extended Abstract), Gene D. Cooperman, Jason Ansel, Xiaoqin Ma Dec 2010

Adaptive Checkpointing For Master-Worker Style Parallelism (Extended Abstract), Gene D. Cooperman, Jason Ansel, Xiaoqin Ma

Gene D. Cooperman

No abstract provided.


Dmtcp: Transparent Checkpointing For Cluster Computations And The Desktop, Jason Ansel, Kapil Arya, Gene D. Cooperman Dec 2010

Dmtcp: Transparent Checkpointing For Cluster Computations And The Desktop, Jason Ansel, Kapil Arya, Gene D. Cooperman

Gene D. Cooperman

DMTCP (Distributed MultiThreaded CheckPointing) is a transparent user-level checkpointing package for distributed applications. Checkpointing and restart is demonstrated for a wide range of over 20 well known applications, including MATLAB, Python, TightVNC, MPICH2, OpenMPI, and runCMS. RunCMS runs as a 680 MB image in memory that includes 540 dynamic libraries, and is used for the CMS experiment of the Large Hadron Collider at CERN. DMTCP transparently checkpoints general cluster computations consisting of many nodes, processes, and threads; as well as typical desktop applications. On 128 distributed cores (32 nodes), checkpoint and restart times are typically 2 seconds, with negligible run-time …