Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Resilience

Computer Science Faculty Research & Creative Works

Publication Year

Articles 1 - 2 of 2

Full-Text Articles in Physical Sciences and Mathematics

Improving Performance Of Iterative Methods By Lossy Checkponting, Dingwen Tao, Sheng Di, Xin Liang, Zizhong Chen, Franck Cappello Jun 2018

Improving Performance Of Iterative Methods By Lossy Checkponting, Dingwen Tao, Sheng Di, Xin Liang, Zizhong Chen, Franck Cappello

Computer Science Faculty Research & Creative Works

Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks in parallel, they have to checkpoint the dynamic variables periodically in case of unavoidable fail-stop errors, requiring fast I/O systems and large storage space. To this end, significantly reducing the checkpointing overhead is critical to improving the overall performance of iterative methods. Our contribution is fourfold. (1) We propose a novel lossy checkpointing scheme that can significantly improve the checkpointing performance of iterative methods by leveraging …


New-Sum: A Novel Online Abft Scheme For General Iterative Methods, Dingwen Tao, Shuaiwen Leon Song, Sriram Krishnamoorthy, Panruo Wu, Xin Liang, Eddy Z. Zhang, For Full List Of Authors, See Publisher's Website. May 2016

New-Sum: A Novel Online Abft Scheme For General Iterative Methods, Dingwen Tao, Shuaiwen Leon Song, Sriram Krishnamoorthy, Panruo Wu, Xin Liang, Eddy Z. Zhang, For Full List Of Authors, See Publisher's Website.

Computer Science Faculty Research & Creative Works

Emerging high-performance computing platforms, with large component counts and lower power margins, are anticipated to be more susceptible to soft errors in both logic circuits and memory subsystems. We present an online algorithm-based fault tolerance (ABFT) approach to efficiently detect and recover soft errors for general iterative methods. We design a novel checksum-based encoding scheme for matrix-vector multiplication that is resilient to both arithmetic and memory errors. Our design decouples the checksum updating process from the actual computation, and allows adaptive checksum overhead control. Building on this new encoding mechanism, we propose two online ABFT designs that can effectively recover …