Open Access. Powered by Scholars. Published by Universities.®

Systems Architecture Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 2 of 2

Full-Text Articles in Systems Architecture

Optimizing Collective Communication For Scalable Scientific Computing And Deep Learning, Jiali Li Aug 2023

Optimizing Collective Communication For Scalable Scientific Computing And Deep Learning, Jiali Li

Doctoral Dissertations

In the realm of distributed computing, collective operations involve coordinated communication and synchronization among multiple processing units, enabling efficient data exchange and collaboration. Scientific applications, such as simulations, computational fluid dynamics, and scalable deep learning, require complex computations that can be parallelized across multiple nodes in a distributed system. These applications often involve data-dependent communication patterns, where collective operations are critical for achieving high performance in data exchange. Optimizing collective operations for scientific applications and deep learning involves improving the algorithms, communication patterns, and data distribution strategies to minimize communication overhead and maximize computational efficiency.

Within the context of this …


Adaft: A Resource-Efficient Framework For Adaptive Fault-Tolerance In Cyber-Physical Systems, Ye Xu Nov 2017

Adaft: A Resource-Efficient Framework For Adaptive Fault-Tolerance In Cyber-Physical Systems, Ye Xu

Doctoral Dissertations

Cyber-physical systems frequently have to use massive redundancy to meet application requirements for high reliability. While such redundancy is required, it can be activated adaptively, based on the current state of the controlled plant. Most of the time the physical plant is in a state that allows for a lower level of fault-tolerance. Avoiding the continuous deployment of massive fault-tolerance will greatly reduce the workload of CPSs. In this dissertation, we demonstrate a software simulation framework (AdaFT) that can automatically generate the sub-spaces within which our adaptive fault-tolerance can be applied. We also show the theoretical benefits of AdaFT, and …