Open Access. Powered by Scholars. Published by Universities.®

Systems Architecture Commons

Open Access. Powered by Scholars. Published by Universities.®

Numerical Analysis and Scientific Computing

Theses/Dissertations

GPGPU

Articles 1 - 2 of 2

Full-Text Articles in Systems Architecture

Dynamically Finding Optimal Kernel Launch Parameters For Cuda Programs, Taabish Jeshani Apr 2023

Dynamically Finding Optimal Kernel Launch Parameters For Cuda Programs, Taabish Jeshani

Electronic Thesis and Dissertation Repository

In this thesis, we present KLARAPTOR (Kernel LAunch parameters RAtional Program estimaTOR), a freely available tool to dynamically determine the values of kernel launch parameters of a CUDA kernel. We describe a technique for building a helper program, at the compile-time of a CUDA program, that is used at run-time to determine near-optimal kernel launch parameters for the kernels of that CUDA program. This technique leverages the MWP-CWP performance prediction model, runtime data parameters, and runtime hardware parameters to dynamically determine the launch parameters for each kernel invocation. This technique is implemented within the KLARAPTOR tool, utilizing the LLVM Pass …


Gpgpu Microbenchmarking For Irregular Application Optimization, Dalton R. Winans-Pruitt Aug 2022

Gpgpu Microbenchmarking For Irregular Application Optimization, Dalton R. Winans-Pruitt

Theses and Dissertations

Irregular applications, such as unstructured mesh operations, do not easily map onto the typical GPU programming paradigms endorsed by GPU manufacturers, which mostly focus on maximizing concurrency for latency hiding. In this work, we show how alternative techniques focused on latency amortization can be used to control overall latency while requiring less concurrency. We used a custom-built microbenchmarking framework to test several GPU kernels and show how the GPU behaves under relevant workloads. We demonstrate that coalescing is not required for efficacious performance; an uncoalesced access pattern can achieve high bandwidth - even over 80% of the theoretical global memory …