Open Access. Powered by Scholars. Published by Universities.®

Engineering Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 20 of 20

Full-Text Articles in Engineering

Accelerating Graphics Rendering On Risc-V Gpus, Joshua Simpson Jun 2022

Accelerating Graphics Rendering On Risc-V Gpus, Joshua Simpson

Master's Theses

Graphics Processing Units (GPUs) are commonly used to accelerate massively parallel workloads across a wide range of applications from machine learning to cryptocurrency mining. The original application for GPUs, however, was to accelerate graphics rendering which remains popular today through video gaming and video rendering. While GPUs began as fixed function hardware with minimal programmability, modern GPUs have adopted a design with many programmable cores and supporting fixed function hardware for rasterization, texture sampling, and render output tasks. This balance enables GPUs to be used for general purpose computing and still remain adept at graphics rendering. Previous work at the …


Hardware Acceleration In Image Stitching: Gpu Vs Fpga, Joshua David Edgcombe Jul 2021

Hardware Acceleration In Image Stitching: Gpu Vs Fpga, Joshua David Edgcombe

Masters Theses

Image stitching is a process where two or more images with an overlapping field of view are combined. This process is commonly used to increase the field of view or image quality of a system. While this process is not particularly difficult for modern personal computers, hardware acceleration is often required to achieve real-time performance in low-power image stitching solutions. In this thesis, two separate hardware accelerated image stitching solutions are developed and compared. One solution is accelerated using a Xilinx Zynq UltraScale+ ZU3EG FPGA and the other solution is accelerated using an Nvidia RTX 2070 Super GPU. The image …


Towards Practical Homomorphic Encryption And Efficient Implementation, Gyana R. Sahu Aug 2020

Towards Practical Homomorphic Encryption And Efficient Implementation, Gyana R. Sahu

Dissertations

Cloud computing has gained significant traction over the past few years and its application continues to soar as evident from its rapid adoption in various industries. One of the major challenges involved in cloud computing services is the security of sensitive information as cloud servers have been often found to be vulnerable to snooping by malicious adversaries. Such data privacy concerns can be addressed to a greater extent by enforcing cryptographic measures. Fully homomorphic encryption (FHE), a special form of public key encryption has emerged as a primary tool in deploying such cryptographic security assurances without sacrificing many of the …


Algorithms And Framework For Computing 2-Body Statistics On Graphics Processing Units, Napath Pitaksirianan Feb 2020

Algorithms And Framework For Computing 2-Body Statistics On Graphics Processing Units, Napath Pitaksirianan

USF Tampa Graduate Theses and Dissertations

Various types of two-body statistics (2-BS) are regarded as essential components of low-level data analysis in scientific database systems. In relational algebraic terms, a 2-BS is essentially a Cartesian product between two datasets (or two instances of the same dataset) followed by a user-defined aggregate. The quadratic complexity of these computations hinders the timely processing of data. Thus using modern parallel hardware has become an obvious solution to meet such challenges. This dissertation presents our recent work in designing and optimizing parallel algorithms for 2-BS computation on Graphics Processing Units (GPUs). The unique architecture, however, provides abundant opportunities for optimizing …


Shortest Path Calculation Using Contraction Hierarchy Graph Algorithms On Nvidia Gpus, Roozbeh Karimi Nov 2019

Shortest Path Calculation Using Contraction Hierarchy Graph Algorithms On Nvidia Gpus, Roozbeh Karimi

LSU Doctoral Dissertations

PHAST is to date one of the fastest algorithms for performing single source shortest path (SSSP) queries on road-network graphs. PHAST operates on graphs produced in part using Geisberger's contraction hierarchy (CH) algorithm. Producing these graphs is time consuming, limiting PHAST's usefulness when graphs are not available in advance. CH iteratively assigns scores to nodes, contracts (removes) the highest-scoring node, and adds shortcut edges to preserve distances. Iteration stops when only one node remains. Scoring and contraction rely on a witness path search (WPS) of nearby nodes. Little work has been reported on parallel and especially GPU CH algorithms. This …


Split Latency Allocator: Process Variation-Aware Register Access Latency Boost In A Near-Threshold Graphics Processing Unit, Asmita Pal Aug 2018

Split Latency Allocator: Process Variation-Aware Register Access Latency Boost In A Near-Threshold Graphics Processing Unit, Asmita Pal

All Graduate Theses and Dissertations, Spring 1920 to Summer 2023

Over the last decade, Graphics Processing Units (GPUs) have been used extensively in gaming consoles, mobile phones, workstations and data centers, as they have exhibited immense performance improvement over CPUs, in graphics intensive applications. Due to their highly parallel architecture, general purpose GPUs (GPGPUs) have gained the foreground in applications where large data blocks can be processed in parallel. However, the performance improvement is constrained by a large power consumption. Likewise, Near Threshold Computing (NTC) has emerged as an energy-efficient design paradigm. Hence, operating GPUs at NTC seems like a plausible solution to counteract the high energy consumption. This work …


Tackling Choke Point Induced Performance Bottlenecks In A Near-Threshold Gpgpu, Tahmoures Shabanian Aug 2018

Tackling Choke Point Induced Performance Bottlenecks In A Near-Threshold Gpgpu, Tahmoures Shabanian

All Graduate Theses and Dissertations, Spring 1920 to Summer 2023

Over the last decade, General Purpose Graphics Processing Units (GPGPUs) have garnered a substantial attention in the research community due to their extensive thread-level parallelism. GPGPUs provide a remarkable performance improvement over Central Processing Units (CPUs), for highly parallel applications. However, GPGPUs typically achieve this extensive thread-level parallelism at the cost of a large power consumption. Consequently, Near-Threshold Computing (NTC) provides a promising opportunity for designing energy-efficient GPGPUs (NTC-GPUs). However, NTC-GPUs suffer from a crucial Process Variation (PV)-inflicted performance bottleneck, which is called Choke Point. Choke Point is defined as one or small group of gates which is affected by …


Efficiently And Transparently Maintaining High Simd Occupancy In The Presence Of Wavefront Irregularity, Stephen V. Cole Aug 2017

Efficiently And Transparently Maintaining High Simd Occupancy In The Presence Of Wavefront Irregularity, Stephen V. Cole

McKelvey School of Engineering Theses & Dissertations

Demand is increasing for high throughput processing of irregular streaming applications; examples of such applications from scientific and engineering domains include biological sequence alignment, network packet filtering, automated face detection, and big graph algorithms. With wide SIMD, lightweight threads, and low-cost thread-context switching, wide-SIMD architectures such as GPUs allow considerable flexibility in the way application work is assigned to threads. However, irregular applications are challenging to map efficiently onto wide SIMD because data-dependent filtering or replication of items creates an unpredictable data wavefront of items ready for further processing. Straightforward implementations of irregular applications on a wide-SIMD architecture are prone …


High Performance Multiview Video Coding, Caoyang Jiang Jan 2017

High Performance Multiview Video Coding, Caoyang Jiang

Dissertations, Master's Theses and Master's Reports

Following the standardization of the latest video coding standard High Efficiency Video Coding in 2013, in 2014, multiview extension of HEVC (MV-HEVC) was published and brought significantly better compression performance of around 50% for multiview and 3D videos compared to multiple independent single-view HEVC coding. However, the extremely high computational complexity of MV-HEVC demands significant optimization of the encoder. To tackle this problem, this work investigates the possibilities of using modern parallel computing platforms and tools such as single-instruction-multiple-data (SIMD) instructions, multi-core CPU, massively parallel GPU, and computer cluster to significantly enhance the MVC encoder performance. The aforementioned computing tools …


Analysis Of 3d Cone-Beam Ct Image Reconstruction Performance On A Fpga, Devin Held Dec 2016

Analysis Of 3d Cone-Beam Ct Image Reconstruction Performance On A Fpga, Devin Held

Electronic Thesis and Dissertation Repository

Efficient and accurate tomographic image reconstruction has been an intensive topic of research due to the increasing everyday usage in areas such as radiology, biology, and materials science. Computed tomography (CT) scans are used to analyze internal structures through capture of x-ray images. Cone-beam CT scans project a cone-shaped x-ray to capture 2D image data from a single focal point, rotating around the object. CT scans are prone to multiple artifacts, including motion blur, streaks, and pixel irregularities, therefore must be run through image reconstruction software to reduce visual artifacts. The most common algorithm used is the Feldkamp, Davis, and …


A Reused Distance Based Analysis And Optimization For Gpu Cache, Dongwei Wang Jan 2016

A Reused Distance Based Analysis And Optimization For Gpu Cache, Dongwei Wang

Theses and Dissertations

As a throughput-oriented device, Graphics Processing Unit(GPU) has already integrated with cache, which is similar to CPU cores. However, the applications in GPGPU computing exhibit distinct memory access patterns. Normally, the cache, in GPU cores, suffers from threads contention and resources over-utilization, whereas few detailed works excavate the root of this phenomenon. In this work, we adequately analyze the memory accesses from twenty benchmarks based on reuse distance theory and quantify their patterns. Additionally, we discuss the optimization suggestions, and implement a Bypassing Aware(BA) Cache which could intellectually bypass the thrashing-prone candidates.

BA cache is a cost efficient cache design …


Novel Software Defined Radio Architecture With Graphics Processor Acceleration, Lalith Narasimhan Dec 2015

Novel Software Defined Radio Architecture With Graphics Processor Acceleration, Lalith Narasimhan

Dissertations

Wireless has become one of the most pervasive core technologies in the modern world. Demand for faster data rates, improved spectrum efficiency, higher system access capacity, seamless protocol integration, improved security and robustness under varying channel environments has led to the resurgence of programmable software defined radio (SDR) as an alternative to traditional ASIC based radios. Future SDR implementations will need support for multiple standards on platforms with multi-Gb/s connectivity, parallel processing and spectrum sensing capabilities. This dissertation implemented key technologies of importance in addressing these issues namely development of cost effective multi-mode reconfigurable SDR and providing a framework to …


Optimizing Lempel-Ziv Factorization For The Gpu Architecture, Bryan Ching Jun 2014

Optimizing Lempel-Ziv Factorization For The Gpu Architecture, Bryan Ching

Master's Theses

Lossless data compression is used to reduce storage requirements, allowing for the relief of I/O channels and better utilization of bandwidth. The Lempel-Ziv lossless compression algorithms form the basis for many of the most commonly used compression schemes. General purpose computing on graphic processing units (GPGPUs) allows us to take advantage of the massively parallel nature of GPUs for computations other that their original purpose of rendering graphics. Our work targets the use of GPUs for general lossless data compression. Specifically, we developed and ported an algorithm that constructs the Lempel-Ziv factorization directly on the GPU. Our implementation bypasses the …


Paris: A Parallel Rsa-Prime Inspection Tool, Joseph R. White Jun 2013

Paris: A Parallel Rsa-Prime Inspection Tool, Joseph R. White

Master's Theses

Modern-day computer security relies heavily on cryptography as a means to protect the data that we have become increasingly reliant on. As the Internet becomes more ubiquitous, methods of security must be better than ever. Validation tools can be leveraged to help increase our confidence and accountability for methods we employ to secure our systems.

Security validation, however, can be difficult and time-consuming. As our computational ability increases, calculations that were once considered “hard” due to length of computation, can now be done in minutes. We are constantly increasing the size of our keys and attempting to make computations harder …


Efficient, Scalable, Parallel, Matrix-Matrix Multiplication, Enrique Portillo Jan 2013

Efficient, Scalable, Parallel, Matrix-Matrix Multiplication, Enrique Portillo

Open Access Theses & Dissertations

For the past decade, power/energy consumption has become a limiting factor for large-scale and embedded High Performance Computing (HPC) systems. This is especially true for systems that include accelerators, e.g., high-end computing devices, such as Graphics Processing Units (GPUs), with terascale computing capabilities and high power draws that greatly surpass that of multi-core CPUs. Accordingly, improving the node-level power/energy efficiency of an application can have a direct and positive impact on both classes of HPC systems.

The research reported in this thesis explores the use of software techniques to enhance the execution-time and power-consumption performance of applications executed on a …


Exploring Computational Chemistry On Emerging Architectures, David Dewayne Jenkins Dec 2012

Exploring Computational Chemistry On Emerging Architectures, David Dewayne Jenkins

Doctoral Dissertations

Emerging architectures, such as next generation microprocessors, graphics processing units, and Intel MIC cards, are being used with increased popularity in high performance computing. Each of these architectures has advantages over previous generations of architectures including performance, programmability, and power efficiency. With the ever-increasing performance of these architectures, scientific computing applications are able to attack larger, more complicated problems. However, since applications perform differently on each of the architectures, it is difficult to determine the best tool for the job. This dissertation makes the following contributions to computer engineering and computational science. First, this work implements the computational chemistry variational …


Parallel For Loops On Heterogeneous Resources, Frederick Edward Weber Dec 2012

Parallel For Loops On Heterogeneous Resources, Frederick Edward Weber

Doctoral Dissertations

In recent years, Graphics Processing Units (GPUs) have piqued the interest of researchers in scientific computing. Their immense floating point throughput and massive parallelism make them ideal for not just graphical applications, but many general algorithms as well. Load balancing applications and taking advantage of all computational resources in a machine is a difficult challenge, especially when the resources are heterogeneous. This dissertation presents the clUtil library, which vastly simplifies developing OpenCL applications for heterogeneous systems. The core focus of this dissertation lies in clUtil's ParallelFor construct and our novel PINA scheduler which can efficiently load balance work onto multiple …


Gpu Implementation Of A Novel Approach To Cramer’S Algorithm For Solving Large Scale Linear Systems, Rosanne Lane West May 2010

Gpu Implementation Of A Novel Approach To Cramer’S Algorithm For Solving Large Scale Linear Systems, Rosanne Lane West

Masters Theses

Scientific computing often requires solving systems of linear equations. Most software pack- ages for solving large-scale linear systems use Gaussian elimination methods such as LU- decomposition. An alternative method, recently introduced by K. Habgood and I. Arel, involves an application of Cramer’s Rule and Chio’s condensation to achieve a better per- forming system for solving linear systems on parallel computing platforms. This thesis describes an implementation of this algorithm on an nVidia graphics processor card us- ing the CUDA language. Increased performance, relative to the serial implementation, is demonstrated, paving the way for future parallel realizations of the scheme.


Performance Evaluation Of Memory And Computationally Bound Chemistry Applications On Streaming Gpgpus And Multi-Core X86 Cpus, Frederick E. Weber Iii May 2010

Performance Evaluation Of Memory And Computationally Bound Chemistry Applications On Streaming Gpgpus And Multi-Core X86 Cpus, Frederick E. Weber Iii

Masters Theses

In recent years, multi-core processors have come to dominate the field in desktop and high performance computing. Graphics processors traditionally used in CAD, video games, and other 3-d applications, have become more programmable and are now suitable for general purpose computing. This thesis explores multi-core processors and GPU performance and limitations in two computational chemistry applications: a memory bound component of ab-initio modeling and a computationally bound Monte Carlo simulation. For the applications presented in this thesis, exploiting multiple processors is done using a variety of tools and languages including OpenMP and MKL. Brook+ and the Compute Abstraction Layer streaming …


An Approach For Computing Intervisibility Using Graphical Processing U, Judd Tracy Jan 2004

An Approach For Computing Intervisibility Using Graphical Processing U, Judd Tracy

Electronic Theses and Dissertations

In large scale entity-level military force-on-force simulations it is essential to know when one entity can visibly see another entity. This visibility determination plays an important role in the simulation and can affect the outcome of the simulation. When virtual Computer Generated Forces (CGF) are introduced into the simulation these intervisibilities must now be calculated by the virtual entities on the battlefield. But as the simulation size increases so does the complexity of calculating visibility between entities. This thesis presents an algorithm for performing these visibility calculations using Graphical Processing Units (GPU) instead of the Central Processing Units (CPU) that …