Computer Engineering | Open Access Articles | Digital Commons Network™

Large Genomes Assembly Using Mapreduce Framework, Yuehua Zhang Dec 2022

Large Genomes Assembly Using Mapreduce Framework, Yuehua Zhang

All Dissertations

Knowing the genome sequence of an organism is the essential step toward understanding its genomic and genetic characteristics. Currently, whole genome shotgun (WGS) sequencing is the most widely used genome sequencing technique to determine the entire DNA sequence of an organism. Recent advances in next-generation sequencing (NGS) techniques have enabled biologists to generate large DNA sequences in a high-throughput and low-cost way. However, the assembly of NGS reads faces significant challenges due to short reads and an enormously high volume of data. Despite recent progress in genome assembly, current NGS assemblers cannot generate high-quality results or efficiently handle large genomes …

Go to article

Design Development And Performance Analysis Of Distributed Least Square Twinsupport Vector Machine For Binary Classification, Bakshi Rohit Prasad, Sonali Agarwal Jan 2021

Design Development And Performance Analysis Of Distributed Least Square Twinsupport Vector Machine For Binary Classification, Bakshi Rohit Prasad, Sonali Agarwal

Turkish Journal of Electrical Engineering and Computer Sciences

Machine learning (ML) on Big Data has gone beyond the capacity of traditional machines and technologies. ML for large scale datasets is the current focus of researchers. Most of the ML algorithms primarily suffer from memory constraints, complex computation, and scalability issues.The least square twin support vector machine (LSTSVM) technique is an extended version of support vector machine (SVM). It is much faster as compared to SVM and is widely used for classification tasks. However, when applied to large scale datasets having millions or billions of samples and/or large number of classes, it causes computational and storage bottlenecks. This paper …

Go to article

Scale-Invariant Histogram Of Oriented Gradients: Novel Approach For Pedestriandetection In Multiresolution Image Dataset, Sweta Panigrahi, Surya Narayana Raju Undi Jan 2021

Scale-Invariant Histogram Of Oriented Gradients: Novel Approach For Pedestriandetection In Multiresolution Image Dataset, Sweta Panigrahi, Surya Narayana Raju Undi

Turkish Journal of Electrical Engineering and Computer Sciences

This paper proposes a scale-invariant histogram of oriented gradients (SI-HOG) for pedestrian detection. Most of the algorithms for pedestrian detection use the HOG as the basic feature and combine other features with the HOG to form the feature set, which is usually applied with a support vector machine (SVM). Hence, the HOG feature is the most efficient and fundamental feature for pedestrian detection. However, the HOG feature produces feature vectors of different lengths for different image resolutions; thus, the feature vectors are incomparable for the SVM. The proposed method forms a scale-space pyramid wherein the histogram bin is calculated. Thus, …

Go to article

A Counter Based Approach For Reducer Placement With Augmented Hadoop Rackawareness, Mir Wajahat Hussain, K Hemant Reddy, Diptendu Sinha Roy Jan 2021

A Counter Based Approach For Reducer Placement With Augmented Hadoop Rackawareness, Mir Wajahat Hussain, K Hemant Reddy, Diptendu Sinha Roy

Turkish Journal of Electrical Engineering and Computer Sciences

As the data-driven paradigm for intelligent systems design is gaining prominence, performance requirements have become very stringent, leading to numerous fine-tuned versions of Hadoop and its MapReduce programming model. However, very few researchers have investigated the effect of intelligent reducer placement on Hadoop's performance. This paper delves into this much ignored reducer placement phase for improving Hadoop's performance and proposes to spawn reduce phase of Hadoop tasks in an asynchronous fashion across nodes in a Hadoop cluster. The main contributions of this paper are: (i) to track when map phase of tasks are completed, (ii) to count the number of …

Go to article

Scalable Profiling And Visualization For Characterizing Microbiomes, Camilo Valdes Mar 2020

Scalable Profiling And Visualization For Characterizing Microbiomes, Camilo Valdes

FIU Electronic Theses and Dissertations

Metagenomics is the study of the combined genetic material found in microbiome samples, and it serves as an instrument for studying microbial communities, their biodiversities, and the relationships to their host environments. Creating, interpreting, and understanding microbial community profiles produced from microbiome samples is a challenging task as it requires large computational resources along with innovative techniques to process and analyze datasets that can contain terabytes of information.

The community profiles are critical because they provide information about what microorganisms are present in the sample, and in what proportions. This is particularly important as many human diseases and environmental disasters …

Go to article

The Scheduling Algorithm Of Cloud Job Based On Hopfield Neural Network, Yudong Guo, Jinping Zuo Dec 2019

The Scheduling Algorithm Of Cloud Job Based On Hopfield Neural Network, Yudong Guo, Jinping Zuo

Journal of System Simulation

Abstract: Focusing on the low efficiency of cloud job scheduling and the insufficient utility of resource, a job scheduling algorithm based on Hopfield Neural Network is proposed. In order to improve the resource scheduling ability of the system, The resource characteristics which influence the cloud job scheduling are shown. The mathematical model of resource constraints is established, and the Hopfield energy function is designed and optimized. The average utilization rate of 9 nodes is analyzed by using the standard test cases, and the performance and resource utilization of the proposed strategy are compared with three typical algorithms. …

Go to article

Cloud Job Scheduling Model Based On Improved Plant Growth Algorithm, Li Qiang, Xiaofeng Liu Jan 2019

Cloud Job Scheduling Model Based On Improved Plant Growth Algorithm, Li Qiang, Xiaofeng Liu

Journal of System Simulation

Abstract: The performance of cloud job scheduling algorithm has a great importance to the whole cloud system. The key factors that affect cloud operation scheduling are found out, and a resource constraint model is established. The existing simulation plant growth algorithm is improved based on the Logistic model of plant growth law, so that the plant growth way was made to change according to the energy power. The comparison of four different plant models was carried out and their different features were analyzed. Compared with 6 typical cloud job scheduling algorithms, it is concluded that the improved simulation plant growth …

Go to article

Scheduling In Mapreduce Clusters, Chen He Feb 2018

Scheduling In Mapreduce Clusters, Chen He

Department of Computer Science and Engineering: Dissertations, Theses, and Student Research

MapReduce is a framework proposed by Google for processing huge amounts of data in a distributed environment. The simplicity of the programming model and the fault-tolerance feature of the framework make it very popular in Big Data processing.

As MapReduce clusters get popular, their scheduling becomes increasingly important. On one hand, many MapReduce applications have high performance requirements, for example, on response time and/or throughput. On the other hand, with the increasing size of MapReduce clusters, the energy-efficient scheduling of MapReduce clusters becomes inevitable. These scheduling challenges, however, have not been systematically studied.

The objective of this dissertation is to …

Go to article

Hadoop Framework Implementation And Performance Analysis On A Cloud, Göksu Zeki̇ye Özen, Mehmet Tekerek, Rayi̇mbek Sultanov Jan 2017

Hadoop Framework Implementation And Performance Analysis On A Cloud, Göksu Zeki̇ye Özen, Mehmet Tekerek, Rayi̇mbek Sultanov

Turkish Journal of Electrical Engineering and Computer Sciences

The Hadoop framework uses the MapReduce programming paradigm to process big data by distributing data across a cluster and aggregating. MapReduce is one of the methods used to process big data hosted on large clusters. In this method, jobs are processed by dividing into small pieces and distributing over nodes. Parameters such as distributing method over nodes, the number of jobs held in a parallel fashion, and the number of nodes in the cluster affect the execution time of jobs. The aim of this paper is to determine how the numbers of nodes, maps, and reduces affect the performance of …

Go to article

A Mapreduce-Based Distributed Svm Algorithm For Binary Classification, Ferhat Özgür Çatak, Mehmet Erdal Balaban Jan 2016

A Mapreduce-Based Distributed Svm Algorithm For Binary Classification, Ferhat Özgür Çatak, Mehmet Erdal Balaban

Turkish Journal of Electrical Engineering and Computer Sciences

Although the support vector machine (SVM) algorithm has a high generalization property for classifying unseen examples after the training phase~and a small loss value, the algorithm is not suitable for real-life classification and regression problems. SVMs cannot solve hundreds of thousands of examples in a training dataset. In previous studies on distributed machine-learning algorithms, the SVM was trained in a costly and preconfigured computer environment. In this research, we present a MapReduce-based distributed parallel SVM training algorithm for binary classification problems. This work shows how to distribute optimization problems over cloud computing systems with the MapReduce technique. In the second …

Go to article

Scalable Sentiment Analytics, Aslan Baki̇rov, Kevser Nur Çoğalmiş, Ahmet Bulut Jan 2016

Scalable Sentiment Analytics, Aslan Baki̇rov, Kevser Nur Çoğalmiş, Ahmet Bulut

Turkish Journal of Electrical Engineering and Computer Sciences

Spark has become a widely popular analytics framework that provides an implementation of the equally popular MapReduce programming model. Hadoop is an Apache foundation framework that can be used for processing large datasets on a cluster of computers using the MapReduce programming model. Mahout is an Apache foundation project developed for building scalable machine learning libraries, which includes built-in machine learning classifiers. In this paper, we show how to build a simple text classifier on Spark, Apache Hadoop, and Apache Mahout for extracting out sentiments from a text collection containing millions of text documents. Using a collection of 7 million …

Go to article

Distributed Formal Concept Analysis Algorithms Based On An Iterative Mapreduce Framework, Ruairí De Fréin, Biao Xu, Eric Robson, Mícheál Ó Fóghlú Jan 2012

Distributed Formal Concept Analysis Algorithms Based On An Iterative Mapreduce Framework, Ruairí De Fréin, Biao Xu, Eric Robson, Mícheál Ó Fóghlú

Conference papers

While many existing formal concept analysis algorithms are efficient, they are typically unsuitable for distributed implementation. Taking the MapReduce (MR) framework as our inspiration we introduce a distributed approach for performing formal concept mining. Our method has its novelty in that we use a light-weight MapReduce runtime called Twister which is better suited to iterative algorithms than recent distributed approaches. First, we describe the theoretical foundations underpinning our distributed formal concept analysis approach. Second, we provide a representative exemplar of how a classic centralized algorithm can be implemented in a distributed fashion using our methodology: we modify Ganter’s classic algorithm …

Go to article

Managing Large Data Sets Using Support Vector Machines, Ranjini Srinivas Aug 2010

Managing Large Data Sets Using Support Vector Machines, Ranjini Srinivas

Department of Computer Science and Engineering: Dissertations, Theses, and Student Research

Hundreds of Terabytes of CMS (Compact Muon Solenoid) data are being accumulated for storage day by day at the University of Nebraska-Lincoln, which is one of the eight US CMS Tier-2 sites. Managing this data includes retaining useful CMS data sets and clearing storage space for newly arriving data by deleting less useful data sets. This is an important task that is currently being done manually and it requires a large amount of time. The overall objective of this study was to develop a methodology to help identify the data sets to be deleted when there is a requirement for …

Go to article

Improving Performance And Programmer Productivity For I/O-Intensive High Performance Computing Applications, Saba Sehrish Jan 2010

Improving Performance And Programmer Productivity For I/O-Intensive High Performance Computing Applications, Saba Sehrish

Electronic Theses and Dissertations

Due to the explosive growth in the size of scientific data sets, data-intensive computing is an emerging trend in computational science. HPC applications are generating and processing large amount of data ranging from terabytes (TB) to petabytes (PB). This new trend of growth in data for HPC applications has imposed challenges as to what is an appropriate parallel programming framework to efficiently process large data sets. In this work, we study the applicability of two programming models (MPI/MPI-IO and MapReduce) to a variety of I/O-intensive HPC applications ranging from simulations to analytics. We identify several performance and programmer productivity related …

Go to article

Computer Engineering Commons^™

Full-Text Articles in Computer Engineering

Large Genomes Assembly Using Mapreduce Framework, Yuehua Zhang

All Dissertations

Design Development And Performance Analysis Of Distributed Least Square Twinsupport Vector Machine For Binary Classification, Bakshi Rohit Prasad, Sonali Agarwal

Turkish Journal of Electrical Engineering and Computer Sciences

Scale-Invariant Histogram Of Oriented Gradients: Novel Approach For Pedestriandetection In Multiresolution Image Dataset, Sweta Panigrahi, Surya Narayana Raju Undi

Turkish Journal of Electrical Engineering and Computer Sciences

A Counter Based Approach For Reducer Placement With Augmented Hadoop Rackawareness, Mir Wajahat Hussain, K Hemant Reddy, Diptendu Sinha Roy

Turkish Journal of Electrical Engineering and Computer Sciences

Scalable Profiling And Visualization For Characterizing Microbiomes, Camilo Valdes

FIU Electronic Theses and Dissertations

The Scheduling Algorithm Of Cloud Job Based On Hopfield Neural Network, Yudong Guo, Jinping Zuo

Journal of System Simulation

Cloud Job Scheduling Model Based On Improved Plant Growth Algorithm, Li Qiang, Xiaofeng Liu

Journal of System Simulation

Scheduling In Mapreduce Clusters, Chen He

Department of Computer Science and Engineering: Dissertations, Theses, and Student Research

Hadoop Framework Implementation And Performance Analysis On A Cloud, Göksu Zeki̇ye Özen, Mehmet Tekerek, Rayi̇mbek Sultanov

Turkish Journal of Electrical Engineering and Computer Sciences

A Mapreduce-Based Distributed Svm Algorithm For Binary Classification, Ferhat Özgür Çatak, Mehmet Erdal Balaban

Turkish Journal of Electrical Engineering and Computer Sciences

Scalable Sentiment Analytics, Aslan Baki̇rov, Kevser Nur Çoğalmiş, Ahmet Bulut

Turkish Journal of Electrical Engineering and Computer Sciences

Distributed Formal Concept Analysis Algorithms Based On An Iterative Mapreduce Framework, Ruairí De Fréin, Biao Xu, Eric Robson, Mícheál Ó Fóghlú

Conference papers

Managing Large Data Sets Using Support Vector Machines, Ranjini Srinivas

Department of Computer Science and Engineering: Dissertations, Theses, and Student Research

Improving Performance And Programmer Productivity For I/O-Intensive High Performance Computing Applications, Saba Sehrish

Electronic Theses and Dissertations