Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 4 of 4

Full-Text Articles in Physical Sciences and Mathematics

Application Of Cosine Similarity In Bioinformatics, Srikanth Maturu May 2018

Application Of Cosine Similarity In Bioinformatics, Srikanth Maturu

Department of Computer Science and Engineering: Dissertations, Theses, and Student Research

Finding similar sequences to an input query sequence (DNA or proteins) from a sequence data set is an important problem in bioinformatics. It provides researchers an intuition of what could be related or how the search space can be reduced for further tasks. An exact brute-force nearest-neighbor algorithm used for this task has complexity O(m * n) where n is the database size and m is the query size. Such an algorithm faces time-complexity issues as the database and query sizes increase. Furthermore, the use of alignment-based similarity measures such as minimum edit distance adds an additional complexity to the …


Evaluating Reproducibility In Computational Biology Research, Morgan Oneka Apr 2018

Evaluating Reproducibility In Computational Biology Research, Morgan Oneka

Honors Projects

For my Honors Senior Project, I read five research papers in the field of computational biology and attempted to reproduce the results. However, for the most part, this proved a challenge, as many details vital to utilizing relevant software and data had been excluded. Using Geir Kjetil Sandve's paper "Ten Simple Rules for Reproducible Computational Research" as a guide, I discuss how authors of these five papers did and did not obey these rules of reproducibility and how this affected my ability to reproduce their results.


Fast And Space-Efficient Location Of Heavy Or Dense Segments In Run-Length Encoded Sequences, Ronald I. Greenberg Jan 2018

Fast And Space-Efficient Location Of Heavy Or Dense Segments In Run-Length Encoded Sequences, Ronald I. Greenberg

Ronald Greenberg

This paper considers several variations of an optimization problem with potential applications in such areas as biomolecular sequence analysis and image processing. Given a sequence of items, each with a weight and a length, the goal is to find a subsequence of consecutive items of optimal value, where value is either total weight or total weight divided by total length. There may also be a specified lower and/or upper bound on the acceptable length of subsequences. This paper shows that all the variations of the problem are solvable in linear time and space even with non-uniform item lengths and divisible …


Scalable Feature Selection And Extraction With Applications In Kinase Polypharmacology, Derek Jones Jan 2018

Scalable Feature Selection And Extraction With Applications In Kinase Polypharmacology, Derek Jones

Theses and Dissertations--Computer Science

In order to reduce the time associated with and the costs of drug discovery, machine learning is being used to automate much of the work in this process. However the size and complex nature of molecular data makes the application of machine learning especially challenging. Much work must go into the process of engineering features that are then used to train machine learning models, costing considerable amounts of time and requiring the knowledge of domain experts to be most effective. The purpose of this work is to demonstrate data driven approaches to perform the feature selection and extraction steps in …