Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Statistics and Probability

2020

SMU Data Science Review

Computational Biology

Articles 1 - 1 of 1

Full-Text Articles in Physical Sciences and Mathematics

Compressed Dna Representation For Efficient Amr Classification, John Partee, Robert Hazell, Anjli Solsi, John Santerre Aug 2020

Compressed Dna Representation For Efficient Amr Classification, John Partee, Robert Hazell, Anjli Solsi, John Santerre

SMU Data Science Review

In this paper, we explore a representation methodology for the compression of DNA isolates. Using lossless string compression via tokenization of frequently repeated segments of DNA, we reduce the length of the isolates to be counted as k-mers for classification. With this new representation, we apply a previously established feature sampling method to dramatically reduce the feature space. In understanding the genetic diversity, we also look at conserving biological function across these spaces. Using a random forest model we were able to predict the resistance or susceptibility of bacteria with 85-90\% accuracy, with a 30-50\% reduction in overall isolate length, …