Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 21 of 21

Full-Text Articles in Physical Sciences and Mathematics

A Data Science Approach To Defining A Data Scientist, Andy Ho, An Nguyen, Jodi L. Pafford, Robert Slater Dec 2019

A Data Science Approach To Defining A Data Scientist, Andy Ho, An Nguyen, Jodi L. Pafford, Robert Slater

SMU Data Science Review

In this paper, we present a common definition and list of skills for a Data Scientist using online job postings. The overlap and ambiguity of various roles such as data scientist, data engineer, data analyst, software engineer, database administrator, and statistician motivate the problem. To arrive at a single Data Scientist definition, we collect over 8,000 job postings from Indeed.com for the six job titles. Each corpus contains text on job qualifications, skills, responsibilities, educational preferences, and requirements. Our data science methodology and analysis rendered the single definition of a data scientist: A data scientist codes, collaborates, and communicates – …


Spatiotemporal Mode Analysis Of Urban Dockless Shared Bikes Based On Point Of Interests Clustering, Zhang Fang, Bin Chen, Yanghua Tang, Dong Jian, Chuan Ai, Xiaogang Qiu Dec 2019

Spatiotemporal Mode Analysis Of Urban Dockless Shared Bikes Based On Point Of Interests Clustering, Zhang Fang, Bin Chen, Yanghua Tang, Dong Jian, Chuan Ai, Xiaogang Qiu

Journal of System Simulation

Abstract: The city’s dockless shared bikes have developed rapidly, and its features of convenience, economy and efficiency have been widely welcomed. The digital footprint they generate reveals the movement of people in time and space within the city, which makes it possible to quantify the activities of people in the city using shared bikes. In this paper, based on the collected shared bikes data of Beijing, a clustering method based on the point of interests is proposed to divide the urban space, so as to construct a mobile network of urban shared bikes, and analysis the spatiotemporal mode of bike …


Salience-Aware Adaptive Resonance Theory For Large-Scale Sparse Data Clustering, Lei Meng, Ah-Hwee Tan, Chunyan Miao Dec 2019

Salience-Aware Adaptive Resonance Theory For Large-Scale Sparse Data Clustering, Lei Meng, Ah-Hwee Tan, Chunyan Miao

Research Collection School Of Computing and Information Systems

Sparse data is known to pose challenges to cluster analysis, as the similarity between data tends to be ill-posed in the high-dimensional Hilbert space. Solutions in the literature typically extend either k-means or spectral clustering with additional steps on representation learning and/or feature weighting. However, adding these usually introduces new parameters and increases computational cost, thus inevitably lowering the robustness of these algorithms when handling massive ill-represented data. To alleviate these issues, this paper presents a class of self-organizing neural networks, called the salience-aware adaptive resonance theory (SA-ART) model. SA-ART extends Fuzzy ART with measures for cluster-wise salient feature modeling. …


Topicsummary: A Tool For Analyzing Class Discussion Forums Using Topic Based Summarizations, Swapna Gottipati, Venky Shankararaman, Renjini Ramesh Oct 2019

Topicsummary: A Tool For Analyzing Class Discussion Forums Using Topic Based Summarizations, Swapna Gottipati, Venky Shankararaman, Renjini Ramesh

Research Collection School Of Computing and Information Systems

This Innovative Practice full paper, describes the application of text mining techniques for extracting insights from a course based online discussion forum through generation of topic based summaries. Discussions, either in classroom or online provide opportunity for collaborative learning through exchange of ideas that leads to enhanced learning through active participation. Online discussions offer a number of benefits namely providing additional time to reflect and synthesize information before writing, providing a natural platform for students to voice their ideas without any one student dominating the conversation, and providing a record of the student’s thoughts. An online discussion forum provides a …


High Performance Computing Techniques To Better Understand Protein Conformational Space, Arpita Joshi Aug 2019

High Performance Computing Techniques To Better Understand Protein Conformational Space, Arpita Joshi

Graduate Doctoral Dissertations

This thesis presents an amalgamation of high performance computing techniques to get better insight into protein molecular dynamics. Key aspects of protein function and dynamics can be learned from their conformational space. Datasets that represent the complex nuances of a protein molecule are high dimensional. Efficient dimensionality reduction becomes indispensable for the analysis of such exorbitant datasets. Dimensionality reduction forms a formidable portion of this work and its application has been explored for other datasets as well. It begins with the parallelization of a known non-liner feature reduction algorithm called Isomap. The code for the algorithm was re-written in C …


Redpc: A Residual Error-Based Density Peak Clustering Algorithm, Milan Parmar, Di Wang, Xiaofeng Zhang, Ah-Hwee Tan, Chunyan Miao, You Zhou Jul 2019

Redpc: A Residual Error-Based Density Peak Clustering Algorithm, Milan Parmar, Di Wang, Xiaofeng Zhang, Ah-Hwee Tan, Chunyan Miao, You Zhou

Research Collection School Of Computing and Information Systems

The density peak clustering (DPC) algorithm was designed to identify arbitrary-shaped clusters by finding density peaks in the underlying dataset. Due to its aptitudes of relatively low computational complexity and a small number of control parameters in use, DPC soon became widely adopted. However, because DPC takes the entire data space into consideration during the computation of local density, which is then used to generate a decision graph for the identification of cluster centroids, DPC may face difficulty in differentiating overlapping clusters and in dealing with low-density data points. In this paper, we propose a residual error-based density peak clustering …


Cure: Flexible Categorical Data Representation By Hierarchical Coupling Learning, Songlei Jian, Guansong Pang, Longbing Cao, Kai Lu, Hang Gao May 2019

Cure: Flexible Categorical Data Representation By Hierarchical Coupling Learning, Songlei Jian, Guansong Pang, Longbing Cao, Kai Lu, Hang Gao

Research Collection School Of Computing and Information Systems

The representation of categorical data with hierarchical value coupling relationships (i.e., various value-to-value cluster interactions) is very critical yet challenging for capturing complex data characteristics in learning tasks. This paper proposes a novel and flexible coupled unsupervised categorical data representation (CURE) framework, which not only captures the hierarchical couplings but is also flexible enough to be instantiated for contrastive learning tasks. CURE first learns the value clusters of different granularities based on multiple value coupling functions and then learns the value representation from the couplings between the obtained value clusters. With two complementary value coupling functions, CURE is instantiated into …


Clustering Of Multiple Instance Data., Andrew D. Karem May 2019

Clustering Of Multiple Instance Data., Andrew D. Karem

Electronic Theses and Dissertations

An emergent area of research in machine learning that aims to develop tools to analyze data where objects have multiple representations is Multiple Instance Learning (MIL). In MIL, each object is represented by a bag that includes a collection of feature vectors called instances. A bag is positive if it contains at least one positive instance, and negative if no instances are positive. One of the main objectives in MIL is to identify a region in the instance feature space with high correlation to instances from positive bags and low correlation to instances from negative bags -- this region is …


Evolutionary Trends In The Collaborative Review Process Of A Large Software System, Subhajit Datta, Poulami Sarkar Feb 2019

Evolutionary Trends In The Collaborative Review Process Of A Large Software System, Subhajit Datta, Poulami Sarkar

Research Collection School Of Computing and Information Systems

In this paper, we study the evolutionary trends in the collaborative review process of a large open source software system. As expected, the number of reviews, the number of reviews commented on, as well as the number of reviewers, and the interactions between them show increasing trends over time. But unexpectedly, levels of clustering between developers in their interaction networks show a decreasing trend, even as connections between them increase. In the context of our study, clustering is an indicator of developer collaboration, whereas connection points to how intensely developers work together. Thus the trends we observe can inform how …


Transfer Learning For Detecting Unknown Network Attacks, Juan Zhao, Sachin Shetty, Jan Wei Pan, Charles Kamhoua, Kevin Kwiat Jan 2019

Transfer Learning For Detecting Unknown Network Attacks, Juan Zhao, Sachin Shetty, Jan Wei Pan, Charles Kamhoua, Kevin Kwiat

VMASC Publications

Network attacks are serious concerns in today’s increasingly interconnected society. Recent studies have applied conventional machine learning to network attack detection by learning the patterns of the network behaviors and training a classification model. These models usually require large labeled datasets; however, the rapid pace and unpredictability of cyber attacks make this labeling impossible in real time. To address these problems, we proposed utilizing transfer learning for detecting new and unseen attacks by transferring the knowledge of the known attacks. In our previous work, we have proposed a transfer learning-enabled framework and approach, called HeTL, which can find the common …


A Hybrid (Active-Passive) Vanet Clustering Technique, Garrett Lee Moore Jan 2019

A Hybrid (Active-Passive) Vanet Clustering Technique, Garrett Lee Moore

CCE Theses and Dissertations

Clustering serves a vital role in the operation of Vehicular Ad hoc Networks (VANETs) by continually grouping highly mobile vehicles into logical hierarchical structures. These moving clusters support Intelligent Transport Systems (ITS) applications and message routing by establishing a more stable global topology. Clustering increases scalability of the VANET by eliminating broadcast storms caused by packet flooding and facilitate multi-channel operation. Clustering techniques are partitioned in research into two categories: active and passive. Active techniques rely on periodic beacon messages from all vehicles containing location, velocity, and direction information. However, in areas of high vehicle density, congestion may occur on …


Exploring Bigram Character Features For Arabic Text Clustering, Dia Eddin Abuzeina Jan 2019

Exploring Bigram Character Features For Arabic Text Clustering, Dia Eddin Abuzeina

Turkish Journal of Electrical Engineering and Computer Sciences

The vector space model (VSM) is an algebraic model that is widely used for data representation in text mining applications. However, the VSM poses a critical challenge, as it requires a high-dimensional feature space. Therefore, many feature selection techniques, such as employing roots or stems (i.e. words without infixes and prefixes, and/or suffixes) instead of using complete word forms, are proposed to tackle this space challenge problem. Recently, the literature shows that one more basic unit feature can be used to handle the textual features, which is the twoneighboring character form that we call microword. To evaluate this feature type, …


A New Model To Determine The Hierarchical Structure Of The Wireless Sensor Networks, Resmi̇ye Nasi̇boğlu, Zülküf Teki̇n Erten Jan 2019

A New Model To Determine The Hierarchical Structure Of The Wireless Sensor Networks, Resmi̇ye Nasi̇boğlu, Zülküf Teki̇n Erten

Turkish Journal of Electrical Engineering and Computer Sciences

Wireless sensor networks are one of the rising areas of scientific research. Common purpose of these investigations is usually constructing optimal structure of the network by prolonging its lifetime. In this study, a new model has been proposed to construct a hierarchical structure of wireless sensor networks. Methods used in the model to determine clusters and appropriate cluster heads are k-means clustering and fuzzy inference system (FIS), respectively. The weighted averaging based on levels (WABL) defuzzification method is used to calculate crisp outputs of the FIS. A new theorem for calculation of WABL values has been proved in order to …


Evaluating The Attributes Of Remote Sensing Image Pixels For Fast K-Means Clustering, Ali̇ Sağlam, Nurdan Baykan Jan 2019

Evaluating The Attributes Of Remote Sensing Image Pixels For Fast K-Means Clustering, Ali̇ Sağlam, Nurdan Baykan

Turkish Journal of Electrical Engineering and Computer Sciences

Clustering process is an important stage for many data mining applications. In this process, data elements are grouped according to their similarities. One of the most known clustering algorithms is the k-means algorithm. The algorithm initially requires the number of clusters as a parameter and runs iteratively. Many remote sensing image processing applications usually need the clustering stage like many image processing applications. Remote sensing images provide more information about the environments with the development of the multispectral sensor and laser technologies. In the dataset used in this paper, the infrared (IR) and the digital surface maps (DSM) are also …


Exploring The Impact Of (Not) Changing Default Settings In Algorithmic Crime Mapping - A Case Study Of Milwaukee, Wisconsin, Md Romael Haque, Katy Weathington, Shion Guha Jan 2019

Exploring The Impact Of (Not) Changing Default Settings In Algorithmic Crime Mapping - A Case Study Of Milwaukee, Wisconsin, Md Romael Haque, Katy Weathington, Shion Guha

Computer Science Faculty Research and Publications

Policing decisions, allocations and outcomes are determined by mapping historical crime data geo-spatially using popular algorithms. In this extended abstract, we present early results from a mixed-methods study of the practices, policies, and perceptions of algorithmic crime mapping in the city of Milwaukee, Wisconsin. We investigate this differential by visualizing potential demographic biases from publicly available crime data over 12 years (2005-2016) and conducting semi-structured interviews of 19 city stakeholders and provide future research directions from this study.


Learning From Heterogeneous Data, Lu Wang Jan 2019

Learning From Heterogeneous Data, Lu Wang

Wayne State University Dissertations

Data with both heterogeneity and homogeneity is now ubiquitous due to the development of multitudinous data collection techniques. To encode the data heterogeneity and homogeneity, we focus on unsupervised and supervised learning approaches. In unsupervised learning, to consider both data heterogeneity and homogeneity, we develop three clustering frameworks to maximize the heterogeneity among data sub-groups and homogeneity within each data sub-group for over-dispersed data in three different data types, i.e., alphabetic, network and mixed feature types data. In supervised learning, the traditional approaches, however, either build a global model for a whole group including all sub-groups, which fail to consider …


Efficient Hierarchical Temporal Segmentation Method For Facial Expression Sequences, Jiali Bian, Xue Mei, Yu Xue, Liang Wu, Yao Ding Jan 2019

Efficient Hierarchical Temporal Segmentation Method For Facial Expression Sequences, Jiali Bian, Xue Mei, Yu Xue, Liang Wu, Yao Ding

Turkish Journal of Electrical Engineering and Computer Sciences

Temporal segmentation of facial expression sequences is important to understand and analyze human facial expressions. It is, however, challenging to deal with the complexity of facial muscle movements by finding a suitable metric to distinguish among different expressions and to deal with the uncontrolled environmental factors in the real world. This paper presents a two-step unsupervised segmentation method composed of rough segmentation and fine segmentation stages to compute the optimal segmentation positions in video sequences to facilitate the segmentation of different facial expressions. The proposed method performs localization of facial expression patches to aid in recognition and extraction of specific …


Intelligent Intrusion Detection Using Radial Basis Function Neural Network, Alia Abughazleh, Muder Almiani, Basel Magableh, Abdul Razaque Jan 2019

Intelligent Intrusion Detection Using Radial Basis Function Neural Network, Alia Abughazleh, Muder Almiani, Basel Magableh, Abdul Razaque

Conference papers

Recently we witness a booming and ubiquity evolving of internet connectivity all over the world leading to dramatic amount of network activities and large amount of data and information transfer. Massive data transfer composes a fertile ground to hackers and intruders to launch cyber-attacks and various types of penetrations. As a consequence, researchers around the globe have devoted a large room for researches that can handle different types of attacks efficiently through building various types of intrusion detection systems capable to handle different types of attacks, known and unknown (novel) ones as well as have the capability to deal with …


Data Patterns Discovery Using Unsupervised Learning, Rachel A. Lewis Jan 2019

Data Patterns Discovery Using Unsupervised Learning, Rachel A. Lewis

Electronic Theses and Dissertations

Self-care activities classification poses significant challenges in identifying children’s unique functional abilities and needs within the exceptional children healthcare system. The accuracy of diagnosing a child's self-care problem, such as toileting or dressing, is highly influenced by an occupational therapists’ experience and time constraints. Thus, there is a need for objective means to detect and predict in advance the self-care problems of children with physical and motor disabilities. We use clustering to discover interesting information from self-care problems, perform automatic classification of binary data, and discover outliers. The advantages are twofold: the advancement of knowledge on identifying self-care problems in …


Scalable Clustering For Immune Repertoire Sequence Analysis, Prem Bhusal Jan 2019

Scalable Clustering For Immune Repertoire Sequence Analysis, Prem Bhusal

Browse all Theses and Dissertations

The development of the next-generation sequencing technology has enabled systems immunology researchers to conduct detailed immune repertoire analysis at the molecule level. Large sequence datasets (e.g., millions of sequences) are being collected to comprehensively understand how the immune system of a patient evolves over different stages of disease development. A recent study has shown that the hierarchical clustering (HC) algorithm gives the best results for B-cell clones analysis - an important type of immune repertoire sequencing (IR-Seq) analysis. However, due to the inherent complexity, the classical hierarchical clustering algorithm does not scale well to large sequence datasets. Surprisingly, no algorithms …


Towards An Efficient Data Fragmentation, Allocation, And Clustering Approach In A Distributed Environment, Hassan Abdalla, Abdel Monim Artoli Jan 2019

Towards An Efficient Data Fragmentation, Allocation, And Clustering Approach In A Distributed Environment, Hassan Abdalla, Abdel Monim Artoli

All Works

© 2019 by the authors. Data fragmentation and allocation has for long proven to be an efficient technique for improving the performance of distributed database systems' (DDBSs). A crucial feature of any successful DDBS design revolves around placing an intrinsic emphasis on minimizing transmission costs (TC). This work; therefore, focuses on improving distribution performance based on transmission cost minimization. To do so, data fragmentation and allocation techniques are utilized in this work along with investigating several data replication scenarios. Moreover, site clustering is leveraged with the aim of producing a minimum possible number of highly balanced clusters. By doing so, …