Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 30 of 37

Full-Text Articles in Physical Sciences and Mathematics

Sewordsim: Software-Specific Word Similarity Database, Yuan Tian, David Lo, Julia Lawall Jun 2014

Sewordsim: Software-Specific Word Similarity Database, Yuan Tian, David Lo, Julia Lawall

David LO

Measuring the similarity of words is important in accurately representing and comparing documents, and thus improves the results of many natural language processing (NLP) tasks. The NLP community has proposed various measurements based on WordNet, a lexical database that contains relationships between many pairs of words. Recently, a number of techniques have been proposed to address software engineering issues such as code search and fault localization that require understanding natural language documents, and a measure of word similarity could improve their results. However, WordNet only contains information about words senses in general-purpose conversation, which often differ from word senses in …


An Empirical Study Of Bugs In Software Build Systems, Xin Xia, Xiaozhen Zhou, David Lo, Xiaoqiong Zhao Jun 2014

An Empirical Study Of Bugs In Software Build Systems, Xin Xia, Xiaozhen Zhou, David Lo, Xiaoqiong Zhao

David LO

Build system converts source code, libraries and other data into executable programs by orchestrating the execution of compilers and other tools. The whole building process is managed by a software build system, such as Make, Ant, CMake, Maven, Scons, and QMake. The reliability of software build systems would affect the reliability of the build process. In this paper, we perform an empirical study on bugs in software build systems. We analyze four software build systems, Ant, Maven, CMake and QMake, which are four typical and widely-used software build systems, and can be used to build Java, C, C++ systems. We …


Understanding The Genetic Makeup Of Linux Device Drivers, Peter Senna Tschudin, Laurent Reveillere, Lingxiao Jiang, David Lo, Julia Lawall Jun 2014

Understanding The Genetic Makeup Of Linux Device Drivers, Peter Senna Tschudin, Laurent Reveillere, Lingxiao Jiang, David Lo, Julia Lawall

David LO

No abstract provided.


Leveraging Machine Learning And Information Retrieval Techniques In Software Evolution Tasks: Summary Of The First Malir-Se Workshop, At Ase 2013, - Lucia, David Lo, Giuseppe Scanniello, Alessandro Marchetto, Nasir Ali, Collin Mcmillan Jun 2014

Leveraging Machine Learning And Information Retrieval Techniques In Software Evolution Tasks: Summary Of The First Malir-Se Workshop, At Ase 2013, - Lucia, David Lo, Giuseppe Scanniello, Alessandro Marchetto, Nasir Ali, Collin Mcmillan

David LO

The first International Workshop on MAchine Learning and Information Retrieval for Software Evolution (MALIR-SE) was held on the 11th of November 2013. The workshop was held in conjunction with the 28th IEEE/ACM International Conference on Automated Software Engineering (ASE) in Silicon Valley, California, USA. The workshop brought researchers and practitioners that were interested in leveraging machine learning and information retrieval techniques to automate various software evolution tasks. During the workshop, papers on the application of machine learning and information retrieval techniques to bug fix time prediction and anti-pattern detection were presented. There were also discussions on the presented papers and …


Hierarchical Parallel Algorithm For Modularity-Based Community Detection Using Gpus, Chun Yew Cheong, Huynh Phung Huynh, David Lo, Rick Siow Mong Goh Jun 2014

Hierarchical Parallel Algorithm For Modularity-Based Community Detection Using Gpus, Chun Yew Cheong, Huynh Phung Huynh, David Lo, Rick Siow Mong Goh

David LO

This paper describes the design of a hierarchical parallel algorithm for accelerating community detection which involves partitioning a network into communities of densely connected nodes. The algorithm is based on the Louvain method developed at the Université Catholique de Louvain, which uses modularity to measure community quality and has been successfully applied on many different types of networks. The proposed hierarchical parallel algorithm targets three levels of parallelism in the Louvain method and it has been implemented on single-GPU and multi-GPU architectures. Benchmarking results on several large web-based networks and popular social networks show that on top of offering speedups …


Software Internationalization And Localization: An Industrial Experience, Xin Xia, David Lo, Feng Zhu, Xinyu Wang, Bo Zhou Jun 2014

Software Internationalization And Localization: An Industrial Experience, Xin Xia, David Lo, Feng Zhu, Xinyu Wang, Bo Zhou

David LO

Software internationalization and localization are important steps in distributing and deploying software to different regions of the world. Internationalization refers to the process of reengineering a system such that it could support various languages and regions without further modification. Localization refers to the process of adapting an internationalized software for a specific language or region. Due to various reasons, many large legacy systems did not consider internationalization and localization at the early stage of development. In this paper, we present our experience on, and propose a process along with tool supports for software internationalization and localization. We reengineer a large …


Orion: A Software Project Search Engine With Integrated Diverse Software Artifacts, Tegawende F. Bissyande, Ferdian Thung, David Lo, Lingxiao Jiang, Laurent Réveillère Jun 2014

Orion: A Software Project Search Engine With Integrated Diverse Software Artifacts, Tegawende F. Bissyande, Ferdian Thung, David Lo, Lingxiao Jiang, Laurent Réveillère

David LO

Software projects produce a wealth of data that is leveraged in different tasks and for different purposes: researchers collect project data for building experimental datasets; software programmers reuse code from projects; developers often explore the opportunities for getting involved in the development of a project to gain or offer expertise. Finding relevant projects that suit one needs is however currently challenging with the capabilities of existing search systems. We propose Orion, an integrated search engine architecture that combines information from different types of software repositories from multiple sources to facilitate the construction and execution of advanced search queries. Orion provides …


Leveraging Web 2.0 For Software Evolution, Yuan Tian, David Lo Jun 2014

Leveraging Web 2.0 For Software Evolution, Yuan Tian, David Lo

David LO

In this era of Web 2.0, much information is available on the Internet. Software forums, mailing lists, and question-and-answer sites contain lots of technical information. Blogs contain developers’ opinions, ideas, and descriptions of their day-to-day activities. Microblogs contain recent and popular software news. Software forges contain records of socio-technical interactions of developers. All these resources could potentially be leveraged to help developers in performing software evolution activities. In this chapter, we first present information that is available from these Web 2.0 resources. We then introduce empirical studies that investigate how developers contribute information to and use these resources. Next, we …


Predicting Response In Mobile Advertising With Hierarchical Importance-Aware Factorization Machine, Richard Jayadi Oentaryo, Ee Peng Lim, Jia Wei Low, David Lo, Michael Finegold Jun 2014

Predicting Response In Mobile Advertising With Hierarchical Importance-Aware Factorization Machine, Richard Jayadi Oentaryo, Ee Peng Lim, Jia Wei Low, David Lo, Michael Finegold

David LO

Mobile advertising has recently seen dramatic growth, fueled by the global proliferation of mobile phones and devices. The task of predicting ad response is thus crucial for maximizing business revenue. However, ad response data change dynamically over time, and are subject to cold-start situations in which limited history hinders reliable prediction. There is also a need for a robust regression estimation for high prediction accuracy, and good ranking to distinguish the impacts of different ads. To this end, we develop a Hierarchical Importance-aware Factorization Machine (HIFM), which provides an effective generic latent factor framework that incorporates importance weights and hierarchical …


Got Issues? Who Cares About It? A Large Scale Investigation Of Issue Trackers From Github, Tegawende F. Bissyande, David Lo, Lingxiao Jiang, Laurent Reveillere, Jacques Klein, Yves Le Traon Jun 2014

Got Issues? Who Cares About It? A Large Scale Investigation Of Issue Trackers From Github, Tegawende F. Bissyande, David Lo, Lingxiao Jiang, Laurent Reveillere, Jacques Klein, Yves Le Traon

David LO

Feedback from software users constitutes a vital part in the evolution of software projects. By filing issue reports, users help identify and fix bugs, document software code, and enhance the software via feature requests. Many studies have explored issue reports, proposed approaches to enable the submission of higher-quality reports, and presented techniques to sort, categorize and leverage issues for software engineering needs. Who, however, cares about filing issues? What kind of issues are reported in issue trackers? What kind of correlation exist between issue reporting and the success of software projects? In this study, we address the need for answering …


Clustering Of Search Trajectory And Its Application To Parameter Tuning, Linda Lindawati, Hoong Chuin Lau, David Lo Jun 2014

Clustering Of Search Trajectory And Its Application To Parameter Tuning, Linda Lindawati, Hoong Chuin Lau, David Lo

David LO

This paper is concerned with automated classification of Combinatorial Optimization Problem instances for instance-specific parameter tuning purpose. We propose the CluPaTra Framework, a generic approach to CLUster instances based on similar PAtterns according to search TRAjectories and apply it on parameter tuning. The key idea is to use the search trajectory as a generic feature for clustering problem instances. The advantage of using search trajectory is that it can be obtained from any local-search based algorithm with small additional computation time. We explore and compare two different search trajectory representations, two sequence alignment techniques (to calculate similarities) as well as …


Automatic Recommendation Of Api Methods From Feature Requests, Ferdian Thung, Shaowei Wang, David Lo, Julia Lawall Jun 2014

Automatic Recommendation Of Api Methods From Feature Requests, Ferdian Thung, Shaowei Wang, David Lo, Julia Lawall

David LO

Developers often receive many feature requests. To implement these features, developers can leverage various methods from third party libraries. In this work, we propose an automated approach that takes as input a textual description of a feature request. It then recommends methods in library APIs that developers can use to implement the feature. Our recommendation approach learns from records of other changes made to software systems, and compares the textual description of the requested feature with the textual descriptions of various API methods. We have evaluated our approach on more than 500 feature requests of Axis2/Java, CXF, Hadoop Common, HBase, …


An Empirical Study Of Bugs In Build Process, Xiaoqiong Zhao, Xin Xia, Pavneet Singh Kochhar, David Lo, Shanping Li Jun 2014

An Empirical Study Of Bugs In Build Process, Xiaoqiong Zhao, Xin Xia, Pavneet Singh Kochhar, David Lo, Shanping Li

David LO

Software build process translates source codes into executable programs, packages the programs, generates documents, and distributes products. In this paper, we perform an empirical study to characterize build process bugs. We analyze bugs in build process in 5 open-source systems under Apache namely CXF, Camel, Felix, Struts, and Tuscany. We compare build process bugs and other bugs across 3 different dimensions, i.e., bug severity, bug fix time, and the number of files modified to fix a bug. Our results show that the fraction of build process bugs which are above major severity level is lower than that of other bugs. …


Build System Analysis With Link Prediction, Xin Xia, David Lo, Xinyu Wang, Bo Zhou Jun 2014

Build System Analysis With Link Prediction, Xin Xia, David Lo, Xinyu Wang, Bo Zhou

David LO

Compilation is an important step in building working software system. To compile large systems, typically build systems, such as make, are used. In this paper, we investigate a new research problem for build configuration file (e.g., Makefile) analysis: how to predict missed dependencies in a build configuration file. We refer to this problem as dependency mining. Based on a Makefile, we build a dependency graph capturing various relationships defined in the Makefile. By representing a Makefile as a dependency graph, we map the dependency mining problem to a link prediction problem, and leverage 9 state-of-the-art link prediction algorithms to solve …


Will Fault Localization Work For These Failures? An Automated Approach To Predict Effectiveness Of Fault Localization Tools, Tien-Duy B. Le, David Lo Jun 2014

Will Fault Localization Work For These Failures? An Automated Approach To Predict Effectiveness Of Fault Localization Tools, Tien-Duy B. Le, David Lo

David LO

Debugging is a crucial yet expensive activity to improve the reliability of software systems. To reduce debugging cost, various fault localization tools have been proposed. A spectrum-based fault localization tool often outputs an ordered list of program elements sorted based on their likelihood to be the root cause of a set of failures (i.e., their suspiciousness scores). Despite the many studies on fault localization, unfortunately, however, for many bugs, the root causes are often low in the ordered list. This potentially causes developers to distrust fault localization tools. Recently, Parnin and Orso highlight in their user study that many debuggers …


Drone: Predicting Priority Of Reported Bugs By Multi-Factor Analysis, Yuan Tian, David Lo, Chengnian Sun Jun 2014

Drone: Predicting Priority Of Reported Bugs By Multi-Factor Analysis, Yuan Tian, David Lo, Chengnian Sun

David LO

Bugs are prevalent. To improve software quality, developers often allow users to report bugs that they found using a bug tracking system such as Bugzilla. Users would specify among other things, a description of the bug, the component that is affected by the bug, and the severity of the bug. Based on this information, bug triagers would then assign a priority level to the reported bug. As resources are limited, bug reports would be investigated based on their priority levels. This priority assignment process however is a manual one. Could we do better? In this paper, we propose an automated …


Collaboration Patterns In Software Developer Network, Didi Surian, David Lo, Ee Peng Lim Jun 2014

Collaboration Patterns In Software Developer Network, Didi Surian, David Lo, Ee Peng Lim

David LO

No abstract provided.


An Empirical Study Of Bug Report Field Reassignment, Xin Xia, David Lo, Ming Wen, Shihab Emad, Bo Zhou Jun 2014

An Empirical Study Of Bug Report Field Reassignment, Xin Xia, David Lo, Ming Wen, Shihab Emad, Bo Zhou

David LO

A bug report contains many fields, such as product, component, severity, priority, fixer, operating system (OS), platform, etc., which provide important information for the bug triaging and fixing process. It is important to make sure that bug information is correct since previous studies showed that the wrong assignment of bug report fields could increase the bug fixing time, and even delay the delivery of the software. In this paper, we perform an empirical study on bug report field reassignments in open-source software projects. To better understand why bug report fields are reassigned, we manually collect 99 recent bug reports that …


Predicting Best Answerers For New Questions: An Approach Leveraging Topic Modeling And Collaborative Voting, Yuan Tian, Pavneet Singh Kochhar, Ee Peng Lim, Feida Zhu, David Lo Jun 2014

Predicting Best Answerers For New Questions: An Approach Leveraging Topic Modeling And Collaborative Voting, Yuan Tian, Pavneet Singh Kochhar, Ee Peng Lim, Feida Zhu, David Lo

David LO

Community Question Answering (CQA) sites are becoming increasingly important source of information where users can share knowledge on various topics. Although these platforms bring new opportunities for users to seek help or provide solutions, they also pose many challenges with the ever growing size of the community. The sheer number of questions posted everyday motivates the problem of routing questions to the appropriate users who can answer them. In this paper, we propose an approach to predict the best answerer for a new question on CQA site. Our approach considers both user interest and user expertise relevant to the topics …


On Finding The Point Where There Is No Return: Turning Point Mining On Game Data, Wei Gong, Ee Peng Lim, Feida Zhu, Achananuparp Palakorn, David Lo Jun 2014

On Finding The Point Where There Is No Return: Turning Point Mining On Game Data, Wei Gong, Ee Peng Lim, Feida Zhu, Achananuparp Palakorn, David Lo

David LO

Gaming expertise is usually accumulated through playing or watching many game instances, and identifying critical moments in these game instances called turning points. Turning point rules (shorten as TPRs) are game patterns that almost always lead to some irreversible outcomes. In this paper, we formulate the notion of irreversible outcome property which can be combined with pattern mining so as to automatically extract TPRs from any given game datasets. We specifically extend the well-known PrefixSpan sequence mining algorithm by incorporating the irreversible outcome property. To show the usefulness of TPRs, we apply them to Tetris, a popular game. We mine …


Mining Branching-Time Scenarios, Dirk Fahland, David Lo, Shahar Maoz Jun 2014

Mining Branching-Time Scenarios, Dirk Fahland, David Lo, Shahar Maoz

David LO

Specification mining extracts candidate specification from existing systems, to be used for downstream tasks such as testing and verification. Specifically, we are interested in the extraction of behavior models from execution traces. In this paper we introduce mining of branching-time scenarios in the form of existential, conditional Live Sequence Charts, using a statistical data-mining algorithm. We show the power of branching scenarios to reveal alternative scenario-based behaviors, which could not be mined by previous approaches. The work contrasts and complements previous works on mining linear-time scenarios. An implementation and evaluation over execution trace sets recorded from several real-world applications shows …


Automated Construction Of A Software-Specific Word Similarity Database, Yuan Tian, David Lo, Julia Lawall Jun 2014

Automated Construction Of A Software-Specific Word Similarity Database, Yuan Tian, David Lo, Julia Lawall

David LO

Many automated software engineering approaches, including code search, bug report categorization, and duplicate bug report detection, measure similarities between two documents by analyzing natural language contents. Often different words are used to express the same meaning and thus measuring similarities using exact matching of words is insufficient. To solve this problem, past studies have shown the need to measure the similarities between pairs of words. To meet this need, the natural language processing community has built WordNet which is a manually constructed lexical database that records semantic relations among words and can be used to measure how similar two words …


Popularity, Interoperability, And Impact Of Programming Languages In 100,000 Open Source Projects, Tegawende F. Bissyande, Ferdian Thung, David Lo, Lingxiao Jiang, Laurent Réveillère Jun 2014

Popularity, Interoperability, And Impact Of Programming Languages In 100,000 Open Source Projects, Tegawende F. Bissyande, Ferdian Thung, David Lo, Lingxiao Jiang, Laurent Réveillère

David LO

Programming languages have been proposed even before the era of the modern computer. As years have gone, computer resources have increased and application domains have expanded, leading to the proliferation of hundreds of programming languages, each attempting to improve over others or to address new programming paradigms. These languages range from procedural languages like C, object oriented languages like Java, and functional languages such as ML and Haskell. Unfortunately, there is a lack of large scale and comprehensive studies that examine the “popularity”, “interoperability”, and “impact” of various programming languages. To fill this gap, this study investigates a hundred thousands …


Towards More Accurate Multi-Label Software Behavior Learning, Xin Xia, Feng Yang, David Lo, Zhenyu Chen, Xinyu Wang Jun 2014

Towards More Accurate Multi-Label Software Behavior Learning, Xin Xia, Feng Yang, David Lo, Zhenyu Chen, Xinyu Wang

David LO

In a modern software system, when a program fails, a crash report which contains an execution trace would be sent to the software vendor for diagnosis. A crash report which corresponds to a failure could be caused by multiple types of faults simultaneously. Many large companies such as Baidu organize a team to analyze these failures, and classify them into multiple labels (i.e., multiple types of faults). However, it would be time-consuming and difficult for developers to manually analyze these failures and come out with appropriate fault labels. In this paper, we automatically classify a failure into multiple types of …


Multi-Abstraction Concern Localization, Tien-Duy B. Duy, Shaowei Wang, David Lo Jun 2014

Multi-Abstraction Concern Localization, Tien-Duy B. Duy, Shaowei Wang, David Lo

David LO

Concern localization refers to the process of locating code units that match a particular textual description. It takes as input textual documents such as bug reports and feature requests and outputs a list of candidate code units that need to be changed to address the bug reports or feature requests. Many information retrieval (IR) based concern localization techniques have been proposed in the literature. These techniques typically represent code units and textual descriptions as a bag of tokens at one level of abstraction, e.g., each token is a word, or each token is a topic. In this work, we propose …


An Empirical Study Of Adoption Of Software Testing In Open Source Projects, Pavneet Singh Kochhar, Tegawende F. Bissyande, David Lo, Lingxiao Jiang Jun 2014

An Empirical Study Of Adoption Of Software Testing In Open Source Projects, Pavneet Singh Kochhar, Tegawende F. Bissyande, David Lo, Lingxiao Jiang

David LO

In software engineering, testing is a crucial activity that is designed to ensure the quality of program code. For this activity, software teams spend substantial resources constructing test cases to thoroughly assess the correctness of software functionality. What is the proportion of open source projects that include test cases? What is the effect of number of developers on the number of test cases? In this study, we explore open source projects and investigate the correlation between the presence of test cases and various project development characteristics, including the number of lines of code, the size of development teams and the …


Proceedings Of The 1st International Workshop On Machine Learning And Information Retrieval For Software Evolution, - Lucia, David Lo, Giuseppe Scanniello, Alessandro Marchetto, Nasir Ali, Collin Mcmillan Jun 2014

Proceedings Of The 1st International Workshop On Machine Learning And Information Retrieval For Software Evolution, - Lucia, David Lo, Giuseppe Scanniello, Alessandro Marchetto, Nasir Ali, Collin Mcmillan

David LO

No abstract provided.


Automatic Recovery Of Root Causes From Bug-Fixing Changes, Ferdian Thung, David Lo, Lingxiao Jiang Jun 2014

Automatic Recovery Of Root Causes From Bug-Fixing Changes, Ferdian Thung, David Lo, Lingxiao Jiang

David LO

What is the root cause of this failure? This question is often among the first few asked by software debuggers when they try to address issues raised by a bug report. Root cause is the erroneous lines of code that cause a chain of erroneous program states eventually leading to the failure. Bug tracking and source control systems only record the symptoms (e.g., bug reports) and treatments of a bug (e.g., committed changes that fix the bug), but not its root cause. Many treatments contain non-essential changes, which are intermingled with root causes. Reverse engineering the root cause of a …


Proceedings Of The 2nd International Workshop On Software Mining, Ming Li, Hongyu Zhang, David Lo Jun 2014

Proceedings Of The 2nd International Workshop On Software Mining, Ming Li, Hongyu Zhang, David Lo

David LO

No abstract provided.


Tag Recommendation In Software Information Sites, Xin Xia, David Lo, Xinyu Wang, Bo Zhou Jun 2014

Tag Recommendation In Software Information Sites, Xin Xia, David Lo, Xinyu Wang, Bo Zhou

David LO

Nowadays, software engineers use a variety of online media to search and become informed of new and interesting technologies, and to learn from and help one another. We refer to these kinds of online media which help software engineers improve their performance in software development, maintenance and test processes as software information sites. It is common to see tags in software information sites and many sites allow users to tag various objects with their own words. Users increasingly use tags to describe the most important features of their posted contents or projects. In this paper, we propose TagCombine, an automatic …