Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 30 of 62

Full-Text Articles in Physical Sciences and Mathematics

Sewordsim: Software-Specific Word Similarity Database, Yuan Tian, David Lo, Julia Lawall Jun 2014

Sewordsim: Software-Specific Word Similarity Database, Yuan Tian, David Lo, Julia Lawall

David LO

Measuring the similarity of words is important in accurately representing and comparing documents, and thus improves the results of many natural language processing (NLP) tasks. The NLP community has proposed various measurements based on WordNet, a lexical database that contains relationships between many pairs of words. Recently, a number of techniques have been proposed to address software engineering issues such as code search and fault localization that require understanding natural language documents, and a measure of word similarity could improve their results. However, WordNet only contains information about words senses in general-purpose conversation, which often differ from word senses in …


Leveraging Machine Learning And Information Retrieval Techniques In Software Evolution Tasks: Summary Of The First Malir-Se Workshop, At Ase 2013, - Lucia, David Lo, Giuseppe Scanniello, Alessandro Marchetto, Nasir Ali, Collin Mcmillan Jun 2014

Leveraging Machine Learning And Information Retrieval Techniques In Software Evolution Tasks: Summary Of The First Malir-Se Workshop, At Ase 2013, - Lucia, David Lo, Giuseppe Scanniello, Alessandro Marchetto, Nasir Ali, Collin Mcmillan

David LO

The first International Workshop on MAchine Learning and Information Retrieval for Software Evolution (MALIR-SE) was held on the 11th of November 2013. The workshop was held in conjunction with the 28th IEEE/ACM International Conference on Automated Software Engineering (ASE) in Silicon Valley, California, USA. The workshop brought researchers and practitioners that were interested in leveraging machine learning and information retrieval techniques to automate various software evolution tasks. During the workshop, papers on the application of machine learning and information retrieval techniques to bug fix time prediction and anti-pattern detection were presented. There were also discussions on the presented papers and …


Hierarchical Parallel Algorithm For Modularity-Based Community Detection Using Gpus, Chun Yew Cheong, Huynh Phung Huynh, David Lo, Rick Siow Mong Goh Jun 2014

Hierarchical Parallel Algorithm For Modularity-Based Community Detection Using Gpus, Chun Yew Cheong, Huynh Phung Huynh, David Lo, Rick Siow Mong Goh

David LO

This paper describes the design of a hierarchical parallel algorithm for accelerating community detection which involves partitioning a network into communities of densely connected nodes. The algorithm is based on the Louvain method developed at the Université Catholique de Louvain, which uses modularity to measure community quality and has been successfully applied on many different types of networks. The proposed hierarchical parallel algorithm targets three levels of parallelism in the Louvain method and it has been implemented on single-GPU and multi-GPU architectures. Benchmarking results on several large web-based networks and popular social networks show that on top of offering speedups …


Software Internationalization And Localization: An Industrial Experience, Xin Xia, David Lo, Feng Zhu, Xinyu Wang, Bo Zhou Jun 2014

Software Internationalization And Localization: An Industrial Experience, Xin Xia, David Lo, Feng Zhu, Xinyu Wang, Bo Zhou

David LO

Software internationalization and localization are important steps in distributing and deploying software to different regions of the world. Internationalization refers to the process of reengineering a system such that it could support various languages and regions without further modification. Localization refers to the process of adapting an internationalized software for a specific language or region. Due to various reasons, many large legacy systems did not consider internationalization and localization at the early stage of development. In this paper, we present our experience on, and propose a process along with tool supports for software internationalization and localization. We reengineer a large …


Leveraging Web 2.0 For Software Evolution, Yuan Tian, David Lo Jun 2014

Leveraging Web 2.0 For Software Evolution, Yuan Tian, David Lo

David LO

In this era of Web 2.0, much information is available on the Internet. Software forums, mailing lists, and question-and-answer sites contain lots of technical information. Blogs contain developers’ opinions, ideas, and descriptions of their day-to-day activities. Microblogs contain recent and popular software news. Software forges contain records of socio-technical interactions of developers. All these resources could potentially be leveraged to help developers in performing software evolution activities. In this chapter, we first present information that is available from these Web 2.0 resources. We then introduce empirical studies that investigate how developers contribute information to and use these resources. Next, we …


An Empirical Study Of Bugs In Build Process, Xiaoqiong Zhao, Xin Xia, Pavneet Singh Kochhar, David Lo, Shanping Li Jun 2014

An Empirical Study Of Bugs In Build Process, Xiaoqiong Zhao, Xin Xia, Pavneet Singh Kochhar, David Lo, Shanping Li

David LO

Software build process translates source codes into executable programs, packages the programs, generates documents, and distributes products. In this paper, we perform an empirical study to characterize build process bugs. We analyze bugs in build process in 5 open-source systems under Apache namely CXF, Camel, Felix, Struts, and Tuscany. We compare build process bugs and other bugs across 3 different dimensions, i.e., bug severity, bug fix time, and the number of files modified to fix a bug. Our results show that the fraction of build process bugs which are above major severity level is lower than that of other bugs. …


Build System Analysis With Link Prediction, Xin Xia, David Lo, Xinyu Wang, Bo Zhou Jun 2014

Build System Analysis With Link Prediction, Xin Xia, David Lo, Xinyu Wang, Bo Zhou

David LO

Compilation is an important step in building working software system. To compile large systems, typically build systems, such as make, are used. In this paper, we investigate a new research problem for build configuration file (e.g., Makefile) analysis: how to predict missed dependencies in a build configuration file. We refer to this problem as dependency mining. Based on a Makefile, we build a dependency graph capturing various relationships defined in the Makefile. By representing a Makefile as a dependency graph, we map the dependency mining problem to a link prediction problem, and leverage 9 state-of-the-art link prediction algorithms to solve …


Collaboration Patterns In Software Developer Network, Didi Surian, David Lo, Ee Peng Lim Jun 2014

Collaboration Patterns In Software Developer Network, Didi Surian, David Lo, Ee Peng Lim

David LO

No abstract provided.


An Empirical Study Of Bug Report Field Reassignment, Xin Xia, David Lo, Ming Wen, Shihab Emad, Bo Zhou Jun 2014

An Empirical Study Of Bug Report Field Reassignment, Xin Xia, David Lo, Ming Wen, Shihab Emad, Bo Zhou

David LO

A bug report contains many fields, such as product, component, severity, priority, fixer, operating system (OS), platform, etc., which provide important information for the bug triaging and fixing process. It is important to make sure that bug information is correct since previous studies showed that the wrong assignment of bug report fields could increase the bug fixing time, and even delay the delivery of the software. In this paper, we perform an empirical study on bug report field reassignments in open-source software projects. To better understand why bug report fields are reassigned, we manually collect 99 recent bug reports that …


Predicting Best Answerers For New Questions: An Approach Leveraging Topic Modeling And Collaborative Voting, Yuan Tian, Pavneet Singh Kochhar, Ee Peng Lim, Feida Zhu, David Lo Jun 2014

Predicting Best Answerers For New Questions: An Approach Leveraging Topic Modeling And Collaborative Voting, Yuan Tian, Pavneet Singh Kochhar, Ee Peng Lim, Feida Zhu, David Lo

David LO

Community Question Answering (CQA) sites are becoming increasingly important source of information where users can share knowledge on various topics. Although these platforms bring new opportunities for users to seek help or provide solutions, they also pose many challenges with the ever growing size of the community. The sheer number of questions posted everyday motivates the problem of routing questions to the appropriate users who can answer them. In this paper, we propose an approach to predict the best answerer for a new question on CQA site. Our approach considers both user interest and user expertise relevant to the topics …


On Finding The Point Where There Is No Return: Turning Point Mining On Game Data, Wei Gong, Ee Peng Lim, Feida Zhu, Achananuparp Palakorn, David Lo Jun 2014

On Finding The Point Where There Is No Return: Turning Point Mining On Game Data, Wei Gong, Ee Peng Lim, Feida Zhu, Achananuparp Palakorn, David Lo

David LO

Gaming expertise is usually accumulated through playing or watching many game instances, and identifying critical moments in these game instances called turning points. Turning point rules (shorten as TPRs) are game patterns that almost always lead to some irreversible outcomes. In this paper, we formulate the notion of irreversible outcome property which can be combined with pattern mining so as to automatically extract TPRs from any given game datasets. We specifically extend the well-known PrefixSpan sequence mining algorithm by incorporating the irreversible outcome property. To show the usefulness of TPRs, we apply them to Tetris, a popular game. We mine …


Automated Construction Of A Software-Specific Word Similarity Database, Yuan Tian, David Lo, Julia Lawall Jun 2014

Automated Construction Of A Software-Specific Word Similarity Database, Yuan Tian, David Lo, Julia Lawall

David LO

Many automated software engineering approaches, including code search, bug report categorization, and duplicate bug report detection, measure similarities between two documents by analyzing natural language contents. Often different words are used to express the same meaning and thus measuring similarities using exact matching of words is insufficient. To solve this problem, past studies have shown the need to measure the similarities between pairs of words. To meet this need, the natural language processing community has built WordNet which is a manually constructed lexical database that records semantic relations among words and can be used to measure how similar two words …


Towards More Accurate Multi-Label Software Behavior Learning, Xin Xia, Feng Yang, David Lo, Zhenyu Chen, Xinyu Wang Jun 2014

Towards More Accurate Multi-Label Software Behavior Learning, Xin Xia, Feng Yang, David Lo, Zhenyu Chen, Xinyu Wang

David LO

In a modern software system, when a program fails, a crash report which contains an execution trace would be sent to the software vendor for diagnosis. A crash report which corresponds to a failure could be caused by multiple types of faults simultaneously. Many large companies such as Baidu organize a team to analyze these failures, and classify them into multiple labels (i.e., multiple types of faults). However, it would be time-consuming and difficult for developers to manually analyze these failures and come out with appropriate fault labels. In this paper, we automatically classify a failure into multiple types of …


Proceedings Of The 1st International Workshop On Machine Learning And Information Retrieval For Software Evolution, - Lucia, David Lo, Giuseppe Scanniello, Alessandro Marchetto, Nasir Ali, Collin Mcmillan Jun 2014

Proceedings Of The 1st International Workshop On Machine Learning And Information Retrieval For Software Evolution, - Lucia, David Lo, Giuseppe Scanniello, Alessandro Marchetto, Nasir Ali, Collin Mcmillan

David LO

No abstract provided.


Proceedings Of The 2nd International Workshop On Software Mining, Ming Li, Hongyu Zhang, David Lo Jun 2014

Proceedings Of The 2nd International Workshop On Software Mining, Ming Li, Hongyu Zhang, David Lo

David LO

No abstract provided.


Boat: An Experimental Platform For Researchers To Comparatively And Reproducibly Evaluate Bug Localization Techniques, Xinyu Wang, David Lo, Xin Xia, Xingen Wang, Pavneet Singh Kochhar, Yuan Tian, Xiaohu Yang, Shanping Li, Jianling Sun, Bo Zhou Jun 2014

Boat: An Experimental Platform For Researchers To Comparatively And Reproducibly Evaluate Bug Localization Techniques, Xinyu Wang, David Lo, Xin Xia, Xingen Wang, Pavneet Singh Kochhar, Yuan Tian, Xiaohu Yang, Shanping Li, Jianling Sun, Bo Zhou

David LO

Bug localization refers to the process of identifying source code files that contain defects from descriptions of these defects which are typically contained in bug reports. There have been many bug localization techniques proposed in the literature. However, often it is hard to compare these techniques since different evaluation datasets are used. At times the datasets are not made publicly available and thus it is difficult to reproduce reported results. Furthermore, some techniques are only evaluated on small datasets and thus it is not clear whether the results are generalizable. Thus, there is a need for a platform that allows …


F-Trail: Finding Patterns In Taxi Trajectories, Yasuko Matsubara, Evangelos Papalexakis, Lei Li, David Lo, Yasushi Sakurai, Christos Faloutsos Apr 2013

F-Trail: Finding Patterns In Taxi Trajectories, Yasuko Matsubara, Evangelos Papalexakis, Lei Li, David Lo, Yasushi Sakurai, Christos Faloutsos

David LO

Given a large number of taxi trajectories, we would like to find interesting and unexpected patterns from the data. How can we summarize the major trends, and how can we spot anomalies? The analysis of trajectories has been an issue of considerable interest with many applications such as tracking trails of migrating animals and predicting the path of hurricanes. Several recent works propose methods on clustering and indexing trajectories data. However, these approaches are not especially well suited to pattern discovery with respect to the dynamics of social and economic behavior. To further analyze a huge collection of taxi trajectories, …


A Comparative Study Of Supervised Learning Algorithms For Re-Opened Bug Prediction, Xin Xia, David Lo, Xinyu Wang, Xiaohu Yang, Shanping Li, Jianling Sun Apr 2013

A Comparative Study Of Supervised Learning Algorithms For Re-Opened Bug Prediction, Xin Xia, David Lo, Xinyu Wang, Xiaohu Yang, Shanping Li, Jianling Sun

David LO

Bug fixing is a time-consuming and costly job which is performed in the whole life cycle of software development and maintenance. For many systems, bugs are managed in bug management systems such as Bugzilla. Generally, the status of a typical bug report in Bugzilla changes from new to assigned, verified and closed. However, some bugs have to be reopened. Reopened bugs increase the software development and maintenance cost, increase the workload of bug fixers, and might even delay the future delivery of a software. Only a few studies investigate the phenomenon of reopened bug reports. In this paper, we evaluate …


An Empirical Study On Developer Interactions In Stackoverflow, Shaowei Wang, David Lo, Lingxiao Jiang Apr 2013

An Empirical Study On Developer Interactions In Stackoverflow, Shaowei Wang, David Lo, Lingxiao Jiang

David LO

No abstract provided.


Semantically Related Software Terms And Their Taxonomy By Leveraging Collaborative Tagging, Shaowei Wang, David Lo, Lingxiao Jiang Dec 2012

Semantically Related Software Terms And Their Taxonomy By Leveraging Collaborative Tagging, Shaowei Wang, David Lo, Lingxiao Jiang

David LO

Millions of people, including those in the software engineering communities have turned to microblogging services, such as Twitter, as a means to quickly disseminate information. A number of past studies by Treude et al., Storey, and Yuan et al. have shown that a wealth of interesting information is stored in these microblogs. However, microblogs also contain a large amount of noisy content that are less relevant to software developers in engineering software systems. In this work, we perform a preliminary study to investigate the feasibility of automatic classification of microblogs into two categories: relevant and irrelevant to engineering software systems. …


Mining Indirect Antagonistic Communities From Social Interactions, Kuan Zhang, David Lo, Ee Peng Lim, Philips Kokoh Prasetyo Dec 2012

Mining Indirect Antagonistic Communities From Social Interactions, Kuan Zhang, David Lo, Ee Peng Lim, Philips Kokoh Prasetyo

David LO

Antagonistic communities refer to groups of people with opposite tastes, opinions, and factions within a community. Given a set of interactions among people in a community, we develop a novel pattern mining approach to mine a set of antagonistic communities. In particular, based on a set of user-specified thresholds, we extract a set of pairs of communities that behave in opposite ways with one another. We focus on extracting a compact lossless representation based on the concept of closed patterns to prevent exploding the number of mined antagonistic communities. We also present a variation of the algorithm using a divide …


Searching Connected Api Subgraph Via Text Phrases, Wing-Kwan Chan, Hong Cheng, David Lo Dec 2012

Searching Connected Api Subgraph Via Text Phrases, Wing-Kwan Chan, Hong Cheng, David Lo

David LO

Reusing APIs of existing libraries is a common practice during software development, but searching suitable APIs and their usages can be time-consuming [6]. In this paper, we study a new and more practical approach to help users find usages of APIs given only simple text phrases, when users have limited knowledge about an API library. We model API invocations as an API graph and aim to find an optimum connected subgraph that meets users' search needs. The problem is challenging since the search space in an API graph is very huge. We start with a greedy subgraph search algorithm which …


Learning Extended Fsa From Software: An Empirical Assessment, David Lo, Leonardo Mariani, Mauro Santoro Dec 2012

Learning Extended Fsa From Software: An Empirical Assessment, David Lo, Leonardo Mariani, Mauro Santoro

David LO

A number of techniques that infer finite state automata from execution traces have been used to support test and analysis activities. Some of these techniques can produce automata that integrate information about the data-flow, that is, they also represent how data values affect the operations executed by programs. The integration of information about operation sequences and data values into a unique model is indeed conceptually useful to accurately represent the behavior of a program. However, it is still unclear whether handling heterogeneous types of information, such as operation sequences and data values, necessarily produces higher quality models or not. In …


Scenario-Based And Value-Based Specification Mining: Better Together, David Lo, Shahar Maoz Dec 2012

Scenario-Based And Value-Based Specification Mining: Better Together, David Lo, Shahar Maoz

David LO

Specification mining takes execution traces as input and extracts likely program invariants, which can be used for comprehension, verification, and evolution related tasks. In this work we integrate scenario-based specification mining, which uses a data-mining algorithm to suggest ordering constraints in the form of live sequence charts, an inter-object, visual, modal, scenario-based specification language, with mining of value-based invariants, which detects likely invariants holding at specific program points. The key to the integration is a technique we call scenario-based slicing, running on top of the mining algorithms to distinguish the scenario-specific invariants from the general ones. The resulting suggested specifications …


Observatory Of Trends In Software Related Microblogs, Achananuparp Palakorn, Nelman Lubis Ibrahim, Yuan Tian, David Lo, Ee Peng Lim Dec 2012

Observatory Of Trends In Software Related Microblogs, Achananuparp Palakorn, Nelman Lubis Ibrahim, Yuan Tian, David Lo, Ee Peng Lim

David LO

Microblogging has recently become a popular means to disseminate information among millions of people. Interestingly, software developers also use microblog to communicate with one another. Different from traditional media, microblog users tend to focus on recency and informality of content. Many tweet contents are relatively more personal and Opinionated, compared to that of traditional news report. Thus, by analyzing microblogs, one could get the up-to-date information about what people are interested in or feel toward a particular topic. In this paper, we describe our microblog observatory that aggregates more than 70,000 Twitter feeds, captures software-related tweets, and computes trends from …


Kbe-Anonymity: Test Data Anonymization For Evolving Programs, - Lucia, David Lo, Lingxiao Jiang, Aditya Budi Dec 2012

Kbe-Anonymity: Test Data Anonymization For Evolving Programs, - Lucia, David Lo, Lingxiao Jiang, Aditya Budi

David LO

High-quality test data that is useful for effective testing is often available on users’ site. However, sharing data owned by users with software vendors may raise privacy concerns. Techniques are needed to enable data sharing among data owners and the vendors without leaking data privacy. Evolving programs bring additional challenges because data may be shared multiple times for every version of a program. When multiple versions of the data are cross-referenced, private information could be inferred. Although there are studies addressing the privacy issue of data sharing for testing and debugging, little work has explicitly addressed the challenges when programs …


To What Extent Could We Detect Field Defects? —An Empirical Study Of False Negatives In Static Bug Finding Tools, Ferdian Thung, - Lucia, David Lo, Lingxiao Jiang, Premkumar Devanbu, Foyzur Rahman Dec 2012

To What Extent Could We Detect Field Defects? —An Empirical Study Of False Negatives In Static Bug Finding Tools, Ferdian Thung, - Lucia, David Lo, Lingxiao Jiang, Premkumar Devanbu, Foyzur Rahman

David LO

Software defects can cause much loss. Static bug-finding tools are believed to help detect and remove defects. These tools are designed to find programming errors; but, do they in fact help prevent actual defects that occur in the field and reported by users? If these tools had been used, would they have detected these field defects, and generated warnings that would direct programmers to fix them? To answer these questions, we perform an empirical study that investigates the effectiveness of state-of-the-art static bug finding tools on hundreds of reported and fixed defects extracted from three open source programs: Lucene, Rhino, …


Information Retrieval Based Nearest Neighbor Classification For Fine-Grained Bug Severity Prediction, Yuan Tian, David Lo, Chengnian Sun Dec 2012

Information Retrieval Based Nearest Neighbor Classification For Fine-Grained Bug Severity Prediction, Yuan Tian, David Lo, Chengnian Sun

David LO

Bugs are prevalent in software systems. Some bugs are critical and need to be fixed right away, whereas others are minor and their fixes could be postponed until resources are available. In this work, we propose a new approach leveraging information retrieval, in particular BM25-based document similarity function, to automatically predict the severity of bug reports. Our approach automatically analyzes bug reports reported in the past along with their assigned severity labels, and recommends severity labels to newly reported bug reports. Duplicate bug reports are utilized to determine what bug report features, be it textual, ordinal, or categorical, are important. …


Semantic Patch Inference, Jesper Abdersen, Anh Cuong Nguyen, David Lo, Julia Lawall, Siau-Cheng Khoo Dec 2012

Semantic Patch Inference, Jesper Abdersen, Anh Cuong Nguyen, David Lo, Julia Lawall, Siau-Cheng Khoo

David LO

We propose a tool for inferring transformation specifications from a few examples of original and updated code. These transformation specifications may contain multiple code fragments from within a single function, all of which must be present for the transformation to apply. This makes the inferred transformations context sensitive. Our algorithm is based on depth-first search, with pruning. Because it is applied locally to a collection of functions that contain related changes, it is efficient in practice. We illustrate the approach on an example drawn from recent changes to the Linux kernel.


Duplicate Bug Report Detection With A Combination Of Information Retrieval And Topic Modeling, Anh Tuan Nguyen, Tung Nguyen, Tien Nguyen, David Lo, Chengnian Sun Dec 2012

Duplicate Bug Report Detection With A Combination Of Information Retrieval And Topic Modeling, Anh Tuan Nguyen, Tung Nguyen, Tien Nguyen, David Lo, Chengnian Sun

David LO

Detecting duplicate bug reports helps reduce triaging efforts and save time for developers in fixing the same issues. Among several automated detection approaches, text-based information retrieval (IR) approaches have been shown to outperform others in term of both accuracy and time efficiency. However, those IR-based approaches do not detect well the duplicate reports on the same technical issues written in different descriptive terms. This paper introduces DBTM, a duplicate bug report detection approach that takes advantage of both IR-based features and topic-based features. DBTM models a bug report as a textual document describing certain technical issue(s), and models duplicate bug …