Open Access. Powered by Scholars. Published by Universities.®

Databases and Information Systems Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 28 of 28

Full-Text Articles in Databases and Information Systems

Sewordsim: Software-Specific Word Similarity Database, Yuan Tian, David Lo, Julia Lawall Jun 2014

Sewordsim: Software-Specific Word Similarity Database, Yuan Tian, David Lo, Julia Lawall

David LO

Measuring the similarity of words is important in accurately representing and comparing documents, and thus improves the results of many natural language processing (NLP) tasks. The NLP community has proposed various measurements based on WordNet, a lexical database that contains relationships between many pairs of words. Recently, a number of techniques have been proposed to address software engineering issues such as code search and fault localization that require understanding natural language documents, and a measure of word similarity could improve their results. However, WordNet only contains information about words senses in general-purpose conversation, which often differ from word senses in …


Predicting Response In Mobile Advertising With Hierarchical Importance-Aware Factorization Machine, Richard Jayadi Oentaryo, Ee Peng Lim, Jia Wei Low, David Lo, Michael Finegold Jun 2014

Predicting Response In Mobile Advertising With Hierarchical Importance-Aware Factorization Machine, Richard Jayadi Oentaryo, Ee Peng Lim, Jia Wei Low, David Lo, Michael Finegold

David LO

Mobile advertising has recently seen dramatic growth, fueled by the global proliferation of mobile phones and devices. The task of predicting ad response is thus crucial for maximizing business revenue. However, ad response data change dynamically over time, and are subject to cold-start situations in which limited history hinders reliable prediction. There is also a need for a robust regression estimation for high prediction accuracy, and good ranking to distinguish the impacts of different ads. To this end, we develop a Hierarchical Importance-aware Factorization Machine (HIFM), which provides an effective generic latent factor framework that incorporates importance weights and hierarchical …


Predicting Best Answerers For New Questions: An Approach Leveraging Topic Modeling And Collaborative Voting, Yuan Tian, Pavneet Singh Kochhar, Ee Peng Lim, Feida Zhu, David Lo Jun 2014

Predicting Best Answerers For New Questions: An Approach Leveraging Topic Modeling And Collaborative Voting, Yuan Tian, Pavneet Singh Kochhar, Ee Peng Lim, Feida Zhu, David Lo

David LO

Community Question Answering (CQA) sites are becoming increasingly important source of information where users can share knowledge on various topics. Although these platforms bring new opportunities for users to seek help or provide solutions, they also pose many challenges with the ever growing size of the community. The sheer number of questions posted everyday motivates the problem of routing questions to the appropriate users who can answer them. In this paper, we propose an approach to predict the best answerer for a new question on CQA site. Our approach considers both user interest and user expertise relevant to the topics …


On Finding The Point Where There Is No Return: Turning Point Mining On Game Data, Wei Gong, Ee Peng Lim, Feida Zhu, Achananuparp Palakorn, David Lo Jun 2014

On Finding The Point Where There Is No Return: Turning Point Mining On Game Data, Wei Gong, Ee Peng Lim, Feida Zhu, Achananuparp Palakorn, David Lo

David LO

Gaming expertise is usually accumulated through playing or watching many game instances, and identifying critical moments in these game instances called turning points. Turning point rules (shorten as TPRs) are game patterns that almost always lead to some irreversible outcomes. In this paper, we formulate the notion of irreversible outcome property which can be combined with pattern mining so as to automatically extract TPRs from any given game datasets. We specifically extend the well-known PrefixSpan sequence mining algorithm by incorporating the irreversible outcome property. To show the usefulness of TPRs, we apply them to Tetris, a popular game. We mine …


R-Energy For Evaluating Robustness Of Dynamic Networks, Ming Gao, Ee Peng Lim, David Lo Jun 2014

R-Energy For Evaluating Robustness Of Dynamic Networks, Ming Gao, Ee Peng Lim, David Lo

David LO

The robustness of a network is determined by how well its vertices are connected to one another so as to keep the network strong and sustainable. As the network evolves its robustness changes and may reveal events as well as periodic trend patterns that affect the interactions among users in the network. In this paper, we develop R-energy as a new measure of network robustness based on the spectral analysis of normalized Laplacian matrix. R-energy can cope with disconnected networks, and is efficient to compute with a time complexity of O (jV j + jEj) where V and E are …


Extended Comprehensive Study Of Association Measures For Fault Localization, Lucia Lucia, David Lo, Lingxiao Jiang, Ferdian Thung, Aditya Budi Jun 2014

Extended Comprehensive Study Of Association Measures For Fault Localization, Lucia Lucia, David Lo, Lingxiao Jiang, Ferdian Thung, Aditya Budi

David LO

Spectrum-based fault localization is a promising approach to automatically locate root causes of failures quickly. Two well-known spectrum-based fault localization techniques, Tarantula and Ochiai, measure how likely a program element is a root cause of failures based on profiles of correct and failed program executions. These techniques are conceptually similar to association measures that have been proposed in statistics, data mining, and have been utilized to quantify the relationship strength between two variables of interest (e.g., the use of a medicine and the cure rate of a disease). In this paper, we view fault localization as a measurement of the …


F-Trail: Finding Patterns In Taxi Trajectories, Yasuko Matsubara, Evangelos Papalexakis, Lei Li, David Lo, Yasushi Sakurai, Christos Faloutsos Apr 2013

F-Trail: Finding Patterns In Taxi Trajectories, Yasuko Matsubara, Evangelos Papalexakis, Lei Li, David Lo, Yasushi Sakurai, Christos Faloutsos

David LO

Given a large number of taxi trajectories, we would like to find interesting and unexpected patterns from the data. How can we summarize the major trends, and how can we spot anomalies? The analysis of trajectories has been an issue of considerable interest with many applications such as tracking trails of migrating animals and predicting the path of hurricanes. Several recent works propose methods on clustering and indexing trajectories data. However, these approaches are not especially well suited to pattern discovery with respect to the dynamics of social and economic behavior. To further analyze a huge collection of taxi trajectories, …


Finding Relevant Answers In Software Forums, Swapna Gottopati, David Lo, Jing Jiang Dec 2011

Finding Relevant Answers In Software Forums, Swapna Gottopati, David Lo, Jing Jiang

David LO

Online software forums provide a huge amount of valuable content. Developers and users often ask questions and receive answers from such forums. The availability of a vast amount of thread discussions in forums provides ample opportunities for knowledge acquisition and summarization. For a given search query, current search engines use traditional information retrieval approach to extract webpages containing relevant keywords. However, in software forums, often there are many threads containing similar keywords where each thread could contain a lot of posts as many as 1,000 or more. Manually finding relevant answers from these long threads is a painstaking task to …


Mining Top-K Large Structural Patterns In A Massive Network, Feida Zhu, Qiang Qu, David Lo, Xifeng Yan, Jiawei Han, Philip S. Yu Dec 2011

Mining Top-K Large Structural Patterns In A Massive Network, Feida Zhu, Qiang Qu, David Lo, Xifeng Yan, Jiawei Han, Philip S. Yu

David LO

With ever-growing popularity of social networks, web and bio-networks, mining large frequent patterns from a single huge network has become increasingly important. Yet the existing pattern mining methods cannot offer the efficiency desirable for large pattern discovery. We propose Spider- Mine, a novel algorithm to efficiently mine top-K largest frequent patterns from a single massive network with any user-specified probability of 1 − ϵ. Deviating from the existing edge-by-edge (i.e., incremental) pattern-growth framework, SpiderMine achieves its efficiency by unleashing the power of small patterns of a bounded diameter, which we call “spiders”. With the spider structure, our approach adopts a …


Towards Succinctness In Mining Scenario-Based Specifications, David Lo, Shahar Maoz Dec 2011

Towards Succinctness In Mining Scenario-Based Specifications, David Lo, Shahar Maoz

David LO

Specification mining methods are used to extract candidate specifications from system execution traces. A major challenge for specification mining is succinctness. That is, in addition to the soundness, completeness, and scalable performance of the specification mining method, one is interested in producing a succinct result, which conveys a lot of information about the system under investigation but uses a short, machine and human-readable representation. In this paper we address the succinctness challenge in the context of scenario-based specification mining, whose target formalism is live sequence charts (LSC), an expressive extension of classical sequence diagrams. We do this by adapting three …


Automated Detection Of Likely Design Flaws In Layered Architectures, Aditya Budi, - Lucia, David Lo, Lingxiao Jiang, Shaowei Wang Dec 2011

Automated Detection Of Likely Design Flaws In Layered Architectures, Aditya Budi, - Lucia, David Lo, Lingxiao Jiang, Shaowei Wang

David LO

Layered architecture prescribes a good principle for separating concerns to make systems more maintainable. One example of such layered architectures is the separation of classes into three groups: Boundary, Control, and Entity, which are referred to as the three analysis class stereotypes in UML. Classes of different stereotypes are interacting with one another, when properly designed, the overall interaction would be maintainable, flexible, and robust. On the other hand, poor design would result in less maintainable system that is prone to errors. In many software projects, the stereotypes of classes are often missing, thus detection of design flaws becomes non-trivial. …


Efficient Mining Of Iterative Patterns For Software Specification Discovery, David Lo, Siau-Cheng Khoo, Chao Liu Nov 2011

Efficient Mining Of Iterative Patterns For Software Specification Discovery, David Lo, Siau-Cheng Khoo, Chao Liu

David LO

Studies have shown that program comprehension takes up to 45% of software development costs. Such high costs are caused by the lack-of documented specification and further aggravated by the phenomenon of software evolution. There is a need for automated tools to extract specifications to aid program comprehension. In this paper, a novel technique to efficiently mine common software temporal patterns from traces is proposed. These patterns shed light on program behaviors, and are termed iterative patterns. They capture unique characteristic of software traces, typically not found in arbitrary sequences. Specifically, due to loops, interesting iterative patterns can occur multiple times …


Smartic: Specification Mining Architecture With Trace Filtering And Clustering, David Lo, Siau-Cheng Khoo Nov 2011

Smartic: Specification Mining Architecture With Trace Filtering And Clustering, David Lo, Siau-Cheng Khoo

David LO

Improper management of software evolution, compounded by imprecise, and changing requirements, along with the "short time to market" requirement, commonly leads to a lack of up-to-date specifications. This can result in software that is characterized by bugs, anomalies and even security threats. Software specification mining is a new technique to address this concern by inferring specifications automatically. In this paper, we propose a novel API specification mining architecture called SMArTIC Specification Mining Architecture with Trace fIltering and Clustering) to improve the accuracy, robustness and scalability of specification miners. This architecture is constructed based on two hypotheses: (1) Erroneous traces should …


Mining Software Specifications, David Lo, Siau-Cheng Khoo Nov 2011

Mining Software Specifications, David Lo, Siau-Cheng Khoo

David LO

No abstract provided.


Matching Dependence-Related Queries In The System Dependence Graph., Xiaoyin Wang, David Lo, Jiefeng Cheng, Lu Zhang, Hong Mei, Jeffrey Xu Yu Nov 2011

Matching Dependence-Related Queries In The System Dependence Graph., Xiaoyin Wang, David Lo, Jiefeng Cheng, Lu Zhang, Hong Mei, Jeffrey Xu Yu

David LO

In software maintenance and evolution, it is common that developers want to apply a change to a number of similar places. Due to the size and complexity of the code base, it is challenging for developers to locate all the places that need the change. A main challenge in locating the places that need the change is that, these places share certain common dependence conditions but existing code searching techniques can hardly handle dependence relations satisfactorily. In this paper, we propose a technique that enables developers to make queries involving dependence conditions and textual conditions on the system dependence graph …


Mining Iterative Generators And Representative Rules For Software Specification Discovery, David Lo, Jinyan Li, Limsoon Wong, Siau-Cheng Khoo Nov 2011

Mining Iterative Generators And Representative Rules For Software Specification Discovery, David Lo, Jinyan Li, Limsoon Wong, Siau-Cheng Khoo

David LO

Billions of dollars are spent annually on software-related cost. It is estimated that up to 45 percent of software cost is due to the difficulty in understanding existing systems when performing maintenance tasks (i.e., adding features, removing bugs, etc.). One of the root causes is that software products often come with poor, incomplete, or even without any documented specifications. In an effort to improve program understanding, Lo et al. have proposed iterative pattern mining which outputs patterns that are repeated frequently within a program trace, or across multiple traces, or both. Frequent iterative patterns reflect frequent program behaviors that likely …


Mining Past-Time Temporal Rules: A Dynamic Analysis Approach, David Lo, Siau-Cheng Khoo, Chao Liu Nov 2011

Mining Past-Time Temporal Rules: A Dynamic Analysis Approach, David Lo, Siau-Cheng Khoo, Chao Liu

David LO

No abstract provided.


Mining Antagonistic Communities From Social Networks, Kuan Zhang, David Lo, Ee Peng Lim Nov 2011

Mining Antagonistic Communities From Social Networks, Kuan Zhang, David Lo, Ee Peng Lim

David LO

During social interactions in a community, there are often sub-communities that behave in opposite manner. These antagonistic sub-communities could represent groups of people with opposite tastes, factions within a community distrusting one another, etc. Taking as input a set of interactions within a community, we develop a novel pattern mining approach that extracts for a set of antagonistic sub-communities. In particular, based on a set of user specified thresholds, we extract a set of pairs of sub-communities that behave in opposite ways with one another. To prevent a blow up in these set of pairs, we focus on extracting a …


Efficient Mining Of Recurrent Rules From A Sequence Database, David Lo, Siau-Cheng Khoo, Chao Liu Nov 2011

Efficient Mining Of Recurrent Rules From A Sequence Database, David Lo, Siau-Cheng Khoo, Chao Liu

David LO

We study a novel problem of mining significant recurrent rules from a sequence database. Recurrent rules have the form "whenever a series of precedent events occurs, eventually a series of consequent events occurs". Recurrent rules are intuitive and characterize behaviors in many domains. An example is in the domain of software specifications, in which the rules capture a family of program properties beneficial to program verification and bug detection. Recurrent rules generalize existing work on sequential and episode rules by considering repeated occurrences of premise and consequent events within a sequence and across multiple sequences, and by removing the "window" …


Efficient Topological Olap On Information Networks, Qiang Qu, Feida Zhu, Xifeng Yan, Jiawei Han, Philip Yu, Hongyan Li Nov 2011

Efficient Topological Olap On Information Networks, Qiang Qu, Feida Zhu, Xifeng Yan, Jiawei Han, Philip Yu, Hongyan Li

David LO

We propose a framework for efficient OLAP on information networks with a focus on the most interesting kind, the topological OLAP (called “T-OLAP”), which incurs topological changes in the underlying networks. T-OLAP operations generate new networks from the original ones by rolling up a subset of nodes chosen by certain constraint criteria. The key challenge is to efficiently compute measures for the newly generated networks and handle user queries with varied constraints. Two effective computational techniques, T-Distributiveness and T-Monotonicity are proposed to achieve efficient query processing and cube materialization. We also provide a T-OLAP query processing framework into which these …


Comprehensive Evaluation Of Association Measures For Fault Localization, Lucia Lucia, David Lo, Lingxiao Jiang, Aditya Budi Nov 2011

Comprehensive Evaluation Of Association Measures For Fault Localization, Lucia Lucia, David Lo, Lingxiao Jiang, Aditya Budi

David LO

In statistics and data mining communities, there have been many measures proposed to gauge the strength of association between two variables of interest, such as odds ratio, confidence, Yule-Y, Yule-Q, Kappa, and gini index. These association measures have been used in various domains, for example, to evaluate whether a particular medical practice is associated positively to a cure of a disease or whether a particular marketing strategy is associated positively to an increase in revenue, etc. This paper models the problem of locating faults as association between the execution or non-execution of particular program elements with failures. There have been …


Towards Better Quality Specification Miners, David Lo, Siau-Cheng Khoo Nov 2011

Towards Better Quality Specification Miners, David Lo, Siau-Cheng Khoo

David LO

Softwares are often built without specification. Tools to automatically extract specification from software are needed and many techniques have been proposed. One type of these specifications – temporal API specification – is often specified in the form of automaton (i.e., FSA/PFSA). There have been many work on mining software temporal specification using dynamic analysis techniques; i.e., analysis of software program traces. Unfortunately, the issues of scalability, robustness and accuracy of these techniques have not been comprehensively addressed. In this paper, we describe a framework that enables assessments of the performance of a specification miner in generating temporal specification of software …


Mining Interesting Link Formation Rules In Social Networks, Cane Wing-Ki Leung, Ee Peng Lim, David Lo, Jianshu Weng Nov 2011

Mining Interesting Link Formation Rules In Social Networks, Cane Wing-Ki Leung, Ee Peng Lim, David Lo, Jianshu Weng

David LO

Link structures are important patterns one looks out for when modeling and analyzing social networks. In this paper, we propose the task of mining interesting Link Formation rules (LF-rules) containing link structures known as Link Formation patterns (LF-patterns). LF-patterns capture various dyadic and/or triadic structures among groups of nodes, while LF-rules capture the formation of a new link from a focal node to another node as a postcondition of existing connections between the two nodes. We devise a novel LF-rule mining algorithm, known as LFR-Miner, based on frequent subgraph mining for our task. In addition to using a support-confidence framework …


Mining Closed Discriminative Dyadic Sequential Patterns, David Lo, Hong Cheng, - Lucia Nov 2011

Mining Closed Discriminative Dyadic Sequential Patterns, David Lo, Hong Cheng, - Lucia

David LO

A lot of data are in sequential formats. In this study, we are interested in sequential data that goes in pairs. There are many interesting datasets in this format coming from various domains including parallel textual corpora, duplicate bug reports, and other pairs of related sequences of events. Our goal is to mine a set of closed discriminative dyadic sequential patterns from a database of sequence pairs each belonging to one of the two classes +ve and -ve. These dyadic sequential patterns characterize the discriminating facets contrasting the two classes. They are potentially good features to be used for the …


Mining Patterns And Rules For Software Specification Discovery, David Lo, Siau-Cheng Khoo Nov 2011

Mining Patterns And Rules For Software Specification Discovery, David Lo, Siau-Cheng Khoo

David LO

Software specifications are often lacking, incomplete and outdated in the industry. Lack and incomplete specifications cause various software engineering problems. Studies have shown that program comprehension takes up to 45% of software development costs. One of the root causes of the high cost is the lack-of documented specification. Also, outdated and incomplete specification might potentially cause bugs and compatibility issues. In this paper, we describe novel data mining techniques to mine or reverse engineer these specifications from the pool of software engineering data. A large amount of software data is available for analysis. One form of software data is program …


Mining Specifications In Diversified Formats From Execution Traces, David Lo Nov 2011

Mining Specifications In Diversified Formats From Execution Traces, David Lo

David LO

Software evolves; this phenomenon causes increase in maintenance efforts, problem in comprehending the ever-changing code base and difficulty in verifying software correctness. As software changes, often the documented specification is not updated. Outdated specification adds challenge to the understanding of the code base during maintenance tasks. Also, software changes might induce bugs, anomalies and even security threats. To address the above issues, we propose an array of specification mining techniques to mine software specifications in diversified formats from program execution traces. Case studies on various systems show that the extracted specifications shed light on the behaviors of systems under analysis. …


Data Mining For Software Engineering, Tao Xie, Suresh Thummalapenta, David Lo, Chao Liu Nov 2011

Data Mining For Software Engineering, Tao Xie, Suresh Thummalapenta, David Lo, Chao Liu

David LO

To improve software productivity and quality, software engineers are increasingly applying data mining algorithms to various software engineering tasks. However, mining SE data poses several challenges. The authors present various algorithms to effectively mine sequences, graphs, and text from such data.


Specification Mining: A Concise Introduction, David Lo, Siau-Cheng Khoo, Chao Liu, Jiawei Han Nov 2011

Specification Mining: A Concise Introduction, David Lo, Siau-Cheng Khoo, Chao Liu, Jiawei Han

David LO

No abstract provided.