Open Access. Powered by Scholars. Published by Universities.®

Databases and Information Systems Commons

Open Access. Powered by Scholars. Published by Universities.®

Articles 1 - 30 of 34

Full-Text Articles in Databases and Information Systems

Diversification And Fairness In Top-K Ranking Algorithms, Mahsa Asadi Aug 2023

Diversification And Fairness In Top-K Ranking Algorithms, Mahsa Asadi

Dissertations

Given a user query, the typical user interfaces, such as search engines and recommender systems, only allow a small number of results to be returned to the user. Hence, figuring out what would be the top-k results is an important task in information retrieval, as it helps to ensure that the most relevant results are presented to the user. There exists an extensive body of research that studies how to score the records and return top-k to the user. Moreover, there exists an extensive set of criteria that researchers identify to present the user with top-k results, and result diversification …


A Survey Of Typical Attributed Graph Queries, Yanhao Wang, Yuchen Li, Ju Fan, Chang Ye, Mingke Chai Nov 2020

A Survey Of Typical Attributed Graph Queries, Yanhao Wang, Yuchen Li, Ju Fan, Chang Ye, Mingke Chai

Research Collection School Of Computing and Information Systems

Graphs are commonly used for representing complex structures such as social relationships, biological interactions, and knowledge bases. In many scenarios, graphs not only represent topological relationships but also store the attributes that denote the semantics associated with their vertices and edges, known as attributed graphs. Attributed graphs can meet demands for a wide range of applications, and thus a variety of queries on attributed graphs have been proposed. However, these diverse types of attributed graph queries have not been systematically investigated yet. In this paper, we provide an extensive survey of several typical types of attributed graph queries. We propose …


Efficient Algorithms For Solving Aggregate Keyword Routing Problems, Qize Jiang, Weiwei Sun, Baihua Zheng, Kunjie Chen Apr 2019

Efficient Algorithms For Solving Aggregate Keyword Routing Problems, Qize Jiang, Weiwei Sun, Baihua Zheng, Kunjie Chen

Research Collection School Of Computing and Information Systems

With the emergence of smart phones and the popularity of GPS, the number of point of interest (POIs) is growing rapidly and spatial keyword search based on POIs has attracted significant attention. In this paper, we study a more sophistic type of spatial keyword searches that considers multiple query points and multiple query keywords, namely Aggregate Keyword Routing (AKR). AKR looks for an aggregate point m together with routes from each query point to m. The aggregate point has to satisfy the aggregate keywords, the routes from query points to the aggregate point have to pass POIs in order to …


Semantic And Influence Aware K-Representative Queries Over Social Streams, Yanhao Wang, Yuchen Li, Kianlee Tan Mar 2019

Semantic And Influence Aware K-Representative Queries Over Social Streams, Yanhao Wang, Yuchen Li, Kianlee Tan

Research Collection School Of Computing and Information Systems

Massive volumes of data continuously generated on social platforms have become an important information source for users. A primary method to obtain fresh and valuable information from social streams is social search. Although there have been extensive studies on social search, existing methods only focus on the relevance of query results but ignore the representativeness. In this paper, we propose a novel Semantic and Influence aware k-Representative (k-SIR) query for social streams based on topic modeling. Specifically, we consider that both user queries and elements are represented as vectors in the topic space. A k-SIR query retrieves a set of …


Distributed K-Nearest Neighbor Queries In Metric Spaces, Xin Ding, Yuanliang Zhang, Lu Chen, Yunjun Gao, Baihua Zheng Jul 2018

Distributed K-Nearest Neighbor Queries In Metric Spaces, Xin Ding, Yuanliang Zhang, Lu Chen, Yunjun Gao, Baihua Zheng

Research Collection School Of Computing and Information Systems

Metric k nearest neighbor (MkNN) queries have applications in many areas such as multimedia retrieval, computational biology, and location-based services. With the growing volumes of data, a distributed method is required. In this paper, we propose an Asynchronous Metric Distributed System (AMDS), which uniformly partitions the data with the pivot-mapping technique to ensure the load balancing, and employs publish/subscribe communication model to asynchronously process large scale of queries. The employment of asynchronous processing model also improves robustness and efficiency of AMDS. In addition, we develop an efficient estimation based MkNN method using AMDS to improve the query efficiency. Extensive experiments …


A Novel Representation And Compression For Queries On Trajectories In Road Networks, Xiaochun Yang, Bin Wang, Kai Yang, Chengfei Liu, Baihua Zheng Apr 2018

A Novel Representation And Compression For Queries On Trajectories In Road Networks, Xiaochun Yang, Bin Wang, Kai Yang, Chengfei Liu, Baihua Zheng

Research Collection School Of Computing and Information Systems

Recording and querying time-stamped trajectories incurs high cost of data storage and computing. In this paper, we explore several characteristics of the trajectories in road mbox{networks}, which have motivated the idea of coding trajectories by associating timestamps with relative spatial path and locations. Such a representation contains large number of duplicate information to achieve a lower entropy compared with the existing representations, thereby drastically cutting the storage cost. We propose several techniques to compress spatial path and locations separately, which can support fast positioning and achieve better compression ratio. For locations, we propose two novel encoding schemes such that the …


Geometric Aspects And Auxiliary Features To Top-K Processing [Advanced Seminar], Kyriakos Mouratidis Jun 2016

Geometric Aspects And Auxiliary Features To Top-K Processing [Advanced Seminar], Kyriakos Mouratidis

Research Collection School Of Computing and Information Systems

Top-k processing is a well-studied problem with numerous applications that is becoming increasingly relevant with the growing availability of recommendation systems and decision making software on PCs, PDAs and smart-phones. The objective of this seminar is twofold. First, we will delve into the geometric aspects of top-k processing. Second, we will cover complementary features to top-k queries that have a strong geometric nature. The seminar will close with insights in the effect of dimensionality on the meaningfulness of top-k queries, and interesting similarities to nearest neighbor search.


When Peculiarity Makes A Difference: Object Characterisation In Heterogeneous Information Networks, Wei Chen, Feida Zhu, Lei Zhao, Xiaofang Zhou Apr 2016

When Peculiarity Makes A Difference: Object Characterisation In Heterogeneous Information Networks, Wei Chen, Feida Zhu, Lei Zhao, Xiaofang Zhou

Research Collection School Of Computing and Information Systems

A central task in heterogeneous information networks (HIN) is how to characterise an entity, which underlies a wide range of applications such as similarity search, entity profiling and linkage. Most existing work focus on using the main features common to all. While this approach makes sense in settings where commonality is of primary interest, there are many scenarios as important where uncommon and discriminative features are more useful. To address the problem, a novel model COHIN (Characterize Objects in Heterogeneous Information Networks) is proposed, where each object is characterized as a set of feature paths that contain both main and …


Top-K Dominating Queries On Incomplete Data, Xiaoye Miao, Yunjun Gao, Baihua Zheng, Gang Chen, Huiyong Cui Jan 2016

Top-K Dominating Queries On Incomplete Data, Xiaoye Miao, Yunjun Gao, Baihua Zheng, Gang Chen, Huiyong Cui

Research Collection School Of Computing and Information Systems

The top-k dominating (TKD) query returns the k objects that dominate the maximum number of objects in a given dataset. It combines the advantages of skyline and top-k queries, and plays an important role in many decision support applications. Incomplete data exists in a wide spectrum of real datasets, due to device failure, privacy preservation, data loss, and so on. In this paper, for the first time, we carry out a systematic study of TKD queries on incomplete data, which involves the data having some missing dimensional value(s). We formalize this problem, and propose a suite of efficient algorithms for …


Fast Optimal Aggregate Point Search For A Merged Set On Road Networks, Weiwei Sun, Chong Chen, Baihua Zheng, Chunan Chen, Liang Zhu, Weimo Liu, Yan Huang Jul 2015

Fast Optimal Aggregate Point Search For A Merged Set On Road Networks, Weiwei Sun, Chong Chen, Baihua Zheng, Chunan Chen, Liang Zhu, Weimo Liu, Yan Huang

Research Collection School Of Computing and Information Systems

Aggregate nearest neighbor query, which returns an optimal target point that minimizes the aggregate distance for a given query point set, is one of the most important operations in spatial databases and their application domains. This paper addresses the problem of finding the aggregate nearest neighbor for a merged set that consists of the given query point set and multiple points needed to be selected from a candidate set, which we name as merged aggregate nearest neighbor(MANN) query. This paper proposes two algorithms to process MANN query on road networks when aggregate function is max. Then, we extend the algorithms …


On Efficient K-Optimal-Location-Selection Query Processing In Metric Spaces, Yunjun Gao, Shuyao Qi, Lu Chen, Baihua Zheng, Xinhan Li Mar 2015

On Efficient K-Optimal-Location-Selection Query Processing In Metric Spaces, Yunjun Gao, Shuyao Qi, Lu Chen, Baihua Zheng, Xinhan Li

Research Collection School Of Computing and Information Systems

This paper studies the problem of k-optimal-location-selection (kOLS) retrieval in metric spaces. Given a set DA of customers, a set DB of locations, a constrained region R , and a critical distance dc, a metric kOLS (MkOLS) query retrieves k locations in DB that are outside R but have the maximal optimality scores. Here, the optimality score of a location l∈DB located outside R is defined as the number of the customers in DA that are inside R and meanwhile have their distances to l bounded by …


On Processing Reverse K-Skyband And Ranked Reverse Skyline Queries, Yunjun Gao, Qing Liu, Baihua Zheng, Mou Li, Gang Chen, Qing Li Feb 2015

On Processing Reverse K-Skyband And Ranked Reverse Skyline Queries, Yunjun Gao, Qing Liu, Baihua Zheng, Mou Li, Gang Chen, Qing Li

Research Collection School Of Computing and Information Systems

In this paper, for the first time, we identify and solve the problem of efficient reverse k-skyband (RkSB) query processing. Given a set P of multi-dimensional points and a query point q, an RkSB query returns all the points in P whose dynamic k-skyband contains q. We formalize RkSB retrieval, and then propose five algorithms for computing the RkSB of an arbitrary query point efficiently. Our methods utilize a conventional data-partitioning index (e.g., R-tree) on the dataset, and employ pre-computation, reuse and pruning techniques to boost the query efficiency. In addition, we extend our solutions to tackle an interesting variant …


On Efficient Reverse Skyline Query Processing, Yunjun Gao, Qing Liu, Baihua Zheng, Gang Chen Jun 2014

On Efficient Reverse Skyline Query Processing, Yunjun Gao, Qing Liu, Baihua Zheng, Gang Chen

Research Collection School Of Computing and Information Systems

Given a D-dimensional data set P and a query point q, a reverse skyline query (RSQ) returns all the data objects in P whose dynamic skyline contains q. It is important for many real life applications such as business planning and environmental monitoring. Currently, the state-of-the-art algorithm for answering the RSQ is the reverse skyline using skyline approximations (RSSA) algorithm, which is based on the precomputed approximations of the skylines. Although RSSA has some desirable features, e.g., applicability to arbitrary data distributions and dimensions, it needs for multiple accesses of the same nodes, incurring redundant I/O and CPU costs. In …


Continuous Visible Nearest Neighbor Query Processing In Spatial Databases, Yunjun Gao, Baihua Zheng, Gencai Chen, Qing Li, Xiaofa Guo Jun 2011

Continuous Visible Nearest Neighbor Query Processing In Spatial Databases, Yunjun Gao, Baihua Zheng, Gencai Chen, Qing Li, Xiaofa Guo

Research Collection School Of Computing and Information Systems

In this paper, we identify and solve a new type of spatial queries, called continuous visible nearest neighbor (CVNN) search. Given a data set P, an obstacle set O, and a query line segment q in a two-dimensional space, a CVNN query returns a set of $${\langle p, R\rangle}$$ tuples such that $${p \in P}$$ is the nearest neighbor to every point r along the interval $${R \subseteq q}$$ as well as pis visible to r. Note that p may be NULL, meaning that all points in P are invisible to all points in R due to the obstruction of …


Tree-Based Partition Querying: A Methodology For Computing Medoids In Large Spatial Datasets, Kyriakos Mouratidis, Dimitris Papadias, Spiros Papadimitriou Dec 2010

Tree-Based Partition Querying: A Methodology For Computing Medoids In Large Spatial Datasets, Kyriakos Mouratidis, Dimitris Papadias, Spiros Papadimitriou

Kyriakos MOURATIDIS

Besides traditional domains (e.g., resource allocation, data mining applications), algorithms for medoid computation and related problems will play an important role in numerous emerging fields, such as location based services and sensor networks. Since the k-medoid problem is NP hard, all existing work deals with approximate solutions on relatively small datasets. This paper aims at efficient methods for very large spatial databases, motivated by: (i) the high and ever increasing availability of spatial data, and (ii) the need for novel query types and improved services. The proposed solutions exploit the intrinsic grouping properties of a data partition index in order …


Anonymous Query Processing In Road Networks, Kyriakos Mouratidis, Man Lung Yiu Dec 2010

Anonymous Query Processing In Road Networks, Kyriakos Mouratidis, Man Lung Yiu

Kyriakos MOURATIDIS

The increasing availability of location-aware mobile devices has given rise to a flurry of location-based services (LBSs). Due to the nature of spatial queries, an LBS needs the user position in order to process her requests. On the other hand, revealing exact user locations to a (potentially untrusted) LBS may pinpoint their identities and breach their privacy. To address this issue, spatial anonymity techniques obfuscate user locations, forwarding to the LBS a sufficiently large region instead. Existing methods explicitly target processing in the euclidean space and do not apply when proximity to the users is defined according to network distance …


Query Processing In Spatial Databases Containing Obstacles, Jun Zhang, Dimitris Papadias, Kyriakos Mouratidis, Manli Zhu Dec 2010

Query Processing In Spatial Databases Containing Obstacles, Jun Zhang, Dimitris Papadias, Kyriakos Mouratidis, Manli Zhu

Kyriakos MOURATIDIS

Despite the existence of obstacles in many database applications, traditional spatial query processing assumes that points in space are directly reachable and utilizes the Euclidean distance metric. In this paper, we study spatial queries in the presence of obstacles, where the obstructed distance between two points is defined as the length of the shortest path that connects them without crossing any obstacles. We propose efficient algorithms for the most important query types, namely, range search, nearest neighbours, e-distance joins, closest pairs and distance semi-joins, assuming that both data objects and obstacles are indexed by R-trees. The effectiveness of the proposed …


A Threshold-Based Algorithm For Continuous Monitoring Of K Nearest Neighbors, Kyriakos Mouratidis, Dimitris Papadias, Spiridon Bakiras, Yufei Tao Dec 2010

A Threshold-Based Algorithm For Continuous Monitoring Of K Nearest Neighbors, Kyriakos Mouratidis, Dimitris Papadias, Spiridon Bakiras, Yufei Tao

Kyriakos MOURATIDIS

Assume a set of moving objects and a central server that monitors their positions over time, while processing continuous nearest neighbor queries from geographically distributed clients. In order to always report up-to-date results, the server could constantly obtain the most recent position of all objects. However, this naïve solution requires the transmission of a large number of rapid data streams corresponding to location updates. Intuitively, current information is necessary only for objects that may influence some query result (i.e., they may be included in the nearest neighbor set of some client). Motivated by this observation, we present a threshold-based algorithm …


Efficient Mutual Nearest Neighbor Query Processing For Moving Object Trajectories, Yunjun Gao, Baihua Zheng, Gencai Chen, Qing Li, Chun Chen, Gang Chen Jun 2010

Efficient Mutual Nearest Neighbor Query Processing For Moving Object Trajectories, Yunjun Gao, Baihua Zheng, Gencai Chen, Qing Li, Chun Chen, Gang Chen

Research Collection School Of Computing and Information Systems

Given a set D of trajectories, a query object q, and a query time extent Γ, a mutual (i.e., symmetric) nearest neighbor (MNN) query over trajectories finds from D, the set of trajectories that are among the k1 nearest neighbors (NNs) of q within Γ, and meanwhile, have q as one of their k2 NNs. This type of queries is useful in many applications such as decision making, data mining, and pattern recognition, as it considers both the proximity of the trajectories to q and the proximity of q to the trajectories. In this paper, we first formalize MNN search …


Algorithms For Constrained K-Nearest Neighbor Queries Over Moving Object Trajectories, Yunjun Gao, Baihua Zheng, Gencai Chen, Qing Li, Chun Chen Apr 2010

Algorithms For Constrained K-Nearest Neighbor Queries Over Moving Object Trajectories, Yunjun Gao, Baihua Zheng, Gencai Chen, Qing Li, Chun Chen

Research Collection School Of Computing and Information Systems

An important query for spatio-temporal databases is to find nearest trajectories of moving objects. Existing work on this topic focuses on the closest trajectories in the whole data space. In this paper, we introduce and solve constrained k-nearest neighbor (CkNN) queries and historical continuous CkNN (HCCkNN) queries on R-tree-like structures storing historical information about moving object trajectories. Given a trajectory set D, a query object (point or trajectory) q, a temporal extent T, and a constrained region CR, (i) a CkNN query over trajectories retrieves from D within T, the k (≥ 1) trajectories that lie closest to q and …


Anonymous Query Processing In Road Networks, Kyriakos Mouratidis, Man Lung Yiu Jan 2010

Anonymous Query Processing In Road Networks, Kyriakos Mouratidis, Man Lung Yiu

Research Collection School Of Computing and Information Systems

The increasing availability of location-aware mobile devices has given rise to a flurry of location-based services (LBSs). Due to the nature of spatial queries, an LBS needs the user position in order to process her requests. On the other hand, revealing exact user locations to a (potentially untrusted) LBS may pinpoint their identities and breach their privacy. To address this issue, spatial anonymity techniques obfuscate user locations, forwarding to the LBS a sufficiently large region instead. Existing methods explicitly target processing in the euclidean space and do not apply when proximity to the users is defined according to network distance …


Visible Reverse K-Nearest Neighbor Query Processing In Spatial Databases, Yunjun Gao, Baihua Zheng, Gencai Chen, Wang-Chien Lee, Ken C. K. Lee, Qing Li Sep 2009

Visible Reverse K-Nearest Neighbor Query Processing In Spatial Databases, Yunjun Gao, Baihua Zheng, Gencai Chen, Wang-Chien Lee, Ken C. K. Lee, Qing Li

Research Collection School Of Computing and Information Systems

Reverse nearest neighbor (RNN) queries have a broad application base such as decision support, profile-based marketing, resource allocation, etc. Previous work on RNN search does not take obstacles into consideration. In the real world, however, there are many physical obstacles (e.g., buildings) and their presence may affect the visibility between objects. In this paper, we introduce a novel variant of RNN queries, namely, visible reverse nearest neighbor (VRNN) search, which considers the impact of obstacles on the visibility of objects. Given a data set P, an obstacle set O, and a query point q in a 2D space, a VRNN …


On Efficient Mutual Nearest Neighbor Query Processing In Spatial Databases, Yunjun Gao, Baihua Zheng, Gencai Chen, Qing Li Aug 2009

On Efficient Mutual Nearest Neighbor Query Processing In Spatial Databases, Yunjun Gao, Baihua Zheng, Gencai Chen, Qing Li

Research Collection School Of Computing and Information Systems

This paper studies a new form of nearest neighbor queries in spatial databases, namely, mutual nearest neighbour (MNN) search. Given a set D of objects and a query object q, an MNN query returns from D, the set of objects that are among the k1 (≥ 1) nearest neighbors (NNs) of q; meanwhile, have q as one of their k2(≥ 1) NNs. Although MNN queries are useful in many applications involving decision making, data mining, and pattern recognition, it cannot be efficiently handled by existing spatial query processing approaches. In this paper, we present …


Optimal-Location-Selection Query Processing In Spatial Databases, Yunjun Gao, Baihua Zheng, Gencai Chen, Qing Li Aug 2009

Optimal-Location-Selection Query Processing In Spatial Databases, Yunjun Gao, Baihua Zheng, Gencai Chen, Qing Li

Research Collection School Of Computing and Information Systems

This paper introduces and solves a novel type of spatial queries, namely, Optimal-Location-Selection (OLS) search, which has many applications in real life. Given a data object set D_A, a target object set D_B, a spatial region R, and a critical distance d_c in a multidimensional space, an OLS query retrieves those target objects in D_B that are outside R but have maximal optimality. Here, the optimality of a target object b \in D_B located outside R is defined as the number of the data objects from D_A that are inside R and meanwhile have their distances to b not exceeding …


Tree-Based Partition Querying: A Methodology For Computing Medoids In Large Spatial Datasets, Kyriakos Mouratidis, Dimitris Papadias, Spiros Papadimitriou Jul 2008

Tree-Based Partition Querying: A Methodology For Computing Medoids In Large Spatial Datasets, Kyriakos Mouratidis, Dimitris Papadias, Spiros Papadimitriou

Research Collection School Of Computing and Information Systems

Besides traditional domains (e.g., resource allocation, data mining applications), algorithms for medoid computation and related problems will play an important role in numerous emerging fields, such as location based services and sensor networks. Since the k-medoid problem is NP hard, all existing work deals with approximate solutions on relatively small datasets. This paper aims at efficient methods for very large spatial databases, motivated by: (i) the high and ever increasing availability of spatial data, and (ii) the need for novel query types and improved services. The proposed solutions exploit the intrinsic grouping properties of a data partition index in order …


Processing Transitive Nearest-Neighbor Queries In Multi-Channel Access Environments, Xiao Zhang, Wang-Chien Lee, Prasnjit Mitra, Baihua Zheng Mar 2008

Processing Transitive Nearest-Neighbor Queries In Multi-Channel Access Environments, Xiao Zhang, Wang-Chien Lee, Prasnjit Mitra, Baihua Zheng

Research Collection School Of Computing and Information Systems

Wireless broadcast is an efficient way for information dissemination due to its good scalability [10]. Existing works typically assume mobile devices, such as cell phones and PDAs, can access only one channel at a time. In this paper, we consider a scenario of near future where a mobile device has the ability to process queries using information simultaneously received from multiple channels. We focus on the query processing of the transitive nearest neighbor (TNN) search [19]. Two TNN algorithms developed for a single broadcast channel environment are adapted to our new broadcast enviroment. Based on the obtained insights, we propose …


Authenticating Multi-Dimensional Query Results In Data Publishing, Weiwei Cheng, Hwee Hwa Pang, Kian-Lee Tan Jul 2006

Authenticating Multi-Dimensional Query Results In Data Publishing, Weiwei Cheng, Hwee Hwa Pang, Kian-Lee Tan

Research Collection School Of Computing and Information Systems

In data publishing, the owner delegates the role of satisfying user queries to a third-party publisher. As the publisher may be untrusted or susceptible to attacks, it could produce incorrect query results. This paper introduces a mechanism for users to verify that their query answers on a multi-dimensional dataset are correct, in the sense of being complete (i.e., no qualifying data points are omitted) and authentic (i.e., all the result values originated from the owner). Our approach is to add authentication information into a spatial data structure, by constructing certified chains on the points within each partition, as well as …


A Threshold-Based Algorithm For Continuous Monitoring Of K Nearest Neighbors, Kyriakos Mouratidis, Dimitris Papadias, Spiridon Bakiras, Yufei Tao Nov 2005

A Threshold-Based Algorithm For Continuous Monitoring Of K Nearest Neighbors, Kyriakos Mouratidis, Dimitris Papadias, Spiridon Bakiras, Yufei Tao

Research Collection School Of Computing and Information Systems

Assume a set of moving objects and a central server that monitors their positions over time, while processing continuous nearest neighbor queries from geographically distributed clients. In order to always report up-to-date results, the server could constantly obtain the most recent position of all objects. However, this naïve solution requires the transmission of a large number of rapid data streams corresponding to location updates. Intuitively, current information is necessary only for objects that may influence some query result (i.e., they may be included in the nearest neighbor set of some client). Motivated by this observation, we present a threshold-based algorithm …


Query Processing In Spatial Databases Containing Obstacles, Jun Zhang, Dimitris Papadias, Kyriakos Mouratidis, Manli Zhu Nov 2005

Query Processing In Spatial Databases Containing Obstacles, Jun Zhang, Dimitris Papadias, Kyriakos Mouratidis, Manli Zhu

Research Collection School Of Computing and Information Systems

Despite the existence of obstacles in many database applications, traditional spatial query processing assumes that points in space are directly reachable and utilizes the Euclidean distance metric. In this paper, we study spatial queries in the presence of obstacles, where the obstructed distance between two points is defined as the length of the shortest path that connects them without crossing any obstacles. We propose efficient algorithms for the most important query types, namely, range search, nearest neighbours, e-distance joins, closest pairs and distance semi-joins, assuming that both data objects and obstacles are indexed by R-trees. The effectiveness of the proposed …


Fast Filter-And-Refine Algorithms For Subsequence Selection, Beng-Chin Ooi, Hwee Hwa Pang, Hao Wang, Limsoon Wong, Cui Yu Jul 2002

Fast Filter-And-Refine Algorithms For Subsequence Selection, Beng-Chin Ooi, Hwee Hwa Pang, Hao Wang, Limsoon Wong, Cui Yu

Research Collection School Of Computing and Information Systems

Large sequence databases, such as protein, DNA and gene sequences in biology, are becoming increasingly common. An important operation on a sequence database is approximate subsequence matching, where all subsequences that are within some distance from a given query string are retrieved. This paper proposes a filter-and-refine algorithm that enables efficient approximate subsequence matching in large DNA sequence databases. It employs a bitmap indexing structure to condense and encode each data sequence into a shorter index sequence. During query processing, the bitmap index is used to filter out most of the irrelevant subsequences, and false positives are removed in the …