Open Access. Powered by Scholars. Published by Universities.®
Databases and Information Systems Commons™
Open Access. Powered by Scholars. Published by Universities.®
- Discipline
-
- Numerical Analysis and Scientific Computing (23)
- Theory and Algorithms (5)
- Electrical and Computer Engineering (2)
- Engineering (2)
- Computer Engineering (1)
-
- Digital Communications and Networking (1)
- Graphics and Human Computer Interfaces (1)
- Information Security (1)
- Other Computer Sciences (1)
- Public Affairs, Public Policy and Public Administration (1)
- Signal Processing (1)
- Social and Behavioral Sciences (1)
- Systems and Communications (1)
- Transportation (1)
- Institution
- Publication Year
- Publication
- Publication Type
Articles 1 - 30 of 34
Full-Text Articles in Databases and Information Systems
Diversification And Fairness In Top-K Ranking Algorithms, Mahsa Asadi
Diversification And Fairness In Top-K Ranking Algorithms, Mahsa Asadi
Dissertations
Given a user query, the typical user interfaces, such as search engines and recommender systems, only allow a small number of results to be returned to the user. Hence, figuring out what would be the top-k results is an important task in information retrieval, as it helps to ensure that the most relevant results are presented to the user. There exists an extensive body of research that studies how to score the records and return top-k to the user. Moreover, there exists an extensive set of criteria that researchers identify to present the user with top-k results, and result diversification …
A Survey Of Typical Attributed Graph Queries, Yanhao Wang, Yuchen Li, Ju Fan, Chang Ye, Mingke Chai
A Survey Of Typical Attributed Graph Queries, Yanhao Wang, Yuchen Li, Ju Fan, Chang Ye, Mingke Chai
Research Collection School Of Computing and Information Systems
Graphs are commonly used for representing complex structures such as social relationships, biological interactions, and knowledge bases. In many scenarios, graphs not only represent topological relationships but also store the attributes that denote the semantics associated with their vertices and edges, known as attributed graphs. Attributed graphs can meet demands for a wide range of applications, and thus a variety of queries on attributed graphs have been proposed. However, these diverse types of attributed graph queries have not been systematically investigated yet. In this paper, we provide an extensive survey of several typical types of attributed graph queries. We propose …
Efficient Algorithms For Solving Aggregate Keyword Routing Problems, Qize Jiang, Weiwei Sun, Baihua Zheng, Kunjie Chen
Efficient Algorithms For Solving Aggregate Keyword Routing Problems, Qize Jiang, Weiwei Sun, Baihua Zheng, Kunjie Chen
Research Collection School Of Computing and Information Systems
With the emergence of smart phones and the popularity of GPS, the number of point of interest (POIs) is growing rapidly and spatial keyword search based on POIs has attracted significant attention. In this paper, we study a more sophistic type of spatial keyword searches that considers multiple query points and multiple query keywords, namely Aggregate Keyword Routing (AKR). AKR looks for an aggregate point m together with routes from each query point to m. The aggregate point has to satisfy the aggregate keywords, the routes from query points to the aggregate point have to pass POIs in order to …
Semantic And Influence Aware K-Representative Queries Over Social Streams, Yanhao Wang, Yuchen Li, Kianlee Tan
Semantic And Influence Aware K-Representative Queries Over Social Streams, Yanhao Wang, Yuchen Li, Kianlee Tan
Research Collection School Of Computing and Information Systems
Massive volumes of data continuously generated on social platforms have become an important information source for users. A primary method to obtain fresh and valuable information from social streams is social search. Although there have been extensive studies on social search, existing methods only focus on the relevance of query results but ignore the representativeness. In this paper, we propose a novel Semantic and Influence aware k-Representative (k-SIR) query for social streams based on topic modeling. Specifically, we consider that both user queries and elements are represented as vectors in the topic space. A k-SIR query retrieves a set of …
Distributed K-Nearest Neighbor Queries In Metric Spaces, Xin Ding, Yuanliang Zhang, Lu Chen, Yunjun Gao, Baihua Zheng
Distributed K-Nearest Neighbor Queries In Metric Spaces, Xin Ding, Yuanliang Zhang, Lu Chen, Yunjun Gao, Baihua Zheng
Research Collection School Of Computing and Information Systems
Metric k nearest neighbor (MkNN) queries have applications in many areas such as multimedia retrieval, computational biology, and location-based services. With the growing volumes of data, a distributed method is required. In this paper, we propose an Asynchronous Metric Distributed System (AMDS), which uniformly partitions the data with the pivot-mapping technique to ensure the load balancing, and employs publish/subscribe communication model to asynchronously process large scale of queries. The employment of asynchronous processing model also improves robustness and efficiency of AMDS. In addition, we develop an efficient estimation based MkNN method using AMDS to improve the query efficiency. Extensive experiments …
A Novel Representation And Compression For Queries On Trajectories In Road Networks, Xiaochun Yang, Bin Wang, Kai Yang, Chengfei Liu, Baihua Zheng
A Novel Representation And Compression For Queries On Trajectories In Road Networks, Xiaochun Yang, Bin Wang, Kai Yang, Chengfei Liu, Baihua Zheng
Research Collection School Of Computing and Information Systems
Recording and querying time-stamped trajectories incurs high cost of data storage and computing. In this paper, we explore several characteristics of the trajectories in road mbox{networks}, which have motivated the idea of coding trajectories by associating timestamps with relative spatial path and locations. Such a representation contains large number of duplicate information to achieve a lower entropy compared with the existing representations, thereby drastically cutting the storage cost. We propose several techniques to compress spatial path and locations separately, which can support fast positioning and achieve better compression ratio. For locations, we propose two novel encoding schemes such that the …
Geometric Aspects And Auxiliary Features To Top-K Processing [Advanced Seminar], Kyriakos Mouratidis
Geometric Aspects And Auxiliary Features To Top-K Processing [Advanced Seminar], Kyriakos Mouratidis
Research Collection School Of Computing and Information Systems
Top-k processing is a well-studied problem with numerous applications that is becoming increasingly relevant with the growing availability of recommendation systems and decision making software on PCs, PDAs and smart-phones. The objective of this seminar is twofold. First, we will delve into the geometric aspects of top-k processing. Second, we will cover complementary features to top-k queries that have a strong geometric nature. The seminar will close with insights in the effect of dimensionality on the meaningfulness of top-k queries, and interesting similarities to nearest neighbor search.
When Peculiarity Makes A Difference: Object Characterisation In Heterogeneous Information Networks, Wei Chen, Feida Zhu, Lei Zhao, Xiaofang Zhou
When Peculiarity Makes A Difference: Object Characterisation In Heterogeneous Information Networks, Wei Chen, Feida Zhu, Lei Zhao, Xiaofang Zhou
Research Collection School Of Computing and Information Systems
A central task in heterogeneous information networks (HIN) is how to characterise an entity, which underlies a wide range of applications such as similarity search, entity profiling and linkage. Most existing work focus on using the main features common to all. While this approach makes sense in settings where commonality is of primary interest, there are many scenarios as important where uncommon and discriminative features are more useful. To address the problem, a novel model COHIN (Characterize Objects in Heterogeneous Information Networks) is proposed, where each object is characterized as a set of feature paths that contain both main and …
Top-K Dominating Queries On Incomplete Data, Xiaoye Miao, Yunjun Gao, Baihua Zheng, Gang Chen, Huiyong Cui
Top-K Dominating Queries On Incomplete Data, Xiaoye Miao, Yunjun Gao, Baihua Zheng, Gang Chen, Huiyong Cui
Research Collection School Of Computing and Information Systems
The top-k dominating (TKD) query returns the k objects that dominate the maximum number of objects in a given dataset. It combines the advantages of skyline and top-k queries, and plays an important role in many decision support applications. Incomplete data exists in a wide spectrum of real datasets, due to device failure, privacy preservation, data loss, and so on. In this paper, for the first time, we carry out a systematic study of TKD queries on incomplete data, which involves the data having some missing dimensional value(s). We formalize this problem, and propose a suite of efficient algorithms for …
Fast Optimal Aggregate Point Search For A Merged Set On Road Networks, Weiwei Sun, Chong Chen, Baihua Zheng, Chunan Chen, Liang Zhu, Weimo Liu, Yan Huang
Fast Optimal Aggregate Point Search For A Merged Set On Road Networks, Weiwei Sun, Chong Chen, Baihua Zheng, Chunan Chen, Liang Zhu, Weimo Liu, Yan Huang
Research Collection School Of Computing and Information Systems
Aggregate nearest neighbor query, which returns an optimal target point that minimizes the aggregate distance for a given query point set, is one of the most important operations in spatial databases and their application domains. This paper addresses the problem of finding the aggregate nearest neighbor for a merged set that consists of the given query point set and multiple points needed to be selected from a candidate set, which we name as merged aggregate nearest neighbor(MANN) query. This paper proposes two algorithms to process MANN query on road networks when aggregate function is max. Then, we extend the algorithms …
On Efficient K-Optimal-Location-Selection Query Processing In Metric Spaces, Yunjun Gao, Shuyao Qi, Lu Chen, Baihua Zheng, Xinhan Li
On Efficient K-Optimal-Location-Selection Query Processing In Metric Spaces, Yunjun Gao, Shuyao Qi, Lu Chen, Baihua Zheng, Xinhan Li
Research Collection School Of Computing and Information Systems
This paper studies the problem of k-optimal-location-selection (kOLS) retrieval in metric spaces. Given a set DA of customers, a set DB of locations, a constrained region R , and a critical distance dc, a metric kOLS (MkOLS) query retrieves k locations in DB that are outside R but have the maximal optimality scores. Here, the optimality score of a location l∈DB located outside R is defined as the number of the customers in DA that are inside R and meanwhile have their distances to l bounded by …
On Processing Reverse K-Skyband And Ranked Reverse Skyline Queries, Yunjun Gao, Qing Liu, Baihua Zheng, Mou Li, Gang Chen, Qing Li
On Processing Reverse K-Skyband And Ranked Reverse Skyline Queries, Yunjun Gao, Qing Liu, Baihua Zheng, Mou Li, Gang Chen, Qing Li
Research Collection School Of Computing and Information Systems
In this paper, for the first time, we identify and solve the problem of efficient reverse k-skyband (RkSB) query processing. Given a set P of multi-dimensional points and a query point q, an RkSB query returns all the points in P whose dynamic k-skyband contains q. We formalize RkSB retrieval, and then propose five algorithms for computing the RkSB of an arbitrary query point efficiently. Our methods utilize a conventional data-partitioning index (e.g., R-tree) on the dataset, and employ pre-computation, reuse and pruning techniques to boost the query efficiency. In addition, we extend our solutions to tackle an interesting variant …
On Efficient Reverse Skyline Query Processing, Yunjun Gao, Qing Liu, Baihua Zheng, Gang Chen
On Efficient Reverse Skyline Query Processing, Yunjun Gao, Qing Liu, Baihua Zheng, Gang Chen
Research Collection School Of Computing and Information Systems
Given a D-dimensional data set P and a query point q, a reverse skyline query (RSQ) returns all the data objects in P whose dynamic skyline contains q. It is important for many real life applications such as business planning and environmental monitoring. Currently, the state-of-the-art algorithm for answering the RSQ is the reverse skyline using skyline approximations (RSSA) algorithm, which is based on the precomputed approximations of the skylines. Although RSSA has some desirable features, e.g., applicability to arbitrary data distributions and dimensions, it needs for multiple accesses of the same nodes, incurring redundant I/O and CPU costs. In …
Continuous Visible Nearest Neighbor Query Processing In Spatial Databases, Yunjun Gao, Baihua Zheng, Gencai Chen, Qing Li, Xiaofa Guo
Continuous Visible Nearest Neighbor Query Processing In Spatial Databases, Yunjun Gao, Baihua Zheng, Gencai Chen, Qing Li, Xiaofa Guo
Research Collection School Of Computing and Information Systems
In this paper, we identify and solve a new type of spatial queries, called continuous visible nearest neighbor (CVNN) search. Given a data set P, an obstacle set O, and a query line segment q in a two-dimensional space, a CVNN query returns a set of $${\langle p, R\rangle}$$ tuples such that $${p \in P}$$ is the nearest neighbor to every point r along the interval $${R \subseteq q}$$ as well as pis visible to r. Note that p may be NULL, meaning that all points in P are invisible to all points in R due to the obstruction of …
Tree-Based Partition Querying: A Methodology For Computing Medoids In Large Spatial Datasets, Kyriakos Mouratidis, Dimitris Papadias, Spiros Papadimitriou
Tree-Based Partition Querying: A Methodology For Computing Medoids In Large Spatial Datasets, Kyriakos Mouratidis, Dimitris Papadias, Spiros Papadimitriou
Kyriakos MOURATIDIS
Besides traditional domains (e.g., resource allocation, data mining applications), algorithms for medoid computation and related problems will play an important role in numerous emerging fields, such as location based services and sensor networks. Since the k-medoid problem is NP hard, all existing work deals with approximate solutions on relatively small datasets. This paper aims at efficient methods for very large spatial databases, motivated by: (i) the high and ever increasing availability of spatial data, and (ii) the need for novel query types and improved services. The proposed solutions exploit the intrinsic grouping properties of a data partition index in order …
Anonymous Query Processing In Road Networks, Kyriakos Mouratidis, Man Lung Yiu
Anonymous Query Processing In Road Networks, Kyriakos Mouratidis, Man Lung Yiu
Kyriakos MOURATIDIS
The increasing availability of location-aware mobile devices has given rise to a flurry of location-based services (LBSs). Due to the nature of spatial queries, an LBS needs the user position in order to process her requests. On the other hand, revealing exact user locations to a (potentially untrusted) LBS may pinpoint their identities and breach their privacy. To address this issue, spatial anonymity techniques obfuscate user locations, forwarding to the LBS a sufficiently large region instead. Existing methods explicitly target processing in the euclidean space and do not apply when proximity to the users is defined according to network distance …
Query Processing In Spatial Databases Containing Obstacles, Jun Zhang, Dimitris Papadias, Kyriakos Mouratidis, Manli Zhu
Query Processing In Spatial Databases Containing Obstacles, Jun Zhang, Dimitris Papadias, Kyriakos Mouratidis, Manli Zhu
Kyriakos MOURATIDIS
Despite the existence of obstacles in many database applications, traditional spatial query processing assumes that points in space are directly reachable and utilizes the Euclidean distance metric. In this paper, we study spatial queries in the presence of obstacles, where the obstructed distance between two points is defined as the length of the shortest path that connects them without crossing any obstacles. We propose efficient algorithms for the most important query types, namely, range search, nearest neighbours, e-distance joins, closest pairs and distance semi-joins, assuming that both data objects and obstacles are indexed by R-trees. The effectiveness of the proposed …
A Threshold-Based Algorithm For Continuous Monitoring Of K Nearest Neighbors, Kyriakos Mouratidis, Dimitris Papadias, Spiridon Bakiras, Yufei Tao
A Threshold-Based Algorithm For Continuous Monitoring Of K Nearest Neighbors, Kyriakos Mouratidis, Dimitris Papadias, Spiridon Bakiras, Yufei Tao
Kyriakos MOURATIDIS
Assume a set of moving objects and a central server that monitors their positions over time, while processing continuous nearest neighbor queries from geographically distributed clients. In order to always report up-to-date results, the server could constantly obtain the most recent position of all objects. However, this naïve solution requires the transmission of a large number of rapid data streams corresponding to location updates. Intuitively, current information is necessary only for objects that may influence some query result (i.e., they may be included in the nearest neighbor set of some client). Motivated by this observation, we present a threshold-based algorithm …
Efficient Mutual Nearest Neighbor Query Processing For Moving Object Trajectories, Yunjun Gao, Baihua Zheng, Gencai Chen, Qing Li, Chun Chen, Gang Chen
Efficient Mutual Nearest Neighbor Query Processing For Moving Object Trajectories, Yunjun Gao, Baihua Zheng, Gencai Chen, Qing Li, Chun Chen, Gang Chen
Research Collection School Of Computing and Information Systems
Given a set D of trajectories, a query object q, and a query time extent Γ, a mutual (i.e., symmetric) nearest neighbor (MNN) query over trajectories finds from D, the set of trajectories that are among the k1 nearest neighbors (NNs) of q within Γ, and meanwhile, have q as one of their k2 NNs. This type of queries is useful in many applications such as decision making, data mining, and pattern recognition, as it considers both the proximity of the trajectories to q and the proximity of q to the trajectories. In this paper, we first formalize MNN search …
Algorithms For Constrained K-Nearest Neighbor Queries Over Moving Object Trajectories, Yunjun Gao, Baihua Zheng, Gencai Chen, Qing Li, Chun Chen
Algorithms For Constrained K-Nearest Neighbor Queries Over Moving Object Trajectories, Yunjun Gao, Baihua Zheng, Gencai Chen, Qing Li, Chun Chen
Research Collection School Of Computing and Information Systems
An important query for spatio-temporal databases is to find nearest trajectories of moving objects. Existing work on this topic focuses on the closest trajectories in the whole data space. In this paper, we introduce and solve constrained k-nearest neighbor (CkNN) queries and historical continuous CkNN (HCCkNN) queries on R-tree-like structures storing historical information about moving object trajectories. Given a trajectory set D, a query object (point or trajectory) q, a temporal extent T, and a constrained region CR, (i) a CkNN query over trajectories retrieves from D within T, the k (≥ 1) trajectories that lie closest to q and …
Anonymous Query Processing In Road Networks, Kyriakos Mouratidis, Man Lung Yiu
Anonymous Query Processing In Road Networks, Kyriakos Mouratidis, Man Lung Yiu
Research Collection School Of Computing and Information Systems
The increasing availability of location-aware mobile devices has given rise to a flurry of location-based services (LBSs). Due to the nature of spatial queries, an LBS needs the user position in order to process her requests. On the other hand, revealing exact user locations to a (potentially untrusted) LBS may pinpoint their identities and breach their privacy. To address this issue, spatial anonymity techniques obfuscate user locations, forwarding to the LBS a sufficiently large region instead. Existing methods explicitly target processing in the euclidean space and do not apply when proximity to the users is defined according to network distance …
Visible Reverse K-Nearest Neighbor Query Processing In Spatial Databases, Yunjun Gao, Baihua Zheng, Gencai Chen, Wang-Chien Lee, Ken C. K. Lee, Qing Li
Visible Reverse K-Nearest Neighbor Query Processing In Spatial Databases, Yunjun Gao, Baihua Zheng, Gencai Chen, Wang-Chien Lee, Ken C. K. Lee, Qing Li
Research Collection School Of Computing and Information Systems
Reverse nearest neighbor (RNN) queries have a broad application base such as decision support, profile-based marketing, resource allocation, etc. Previous work on RNN search does not take obstacles into consideration. In the real world, however, there are many physical obstacles (e.g., buildings) and their presence may affect the visibility between objects. In this paper, we introduce a novel variant of RNN queries, namely, visible reverse nearest neighbor (VRNN) search, which considers the impact of obstacles on the visibility of objects. Given a data set P, an obstacle set O, and a query point q in a 2D space, a VRNN …
On Efficient Mutual Nearest Neighbor Query Processing In Spatial Databases, Yunjun Gao, Baihua Zheng, Gencai Chen, Qing Li
On Efficient Mutual Nearest Neighbor Query Processing In Spatial Databases, Yunjun Gao, Baihua Zheng, Gencai Chen, Qing Li
Research Collection School Of Computing and Information Systems
This paper studies a new form of nearest neighbor queries in spatial databases, namely, mutual nearest neighbour (MNN) search. Given a set D of objects and a query object q, an MNN query returns from D, the set of objects that are among the k1 (≥ 1) nearest neighbors (NNs) of q; meanwhile, have q as one of their k2(≥ 1) NNs. Although MNN queries are useful in many applications involving decision making, data mining, and pattern recognition, it cannot be efficiently handled by existing spatial query processing approaches. In this paper, we present …
Optimal-Location-Selection Query Processing In Spatial Databases, Yunjun Gao, Baihua Zheng, Gencai Chen, Qing Li
Optimal-Location-Selection Query Processing In Spatial Databases, Yunjun Gao, Baihua Zheng, Gencai Chen, Qing Li
Research Collection School Of Computing and Information Systems
This paper introduces and solves a novel type of spatial queries, namely, Optimal-Location-Selection (OLS) search, which has many applications in real life. Given a data object set D_A, a target object set D_B, a spatial region R, and a critical distance d_c in a multidimensional space, an OLS query retrieves those target objects in D_B that are outside R but have maximal optimality. Here, the optimality of a target object b \in D_B located outside R is defined as the number of the data objects from D_A that are inside R and meanwhile have their distances to b not exceeding …
Tree-Based Partition Querying: A Methodology For Computing Medoids In Large Spatial Datasets, Kyriakos Mouratidis, Dimitris Papadias, Spiros Papadimitriou
Tree-Based Partition Querying: A Methodology For Computing Medoids In Large Spatial Datasets, Kyriakos Mouratidis, Dimitris Papadias, Spiros Papadimitriou
Research Collection School Of Computing and Information Systems
Besides traditional domains (e.g., resource allocation, data mining applications), algorithms for medoid computation and related problems will play an important role in numerous emerging fields, such as location based services and sensor networks. Since the k-medoid problem is NP hard, all existing work deals with approximate solutions on relatively small datasets. This paper aims at efficient methods for very large spatial databases, motivated by: (i) the high and ever increasing availability of spatial data, and (ii) the need for novel query types and improved services. The proposed solutions exploit the intrinsic grouping properties of a data partition index in order …
Processing Transitive Nearest-Neighbor Queries In Multi-Channel Access Environments, Xiao Zhang, Wang-Chien Lee, Prasnjit Mitra, Baihua Zheng
Processing Transitive Nearest-Neighbor Queries In Multi-Channel Access Environments, Xiao Zhang, Wang-Chien Lee, Prasnjit Mitra, Baihua Zheng
Research Collection School Of Computing and Information Systems
Wireless broadcast is an efficient way for information dissemination due to its good scalability [10]. Existing works typically assume mobile devices, such as cell phones and PDAs, can access only one channel at a time. In this paper, we consider a scenario of near future where a mobile device has the ability to process queries using information simultaneously received from multiple channels. We focus on the query processing of the transitive nearest neighbor (TNN) search [19]. Two TNN algorithms developed for a single broadcast channel environment are adapted to our new broadcast enviroment. Based on the obtained insights, we propose …
Authenticating Multi-Dimensional Query Results In Data Publishing, Weiwei Cheng, Hwee Hwa Pang, Kian-Lee Tan
Authenticating Multi-Dimensional Query Results In Data Publishing, Weiwei Cheng, Hwee Hwa Pang, Kian-Lee Tan
Research Collection School Of Computing and Information Systems
In data publishing, the owner delegates the role of satisfying user queries to a third-party publisher. As the publisher may be untrusted or susceptible to attacks, it could produce incorrect query results. This paper introduces a mechanism for users to verify that their query answers on a multi-dimensional dataset are correct, in the sense of being complete (i.e., no qualifying data points are omitted) and authentic (i.e., all the result values originated from the owner). Our approach is to add authentication information into a spatial data structure, by constructing certified chains on the points within each partition, as well as …
A Threshold-Based Algorithm For Continuous Monitoring Of K Nearest Neighbors, Kyriakos Mouratidis, Dimitris Papadias, Spiridon Bakiras, Yufei Tao
A Threshold-Based Algorithm For Continuous Monitoring Of K Nearest Neighbors, Kyriakos Mouratidis, Dimitris Papadias, Spiridon Bakiras, Yufei Tao
Research Collection School Of Computing and Information Systems
Assume a set of moving objects and a central server that monitors their positions over time, while processing continuous nearest neighbor queries from geographically distributed clients. In order to always report up-to-date results, the server could constantly obtain the most recent position of all objects. However, this naïve solution requires the transmission of a large number of rapid data streams corresponding to location updates. Intuitively, current information is necessary only for objects that may influence some query result (i.e., they may be included in the nearest neighbor set of some client). Motivated by this observation, we present a threshold-based algorithm …
Query Processing In Spatial Databases Containing Obstacles, Jun Zhang, Dimitris Papadias, Kyriakos Mouratidis, Manli Zhu
Query Processing In Spatial Databases Containing Obstacles, Jun Zhang, Dimitris Papadias, Kyriakos Mouratidis, Manli Zhu
Research Collection School Of Computing and Information Systems
Despite the existence of obstacles in many database applications, traditional spatial query processing assumes that points in space are directly reachable and utilizes the Euclidean distance metric. In this paper, we study spatial queries in the presence of obstacles, where the obstructed distance between two points is defined as the length of the shortest path that connects them without crossing any obstacles. We propose efficient algorithms for the most important query types, namely, range search, nearest neighbours, e-distance joins, closest pairs and distance semi-joins, assuming that both data objects and obstacles are indexed by R-trees. The effectiveness of the proposed …
Fast Filter-And-Refine Algorithms For Subsequence Selection, Beng-Chin Ooi, Hwee Hwa Pang, Hao Wang, Limsoon Wong, Cui Yu
Fast Filter-And-Refine Algorithms For Subsequence Selection, Beng-Chin Ooi, Hwee Hwa Pang, Hao Wang, Limsoon Wong, Cui Yu
Research Collection School Of Computing and Information Systems
Large sequence databases, such as protein, DNA and gene sequences in biology, are becoming increasingly common. An important operation on a sequence database is approximate subsequence matching, where all subsequences that are within some distance from a given query string are retrieved. This paper proposes a filter-and-refine algorithm that enables efficient approximate subsequence matching in large DNA sequence databases. It employs a bitmap indexing structure to condense and encode each data sequence into a shorter index sequence. During query processing, the bitmap index is used to filter out most of the irrelevant subsequences, and false positives are removed in the …