Open Access. Powered by Scholars. Published by Universities.®

Physical Sciences and Mathematics Commons

Open Access. Powered by Scholars. Published by Universities.®

Computer Sciences

Information retrieval

Institution
Publication Year
Publication
Publication Type
File Type

Articles 1 - 30 of 152

Full-Text Articles in Physical Sciences and Mathematics

Dynamic Storytelling Algorithms Using Contextual Aspects Of A Large Language Model, Alireza Pasha Nouri May 2024

Dynamic Storytelling Algorithms Using Contextual Aspects Of A Large Language Model, Alireza Pasha Nouri

Open Access Theses & Dissertations

Storytelling is a set of algorithms used to create narratives by connecting documents in a sequencethat accurately reflects the evolution of events and entities within a particular topic or theme. Early storytelling algorithms face challenges in encoding the progression and interconnections of information between consecutive texts, given that the conventional approaches rely primarily on connecting document pairs based on content overlap. They often neglect critical linguistic features, such as word contexts, semantics, the roles words play across different documents, and attention to the historical contexts of the underlying documents. Many existing storytelling models frequently produce story chains that, while connected …


Non-Monotonic Generation Of Knowledge Paths For Context Understanding, Pei-Chi Lo, Ee-Peng Lim Mar 2024

Non-Monotonic Generation Of Knowledge Paths For Context Understanding, Pei-Chi Lo, Ee-Peng Lim

Research Collection School Of Computing and Information Systems

Knowledge graphs can be used to enhance text search and access by augmenting textual content with relevant background knowledge. While many large knowledge graphs are available, using them to make semantic connections between entities mentioned in the textual content remains to be a difficult task. In this work, we therefore introduce contextual path generation (CPG) which refers to the task of generating knowledge paths, contextual path, to explain the semantic connections between entities mentioned in textual documents with given knowledge graph. To perform CPG task well, one has to address its three challenges, namely path relevance, incomplete knowledge graph, and …


Active Discovering New Slots For Task-Oriented Conversation, Yuxia Wu, Tianhao Dai, Zhedong Zheng, Lizi Liao Jan 2024

Active Discovering New Slots For Task-Oriented Conversation, Yuxia Wu, Tianhao Dai, Zhedong Zheng, Lizi Liao

Research Collection School Of Computing and Information Systems

Existing task-oriented conversational systems heavily rely on domain ontologies with pre-defined slots and candidate values. In practical settings, these prerequisites are hard to meet, due to the emerging new user requirements and ever-changing scenarios. To mitigate these issues for better interaction performance, there are efforts working towards detecting out-of-vocabulary values or discovering new slots under unsupervised or semi-supervised learning paradigms. However, overemphasizing on the conversation data patterns alone induces these methods to yield noisy and arbitrary slot results. To facilitate the pragmatic utility, real-world systems tend to provide a stringent amount of human labeling quota, which offers an authoritative way …


Boosting The Item-Based Collaborative Filtering Model With Novel Similarity Measures, Hassan I. Abdalla, Ali A. Amer, Yasmeen A. Amer, Loc Nguyen, Basheer Al-Maqaleh Dec 2023

Boosting The Item-Based Collaborative Filtering Model With Novel Similarity Measures, Hassan I. Abdalla, Ali A. Amer, Yasmeen A. Amer, Loc Nguyen, Basheer Al-Maqaleh

All Works

Collaborative filtering (CF), one of the most widely employed methodologies for recommender systems, has drawn undeniable attention due to its effectiveness and simplicity. Nevertheless, a few papers have been published on the CF-based item-based model using similarity measures than the user-based model due to the model's complexity and the time required to build it. Additionally, the substantial shortcomings in the user-based measurements when the item-based model is taken into account motivated us to create stronger models in this work. Not to mention that the common trickiest challenge is dealing with the cold-start problem, in which users' history of item-buying behavior …


Multi-Representation Variational Autoencoder Via Iterative Latent Attention And Implicit Differentiation, Nhu Thuat Tran, Hady Wirawan Lauw Oct 2023

Multi-Representation Variational Autoencoder Via Iterative Latent Attention And Implicit Differentiation, Nhu Thuat Tran, Hady Wirawan Lauw

Research Collection School Of Computing and Information Systems

Variational Autoencoder (VAE) offers a non-linear probabilistic modeling of user's preferences. While it has achieved remarkable performance at collaborative filtering, it typically samples a single vector for representing user's preferences, which may be insufficient to capture the user's diverse interests. Existing solutions extend VAE to model multiple interests of users by resorting a variant of self-attentive method, i.e., employing prototypes to group items into clusters, each capturing one topic of user's interests. Despite showing improvements, the current design could be more effective since prototypes are randomly initialized and shared across users, resulting in uninformative and non-personalized clusters.To fill the gap, …


Arduinoprog: Towards Automating Arduino Programming, Imam Nur Bani Yusuf, Diyanah Binte Abdul Jamal, Lingxiao Jiang Sep 2023

Arduinoprog: Towards Automating Arduino Programming, Imam Nur Bani Yusuf, Diyanah Binte Abdul Jamal, Lingxiao Jiang

Research Collection School Of Computing and Information Systems

Writing code for Arduino poses unique challenges. A developer 1) needs hardware-specific knowledge about the interface configuration between the Arduino controller and the I/Ohardware, 2) identifies a suitable driver library for the I/O hardware, and 3) follows certain usage patterns of the driver library in order to use them properly. In this work, based on a study of real-world user queries posted in the Arduino forum, we propose ArduinoProg to address such challenges. ArduinoProg consists of three components, i.e., Library Retriever, Configuration Classifier, and Pattern Generator. Given a query, Library Retriever retrieves library names relevant to the I/O hardware identified …


Automating Arduino Programming: From Hardware Setups To Sample Source Code Generation, Imam Nur Bani Yusuf, Diyanah Binte Abdul Jamal, Lingxiao Jiang May 2023

Automating Arduino Programming: From Hardware Setups To Sample Source Code Generation, Imam Nur Bani Yusuf, Diyanah Binte Abdul Jamal, Lingxiao Jiang

Research Collection School Of Computing and Information Systems

An embedded system is a system consisting of software code, controller hardware, and I/O (Input/Output) hardware that performs a specific task. Developing an embedded system presents several challenges. First, the development often involves configuring hardware that requires domain-specific knowledge. Second, the library for the hardware may have API usage patterns that must be followed. To overcome such challenges, we propose a framework called ArduinoProg towards the automatic generation of Arduino applications. ArduinoProg takes a natural language query as input and outputs the configuration and API usage pattern for the hardware described in the query. Motivated by our findings on the …


Does Deep Learning Improve The Performance Of Duplicate Bug Report Detection? An Empirical Study, Yuan Jiang, Xiaohong Su, Christoph Treude, Chao Shang, Tiantian Wang Apr 2023

Does Deep Learning Improve The Performance Of Duplicate Bug Report Detection? An Empirical Study, Yuan Jiang, Xiaohong Su, Christoph Treude, Chao Shang, Tiantian Wang

Research Collection School Of Computing and Information Systems

Do Deep Learning (DL) techniques actually help to improve the performance of duplicate bug report detection? Prior studies suggest that they do, if the duplicate bug report detection task is treated as a binary classification problem. However, in realistic scenarios, the task is often viewed as a ranking problem, which predicts potential duplicate bug reports by ranking based on similarities with existing historical bug reports. There is little empirical evidence to support that DL can be effectively applied to detect duplicate bug reports in the ranking scenario. Therefore, in this paper, we investigate whether well-known DL-based methods outperform classic information …


Contextual Path Retrieval: A Contextual Entity Relation Embedding-Based Approach, Pei-Chi Lo, Ee-Peng Lim Jan 2023

Contextual Path Retrieval: A Contextual Entity Relation Embedding-Based Approach, Pei-Chi Lo, Ee-Peng Lim

Research Collection School Of Computing and Information Systems

Contextual path retrieval (CPR) refers to the task of finding contextual path(s) between a pair of entities in a knowledge graph that explains the connection between them in a given context. For this novel retrieval task, we propose the Embedding-based Contextual Path Retrieval (ECPR) framework. ECPR is based on a three-component structure that includes a context encoder and path encoder that encode query context and path, respectively, and a path ranker that assigns a ranking score to each candidate path to determine the one that should be the contextual path. For context encoding, we propose two novel context encoding methods, …


Evaluation Of Geo-Spebh Algorithm Based On Bandwidth For Big Data Retrieval In Cloud Computing, Abubakar Usman Othman, Moses Timothy, Aisha Yahaya Umar, Abdullahi Salihu Audu, Boukari Souley, Abdulsalam Ya’U Gital Sep 2022

Evaluation Of Geo-Spebh Algorithm Based On Bandwidth For Big Data Retrieval In Cloud Computing, Abubakar Usman Othman, Moses Timothy, Aisha Yahaya Umar, Abdullahi Salihu Audu, Boukari Souley, Abdulsalam Ya’U Gital

Al-Bahir Journal for Engineering and Pure Sciences

The fast increase in volume and speed of information created by mobile devices, along with the availability of web-based applications, has considerably contributed to the massive collection of data. Approximate Nearest Neighbor (ANN) is essential in big size databases for comparison search to offer the nearest neighbor of a given query in the field of computer vision and pattern recognition. Many hashing algorithms have been developed to improve data management and retrieval accuracy in huge databases. However, none of these algorithms took bandwidth into consideration, which is a significant aspect in information retrieval and pattern recognition. As a result, our …


Legion: Massively Composing Rankers For Improved Bug Localization At Adobe, Darryl Jarman, Jeffrey Berry, Riley Smith, Ferdian Thung, David Lo Aug 2022

Legion: Massively Composing Rankers For Improved Bug Localization At Adobe, Darryl Jarman, Jeffrey Berry, Riley Smith, Ferdian Thung, David Lo

Research Collection School Of Computing and Information Systems

Studies have estimated that, in industrial settings, developers spend between 30 and 90 percent of their time fixing bugs. As such, tools that assist in identifying the location of bugs provide value by reducing debugging costs. One such tool is BugLocator. This study initially aimed to determine if developers working on the Adobe Analytics product could use BugLocator. The initial results show that BugLocator achieves a similar accuracy on five of seven Adobe Analytics repositories and on open-source projects. However, these results do not meet the minimum applicability requirement deemed necessary by Adobe Analytics developers prior to possible adoption. Thus, …


Fairness In Information Access Systems, Michael D. Ekstrand, Anubrata Das, Robin Burke, Fernando Diaz Jul 2022

Fairness In Information Access Systems, Michael D. Ekstrand, Anubrata Das, Robin Burke, Fernando Diaz

Computer Science Faculty Publications and Presentations

Recommendation, information retrieval, and other information access systems pose unique challenges for investigating and applying the fairness and non-discrimination concepts that have been developed for studying other machine learning systems. While fair information access shares many commonalities with fair classification, there are important differences: the multistakeholder nature of information access applications, the rank-based problem setting, the centrality of personalization in many cases, and the role of user response all complicate the problem of identifying precisely what types and operationalizations of fairness may be relevant.

In this monograph, we present a taxonomy of the various dimensions of fair information access and …


Digbug: Pre/Post-Processing Operator Selection For Accurate Bug Localization, Kisub Kim, Sankalp Ghatpande, Kui Liu, Anil Koyuncu, Dongsun Kim, Tegawendé F. Bissyande, Jacques Klein, Yves Le Traon Jul 2022

Digbug: Pre/Post-Processing Operator Selection For Accurate Bug Localization, Kisub Kim, Sankalp Ghatpande, Kui Liu, Anil Koyuncu, Dongsun Kim, Tegawendé F. Bissyande, Jacques Klein, Yves Le Traon

Research Collection School Of Computing and Information Systems

Bug localization is a recurrent maintenance task in software development. It aims at identifying relevant code locations (e.g., code files) that must be inspected to fix bugs. When such bugs are reported by users, the localization process become often overwhelming as it is mostly a manual task due to incomplete and informal information (written in natural languages) available in bug reports. The research community has then invested in automated approaches, notably using Information Retrieval techniques. Unfortunately, reported performance in the literature is still limited for practical usage. Our key observation, after empirically investigating a large dataset of bug reports as …


Learning Term Weights By Overfitting Pairwise Ranking Loss, Ömer Şahi̇n, İlyas Çi̇çekli̇, Gönenç Ercan Jul 2022

Learning Term Weights By Overfitting Pairwise Ranking Loss, Ömer Şahi̇n, İlyas Çi̇çekli̇, Gönenç Ercan

Turkish Journal of Electrical Engineering and Computer Sciences

A search engine strikes a balance between effectiveness and efficiency to retrieve the best documents in a scalable way. Recent deep learning-based ranker methods are proving to be effective and improving the state-of-the-art in relevancy metrics. However, as opposed to index-based retrieval methods, neural rankers like bidirectional encoder representations from transformers (BERT) do not scale to large datasets. In this article, we propose a query term weighting method that can be used with a standard inverted index without modifying it. Query term weights are learned using relevant and irrelevant document pairs for each query, using a pairwise ranking loss. The …


Automatic Keyword Assignment System For Medical Research Articles Using Nearest-Neighbor Searches, Fati̇h Di̇lmaç, Adi̇l Alpkoçak Jul 2022

Automatic Keyword Assignment System For Medical Research Articles Using Nearest-Neighbor Searches, Fati̇h Di̇lmaç, Adi̇l Alpkoçak

Turkish Journal of Electrical Engineering and Computer Sciences

Assigning accurate keywords to research articles is increasingly important concern. Keywords should be selected meticulously to describe the article well since keywords play an important role in matching readers with research articles in order to reach a bigger audience. So, improper selection of keywords may result in less attraction to readers which results in degradation in its audience. Hence, we designed and developed an automatic keyword assignment system (AKAS) for research articles based on k-nearest neighbor (k-NN) and threshold-nearest neighbor (t-NN) accompanied with information retrieval systems (IRS), which is a corpus-based method by utilizing IRS using the Medline dataset in …


Structure-Aware Visualization Retrieval, Haotian Li, Yong Wang, Aoyu Wu, Huan Wei, Huamin. Qu May 2022

Structure-Aware Visualization Retrieval, Haotian Li, Yong Wang, Aoyu Wu, Huan Wei, Huamin. Qu

Research Collection School Of Computing and Information Systems

With the wide usage of data visualizations, a huge number of Scalable Vector Graphic (SVG)-based visualizations have been created and shared online. Accordingly, there has been an increasing interest in exploring how to retrieve perceptually similar visualizations from a large corpus, since it can benefit various downstream applications such as visualization recommendation. Existing methods mainly focus on the visual appearance of visualizations by regarding them as bitmap images. However, the structural information intrinsically existing in SVG-based visualizations is ignored. Such structural information can delineate the spatial and hierarchical relationship among visual elements, and characterize visualizations thoroughly from a new perspective. …


Codematcher: Searching Code Based On Sequential Semantics Of Important Query Words, Chao Liu, Xin Xia, David Lo, Zhiwei Liu, Ahmed E. Hassan, Shanping Li Jan 2022

Codematcher: Searching Code Based On Sequential Semantics Of Important Query Words, Chao Liu, Xin Xia, David Lo, Zhiwei Liu, Ahmed E. Hassan, Shanping Li

Research Collection School Of Computing and Information Systems

To accelerate software development, developers frequently search and reuse existing code snippets from a large-scale codebase, e.g., GitHub. Over the years, researchers proposed many information retrieval (IR)-based models for code search, but they fail to connect the semantic gap between query and code. An early successful deep learning (DL)-based model DeepCS solved this issue by learning the relationship between pairs of code methods and corresponding natural language descriptions. Two major advantages of DeepCS are the capability of understanding irrelevant/noisy keywords and capturing sequential relationships between words in query and code. In this article, we proposed an IR-based model CodeMatcher that …


Training Wheels For Web Search: Multi-Perspective Learning To Rank To Support Children's Information Seeking In The Classroom, Garrett Allen Dec 2021

Training Wheels For Web Search: Multi-Perspective Learning To Rank To Support Children's Information Seeking In The Classroom, Garrett Allen

Boise State University Theses and Dissertations

Bicycle design has not changed for a long time, as they are well-crafted for those that possess the skills to ride, i.e., adults. Those learning to ride, however, often need additional support in the form of training wheels. Searching for information on the Web is much like riding a bicycle, where modern search engines (the bicycle) are optimized for general use and adult users, but lack the functionality to support non-traditional audiences and environments. In this thesis, we introduce a set of training wheels in the form of a learning to rank model as augmentation for standard search engines to …


Exploratory Search With Archetype-Based Language Models, Brent D. Davis Aug 2021

Exploratory Search With Archetype-Based Language Models, Brent D. Davis

Electronic Thesis and Dissertation Repository

This dissertation explores how machine learning, natural language processing and information retrieval may assist the exploratory search task. Exploratory search is a search where the ideal outcome of the search is unknown, and thus the ideal language to use in a retrieval query to match it is unavailable. Three algorithms represent the contribution of this work. Archetype-based Modeling and Search provides a way to use previously identified archetypal documents relevant to an archetype to form a notion of similarity and find related documents that match the defined archetype. This is beneficial for exploratory search as it can generalize beyond standard …


Why Don't You Act Your Age?: Recognizing The Stereotypical 8-12 Year Old Searcher By Their Search Behavior, Michael Green Aug 2021

Why Don't You Act Your Age?: Recognizing The Stereotypical 8-12 Year Old Searcher By Their Search Behavior, Michael Green

Boise State University Theses and Dissertations

Online search engines for children are known to filter retrieved resources based on page complexity, and offer specialized functionality meant to address gaps in search literacy according to a user's age or grade. However, not every searcher grouped by these identifiers displays the same level of text comprehension, or requires the same aid with search. Furthermore, these search engines typically rely on direct feedback to ascertain these identifiers. This reliance on self identification may cause users to accidentally misrepresent themselves. We therefore seek to recognize users from skill based signals rather than utilizing age or grade identifiers, as skill dictates …


Into The Unknown: Exploration Of Search Engines' Responses To Users With Depression And Anxiety, Ashlee Milton Aug 2021

Into The Unknown: Exploration Of Search Engines' Responses To Users With Depression And Anxiety, Ashlee Milton

Boise State University Theses and Dissertations

Mental health disorders (MHD) are a rising, yet stigmatized, topic. With statistics reporting that one in five adults in the United States will be afflicted by a MHD in their lifetime, researchers have begun exploring the behavioral nuances that emerge from interactions of these individuals with persuasive technologies, mainly social media. Yet, there is a gap in the analysis pertaining to a persuasive technology that is part of their everyday lives: search engines (SE). Each day, users with MHD embark on information seeking journeys using SE. Every step of the search process for better or worse has the potential to …


Self-Supervised Contrastive Learning For Code Retrieval And Summarization Via Semantic-Preserving Transformations, Duy Quoc Nghi Bui, Yijun Yu, Lingxiao Jiang Jul 2021

Self-Supervised Contrastive Learning For Code Retrieval And Summarization Via Semantic-Preserving Transformations, Duy Quoc Nghi Bui, Yijun Yu, Lingxiao Jiang

Research Collection School Of Computing and Information Systems

We propose Corder, a self-supervised contrastive learning framework for source code model. Corder is designed to alleviate the need of labeled data for code retrieval and code summarization tasks. The pre-trained model of Corder can be used in two ways: (1) it can produce vector representation of code which can be applied to code retrieval tasks that do not have labeled data; (2) it can be used in a fine-tuning process for tasks that might still require label data such as code summarization. The key innovation is that we train the source code model by asking it to recognize similar …


Bridging The Simulation-To-Reality Gap: Adapting Simulation Environment For Object Recognition, Hardik Yogesh Sonetta Jul 2021

Bridging The Simulation-To-Reality Gap: Adapting Simulation Environment For Object Recognition, Hardik Yogesh Sonetta

Electronic Theses and Dissertations

Rapid advancements in object recognition have created a huge demand for labeled datasets for the task of training, testing, and validation of different techniques. Due to the wide range of applications, object models in the datasets need to cover both variations in geometric features and diverse conditions in which sensory inputs are obtained. Also, the need to manually label the object models is cumbersome. As a result, it becomes difficult for researchers to gain access to adequate datasets for the development of new methods or algorithms. In comparison, computer simulation has been considered a cost-effective solution to generate simulated data …


5Th Kidrec Workshop: Search And Recommendation Technology Through The Lens Of A Teacher, Monica Landoni, Theo Huibers, Maria Soledad Pera, Jerry Alan Fails Jun 2021

5Th Kidrec Workshop: Search And Recommendation Technology Through The Lens Of A Teacher, Monica Landoni, Theo Huibers, Maria Soledad Pera, Jerry Alan Fails

Computer Science Faculty Publications and Presentations

In this past year, the role of technology to support education has been more prominent than ever. This has prompted us to focus the 5th Edition of the International and Interdisciplinary Perspectives on Children & Recommender and Information Retrieval Systems (KidRec) around a major stakeholder when it comes to technology adoption for the classroom: the teacher. Much like in the previous editions of the workshop, our priority remains understanding what is good when it comes to information retrieval systems for children, this time from the perspectives of teachers. In order to control scope of our discussion and …


Neural Methods For Answer Passage Retrieval Over Sparse Collections, Daniel Cohen Apr 2021

Neural Methods For Answer Passage Retrieval Over Sparse Collections, Daniel Cohen

Doctoral Dissertations

Recent advances in machine learning have allowed information retrieval (IR) techniques to advance beyond the stage of handcrafting domain specific features. Specifically, deep neural models incorporate varying levels of features to learn whether a document answers the information need of a query. However, these neural models rely on a large number of parameters to successfully learn a relation between a query and a relevant document.

This reliance on a large number of parameters, combined with the current methods of optimization relying on small updates necessitates numerous samples to allow the neural model to converge on an effective relevance function. This …


Estimation Of Fair Ranking Metrics With Incomplete Judgments, Ömer Kırnap, Fernando Diaz, Asia Biega, Michael Ekstrand, Ben Carterette, Emine Yilmaz Apr 2021

Estimation Of Fair Ranking Metrics With Incomplete Judgments, Ömer Kırnap, Fernando Diaz, Asia Biega, Michael Ekstrand, Ben Carterette, Emine Yilmaz

Computer Science Faculty Publications and Presentations

There is increasing attention to evaluating the fairness of search system ranking decisions. These metrics often consider the membership of items to particular groups, often identified using protected attributes such as gender or ethnicity. To date, these metrics typically assume the availability and completeness of protected attribute labels of items. However, the protected attributes of individuals are rarely present, limiting the application of fair ranking metrics in large scale systems. In order to address this problem, we propose a sampling strategy and estimation technique for four fair ranking metrics. We formulate a robust and unbiased estimator which can operate even …


Building And Using Digital Libraries For Etds, Edward A. Fox Mar 2021

Building And Using Digital Libraries For Etds, Edward A. Fox

The Journal of Electronic Theses and Dissertations

Despite the high value of electronic theses and dissertations (ETDs), the global collection has seen limited use. To extend such use, a new approach to building digital libraries (DLs) is needed. Fortunately, recent decades have seen that a vast amount of “gray literature” has become available through a diverse set of institutional repositories as well as regional and national libraries and archives. Most of the works in those collections include ETDs and are often freely available in keeping with the open-access movement, but such access is limited by the services of supporting information systems. As explained through a set of …


Information Retrieval-Based Bug Localization Approach With Adaptive Attributeweighting, Mustafa Erşahi̇n, Semi̇h Utku, Deni̇z Kilinç, Buket Erşahi̇n Jan 2021

Information Retrieval-Based Bug Localization Approach With Adaptive Attributeweighting, Mustafa Erşahi̇n, Semi̇h Utku, Deni̇z Kilinç, Buket Erşahi̇n

Turkish Journal of Electrical Engineering and Computer Sciences

Software quality assurance is one of the crucial factors for the success of software projects. Bug fixing has an essential role in software quality assurance, and bug localization (BL) is the first step of this process. BL is difficult and time-consuming since the developers should understand the flow, coding structure, and the logic of the program. Information retrieval-based bug localization (IRBL) uses the information of bug reports and source code to locate the section of code in which the bug occurs. It is difficult to apply other tools because of the diversity of software development languages, design patterns, and development …


Neural Representations Of Concepts And Texts For Biomedical Information Retrieval, Jiho Noh Jan 2021

Neural Representations Of Concepts And Texts For Biomedical Information Retrieval, Jiho Noh

Theses and Dissertations--Computer Science

Information retrieval (IR) methods are an indispensable tool in the current landscape of exponentially increasing textual data, especially on the Web. A typical IR task involves fetching and ranking a set of documents (from a large corpus) in terms of relevance to a user's query, which is often expressed as a short phrase. IR methods are the backbone of modern search engines where additional system-level aspects including fault tolerance, scale, user interfaces, and session maintenance are also addressed. In addition to fetching documents, modern search systems may also identify snippets within the documents that are potentially most relevant to the …


A Set Theory Based Similarity Measure For Text Clustering And Classification, Ali A. Amer, Hassan I. Abdalla Dec 2020

A Set Theory Based Similarity Measure For Text Clustering And Classification, Ali A. Amer, Hassan I. Abdalla

All Works

© 2020, The Author(s). Similarity measures have long been utilized in information retrieval and machine learning domains for multi-purposes including text retrieval, text clustering, text summarization, plagiarism detection, and several other text-processing applications. However, the problem with these measures is that, until recently, there has never been one single measure recorded to be highly effective and efficient at the same time. Thus, the quest for an efficient and effective similarity measure is still an open-ended challenge. This study, in consequence, introduces a new highly-effective and time-efficient similarity measure for text clustering and classification. Furthermore, the study aims to provide a …