Open Access. Powered by Scholars. Published by Universities.®
Databases and Information Systems Commons™
Open Access. Powered by Scholars. Published by Universities.®
- Discipline
-
- Artificial Intelligence and Robotics (24)
- Computer Engineering (9)
- Engineering (9)
- Other Computer Sciences (9)
- Library and Information Science (7)
-
- Social and Behavioral Sciences (7)
- Software Engineering (7)
- Graphics and Human Computer Interfaces (6)
- Archival Science (5)
- Cataloging and Metadata (5)
- Collection Development and Management (5)
- Data Storage Systems (5)
- Information Security (5)
- Scholarly Communication (5)
- Scholarly Publishing (5)
- Systems Architecture (4)
- Arts and Humanities (3)
- Digital Humanities (2)
- Numerical Analysis and Scientific Computing (2)
- OS and Networks (2)
- Theory and Algorithms (2)
- Art Practice (1)
- Creative Writing (1)
- Education (1)
- Information Literacy (1)
- Mathematics (1)
- Music (1)
- Keyword
-
- Machine Learning (4)
- Journal Publications (3)
- Natural Language Processing (3)
- Benchmarking (2)
- Blockchain (2)
-
- Chatbots (2)
- Collaborative filtering (2)
- MongoDB (2)
- News Aggregation (2)
- Question Answering System (2)
- Rudy Rucker (2)
- SVM (2)
- Scalability (2)
- Social networks (2)
- Yioop (2)
- "Zip.1" (1)
- 1992 The Happy Muntant Handbook (1)
- Apache HBase (1)
- Assembly Language (1)
- Association rule search geometric traversal problem (1)
- AutoComplete (1)
- Autosuggest entities solr (1)
- BOING bOING (1)
- Big data (1)
- Bitcoin (1)
- Browser Toolbar Traffic Rank (1)
- Browsing (1)
- Buckets (1)
- CTE Pathway (1)
- Career and Technical Education (1)
- Publication Year
- Publication
- Publication Type
- File Type
Articles 31 - 60 of 83
Full-Text Articles in Databases and Information Systems
Predictive Analysis For Cloud Infrastructure Metrics, Paridhi Agrawal
Predictive Analysis For Cloud Infrastructure Metrics, Paridhi Agrawal
Master's Projects
In a cloud computing environment, enterprises have the flexibility to request resources according to their application demands. This elastic feature of cloud computing makes it an attractive option for enterprises to host their applications on the cloud. Cloud providers usually exploit this elasticity by auto-scaling the application resources for quality assurance. However, there is a setup-time delay that may take minutes between the demand for a new resource and it being prepared for utilization. This causes the static resource provisioning techniques, which request allocation of a new resource only when the application breaches a specific threshold, to be slow and …
Gradubique: An Academic Transcript Database Using Blockchain Architecture, Thinh Nguyen
Gradubique: An Academic Transcript Database Using Blockchain Architecture, Thinh Nguyen
Master's Projects
Blockchain has been widely adopted in the last few years even though it is in its infancy. The first well-known application built on blockchain technology was Bitcoin, which is a decentralized and distributed ledger to record crypto-currency transactions. All of the transactions in Bitcoin are anonymously transferred and validated by participants in the network. Bitcoin protocol and its operations are so reliable that technologists have been inspired to enhance blockchain technologies and deploy it outside of the crypto-currency world. The demand for private and non-crypto-currency solutions have surged among consortiums because of the security and fault tolerant features of blockchain. …
Recommender Systems For Large-Scale Social Networks: A Review Of Challenges And Solutions, Magdalini Eirinaki, Jerry Gao, Iraklis Varlamis, Konstantinos Tserpes
Recommender Systems For Large-Scale Social Networks: A Review Of Challenges And Solutions, Magdalini Eirinaki, Jerry Gao, Iraklis Varlamis, Konstantinos Tserpes
Faculty Publications
Social networks have become very important for networking, communications, and content sharing. Social networking applications generate a huge amount of data on a daily basis and social networks constitute a growing field of research, because of the heterogeneity of data and structures formed in them, and their size and dynamics. When this wealth of data is leveraged by recommender systems, the resulting coupling can help address interesting problems related to social engagement, member recruitment, and friend recommendations.In this work we review the various facets of large-scale social recommender systems, summarizing the challenges and interesting problems and discussing some of the …
Support Vector Machines For Image Spam Analysis, Aneri Chavda, Katerina Potika, Fabio Di Troia, Mark Stamp
Support Vector Machines For Image Spam Analysis, Aneri Chavda, Katerina Potika, Fabio Di Troia, Mark Stamp
Faculty Publications, Computer Science
Email is one of the most common forms of digital communication. Spam is unsolicited bulk email, while image spam consists of spam text embedded inside an image. Image spam is used as a means to evade text-based spam filters, and hence image spam poses a threat to email-based communication. In this research, we analyze image spam detection using support vector machines (SVMs), which we train on a wide variety of image features. We use a linear SVM to quantify the relative importance of the features under consideration. We also develop and analyze a realistic “challenge” dataset that illustrates the limitations …
Question Type Recognition Using Natural Language Input, Aishwarya Soni
Question Type Recognition Using Natural Language Input, Aishwarya Soni
Master's Projects
Recently, numerous specialists are concentrating on the utilization of Natural Language Processing (NLP) systems in various domains, for example, data extraction and content mining. One of the difficulties with these innovations is building up a precise Question and Answering (QA) System. Question type recognition is the most significant task in a QA system, for example, chat bots. Organization such as National Institute of Standards (NIST) hosts a conference series called as Text REtrieval Conference (TREC) series which keeps a competition every year to encourage and improve the technique of information retrieval from a large corpus of text. When a user …
Improving Text Classification With Word Embedding, Lihao Ge
Improving Text Classification With Word Embedding, Lihao Ge
Master's Projects
One challenge in text classification is that it is hard to make feature reduction basing upon the meaning of the features. An improper feature reduction may even worsen the classification accuracy. Word2Vec, a word embedding method, has recently been gaining popularity due to its high precision rate of analyzing the semantic similarity between words at relatively low computational cost. However, there are only a limited number of researchers focusing on feature reduction using Word2Vec. In this project, we developed a Word2Vec based method to reduce the feature size while increasing the classification accuracy. The feature reduction is achieved by loosely …
An Open Source Discussion Group Recommendation System, Sarika Padmashali
An Open Source Discussion Group Recommendation System, Sarika Padmashali
Master's Projects
A recommendation system analyzes user behavior on a website to make suggestions about what a user should do in the future on the website. It basically tries to predict the “rating” or “preference” a user would have for an action. Yioop is an open source search engine, wiki system, and user discussion group system managed by Dr. Christopher Pollett at SJSU. In this project, we have developed a recommendation system for Yioop where users are given suggestions about the threads and groups they could join based on their user history. We have used collaborative filtering techniques to make recommendations and …
Adding Differential Privacy In An Open Board Discussion Board System, Pragya Rana
Adding Differential Privacy In An Open Board Discussion Board System, Pragya Rana
Master's Projects
This project implements a privacy system for statistics generated by the Yioop search and discussion board system. Statistical data for such a system consists of various counts, sums, and averages that might be displayed for groups, threads, etc. When statistical data is made publicly available, there is no guarantee of preserving the privacy of an individual. Ideally, any data extracted should not reveal any sensitive information about an individual. In order to help achieve this, we implemented a Differential Privacy mechanism for Yioop. Differential privacy preserves privacy up to some controllable parameters of the number of items or individuals being …
Document Classification Using Machine Learning, Ankit Basarkar
Document Classification Using Machine Learning, Ankit Basarkar
Master's Projects
To perform document classification algorithmically, documents need to be represented such that it is understandable to the machine learning classifier. The report discusses the different types of feature vectors through which document can be represented and later classified. The project aims at comparing the Binary, Count and TfIdf feature vectors and their impact on document classification. To test how well each of the three mentioned feature vectors perform, we used the 20-newsgroup dataset and converted the documents to all the three feature vectors. For each feature vector representation, we trained the Naïve Bayes classifier and then tested the generated classifier …
Headline Generation Using Deep Neural Networks, Dhruven Vora
Headline Generation Using Deep Neural Networks, Dhruven Vora
Master's Projects
News headline generation is one of the important text summarization tasks. Human generated news headlines are generally intended to catch the eye rather than provide useful information. There have been many approaches to generate meaningful headlines by either using neural networks or using linguistic features. In this report, we are proposing a novel approach based on integrating Hedge Trimmer, which is a grammar based extractive summarization system with a deep neural network abstractive summarization system to generate meaningful headlines. We analyze the results against current recurrent neural network based headline generation system.
Reducing Query Latency For Information Retrieval, Swapnil Satish Kamble
Reducing Query Latency For Information Retrieval, Swapnil Satish Kamble
Master's Projects
As the world is moving towards Big Data, NoSQL (Not only SQL) databases are gaining much more popularity. Among the other advantages of NoSQL databases, one of their key advantage is that they facilitate faster retrieval for huge volumes of data, as compared to traditional relational databases. This project deals with one such popular NoSQL database, Apache HBase. It performs quite efficiently in cases of retrieving information using the rowkey (similar to a primary key in a SQL database). But, in cases where one needs to get information based on non-rowkey columns, the response latency is higher than what we …
A Chatbot Framework For Yioop, Harika Nukala
A Chatbot Framework For Yioop, Harika Nukala
Master's Projects
Over the past few years, messaging applications have become more popular than Social networking sites. Instead of using a specific application or website to access some service, chatbots are created on messaging platforms to allow users to interact with companies’ products and also give assistance as needed. In this project, we designed and implemented a chatbot Framework for Yioop. The goal of the Chatbot Framework for Yioop project is to provide a platform for developers in Yioop to build and deploy chatbot applications. A chatbot is a web service that can converse with users using artificial intelligence in messaging platforms. …
Named Entity Recognition And Classification For Natural Language Inputs At Scale, Shreeraj Dabholkar
Named Entity Recognition And Classification For Natural Language Inputs At Scale, Shreeraj Dabholkar
Master's Projects
Natural language processing (NLP) is a technique by which computers can analyze, understand, and derive meaning from human language. Phrases in a body of natural text that represent names, such as those of persons, organizations or locations are referred to as named entities. Identifying and categorizing these named entities is still a challenging task, research on which, has been carried out for many years. In this project, we build a supervised learning based classifier which can perform named entity recognition and classification (NERC) on input text and implement it as part of a chatbot application. The implementation is then scaled …
Intelligent Web Crawler For Semantic Search Engine, Shujia Zhang
Intelligent Web Crawler For Semantic Search Engine, Shujia Zhang
Master's Projects
A Semantic Search Engine (SSE) is a program that produces semantic-oriented concepts from the Internet. A web crawler is the front end of our SSE; its primary goal is to supply important and necessary information to the data analysis component of SSE. The main function of the analysis component is to produce the concepts (moderately frequent finite sequences of keywords) from the input; it uses some variants of TF-IDF as a primary tool to remove stop words. However, it is a very expensive way to filter out stop words using the idea of TF-IDF. The goal of this project is …
Deep Data Analysis On The Web, Xuanyu Liu
Deep Data Analysis On The Web, Xuanyu Liu
Master's Projects
Search engines are well known to people all over the world. People prefer to use keywords searching to open websites or retrieve information rather than type typical URLs. Therefore, collecting finite sequences of keywords that represent important concepts within a set of authors is important, in other words, we need knowledge mining. We use a simplicial concept method to speed up concept mining. Previous CS 298 project has studied this approach under Dr. Lin. This method is very fast, for example, to mine the concept, FP-growth takes 876 seconds from a database with 1257 columns 65k rows, simplicial complex only …
Handling Relationships In A Wiki System, Yashi Kamboj
Handling Relationships In A Wiki System, Yashi Kamboj
Master's Projects
Wiki software enables users to manage content on the web, and create or edit web pages freely. Most wiki systems support the creation of hyperlinks on pages and have a simple text syntax for page formatting. A common, more advanced feature is to allow pages to be grouped together as categories. Currently, wiki systems support categorization of pages in a very traditional way by specifying whether a wiki page belongs to a category or not. Categorization represents unary relationship and is not sufficient to represent n-ary relationships, those involving links between multiple wiki pages.
In this project, we extend Yioop, …
Predicting User's Future Requests Using Frequent Patterns, Marc Nipuna Dominic Savio
Predicting User's Future Requests Using Frequent Patterns, Marc Nipuna Dominic Savio
Master's Projects
In this research, we predict User's Future Request using Data Mining Algorithm. Usage of the World Wide Web has resulted in a huge amount of data and handling of this data is getting hard day by day. All this data is stored as Web Logs and each web log is stored in a different format with different Field names like search string, URL with its corresponding timestamp, User ID’s that helps for session identification, Status code, etc. Whenever a user requests for a URL there is a delay in getting the page requested and sometimes the request is denied. Our …
Web-Based Integrated Development Environment, Hien T. Vu
Web-Based Integrated Development Environment, Hien T. Vu
Master's Projects
As tablets become more powerful and more economical, students are attracted to them and are moving away from desktops and laptops. Their compact size and easy to use Graphical User Interface (GUI) reduce the learning and adoption barriers for new users. This also changes the environment in which undergraduate Computer Science students learn how to program. Popular Integrated Development Environments (IDE) such as Eclipse and NetBeans require disk space for local installations as well as an external compiler. These requirements cannot be met by current tablets and thus drive the need for a web-based IDE. There are also many other …
Analyzing Clustered Web Concepts With Homology, Eric Nam
Analyzing Clustered Web Concepts With Homology, Eric Nam
Master's Projects
As data is being mined more and more from the Internet today, Data Science has become an important field of computing to make that data useful. Data Science allows people to turn all of that data into structured knowledge that is easily utilized, validated, and understandable. There are many known theories to analyze data, but this project will focus on a recently introduced method: analyzing text data with homology from mathematics to understand relationships between keyword-sets.
Using structures of algebraic topology as a starting point, keyword-sets in the text are represented by simplexes based on what they are and what …
Analyze Large Multidimensional Datasets Using Algebraic Topology, David Le
Analyze Large Multidimensional Datasets Using Algebraic Topology, David Le
Master's Projects
This paper presents an efficient algorithm to extract knowledge from high-dimensionality, high- complexity datasets using algebraic topology, namely simplicial complexes. Based on concept of isomorphism of relations, our method turn a relational table into a geometric object (a simplicial complex is a polyhedron). So, conceptually association rule searching is turned into a geometric traversal problem. By leveraging on the core concepts behind Simplicial Complex, we use a new technique (in computer science) that improves the performance over existing methods and uses far less memory. It was designed and developed with a strong emphasis on scalability, reliability, and extensibility. This paper …
Efficient Pair-Wise Similarity Computation Using Apache Spark, Parineetha Gandhi Tirumali
Efficient Pair-Wise Similarity Computation Using Apache Spark, Parineetha Gandhi Tirumali
Master's Projects
Entity matching is the process of identifying different manifestations of the same real world entity. These entities can be referred to as objects(string) or data instances. These entities are in turn split over several databases or clusters based on the signatures of the entities. When entity matching algorithms are performed on these databases or clusters, there is a high possibility that a particular entity pair is compared more than once. The number of comparison for any two entities depend on the number of common signatures or keys they possess. This effects the performance of any entity matching algorithm. This paper …
Hybrid Similarity Function For Big Data Entity Matching With R-Swoosh, Vimal Chandra Gorijala
Hybrid Similarity Function For Big Data Entity Matching With R-Swoosh, Vimal Chandra Gorijala
Master's Projects
Entity Matching (EM) is the problem of determining if two entities in a data set refer to the same real-world object. For example, it decides if two given mentions in the data, such as “Helen Hunt” and “H. M. Hunt”, refer to the same real-world entity by using different similarity functions. This problem plays a key role in information integration, natural language understanding, information processing on the World-Wide Web, and on the emerging Semantic Web. This project deals with the similarity functions and thresholds utilized in them to determine the similarity of the entities. The work contains two major parts: …
Library Writers Reward Project, Saravana Kumar Gajendran
Library Writers Reward Project, Saravana Kumar Gajendran
Master's Projects
Open-source library development exploits the distributed intelligence of participants in Internet communities. Nowadays, contribution to the open-source community is fading [16] (Stackalytics, 2016) as there is not much recognition for library writers. They can start exploring ways to generate revenue as they actively contribute to the open-source community.
This project helps library writers to generate revenue in the form of bitcoins for their contribution. Our solution to generate revenue for library writers is to integrate bitcoin mining with existing JavaScript libraries, such as jQuery. More use of the library leads to more revenue for the library writers. It uses the …
Processing Posting Lists Using Opencl, Radha Kotipalli
Processing Posting Lists Using Opencl, Radha Kotipalli
Master's Projects
One of the main requirements of internet search engines is the ability to retrieve relevant results with faster response times. Yioop is an open source search engine designed and developed in PHP by Dr. Chris Pollett. The goal of this project is to explore the possibilities of enhancing the performance of Yioop by substituting resource-intensive existing PHP functions with C based native PHP extensions and the parallel data processing technology OpenCL. OpenCL leverages the Graphical Processing Unit (GPU) of a computer system for performance improvements.
Some of the critical functions in search engines are resource-intensive in terms of processing power, …
Concept Based Search Engine: Concept Creation, Aishwarya Rastogi
Concept Based Search Engine: Concept Creation, Aishwarya Rastogi
Master's Projects
Data on the internet is increasing exponentially every single second. There are billions and billions of documents on the World Wide Web (The Internet). Each document on the internet contains multiple concepts (an abstract or general idea inferred from specific instances).
In this paper, we show how we created and implemented an algorithm for extracting concepts from a set of documents. These concepts can be used by a search engine for generating search results to cater the needs of the user. The search result will then be more targeted than the usual keyword search.
The main problem was to extract …
Mining Concept In Big Data, Jingjing Yang
Mining Concept In Big Data, Jingjing Yang
Master's Projects
To fruitful using big data, data mining is necessary. There are two well-known methods, one is based on apriori principle, and the other one is based on FP-tree. In this project we explore a new approach that is based on simplicial complex, which is a combinatorial form of polyhedron used in algebraic topology. Our approach, similar to FP-tree, is top down, at the same time, it is based on apriori principle in geometric form, called closed condition in simplicial complex. Our method is almost 300 times faster than FP-growth on a real world database using a SJSU laptop. The database …
A Scalable Search Engine Aggregator, Pooja Mishra
A Scalable Search Engine Aggregator, Pooja Mishra
Master's Projects
The ability to display different media sources in an appropriate way is an integral part of search engines such as Google, Yahoo, and Bing, as well as social networking sites like Facebook, etc. This project explores and implements various media-updating features of the open source search engine Yioop [1]. These include news aggregation, video conversion and email distribution. An older, preexisting news update feature of Yioop was modified and scaled so that it can work on many machines. We redesigned and modified the user interface associated with a distributed news updater feature in Yioop. This project also introduced a video …
An Open Source Advertisement Server, Pushkar Umaranikar
An Open Source Advertisement Server, Pushkar Umaranikar
Master's Projects
This report describes a new online advertisement system and its implementation for the Yioop open source search engine. This system was implemented for my CS298 project. It supports both selling advertisements and displaying them within search results. The selling of advertisement is done using a novel auction system, which we describe in this paper. With this auction system, it is possible to create an advertisement, attach keywords to it, and add it to the advertisement inventory. An advertisement is displayed on a search results page if the search keyword matches the keywords attached to the advertisement. Display of advertisements is …
Index Strategies For Efficient And Effective Entity Search, Huy T. Vu
Index Strategies For Efficient And Effective Entity Search, Huy T. Vu
Master's Projects
The volume of structured data has rapidly grown in recent years, when data-entity emerged as an abstraction that captures almost every data pieces. As a result, searching for a desired piece of information on the web could be a challenge in term of time and relevancy because the number of matching entities could be very large for a given query. This project concerns with the efficiency and effectiveness of such entity queries. The work contains two major parts: implement inverted indexing strategies so that queries can be searched in minimal time, and rank results based on features that are independent …
Context-Based Autosuggest On Graph Data, Hai Nguyen
Context-Based Autosuggest On Graph Data, Hai Nguyen
Master's Projects
Autosuggest is an important feature in any search applications. Currently, most applications only suggest a single term based on how frequent that term appears in the indexed documents or how often it is searched upon. These approaches might not provide the most relevant suggestions because users often enter a series of related query terms to answer a question they have in mind. In this project, we implemented the Smart Solr Suggester plugin using a context-based approach that takes into account the relationships among search keywords. In particular, we used the keywords that the user has chosen so far in the …