Physical Sciences and Mathematics | Open Access Articles

Brooks' Versus Linus' Law: An Empirical Test Of Open Source Projects, Charles M. Schweik, Robert English

National Center for Digital Government

Free/Libre and Open Source Software (FOSS) projects are Internet-based collaborations consisting of volunteers and paid professionals who come together to create computer software...

Go to article

Reflections Of An Online Geographic Information Systems Course Based On Open Source Software, Charles M. Schweik, Maria Fernandez, Michael P. Hamel, Prakash Kashwan, Quentin Lewis, Alexander Stepanov

National Center for Digital Government

This SSCORE report summarizes our experience offering an online introductory course on Geographic Information Systems that utilizes available free/libre and open source software (FOSS). Two primary objectives were to (1) reach students in developing countries, and (2) to help move forward the development of an “open content” GIS curriculum as part of the “Open Source Geospatial Foundation” (OSGeo.org) educational effort. Course design, key software (QGIS, GRASS, PostGresql/PostGIS) and online delivery methods are described. Results and factors leading to a low course completion rate and discussed. Contributing factors include: (1) a for-credit versus no-credit decision; and (2) technical issues. Recommendations for …

Go to article

Better Public Services For Growth And Jobs, Jane E. Fountain

National Center for Digital Government

No abstract provided.

Go to article

Increasing Social Capital For Disaster Response Through Social Networking Services (Sns) In Japanese Local Governments, Alexander Schellong

National Center for Digital Government

Researchers have argued that social networks within a community have positive effects on people’s behavior in the four stages of disaster. The Japanese government is testing Social Networking Service (SNS) at the municipal level with the intention to improve community building, democratic processes and disaster management. This paper presents results from two case studies of local SNS in Yatsushiro city, Kumamoto prefecture and Nagaoka city, Niigata prefecture. While the Yatsushiro’s solution seems to be sustainable, Nagaoka’s SNS is in decline. Both have to compete with popular SNS like Mixi and lack critical mass. Based on the reviewed literature I discuss …

Go to article

Open-Source Collaboration In The Public Sector: The Need For Leadership And Value, Michael P. Hamel

National Center for Digital Government

From executive summary: The “open-source” movement in information technology is largely based on the innovative licensing schemes that encourage collaboration and sharing and promise reduced cost of ownership, customizable software and the ability to extract data in a usable format. Government organizations are becoming increasingly intolerant of the forced migrations (upgrades) and closed data standards (or incompatible data standards) that typically come with the use of proprietary software. To combat the problems of interoperability and cost, governments around the globe are beginning to consider, and in some cases, even require the use of open-source software (Hahn, 2002; Wong, 2004).

Go to article

Exterminator: Automatically Correcting Memory Errors With High Probability, Gene Novark

Computer Science Department Faculty Publication Series

Programs written in C and C++ are susceptible to memory errors, including buffer overflows and dangling pointers. These errors, which can lead to crashes, erroneous execution, and security vulnerabilities, are notoriously costly to repair. Tracking down their location in the source code is difficult, even when the full memory state of the program is available. Once the errors are finally found, fixing them remains challenging: even for critical security-sensitive bugs, the average time between initial reports and the issuance of a patch is nearly one month. We present Exterminator, a system that automatically corrects heap-based memory errors without programmer intervention. …

Go to article

Tragedy Of The Foss Commons? Investigating The Institutional Designs Of Free/Libre And Open Source Software Projects, Charles M. Schweik, Robert English

National Center for Digital Government

Free/Libre and Open Source Software projects (FOSS) are a form of Internetbased commons. Since 1968, when Garrett Hardin published his famous article “Tragedy of the Commons” in the journal Science, there has been significant interest in understanding how to manage commons appropriately, particularly in environmental fields. An important distinction between natural resource commons and FOSS commons is that the “tragedy” to be avoided in natural resources is overharvesting and the potential destruction of the resource. In FOSS commons the “tragedy” to be avoided is project abandonment and a “dead” project. Institutions – defined as informal norms, more formalized rules, and …

Go to article

Identifying Success And Tragedy Of Floss Commons: A Preliminary Classification Of Sourceforge.Net Projects, Robert English, Charles M. Schweik

National Center for Digital Government

Free/Libre and Open Source Software (FLOSS) projects are a form of commons where individuals work collectively to produce software that is a public, rather than a private, good. The famous phrase “Tragedy of the Commons” describes a situation where a natural resource commons, such as a pasture, or a water supply, gets depleted because of overuse. The tragedy in FLOSS commons is distinctly different -- it occurs when collective action ceases before a software product is produced or reaches its full potential. This paper builds on previous work about defining success in FLOSS projects by taking a collective action perspective. …

Go to article

The Digital Divide Metaphor: Understanding Paths To It Literacy, Enrico Ferro, Natalie C. Helbig, J. Ramon Gil-Garci

National Center for Digital Government

Not having access or having a disadvantaged access to information, in an information-based society may be considered as a handicap (Compaine, 2001). In the last two decades scholars have gradually refined the conceptualization of digital divide, moving from a dichotomous model mainly based on access to a multidimensional model accounting for differences in usage levels and perspectives. While models became more complex, research continued to mainly focus on deepening the understanding of demographic and socioeconomic differences between adopters and non-adopters. In doing so, the process of basic IT skills acquisition has been largely overlooked. This paper presents a metaphorical interpretation …

Go to article

Transfer Function And Impulse Response Synthesis Using Classical Techniques, Sonal S. Khilari

Masters Theses 1911 - February 2014

This thesis project presents a MATLAB based application which is designed to synthesize any arbitrary stable transfer function. Our application is based on the Cauer synthesis procedure. It has an interactive front which allows inputs either in the form of residues and poles of a transfer function, in the form of coefficients of the numerator and denominator of the transfer impedance or in the form of samples of an impulse response. The program synthesizes either a single or double resistively terminated LC ladder network. Our application displays a chart showing the variation of stability of an impulse response with the …

Go to article

Mining A Digital Library For Influential Authors, David Mimno, Andrew Mccallum

Andrew McCallum

When browsing a digital library of research papers, it is natural to ask which authors are most influential in a particular topic. We present a probabilistic model that ranks authors based on their influence in particular areas of scientific research. This model combines several sources of information: citation information between documents as represented by PageRank scores, authorship data gathered through automatic information extraction, and the words in paper abstracts. We propose a topic model on the words, and compare performance versus a smoothed language model by assessing the number of major award winners in the resulting ranked list of researchers.

Go to article

Resource-Bounded Information Gathering For Correlation Clustering, Pallika Kanani, Andrew Mccallum

Andrew McCallum

We present a new class of problems, called Resource-bounded Information Gathering for Correlation Clustering. Our goal is to perform correlation clustering on a graph with incomplete information. The missing information can be obtained by querying an external source under constrained resources. The problem is to develop the most effective strategy for querying to achieve optimal clustering. We describe the problem using entity resolution as an example task.

Go to article

Expertise Modeling For Matching Papers With Reviewers, David Mimno, Andrew Mccallum

Andrew McCallum

An essential part of an expert-finding task, such as matching reviewers to submitted papers, is the ability to model the expertise of a person based on documents. We evaluate several measures of the association between an author in an existing collection of research papers and a previously unseen document. We compare two language model based approaches with a novel topic model, Author-Persona-Topic (APT). In this model, each author can write under one or more ``personas,'' which are represented as independent distributions over hidden topics. Examples of previous papers written by prospective reviewers are gathered from the Rexa database, which extracts …

Go to article

Sparse Message Passing Algorithms Forweighted Maximum Satisfiability, Aron Culotta, Andrew Mccallum, Bart Selman, Ashish Sabharwal

Andrew McCallum

Weighted maximum satisfiability is a well-studied problem that has important applicability to artificial intelligence (for instance, MPE inference in Bayesian networks). General-purpose stochastic search algorithms have proven to be accurate and efficient for large problem instances; however, these algorithms largely ignore structural properties of the input. For example, many problems are highly clustered, in that they contain a collection of loosely coupled subproblems (e.g. pipelines of NLP tasks). In this paper, we propose a message passing algorithm to solve weighted maximum satisfiability problems that exhibit this clustering property. Our algorithm fuses local solutions to each subproblem into a global solution …

Go to article

Semi-Supervised Classification With Hybrid Generative/Discriminative Methods, Gregory Druck, Chris Pal, Xiaoping Zhu, Andrew Mccallum

Andrew McCallum

In this paper, we study semi-supervised learning using hybrid generative/discriminative methods. Specifically, we compare two recently proposed frameworks for combining generative and discriminative classifiers and apply them to semi-supervised classification. In both cases we explore the tradeoff between maximizing a discriminative likelihood of labeled data and a generative likelihood of unlabeled data. While prominent semi-supervised learning methods assume low density regions between classes or are subject to generative modeling assumptions, hybrid generative/discriminative methods allow semi-supervised learning in the presence of strongly overlapping classes and reduce the risk of modeling structure in the unlabeled data that is irrelevant for the specific …

Go to article

Improving Author Coreference By Resource-Bounded Information Gathering From Theweb, Pallika Kanani, Andrew Mccallum, Chris Pal

Andrew McCallum

Accurate entity resolution is sometimes impossible simply due to insufficient information. For example, in research paper author name resolution, even clever use of venue, title and co-authorship relations are often not enough to make a confident coreference decision. This paper presents several methods for increasing accuracy by gathering and integrating additional evidence from the web. We formulate the coreference problem as one of graph partitioning with discriminatively-trained edge weights, and then incorporate web information either as additional features or as additional nodes in the graph. Since the web is too large to incorporate all its data, we need an efficient …

Go to article

Topical N-Grams: Phrase And Topic Discovery, With An Application To Information Retrieval, Xuerui Wang, Andrew Mccallum, Xing Wei

Andrew McCallum

Most topic models, such as latent Dirichlet allocation, rely on the bag of words assumption. However, word order and phrases are often critical to capturing the meaning of text. This paper presents Topical N-grams, a topic model that discovers topics as well as the individual words and phrases that define their meaning. The probabilistic model generates words in their textual order by, for each word, first sampling a topic, then sampling its status as a unigram or bigram, then sampling the word from a topic-specific unigram or bigram distribution. Thus our model can represent that the phrase ``white house'' has …

Go to article

Lightly-Supervised Attribute Extraction, Kedar Bellare, Partha Pratim Talukdar, Giridhar Kumaran, Fernando Pereira, Mark Liberman, Andrew Mccallum, Mark Dredze

Andrew McCallum

Web search engines can greatly benefit from knowledge about attributes of entities present in search queries. In this paper, we introduce lightly-supervised methods for extracting entity attributes from natural language text. Using these methods, we are able to extract large numbers of attributes of different entities at fairly high precision from a large natural language corpus. We compare our methods against a previously proposed pattern-based relation extractor, showing that the new methods give considerable improvements over that baseline. We also demonstrate that query expansion using extracted attributes improves retrieval performance on underspecified information-seeking queries.

Go to article

Generalized Component Analysis For Text With Heterogeneous Attributes, Xuerui Wang, Chris Pal, Andrew Mccallum

Andrew McCallum

We present a class of richly structured, undirected hidden variable models suitable for simultaneously modeling text along with other attributes encoded in different modalities. Our model generalizes techniques such as Principal Component Analysis to heterogeneous data types. In contrast to other approaches, this framework allows modalities such as words, authors and timestamps to be captured in their natural, probabilistic encodings. We demonstrate the effectiveness of our framework on the task of author prediction from 13 years of the NIPS conference proceedings and for a recipient prediction task using a 10-month academic email archive of a researcher. Our approach should be …

Go to article

Penn/Umass/Chop Biocreative Ii Systems, Kuzman Ganchev, Koby Crammer, Fernando Pereira, Gideon Mann, Kedar Bellare, Andrew Mccallum, Steve Carroll, Yang Jin, Peter White

Andrew McCallum

Our team participated in the entity tagging and normalization tasks of Biocreative II. For the entity tagging task, we used a k-best MIRA learning algorithm with lexicons and automatically derived word clusters. MIRA accommodates different training loss functions, which allowed us to exploit gene alternatives in training. We also performed a greedy search over feature templates and the development data, achieving a final F-measure of 86.28%. For the normalization task, we proposed a new specialized on-line learning algorithm and applied it for filtering out false positives from a high recall list of candidates. For normalization we received an F-measure of …

Go to article

Efficient Strategies For Improving Partitioning-Based Author Coreference By Incorporating Web Pages As Graph Nodes, Pallika Kanani, Andrew Mccallum

Andrew McCallum

Entity resolution in the research paper domain is an important, but difficult problem. It suffers from insufficient contextual information, hence using information from the web significantly improves performance. We formulate the author coreference problem as one of graph partitioning with discriminatively-trained edge weights. Building on our previous work, we present improved and more comprehensive results for the method in which we incorporate web documents as additional nodes in the graph. We also propose efficient strategies to select a subset of nodes to add to the graph and to select a subset of queries to gather additional nodes, without significant loss …

Go to article

Mixtures Of Hierarchical Topics With Pachinko Allocation, David Mimno, Wei Li, Andrew Mccallum

Andrew McCallum

The four-level Pachinko Allocation model (PAM) represents correlations among topics using a DAG structure. It does not, however, represent a nested hierarchy of topics, with some topical word distributions representing the vocabulary that is shared among several more specific topics. This paper presents Hierarchical PAM---an enhancement that explicitly represents a topic hierarchy. This model can be seen as combining the advantages of hLDA's topical hierarchy representation with PAM's ability to mix multiple leaves of the topic hierarchy. Experimental results show improvements in likelihood of held-out documents, as well as mutual information between automatically-discovered topics and human-generated categories such as journals …

Go to article

Cryptogram Decoding For Ocr Using Numerization Strings, Gary Huang, Erik Learned-Miller, Andrew Mccallum

Andrew McCallum

OCR systems for printed documents typically require large numbers of font styles and character models to work well. When given an unseen font, performance degrades even in the absence of noise. In this paper, we perform OCR in an unsupervised fashion without using any character models by using a cryptogram decoding algorithm. We present results on real and artificial OCR data.

Go to article

Organizing The Oca: Learning Faceted Subjects From A Library Of Digital Books, David Mimno, Andrew Mccallum

Andrew McCallum

Large scale library digitization projects such as the Open Content Alliance are producing vast quantities of text, but little has been done to organize this data. Subject headings inherited from card catalogs are useful but limited, while full-text indexing is most appropriate for readers who already know exactly what they want. Statistical topic models provide a complementary function. These models can identify semantically coherent ``topics'' that are easily recognizable and meaningful to humans, but they have been too computationally intensive to run on library-scale corpora. This paper presents DCM-LDA, a topic model based on Dirichlet Compound Multinomial distributions. This model …

Go to article

Efficient Computation Of Entropy Gradient For Semi-Supervised Conditional Random Fields, Gideon S. Mann, Andrew Mccallum

Andrew McCallum

Entropy regularization is a straightforward and successful method of semi-supervised learning that augments the traditional conditional likelihood objective function with an additional term that aims to minimize the predicted label entropy on unlabeled data. It has previously been demonstrated to provide positive results in linear-chain CRFs, but the published method for calculating the entropy gradient requires significantly more computation than supervised CRF training. This paper presents a new derivation and dynamic program for calculating the entropy gradient that is significantly more efficient---having the same asymptotic time complexity as supervised CRF training. We also present efficient generalizations of this method for …

Go to article

Canonicalization Of Database Records Using Adaptive Similarity Measures, Aron Culotta, Michael Wick, Robert Hall, Matthew Marzilli, Andrew Mccallum

Andrew McCallum

It is becoming increasingly common to construct databases from information automatically culled from many heterogeneous sources. For example, a research publication database can be constructed by automatically extracting titles, authors, and conference information from papers and their references. A common difficulty in consolidating data from multiple sources is that records are referenced in a variety of ways (e.g. abbreviations, aliases, and misspellings). Therefore, it can be difficult to construct a single, standard representation to present to the user. We refer to the task of constructing this representation as canonicalization. Despite its importance, there is very little existing work on canonicalization. …

Go to article

Improved Dynamic Schedules For Belief Propagation, Charles Sutton, Andrew Mccallum

Andrew McCallum

Belief propagation and its variants are popular methods for approximate inference, but their running time and even their convergence depend greatly on the schedule used to send the messages. Recently, dynamic update schedules have been shown to converge much faster on hard networks than static schedules, namely the residual BP schedule of Elidan et al. [2006]. But that RBP algorithm wastes message updates: many messages are computed solely to determine their priority, and are never actually performed. In this paper, we show that estimating the residual, rather than calculating it directly, leads to significant decreases in the number of messages …

Go to article

Learning Extractors From Unlabeled Text Using Relevant Databases, Kedar Bellare, Andrew Mccallum

Andrew McCallum

Supervised machine learning algorithms for information extraction generally require large amounts of training data. In many cases where labeling training data is burdensome, there may, however, already exist an incomplete database relevant to the task at hand. Records from this database can be used to label text strings that express the same information. For tasks where text strings do not follow the same format or layout, and additionally may contain extra information, labeling the strings completely may be problematic. This paper presents a method for training extractors which fill in missing labels of a text sequence that is partially labeled …

Go to article

Nonparametric Bayes Pachinko Allocation, Wei Li, David Blei, Andrew Mccallum

Andrew McCallum

Amherst, MA 01003 David Blei Computer Science Department Princeton University Princeton, NJ 08540 Andrew McCallum Department of Computer Science University of Massachusetts Amherst, MA 01003 Abstract Recent advances in topic models have explored complicated structured distributions to represent topic correlation. For example, the pachinko allocation model (PAM) captures arbitrary, nested, and possibly sparse correlations between topics using a directed acyclic graph (DAG). While PAM provides more flexibility and greater expressive power than previous models like latent Dirichlet allocation (LDA), it is also more difficult to determine the appropriate topic structure for a specific dataset. In this paper, we propose a …

Go to article

Leveraging Existing Resources Using Generalized Expectation Criteria, Gregory Druck, Gideon Mann, Andrew Mccallum

Andrew McCallum

It is difficult to apply machine learning to many real-world tasks because there are no existing labeled instances. In one solution to this problem, a human expert provides instance labels that are used in traditional supervised or semi-supervised training. Instead, we want a solution that allows us to leverage existing resources other than complete labeled instances. We propose the use of generalized expectation (GE) criteria to achieve this goal. A GE criterion is a term in a training objective function that assigns a score to values of a model expectation. In this paper, the expectations are model predicted class distributions …

Go to article

Full-Text Articles in Physical Sciences and Mathematics

Brooks' Versus Linus' Law: An Empirical Test Of Open Source Projects, Charles M. Schweik, Robert English

National Center for Digital Government

Reflections Of An Online Geographic Information Systems Course Based On Open Source Software, Charles M. Schweik, Maria Fernandez, Michael P. Hamel, Prakash Kashwan, Quentin Lewis, Alexander Stepanov

National Center for Digital Government

Better Public Services For Growth And Jobs, Jane E. Fountain

National Center for Digital Government

Increasing Social Capital For Disaster Response Through Social Networking Services (Sns) In Japanese Local Governments, Alexander Schellong

National Center for Digital Government

Open-Source Collaboration In The Public Sector: The Need For Leadership And Value, Michael P. Hamel

National Center for Digital Government

Exterminator: Automatically Correcting Memory Errors With High Probability, Gene Novark

Computer Science Department Faculty Publication Series

Tragedy Of The Foss Commons? Investigating The Institutional Designs Of Free/Libre And Open Source Software Projects, Charles M. Schweik, Robert English

National Center for Digital Government

Identifying Success And Tragedy Of Floss Commons: A Preliminary Classification Of Sourceforge.Net Projects, Robert English, Charles M. Schweik

National Center for Digital Government

The Digital Divide Metaphor: Understanding Paths To It Literacy, Enrico Ferro, Natalie C. Helbig, J. Ramon Gil-Garci

National Center for Digital Government

Transfer Function And Impulse Response Synthesis Using Classical Techniques, Sonal S. Khilari

Masters Theses 1911 - February 2014

Mining A Digital Library For Influential Authors, David Mimno, Andrew Mccallum

Andrew McCallum

Resource-Bounded Information Gathering For Correlation Clustering, Pallika Kanani, Andrew Mccallum

Andrew McCallum

Expertise Modeling For Matching Papers With Reviewers, David Mimno, Andrew Mccallum

Andrew McCallum

Sparse Message Passing Algorithms Forweighted Maximum Satisfiability, Aron Culotta, Andrew Mccallum, Bart Selman, Ashish Sabharwal

Andrew McCallum

Semi-Supervised Classification With Hybrid Generative/Discriminative Methods, Gregory Druck, Chris Pal, Xiaoping Zhu, Andrew Mccallum

Andrew McCallum

Improving Author Coreference By Resource-Bounded Information Gathering From Theweb, Pallika Kanani, Andrew Mccallum, Chris Pal

Andrew McCallum

Topical N-Grams: Phrase And Topic Discovery, With An Application To Information Retrieval, Xuerui Wang, Andrew Mccallum, Xing Wei

Andrew McCallum

Lightly-Supervised Attribute Extraction, Kedar Bellare, Partha Pratim Talukdar, Giridhar Kumaran, Fernando Pereira, Mark Liberman, Andrew Mccallum, Mark Dredze

Andrew McCallum

Generalized Component Analysis For Text With Heterogeneous Attributes, Xuerui Wang, Chris Pal, Andrew Mccallum

Andrew McCallum

Penn/Umass/Chop Biocreative Ii Systems, Kuzman Ganchev, Koby Crammer, Fernando Pereira, Gideon Mann, Kedar Bellare, Andrew Mccallum, Steve Carroll, Yang Jin, Peter White

Andrew McCallum

Efficient Strategies For Improving Partitioning-Based Author Coreference By Incorporating Web Pages As Graph Nodes, Pallika Kanani, Andrew Mccallum

Andrew McCallum

Mixtures Of Hierarchical Topics With Pachinko Allocation, David Mimno, Wei Li, Andrew Mccallum

Andrew McCallum

Cryptogram Decoding For Ocr Using Numerization Strings, Gary Huang, Erik Learned-Miller, Andrew Mccallum

Andrew McCallum

Organizing The Oca: Learning Faceted Subjects From A Library Of Digital Books, David Mimno, Andrew Mccallum

Andrew McCallum

Efficient Computation Of Entropy Gradient For Semi-Supervised Conditional Random Fields, Gideon S. Mann, Andrew Mccallum

Andrew McCallum

Canonicalization Of Database Records Using Adaptive Similarity Measures, Aron Culotta, Michael Wick, Robert Hall, Matthew Marzilli, Andrew Mccallum

Andrew McCallum

Improved Dynamic Schedules For Belief Propagation, Charles Sutton, Andrew Mccallum

Andrew McCallum

Learning Extractors From Unlabeled Text Using Relevant Databases, Kedar Bellare, Andrew Mccallum

Andrew McCallum

Nonparametric Bayes Pachinko Allocation, Wei Li, David Blei, Andrew Mccallum

Andrew McCallum

Leveraging Existing Resources Using Generalized Expectation Criteria, Gregory Druck, Gideon Mann, Andrew Mccallum

Andrew McCallum