BUDA 2014 Workshop on Big Uncertain Data

Invited Talks

Dan Olteanu: Probabilistic Databases are Dead, Long Live Probabilistic Databases!
In this talk I will give a brief account to probabilistic databases with a strong bias towards the SPROUT project at Oxford. After a fruitful decade that brought a good understanding of probabilistic data models and efficient query evaluation, the attention of our community to this area is diminishing. Nevertheless, there is new excitement at the horizon! This comes in several flavors, some of which I will highlight in the talk: Cross-fertilization with PL work on probabilistic programming for expressing computation beyond queries and with AI inference engines for probabilistic data integration; also, the availability of ever-larger probabilistic Web data repositories such as Google Knowledge Vault.

Dan Olteanu is an Associate Professor in the Department of Computer Science at the University of Oxford and Fellow of St Cross College. He is currently spending his sabbatical as Computer Scientist at LogicBlox and as Visiting Associate Professor at UC Berkeley. His research interests are in databases and he contributed to XML query processing, incomplete information and probabilistic databases, and more recently to factorized databases and an industry-strength datalog engine. He is a co-author of the book "Probabilistic Databases" (2011). Olteanu has served in over 50 programme committees of international venues, as associate editor for PVLDB'13 and since 2013 for IEEE TKDE, as PC chair for BNCOD'13, and will serve as track chair for ICDE'15 and group leader for SIGMOD'15.
Guy van den Broeck: Lifted Inference in Statistical Relational Models (mini-tutorial) [pdf]
Statistical relational models combine aspects of first-order logic, databases and probabilistic graphical models, enabling them to represent complex logical and probabilistic relations between large numbers of objects. But this level of expressivity comes at a price: inference (i.e., drawing conclusions on the probability of events) becomes highly intractable. Nevertheless, relational models of real-life applications often exhibit a high level of symmetry (i.e., substructures that are modeled in a similar manner). Lifted inference is the art of exploiting that symmetry towards efficient inference. The first part of this tutorial describes the basic ideas underlying lifted inference algorithms, why they work, and how they are fundamentally different from other probabilistic reasoning algorithms. The second part is a brief overview of theory and applications. Lifted inference theory looks at tractable classes of models, and at different notions of symmetry and exchangeability. Practical applications of lifted inference are in approximate reasoning and machine learning.

Guy Van den Broeck obtained his Ph.D. in Computer Science from KU Leuven, Belgium, in 2013. He was a postdoctoral researcher in the automated reasoning lab at the University of California, Los Angeles in 2014. He currently works as a postdoctoral researcher at KU Leuven and is supported by the Research Foundation-Flanders. His research interests are in artificial intelligence, machine learning, logical and probabilistic automated reasoning and statistical relational learning. His work was awarded the Alcatel-Lucent Innovation Award in 2009 and the ILP Turing Theory Prize (best student paper) in 2011. He is co-organizing the 4th International Workshop on Statistical Relational AI (StarAI) at AAAI 2014 and has served on the (senior) program committee of top-level conferences, including IJCAI, AAAI, ECML/PKDD, ECAI, KR, ILP, and as a reviewer for AI Journal, Machine Learning Journal, and Journal of Machine Learning Research.
Lise Getoor: Unifying AI & DB Perspectives on Structured Probabilistic Models
Machine learning and database approaches to structured probabilistic models share many commonalities, yet exhibit certain important differences. Machine learning methods focus on learning probabilistic models from (certain) data and efficient inference, whereas probabilistic database approaches focus on storing and efficiently querying uncertain data. Nonetheless, the structured probabilistic models that both use are often (almost) identical. In this talk, I'll present an introduction to both perspectives, and then describe some of our recent work that attempts to bridge the gap.

Lise Getoor is a professor in the Computer Science Department at UC Santa Cruz. Her primary research interests are in machine learning and reasoning with uncertainty, applied to graphs and structured data. She has eight best paper and best student paper awards, an NSF Career Award, and is an Association for the Advancement of Artificial Intelligence (AAAI) Fellow.

Accepted Papers

Technical Papers:
- Silviu Maniu, Reynold Cheng and Pierre Senellart: ProbTree: A Query-Efficient Representation of Probabilistic Graphs [Abstract]
  Information in many applications, such as mobile wireless systems, social networks, and road networks, is captured by graphs, in many cases uncertain. We study the problem of querying a probabilistic graph; in particular, we examine “source-to-target” queries, such as computing the shortest path between two vertices. Evaluating ST-queries over probabilistic graphs is #P-hard, as it requires examining an exponential number of “possible worlds”. Existing solutions to the ST-query problem, which sample possible worlds, have two downsides: (i) many samples are needed for reasonable accuracy, and (ii) a possible world can be very large. To tackle these issues, we study the ProbTree, a data structure that stores a succinct representation of the probabilistic graph. Existing ST-query solutions are executed on top of this structure, with the number of samples and possible world sizes reduced.
- Sushovan De, Yuheng Hu, Yi Chen and Subbarao Kambhampati: BayesWipe: A Multimodal System for Data Cleaning and Consistent Query Answering on Structured Data [Abstract]
  Recent efforts in data cleaning have focused exclusively on problems like data deduplication, record matching, and data standardization; none of these focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like CFDs (which have to be provided by domain experts, or learned from a clean sample of the database). In this paper, we provide a method for correcting individual attribute values in a structured database using a Bayesian generative model and a statistical error model learned from the noisy database directly. We thus avoid the necessity for a domain expert or master data. We also show how to efficiently perform consistent query answering using this model over a dirty database, in case write permissions to the database are unavailable. We evaluate our methods over both synthetic and real data.
- Eric Gribkoff, Guy Van den Broeck and Dan Suciu: The Most Probable Database Problem [Abstract]
  This paper proposes a novel inference task for probabilistic databases: the most probable database (MPD) problem. The MPD is the most probable deterministic database where a given query or constraint is true. We highlight two distinctive applications, in database repair of key and dependency constraints, and in finding most probable explanations in statistical relational learning. The MPD problem raises new theoretical questions, such as the possibility of a dichotomy theorem for MPD, classifying queries as being either PTIME or NP-hard. We show that such a dichotomy would diverge from dichotomies for other inference tasks. We then prove a dichotomy for queries that represent unary functional dependency constraints. Finally, we discuss symmetric probabilities and the opportunities for lifted inference.
- Jose Picado, Arash Termehchy and Alan Fern: Schema Independence of Relational Learning Algorithms [Abstract]
  Learning novel and useful relations between entities in structured data sets, such as relational databases, is an important problem with many applications in data management and machine learning. It is well established that the same data set may be represented under different schemas due to various reasons, such as efficiency, data quality, and usability. Further, the schema of a database may evolve over time. In this paper, we argue that relational learning algorithms should be schema independent, i.e. they should return basically the same results across various schemas of the same data set. Schema independent relational learning algorithms require less manual tuning and are easier to use over real-world data sets. We formally define and explore this property for different types of relational learning algorithms. Our formalization provides a method to measure the amount of schema independence for a relational learning algorithm. We analyze the schema independece of some popular relational learning algorihtms both theoretically and empirically. Our results indicate that these algorithms are not generally schema independent and will deliver different results and accuracies or require different amount of training data over different schemas for the same data set.
Position/Vision Papers:
- Leopoldo Bertossi and Babak Salimi: Unifying Causality, Diagnosis, Repairs and View-Updates in Databases [Abstract]
  In this work we establish and point out connections between the notion of query-answer causality in databases and database repairs, model-based diagnosis in its consistency-based and abductive versions, and database updates through views. The mutual relationships among these areas of data management and knowledge representation shed light on each of them and help to share notions and results they have in common. In one way or another, these are all approaches to uncertainty management, which becomes even more relevant in the context of big data that have to be made sense of.
Highlights on Published Work:
- Serge Abiteboul and Daniel Deutch: Managing Uncertainty and Conflicts in a Distributed World [Abstract]
  We consider a distributed setting where peers in a network exchange information, and apply reasoning to derive further information. We note that uncertainty is common in such setting. Peers may have disagreements and state or infer conflicting facts. Peers can settle conflicts by choosing between contradicting base or inferred facts, which introduces a first cause of uncertainty. Then, there is an inherent uncertainty introduced by the asynchronous environment: the order in which messages are sent and received, as well as the order of applying reasoning steps, are both uncertain. In this short paper, we consider the problem of modeling the dynamics of such networks, accounting for uncertainty. We briefly recall a proposal for the management of uncertainty, namely datalog-fd. We consider extending it to the distributed datalog dialect Webdamlog recently introduced. We mention preliminary results.
- Eric Gribkoff, Guy Van den Broeck and Dan Suciu: Understanding the Complexity of Lifted Inference and Asymmetric Weighted Model Counting [Abstract]
  We highlight our work on lifted inference for the asymmetric Weighted First-Order Model Counting problem (WFOMC), which counts the assignments that satisfy a given sentence in first-order logic. This work is at the intersection of probabilistic databases and statistical relational learning. First, we discuss how adding negation can lower the query complexity, and describe the essential element (resolution) necessary to extend a previous algorithm for positive queries to handle queries with negation. Second, we highlight the difficulties with extending the dichotomy result for positive queries to include negation. We outline how we overcame these difficulties to establish a novel dichotomy for a non-trivial fragment of first-order CNF sentences with negation. Finally, we discuss directions for future work.
- Peter Haas and Chris Jermaine: MCDB and SimSQL: Scalable Stochastic Analysis within the Database [Abstract]
  This short paper highlights published work on the MCDB database system as well as its successor, SimSQL. These database systems are designed from the ground up to support stochastic analytics; that is, data analysis performed with the aid of stochastic simulations.
- Mathias Niepert and Guy Van Den Broeck: Tractability through Exchangeability: A New Perspective on Efficient Probabilistic Inference [Abstract]
  Exchangeability is a central notion in statistics and probability theory. The assumption that an infinite sequence of data points is exchangeable is at the core of Bayesian statistics. However, finite exchangeability as a statistical property that renders probabilistic inference tractable is less well-understood. We develop a theory of finite exchangeability and its relation to tractable probabilistic inference. The theory is complementary to that of independence and conditional independence. We show that tractable inference in probabilistic models with high treewidth and millions of variables can be explained with the notion of finite (partial) exchangeability. We also show that existing lifted inference algorithms implicitly utilize a combination of conditional independence and partial exchangeability.
- Dan Olteanu and Sebastiaan van Schaik: ENFrame = (Programs + Queries) / Probabilistic Data [Abstract]
  This paper overviews ENFrame, a framework for processing probabilistic data. In addition to relational query processing supported by existing probabilistic database management systems, ENFrame allows programming with loops, assignments, list comprehension, and aggregates to encode complex tasks such as clustering and classification of data retrieved via queries from probabilistic databases. We explain the design choices behind ENFrame, some distilled from the wealth of work on probabilistic databases and some new. We also highlight a few challenges lying ahead.
Late-Breaking Papers:
- Anja Gruenheid, Besmira Nushi and Donald Kossmann: Cost-Efficient Querying Strategies for the Crowd [Abstract]
  To enhance data processing, crowdsourcing is a mechanism that has evolved recently and has been picked up by companies that refine large amounts of data through so-called crowd workers. Generally, there are two challenges to crowdsourcing: First, the crowd is an uncertain input source because a worker not necessarily knows the right answer or simply answers wrongly. Second, crowdsourcing comes at a monetary cost. As a result, finding ways to minimize the number of times the crowd is accessed is an essential line of work to make crowdsourcing a realistic option in the context of big data. To address these problems, crowd querying strategies have been devised which determine the task order and budget spent per worker task. Existing querying strategies focus on either improving certainty in the current result or optimizing the information gain by resolving the most uncertain parts of the dataset first. In this work, we first compare and evaluate both strategies after which we combine their core ideas into a hybrid querying strategy. We then show experimentally that it is a robust and cost-efficient alternative for querying the crowd.

1:30 - 1:40	Workshop kickoff
1:40 - 2:40	Invited talk by Dan Olteanu
2:40 - 3:00	Spotlight presentations for technical and vision contributions
3:00 - 3:30	Break & poster session
3:30 - 4:20	Mini-tutorial by Guy van den Broeck [pdf]
4:20 - 5:20	Invited talk by Lise Getoor

BUDA 2014

Invited Talks

Program

Sunday, June 22^nd, 2014, 1:30 pm.

Accepted Papers

Technical Papers:

Position/Vision Papers:

Highlights on Published Work:

Late-Breaking Papers:

Organizing Committee

Program Co-Chairs:

Publicity and General Arrangements:

Program Committee:

Important Dates

Paper Submission

BUDA 2014

Invited Talks

Program

Sunday, June 22nd, 2014, 1:30 pm.

Accepted Papers

Technical Papers:

Position/Vision Papers:

Highlights on Published Work:

Late-Breaking Papers:

Organizing Committee

Program Co-Chairs:

Publicity and General Arrangements:

Program Committee:

Important Dates

Paper Submission

Sunday, June 22^nd, 2014, 1:30 pm.