Modern social and technological trends result in an enormous increase in the amount of accessible data, with a significant portion of the resources being massive and having inherent imprecision and uncertainty. Such data, often referred to as “Big Data,” typically features substantial social and/or business value if become amenable to computing machinery. Towards that, Computer Science has made a significant progress over the years. The Artificial Intelligence (AI) community has established principled and well-practiced methodologies to model, explore, and learn realistic knowledge within imprecise and uncertain environments. The Database (DB) community has established fundamental concepts and machinery for the management of data, with focus on scaling to high volumes and intensive access. Albeit the complementary focuses, these two communities share key features and often tackle similar challenges. But until recently, research in them has progressed independently with limited collaboration. It is then evident that time has come to integrate the efforts of the two towards unified concepts and methodologies that share the benefits of both worlds.
This workshop aims to bring together researchers and practitioner from the AI community and the DB community, in order to highlight relevant state of the art research, and to capture opportunities of collaboration and mutual enhancement. The organizers believe that research achievements from both communities provide a significant ground to benefit from such an assembly. Examples include AI concepts such as Markov Logic, lifted inference, and learning/mining of relational data, and DB concepts such as probabilistic databases, query optimization and descriptive complexity. With the synergy between the two communities, the hope is for this workshop to serve as a stepping stone towards the realization of the value in Big Data.
- Dan Olteanu: Probabilistic Databases are Dead,
Long Live Probabilistic Databases!
In this talk I will give a brief account to probabilistic databases with a strong bias towards the SPROUT project at Oxford. After a fruitful decade that brought a good understanding of probabilistic data models and efficient query evaluation, the attention of our community to this area is diminishing. Nevertheless, there is new excitement at the horizon! This comes in several flavors, some of which I will highlight in the talk: Cross-fertilization with PL work on probabilistic programming for expressing computation beyond queries and with AI inference engines for probabilistic data integration; also, the availability of ever-larger probabilistic Web data repositories such as Google Knowledge Vault.
Dan Olteanu is an Associate Professor in the Department of Computer Science at the University of Oxford and Fellow of St Cross College. He is currently spending his sabbatical as Computer Scientist at LogicBlox and as Visiting Associate Professor at UC Berkeley. His research interests are in databases and he contributed to XML query processing, incomplete information and probabilistic databases, and more recently to factorized databases and an industry-strength datalog engine. He is a co-author of the book "Probabilistic Databases" (2011). Olteanu has served in over 50 programme committees of international venues, as associate editor for PVLDB'13 and since 2013 for IEEE TKDE, as PC chair for BNCOD'13, and will serve as track chair for ICDE'15 and group leader for SIGMOD'15.
- Guy van den Broeck: Lifted Inference in Statistical Relational
Models (mini-tutorial) [pdf]
Statistical relational models combine aspects of first-order logic, databases and probabilistic graphical models, enabling them to represent complex logical and probabilistic relations between large numbers of objects. But this level of expressivity comes at a price: inference (i.e., drawing conclusions on the probability of events) becomes highly intractable. Nevertheless, relational models of real-life applications often exhibit a high level of symmetry (i.e., substructures that are modeled in a similar manner). Lifted inference is the art of exploiting that symmetry towards efficient inference. The first part of this tutorial describes the basic ideas underlying lifted inference algorithms, why they work, and how they are fundamentally different from other probabilistic reasoning algorithms. The second part is a brief overview of theory and applications. Lifted inference theory looks at tractable classes of models, and at different notions of symmetry and exchangeability. Practical applications of lifted inference are in approximate reasoning and machine learning.
Guy Van den Broeck obtained his Ph.D. in Computer Science from KU Leuven, Belgium, in 2013. He was a postdoctoral researcher in the automated reasoning lab at the University of California, Los Angeles in 2014. He currently works as a postdoctoral researcher at KU Leuven and is supported by the Research Foundation-Flanders. His research interests are in artificial intelligence, machine learning, logical and probabilistic automated reasoning and statistical relational learning. His work was awarded the Alcatel-Lucent Innovation Award in 2009 and the ILP Turing Theory Prize (best student paper) in 2011. He is co-organizing the 4th International Workshop on Statistical Relational AI (StarAI) at AAAI 2014 and has served on the (senior) program committee of top-level conferences, including IJCAI, AAAI, ECML/PKDD, ECAI, KR, ILP, and as a reviewer for AI Journal, Machine Learning Journal, and Journal of Machine Learning Research.
Unifying AI & DB Perspectives on Structured Probabilistic Models
Machine learning and database approaches to structured probabilistic models share many commonalities, yet exhibit certain important differences. Machine learning methods focus on learning probabilistic models from (certain) data and efficient inference, whereas probabilistic database approaches focus on storing and efficiently querying uncertain data. Nonetheless, the structured probabilistic models that both use are often (almost) identical. In this talk, I'll present an introduction to both perspectives, and then describe some of our recent work that attempts to bridge the gap.
Lise Getoor is a professor in the Computer Science Department at UC Santa Cruz. Her primary research interests are in machine learning and reasoning with uncertainty, applied to graphs and structured data. She has eight best paper and best student paper awards, an NSF Career Award, and is an Association for the Advancement of Artificial Intelligence (AAAI) Fellow.
Sunday, June 22nd, 2014, 1:30 pm.
1:30 - 1:40
1:40 - 2:40
2:40 - 3:00
3:00 - 3:30
3:30 - 4:20
4:20 - 5:20
- Silviu Maniu, Reynold
Cheng and Pierre Senellart: ProbTree: A Query-Efficient
Representation of Probabilistic Graphs
Information in many applications, such as mobile wireless systems, social networks, and road networks, is captured by graphs, in many cases uncertain. We study the problem of querying a probabilistic graph; in particular, we examine “source-to-target” queries, such as computing the shortest path between two vertices. Evaluating ST-queries over probabilistic graphs is #P-hard, as it requires examining an exponential number of “possible worlds”. Existing solutions to the ST-query problem, which sample possible worlds, have two downsides: (i) many samples are needed for reasonable accuracy, and (ii) a possible world can be very large. To tackle these issues, we study the ProbTree, a data structure that stores a succinct representation of the probabilistic graph. Existing ST-query solutions are executed on top of this structure, with the number of samples and possible world sizes reduced.
- Sushovan De, Yuheng
Hu, Yi Chen and Subbarao Kambhampati: BayesWipe: A Multimodal System for
Data Cleaning and Consistent Query Answering on
Recent efforts in data cleaning have focused exclusively on problems like data deduplication, record matching, and data standardization; none of these focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like CFDs (which have to be provided by domain experts, or learned from a clean sample of the database). In this paper, we provide a method for correcting individual attribute values in a structured database using a Bayesian generative model and a statistical error model learned from the noisy database directly. We thus avoid the necessity for a domain expert or master data. We also show how to efficiently perform consistent query answering using this model over a dirty database, in case write permissions to the database are unavailable. We evaluate our methods over both synthetic and real data.
- Eric Gribkoff, Guy Van den Broeck and Dan Suciu: The Most Probable Database Problem
This paper proposes a novel inference task for probabilistic databases: the most probable database (MPD) problem. The MPD is the most probable deterministic database where a given query or constraint is true. We highlight two distinctive applications, in database repair of key and dependency constraints, and in finding most probable explanations in statistical relational learning. The MPD problem raises new theoretical questions, such as the possibility of a dichotomy theorem for MPD, classifying queries as being either PTIME or NP-hard. We show that such a dichotomy would diverge from dichotomies for other inference tasks. We then prove a dichotomy for queries that represent unary functional dependency constraints. Finally, we discuss symmetric probabilities and the opportunities for lifted inference.
- Jose Picado, Arash Termehchy and Alan Fern: Schema Independence of Relational Learning Algorithms
Learning novel and useful relations between entities in structured data sets, such as relational databases, is an important problem with many applications in data management and machine learning. It is well established that the same data set may be represented under different schemas due to various reasons, such as efficiency, data quality, and usability. Further, the schema of a database may evolve over time. In this paper, we argue that relational learning algorithms should be schema independent, i.e. they should return basically the same results across various schemas of the same data set. Schema independent relational learning algorithms require less manual tuning and are easier to use over real-world data sets. We formally define and explore this property for different types of relational learning algorithms. Our formalization provides a method to measure the amount of schema independence for a relational learning algorithm. We analyze the schema independece of some popular relational learning algorihtms both theoretically and empirically. Our results indicate that these algorithms are not generally schema independent and will deliver different results and accuracies or require different amount of training data over different schemas for the same data set.
- Silviu Maniu, Reynold Cheng and Pierre Senellart: ProbTree: A Query-Efficient Representation of Probabilistic Graphs [Abstract]
- Leopoldo Bertossi and Babak Salimi: Unifying Causality, Diagnosis, Repairs and View-Updates in Databases
In this work we establish and point out connections between the notion of query-answer causality in databases and database repairs, model-based diagnosis in its consistency-based and abductive versions, and database updates through views. The mutual relationships among these areas of data management and knowledge representation shed light on each of them and help to share notions and results they have in common. In one way or another, these are all approaches to uncertainty management, which becomes even more relevant in the context of big data that have to be made sense of.
- Leopoldo Bertossi and Babak Salimi: Unifying Causality, Diagnosis, Repairs and View-Updates in Databases [Abstract]
Highlights on Published Work:
- Serge Abiteboul and
Daniel Deutch: Managing Uncertainty and Conflicts
in a Distributed World
We consider a distributed setting where peers in a network exchange information, and apply reasoning to derive further information. We note that uncertainty is common in such setting. Peers may have disagreements and state or infer conflicting facts. Peers can settle conflicts by choosing between contradicting base or inferred facts, which introduces a first cause of uncertainty. Then, there is an inherent uncertainty introduced by the asynchronous environment: the order in which messages are sent and received, as well as the order of applying reasoning steps, are both uncertain. In this short paper, we consider the problem of modeling the dynamics of such networks, accounting for uncertainty. We briefly recall a proposal for the management of uncertainty, namely datalog-fd. We consider extending it to the distributed datalog dialect Webdamlog recently introduced. We mention preliminary results.
- Eric Gribkoff, Guy Van
den Broeck and Dan Suciu: Understanding the Complexity of
Lifted Inference and Asymmetric Weighted Model
We highlight our work on lifted inference for the asymmetric Weighted First-Order Model Counting problem (WFOMC), which counts the assignments that satisfy a given sentence in first-order logic. This work is at the intersection of probabilistic databases and statistical relational learning. First, we discuss how adding negation can lower the query complexity, and describe the essential element (resolution) necessary to extend a previous algorithm for positive queries to handle queries with negation. Second, we highlight the difficulties with extending the dichotomy result for positive queries to include negation. We outline how we overcame these difficulties to establish a novel dichotomy for a non-trivial fragment of first-order CNF sentences with negation. Finally, we discuss directions for future work.
- Peter Haas and Chris Jermaine: MCDB and SimSQL: Scalable Stochastic Analysis within the Database
This short paper highlights published work on the MCDB database system as well as its successor, SimSQL. These database systems are designed from the ground up to support stochastic analytics; that is, data analysis performed with the aid of stochastic simulations.
- Mathias Niepert and Guy Van Den Broeck: Tractability through Exchangeability: A New Perspective on Efficient Probabilistic Inference
Exchangeability is a central notion in statistics and probability theory. The assumption that an infinite sequence of data points is exchangeable is at the core of Bayesian statistics. However, finite exchangeability as a statistical property that renders probabilistic inference tractable is less well-understood. We develop a theory of finite exchangeability and its relation to tractable probabilistic inference. The theory is complementary to that of independence and conditional independence. We show that tractable inference in probabilistic models with high treewidth and millions of variables can be explained with the notion of finite (partial) exchangeability. We also show that existing lifted inference algorithms implicitly utilize a combination of conditional independence and partial exchangeability.
- Dan Olteanu and Sebastiaan van Schaik: ENFrame = (Programs + Queries) / Probabilistic Data
This paper overviews ENFrame, a framework for processing probabilistic data. In addition to relational query processing supported by existing probabilistic database management systems, ENFrame allows programming with loops, assignments, list comprehension, and aggregates to encode complex tasks such as clustering and classification of data retrieved via queries from probabilistic databases. We explain the design choices behind ENFrame, some distilled from the wealth of work on probabilistic databases and some new. We also highlight a few challenges lying ahead.
- Serge Abiteboul and Daniel Deutch: Managing Uncertainty and Conflicts in a Distributed World [Abstract]
- Anja Gruenheid, Besmira Nushi and Donald Kossmann: Cost-Efficient Querying Strategies for the Crowd
To enhance data processing, crowdsourcing is a mechanism that has evolved recently and has been picked up by companies that refine large amounts of data through so-called crowd workers. Generally, there are two challenges to crowdsourcing: First, the crowd is an uncertain input source because a worker not necessarily knows the right answer or simply answers wrongly. Second, crowdsourcing comes at a monetary cost. As a result, finding ways to minimize the number of times the crowd is accessed is an essential line of work to make crowdsourcing a realistic option in the context of big data. To address these problems, crowd querying strategies have been devised which determine the task order and budget spent per worker task. Existing querying strategies focus on either improving certainty in the current result or optimizing the information gain by resolving the most uncertain parts of the dataset first. In this work, we first compare and evaluate both strategies after which we combine their core ideas into a hybrid querying strategy. We then show experimentally that it is a robust and cost-efficient alternative for querying the crowd.
- Anja Gruenheid, Besmira Nushi and Donald Kossmann: Cost-Efficient Querying Strategies for the Crowd [Abstract]
- Kristian Kersting (Technical University of Dortmund, Fraunhofer IAIS)
- Benny Kimelfeld (LogicBlox)
Publicity and General Arrangements:
- André Hernich (University of Liverpool, UK)
- Serge Abiteboul (INRIA-Saclay)
- Foto Afrati (National Technical University of Athens)
- Bogdan Cautis (University of Paris-Sud)
- Sara Cohen (The Hebrew University)
- James Cussens (University of York)
- Jesse Davis (KU Leuven)
- Luc De Raedt (Katholieke Universiteit Leuven)
- Amol Deshpande (University of Maryland at College Park)
- Daniel Deutch (Tel Aviv University)
- Wenfei Fan (University of Edinburgh)
- Rainer Gemulla (MPI)
- Vibhav Gogate (The University of Texas at Dallas)
- Fabian Hadiji (Technical University of Dortmund)
- Christopher Jermaine (Rice University)
- Kristian Kersting (Technical University of Dortmund, Fraunhofer IAIS)
- Roni Khardon (Tufts University)
- Benny Kimelfeld (LogicBlox)
- Daniel Lowd (University of Oregon)
- Martin Mladenov (Technical University of Dortmund, Fraunhofer IAIS)
- Sriraam Natarajan (Indiana University)
- Mathias Niepert (University of Washington)
- Dan Olteanu (Oxford University)
- Hoifung Poon (Microsoft Research)
- Christopher Ré (Stanford University)
- Oliver Schulte (Simon Fraser University)
- Pierre Senellart (Institut Mines–Télécom; Télécom ParisTech; CNRS LTCI)
- Dan Suciu (University of Washington)
- Guy Van Den Broeck (UCLA)
- Maurice Van Keulen (University of Twente)
- Submission: March 28th, 2014.
- Notification: April 30th, 2014.
- Submission of late-breaking papers: April 23rd, 2014.
- Notification for late-breaking papers: May 4th, 2014.
- Camera-ready submission: May 14th, 2014.
- Workshop: June 22nd, 2014.
Those interested in attending the workshop can submit papers of the following kinds, relating to the topic of the workshop as described above.
- A technical paper (ACM SIG style, 6 pages maximum including references). A paper of this kind should include novel, previously unpublished technical results. Accepted technical papers will be presented in both a spotlight talk and a poster during the workshop.
- A position, outrageous-idea or vision paper (ACM SIG style, 2 pages maximum). Accepted papers of this kind will also be presented in both a spotlight talk and a poster during the workshop.
- Highlights on published work (ACM SIG style, 2 pages maximum including references). Papers of this kind outline work of the authors that has already been published in other venues, and highlight the relevant future directions. The purpose of these papers is to publicise the author's work. Accepted papers of this kind will also be presented only in a poster during the workshop.
- A late-breaking paper (ACM SIG style, 3 pages maximum including references). Papers of this kind can be submitted later than the other kinds of papers (as specified below). The purpose of late-breaking papers is to provide workshop attendees with information about research that was initiated, enhanced, improved, or completed after the ordinary submission. Accepted papers of this kind will also be presented only in a poster during the workshop.
Submissions should have their type mentioned within (e.g., as a title footnote). Submitted papers should be in PDF format, via EasyChair. All submitted papers will be carefully peer-reviewed by multiple reviewers. The organizers will publish the accepted papers only in the workshop's website, and these papers will not count as a formal publication.