30th British International Conference on Databases

July 6-8, 2015, Edinburgh, Scotland


Invited Speakers

Graham Cormode
University of Warwick

Graham Cormode
Streaming Methods in Data Analysis
A fundamental challenge in processing the massive quantities of information generated by modern applications is in extracting suitable representations of the data that can be stored, manipulated and interrogated on a single machine. A promising approach is in the design and analysis of compact summaries: data structures which capture key features of the data, and which can be created effectively over distributed, streaming data. Popular summary structures include the count distinct algorithms, which compactly approximate item set cardinalities, and sketches which allow vector norms and products to be estimated. These are very attractive, since they can be computed in parallel and combined to yield a single, compact summary of the data. This talk introduces the concepts and examples of compact summaries.
Daniel Keim
University of Konstanz

Daniel Keim
The Power of Visual Analytics: Unlocking the Value of Big Data
Never before in history data is generated and collected at such high volumes as it is today. For the analysis of large data sets to be effective, it is important to include the human in the data exploration process and combine the flexibility, creativity, and general knowledge of the human with the enormous storage capacity and the computational power of today's computers. Visual Analytics helps to deal with the flood of information by integrating the human in the data analysis process, applying its perceptual abilities to the large data sets. Presenting data in an interactive, graphical form often fosters new insights, encouraging the formation and validation of new hypotheses for better problem-solving and gaining deeper domain knowledge. Visual analytics techniques have proven to be of high value in exploratory data analysis. They are especially powerful for the first steps of the data exploration process, namely understanding the data and generating hypotheses about the data, but they also significantly contribute to the actual knowledge discovery by guiding the search using visual feedback. In putting visual analysis to work on big data, it is not obvious what can be done by automated analysis and what should be done by interactive visual methods. In dealing with massive data, the use of automated methods is mandatory - and for some problems it may be sufficient to only use fully automated analysis methods, but there is also a wide range of problems where the use of interactive visual methods is necessary. The presentation discusses when it is useful to combine visualization and analytics techniques and it will also discuss the options how to combine techniques from both areas. Examples from a wide range of application areas illustrate the benefits of visual analytics techniques.
Renée Miller
University of Toronto

Renee Miller
Big Data Curation
More than a decade ago, Peter Buneman used the term curated databases to refer to databases that are created and maintained using the (often substantial) effort and domain expertise of humans. These human experts clean the data, integrate it with new sources, prepare it for analysis, and share the data with other experts in their field. In data curation, one seeks to support human curators in all activities needed for maintaining and enhancing the value of their data over time. Curation includes data provenance, the process of understanding the origins of data, how it was created, cleaned, or integrated. Big Data offers opportunities to solve curation problems in new ways. The availability of massive data is making it possible to infer semantic connections among data, connections that are central to solving difficult integration, cleaning, and analysis problems. Some of the nuanced semantic differences that eluded enterprise-scale curation solutions can now be understood using evidence from Big Data. Big Data Curation leverages the human expertise that has been embedded in Big Data, be it in general knowledge data that has been created through mass collaboration, or in specialized knowledge-bases created by incentivized user communities who value the creation and maintenance of high quality data. In this talk, I describe our experience in Big Data Curation. This includes our experience over the last five years curating NIH Clinical Trials data that we have published as Open Linked Data at linkedCT.org. I overview how we have adapted some of the traditional solutions for data curation to account for (and take advantage of) Big Data.
Aaron Roth
University of Pennsylvania

Aaron Roth
Differential Privacy and Preserving Validity in Adaptive Data Analysis
In this talk, we briefly introduce differential privacy, a rigorous privacy solution concept developed over the last decade , and explain how it allows various sorts of accurate data analyses to be performed while giving very strong privacy guarantees to the individuals in the data set. Among other things, we will describe recent work which allows the private generation of synthetic data, accurate for large numbers of statistics, even on very high dimensional data sets. We then explain a very recent and surprising connection between differential privacy and statistical validity in adaptive data analysis, in which the guarantees of differential privacy can actually \emph{improve} the accuracy of an analysis!
Sir Nigel Shadbolt
University of Southampton

Nigel Shadbolt
Dealing with a Web of Data
We live in an age of superabundant information. The Internet and World Wide Web have been the agents of this revolution. A deluge of information and data has led to a range of scientific discoveries and engineering innovations. Data at Web scale has allowed us to characterise the shape and structure of the Web itself and to efficiently search its billions of items of contents. Data published on the Web has enabled the mobilisation of hundreds of thousands of humans to solve problems beyond any individual or single organisation. The last five years have seen increasing amounts of open data being published on the Web. Open data published on the Web is improving the efficiency of our public services and giving rise to open innovation. In particular, governments have made data available across a wide range of sectors: spending, crime and justice, education, health, transport, geospatial, environmental and much more. The data has been published in a variety of formats and has been reused with varying degrees of success. Commercial organisations have begun to exploit this resource and in some cases elected to release their own open data. Data collected at scale by public and private agencies also gives rise to concerns about its use and abuse. Meanwhile, data science is emerging as an area of competitive advantage for individuals, companies, universities, public and private sector organisations and nation states. A Web of data offers new opportunities and challenges for science, government and business. These include issues of provenance and security, quality and curation, certification and citation, linkage and annotation.
Padhraic Smyth
University of California, Irvine

Padhraic Smyth
Statistical Thinking in Machine Learning
Machine learning began as a subfield of artificial intelligence several decades ago but has grown to become a major research area within computer science in its own right. In particular, in the past few years machine learning has played a key role in making progress on a variety of application problems in areas such as image recognition, speech recognition, online advertising, and ranking of Web search results. The field is enjoying continued attention with the resurgent interest in neural network models via deep learning, and the broad interest outside computer science in topics such as "data science" and "big data." In this talk we will discuss the role of statistics in the success of machine learning. Statistical theories and models have long provided a foundational basis for many of the techniques used in machine learning, particularly in models with explicit probabilistic semantics such as logistic regression, hidden Markov models, and so on. But even for models which appear on the surface to have no explicit probabilistic or statistical semantics, such as neural networks or decision trees, there are fundamental statistical trade-offs at play, lurking beneath the surface of problems that appear to be more closely related to optimization than they are to statistical modeling. Focusing primarily on predictive modeling (classification and regression) we will explore how statistical thinking is fundamental to machine learning even when statistical models do not appear to be involved.