We consider two challenging ecological problems and show how they are naturally investigated using spatial survival models with time varying covariates.
The first problem involves explanation of the occurrence of wild fires. This process is of interest because over half of the world’s terrestrial ecosystems depend on fire to maintain ecological structure and function and the ecological role of fire regimes can be strongly influenced by weather and climate. To undertake this analysis, we developed an extensive database of observed fires with high-resolution meteorological data to explore fire regimes in the Mediterranean ecosystem in the Cape Floristic Region (CFR) of South Africa during the period 1980-2000. We need to consider the influence of seasonally (quarterly) anomalous weather on fire probability. In addition to these local-scale influences, the Antarctic Ocean Oscillation (AAO) is a potentially important large scale influence with regard to global circulation patterns.
The second involves explaining first flowering times. The objective here is to learn about changes in the length and onset of the growing season. This process has to be examined at individual tree/plant level and in response to weather, in particular daily temperature, rather than aggregating to climate. We are broadly interested in comparison of first flowering time (or bud burst) across species but here we focus on explaining spatial variation in first flowering time. We consider first flowering dates for trees of a single species in Japan at 45 locations over 52 years, collected through 2009. The challenge with this process is to provide suitable functions of the weather - heating and chilling functions - to employ in the explanation. The difficulty is that these functions are not explicitly defined since they require measurement beginning from unknown starting dates as well as unknown thresholds. We have uncertainty in the specification of the functional covariates. We present both analyses and our findings along with some future challenges.
Regression problems are traditionally analyzed via univariate characteristics such as the regression function, but in complex situations, more detailed information about the association is provided by the conditional density of the response Y given the predictors X. A popular approach is to obtain an estimate of the conditional density from an estimate of the joint density of X and Y.
In Bayesian nonparametrics, this is commonly achieved by modelling the joint density as a Dirichlet process mixture of parametric kernels. An appealing aspect of this approach is that computations are relatively easy. More problematic is that the performance depends on the relative degree of smoothing of the marginal and conditional densities. If X is high-dimensional, estimation of its marginal density generally requires a higher number of kernels than for the conditional density of Y given X and this cannot be reflected in the random partition model implied by the Dirichlet process.
We point out this problem, by examining prediction in a regression setting with increasing number of covariates, and suggest a solution using a different nonparametric prior, namely an Enriched Dirichlet process. Our proposal maintains a simple allocation rule, so that computations remain relatively simple. Advantages are shown through both predictive equations and examples.
Coauthors: Sara Wade, David Dunson, Lorenzo Trippa
Some people argue that Bayesian statistics will not be relevant in an age where data is abundant. I will review a number of reasons that point in the opposite direction and argue for the increasing need for Bayesian tools. However, the application of Bayesian methods to big data problems runs into a computational bottleneck that needs to be addressed with a new generation of (approximate) inference methods. I will review two classes of efficient inference procedures, both based on stochastic approximations: stochastic gradient Markov chain Monte Carlo sampling and stochastic variational Bayesian inference. I will also argue that these methods are very well suited for parallelization where the data is distributed over multiple machines. During the last part of the talk I will argue that new algorithmic advances are also needed for likelihood free Bayesian problems where the model is given as an expensive simulation. I will conclude with some new speculative directions for research such as privacy preserving Bayesian methods.
In today's world of connectivity and ubiquitous data, we need to be able to solve ever more complex machine learning problems. Ideally, we would like the solutions to be provided automatically, given a definition of the problem. But how do we even describe a complex and intricate problem in a way a machine can understand?
Probabilistic programming gives us a powerful and flexible way to express such machine learning problems. It uses the rich expressivity of modern programming languages to describe a problem as a Bayesian inference query. Once you have a probabilistic program, the idea is to use an existing inference 'engine' to actually run the program and so answer the Bayesian inference query automatically. This allows the program to be tested rapidly and, if necessary, quickly modified and improved to meet the needs of the application.
In this lecture, I will describe efforts going on at Microsoft Research Cambridge to create a fast, general purpose programming language, called probabilistic C#. As an example, I will show how probabilistic C# can be used to produce compact, complex models of unstructured text. To do inference in probabilistic C# (to run the program), we use the Infer.NET inference engine. I will discuss some of the (many) issues arising in trying to build a general purpose inference engine and describe approaches we are investigating to make running probabilistic C# programs both fast and accurate. The aim is to make probabilistic C# an ideal language for solving the next generation of machine learning problems.
Performance of petroleum reservoirs is predicted with complex parametric models, which describe static geological porous media properties and dynamic fluid flow. These models are highly parametric and are subject to wide uncertainties that are commonly inferred by solving an inverse problem. The efficiency of stochastic sampling performed over the high dimensional model parameter space to infer the uncertainty of the model prediction depends on restricting the space of models to only the geologically realistic ones. Therefore, it is important to use adequate priors to represent the model parameter uncertainty.
Geological realism is often difficult to achieve in reservoir models as it requires use of more sophisticated algorithmic descriptions based on higher order statistics. Furthermore, the priors for these descriptions are may be even more difficult to elicit, because the model parameters are often differ from the ones that can be observed in natural analogues. For instance, spatial correlation of sand bodies can be directly observer and measured in the outcrop as distances, although the corresponding geostatistical model parameters – variogram range and nugget, l– cannot be directly observed from an outcrop and need to be interpreted. Finally, evolution of the model through its parameter update whilst solving the inverse problem may results in unrealistic combination of the model parameters if their correlation in high dimensional space is not handled a priori.
Proper geological priors can be elicited from a vast domain of natural analogue information using a machine learning approach. Thus, multivariate relation between channel geometries (width, thickness, amplitude, wavelength) has been derived from a modern river analogues for different types of depositional environments: meandering channels, deep marine channels, deltas. These, probabilistic description of this relationship formed the informative prior in solving the inverse history matching problem. Machine learning techniques where used again to establish relationship between the geostatistical model parameters and the geological observable parameters.
The application of the proposed approach is illustrated with a realistic petroleum reservoir prediction case study. Comparison of using informative priors vs non-informative flat priors shows that the inference is more efficient with the proper priors and the obtained models have heigher predictive capability.
We propose a method for knowledge transfer between semantically related classes in ImageNet. By transferring knowledge from the images that have bounding-box annotations to the others, our method is capable of automatically populating ImageNet with many more bounding boxes.The underlying assumption that objects from semantically related classes look alike is formalized in our novel Associative Embedding (AE) representation. AE recovers the latent low-dimensional space of appearance variations among image windows. We model the overlap of a window with an object using Gaussian Processes (GP) regression, which spreads annotation smoothly through AE space. The probabilistic nature of GP allows our method to perform self- assessment, i.e. assigning a quality estimate to its own output. It enables trading off the amount of returned annotations for their quality.A large scale experiment on 219 classes and 0.5 million images demonstrates that our method outperforms state-of-the-art methods and baselines for object localization. Using self-assessment we can automatically return bounding-box annotations for 51% of all images with high localization accuracy (i.e. 71% average overlap with ground-truth).
Recently, new experimental techniques allow us to monitor a wide variety of biological processes at high temporal and special resolutions: from gene expression in single-cells to enzymatic activity of single molecules. It is, therefore, becoming essential to develop general and optimal methodologies to analyze such data. I will present a Bayesian framework to analysis biological time series, which combines measurement noise models with stochastic models that describe the underlying dynamical process. This approach allows us to infer model parameters and, more importantly, to discriminate between competing models. I will show a particular application of this methodology to study the kinetics of gene expression. Recent single cell studies showed that most genes appear to be transcribed during short periods called transcriptional bursts, interspersed by silent intervals. Our analysis demonstrated that transcriptional bursting kinetics is highly gene-specific, reflecting refractory periods during which genes stay inactive for a certain time before switching on again.
Recently, the standard statistical inferences methods used across the sciences have come under increased scrutiny. Although critiques of classical statistical tools vary, Bayesian arguments in particular offer a compelling set of critiques of the traditional methods. As researchers become more open to Bayesian methods, Bayesians are under increasing pressure to give researchers principled, viable, and easy to use alternatives. Rouder, Morey, Speckman, and Province (2012) introduced a default family of prior distributions for linear mixed effect models, based on mixtures of g priors (Liang et al., 2008; Zellner and Siow, 1980). Bayes factors based on these prior distributions are implemented in the R package BayesFactor (Morey and Rouder, 2014), which offers an intuitive interface for computing and manipulating Bayes factors, and offering a viable alternative to traditional t tests, ANOVA, and regression.
Data on disease occurrences at field level are valuable resources for quantifying host genetic variation in disease resistance. However, they are often inaccurate due to incomplete information describing exposure, disease prevalence and imperfect diagnostic tests. Quantitative genetic models of disease occurrence data do not typically account for these factors leading to underestimation of the true extent of genetic variation. We propose a framework that integrates genetics and epidemiology including genetic relationships between animals, observed disease state, prevalence of the disease and sensitivity and specificity of diagnostic tests. Bayesian inference allows quantification of host genetic variation accounting for the complexities inherent in field disease data. Prior information, as elicited by expert opinion, is incorporated. Application to simulated data shows this novel approach provides reliable inferences on genetic and epidemiological parameters that are of practical relevance to animal breeders.
In immunology, the T-cell receptor (TCR) diversity is instrumental to understand how the immune system correctly discriminates harmful microorganisms from body components. As in species diversity studies, this quantity is only accessible by means of estimation based on a sample of TCR sequences. Similar to species-abundance distribution in Ecology, the so-called sequence-abundance distribution is the core of the whole analysis. The primary aim is to estimate the total number of unseen TCR types by fitting simple parametric models to that distribution. However, theoretical studies on T-cell physiology have recently shown that the sequence-abundance distribution results from a complex and intricate mixture of different TCR populations. Thus, although fitting well to available data, current parametric solutions for TCR estimation seem unlikely to provide reliable estimates. Here we propose a flexible Bayesian semi-parametric model where TCR sequences are sampled according to a Poisson distribution with a given rate that, in turn, varies with another but yet unknown probability distribution. We assume a Dirichlet process for this second level distribution using an appropriate Gamma distribution as the center measure. We illustrate the proposed model with previously published data on type I diabetes, a disease caused by erroneous immune responses to pancreatic cells. We also simulate data from realistic immunological settings to demonstrate the superiority of our method in relation to current parametric counterparts.