Joris M. Mooij - Abstracts of publications
Joris M. Mooij
Published
Information-geometric approach to inferring causal directions
While conventional approaches to causal inference are mainly based on
conditional (in)dependences, recent methods also account for the shape of
(conditional) distributions. The idea is that the causal hypothesis "X causes
Y" imposes that the marginal distribution PX and the conditional distribution
P(Y|X) represent independent mechanisms of nature. Recently it has been
postulated that the shortest description of the joint distribution P(X,Y) should
therefore be given by separate descriptions of P(X) and P(Y|X). Since description
length in the sense of Kolmogorov complexity is uncomputable, practical
implementations rely on other notions of independence. Here we define
independence via orthogonality in information space. This way, we can
explicitly describe the kind of dependence that occurs between P(Y) and P(X|Y)
making the causal hypothesis "Y causes X" implausible. Remarkably, this
asymmetry between cause and effect becomes particularly simple if X and Y are
deterministically related. We present an inference method that works in this
case. We also discuss some theoretical results for the non-deterministic case
although it is not clear how to employ them for a more general inference
method.
On Causal Discovery with Cyclic Additive Noise Models
We study a particular class of cyclic causal models, where each variable is a
(possibly nonlinear) function of its parents and additive noise. We prove that
the causal graph of such models is generically identifiable in the bivariate,
Gaussian-noise case. We also propose a method to learn such models from
observational data. In the acyclic case, the method reduces to ordinary
regression, but in the more challenging cyclic case, an additional term arises
in the loss function, which makes it a special case of nonlinear independent
component analysis. We illustrate the proposed method on synthetic data.
Efficient inference in matrix-variate Gaussian models with iid observation noise
Inference in matrix-variate Gaussian models has major applications for
multi-output prediction and joint learning of row and column covariances from
matrix-variate data. Here, we discuss an approach for efficient inference in
such models that explicitly account for iid observation noise. Computational
tractability can be retained by exploiting the Kronecker product between row
and column covariance matrices. Using this framework, we show how to
generalize the Graphical Lasso in order to learn a sparse inverse covariance
between features while accounting for a low-rank confounding covariance between
samples. We show practical utility on applications to biology, where we model
covariances with more than 100,000 dimensions. We find greater accuracy in
recovering biological network structures and are able to better reconstruct the
confounders.
Learning of Causal Relations
To learn about causal relations between variables just by
observing samples from them, particular assumptions must be made about
those variables' distributions. This article gives a practical description of
how such a learning task can be undertaken based on different possible
assumptions. Two categories of assumptions lead to different methods,
constraint-based and Bayesian learning, and in each case we review both
the basic ideas and some recent extensions and alternatives to them.
Identifiability of Causal Graphs using Functional Models
This work addresses the following question: Under what assumptions on the data
generating process can one infer the causal graph from the joint distribution?
The approach taken by conditional independence-based causal discovery methods
is based on two assumptions: the Markov condition and faithfulness. It has
been shown that under these assumptions the causal graph can be identified up
to Markov equivalence (some arrows remain undirected) using methods like the PC
algorithm. In this work we propose an alternative by defining Identifiable
Functional Model Classes (IFMOCs). As our main theorem we prove that if the
data generating process belongs to an IFMOC, one can identify the complete
causal graph. To the best of our knowledge this is the first identifiability
result of this kind that is not limited to linear functional relationships. We
discuss how the IFMOC assumption and the Markov and faithfulness assumptions
relate to each other and explain why we believe that the IFMOC assumption can
be tested more easily on given data. We further provide a practical algorithm
that recovers the causal graph from finitely many data; experiments on
simulated data support the theoretical findings.
A Graphical Model Framework for Decoding in the Visual ERP-Based BCI Speller
We present a graphical model framework for decoding in the visual ERP-based
speller system. The proposed framework allows researchers to build generative
models from which the decoding rules are obtained in a straightforward manner.
We suggest two models for generating brain signals conditioned on the stimulus
events. Both models incorporate letter frequency information but assume
different dependencies between brain signals and stimulus events. For both
models, we derive decoding rules and perform a discriminative training. We show
on real visual speller data how decoding performance improves by incorporating
letter frequency information and using a more realistic graphical model for the
dependencies between the brain signals and the stimulus events. Furthermore, we
discuss how the standard approach to decoding can be seen as a special case of
the graphical model framework. The letter also gives more insight into the
discriminative approach for decoding in the visual speller system.
Probabilistic latent variable models for distinguishing between cause and effect
We propose a novel method for inferring whether X causes Y or vice versa from
joint observations of X and Y. The basic idea is to model the observed data
using probabilistic latent variable models, which incorporate the effects of
unobserved noise. To this end, we consider the hypothetical effect variable to
be a function of the hypothetical cause variable and an independent noise term
(not necessarily additive). An important novel aspect of our work is that we
do not restrict the model class, but instead put general non-parametric priors
on this function and on the distribution of the cause. The causal direction
can then be inferred by using standard Bayesian model selection. We evaluate
our approach on synthetic data and real-world data and report encouraging
results.
libDAI: A Free and Open Source C++ Library for Discrete Approximate Inference in Graphical Models
This paper describes the software package libDAI, a free & open source C++
library that provides implementations of various exact and approximate
inference methods for graphical models with discrete-valued variables. libDAI
supports directed graphical models (Bayesian networks) as well as undirected
ones (Markov random fields and factor graphs). It offers various approximations
of the partition sum, marginal probability distributions and maximum
probability states. Parameter learning is also supported. A feature comparison
with other open source software packages for approximate inference is given.
libDAI is licensed under the GPL v2+ license and is available at
http://www.libdai.org.
Inferring deterministic causal relations
We consider two variables that are related to each other by an invertible
function. While it has previously been shown that the dependence structure of
the noise can provide hints to determine which of the two variables is the
cause, we presently show that even in the deterministic (noise-free) case,
there are asymmetries that can be exploited for causal inference. Our method is
based on the idea that if the function and the probability density of the cause
are chosen independently, then the distribution of the effect will, in a
certain sense, depend on the function. We provide a theoretical analysis of
this method, showing that it also works in the low noise regime, and link it to
information geometry. We report strong empirical results on various real-world
data sets from different domains.
Remote Sensing Feature Selection by Kernel Dependence Measures
This letter introduces a nonlinear measure of independence between random
variables for remote sensing supervised feature selection. The so-called
HilbertSchmidt independence criterion (HSIC) is a kernel method for evaluating
statistical dependence and it is based on computing the HilbertSchmidt norm of
the cross-covariance operator of mapped samples in the corresponding Hilbert
spaces. The HSIC empirical estimator is easy to compute and has good
theoretical and practical properties. Rather than using this estimate for
maximizing the dependence between the selected features and the class labels,
we propose the more sensitive criterion of minimizing the associated HSIC
p-value. Results in multispectral, hyperspectral, and SAR data feature
selection for classification show the good performance of the proposed
approach.
Distinguishing between cause and effect
We describe eight data sets that together formed the CauseEffectPairs task in
the Causality Challenge #2: Pot-Luck competition. Each set consists of a sample
of a pair of statistically dependent random variables. One variable is known to
cause the other one, but this information was hidden from the participants; the
task was to identify which of the two variables was the cause and which one the
effect, based upon the observed sample. The data sets were chosen such that we
expect common agreement on the ground truth. Even though part of the
statistical dependences may also be due to hidden common causes, common sense
tells us that there is a significant cause-effect relation between the two
variables in each pair. We also present baseline results using three different
causal inference methods.
Identifying confounders using additive noise models
We propose a method for inferring the existence of a latent common cause
("confounder") of two observed random variables. The method assumes that the
two effects of the confounder are (possibly nonlinear) functions of the
confounder plus independent, additive noise. We discuss under which conditions
the model is identifiable (up to an arbitrary reparameterization of the
confounder) from the joint distribution of the effects. We state and prove a
theoretical result that provides evidence for the conjecture that the model is
generically identifiable under suitable technical conditions. In addition, we
propose a practical method to estimate the confounder from a finite i.i.d.
sample of the effects and illustrate that the method works well on both
simulated and real-world data.
Motivated by causal inference problems, we propose a novel method for
regression that minimizes the statistical dependence between regressors and
residuals. The key advantage of this approach to regression is that it does not
assume a particular distribution of the noise, i.e., it is non-parametric with
respect to the noise distribution. We argue that the proposed regression method
is well suited to the task of causal inference in additive noise models. A
practical disadvantage is that the resulting optimization problem is generally
non-convex and can be difficult to solve. Nevertheless, we report good results
on one of the tasks of the NIPS 2008 Causality Challenge, where the goal is to
distinguish causes from effects in pairs of statistically dependent variables.
In addition, we propose an algorithm for efficiently inferring causal models
from observational data for more than two variables. The required number of
regressions and independence tests is quadratic in the number of variables,
which is a significant improvement over the simple method that tests all
possible DAGs.
The discovery of causal relationships between a set of observed variables is a
fundamental problem in science. For continuous-valued data linear acyclic
causal models with additive noise are often used because these models are well
understood and there are well-known methods to fit them to data. In reality, of
course, many causal relationships are more or less nonlinear, raising some
doubts as to the applicability and usefulness of purely linear methods. In this
contribution we show that the basic linear framework can be generalized
to nonlinear models. In this extended framework, nonlinearities in the
data-generating process are in fact a blessing rather than a curse, as they
typically provide information on the underlying causal system and allow more
aspects of the true data-generating mechanisms to be identified. In addition to
theoretical results we show simulations and some simple real data experiments
illustrating the identification power provided by nonlinearities.
Bounds on marginal probability distributions
We propose a novel bound on single-variable marginal probability distributions in
factor graphs with discrete variables. The bound is obtained by propagating local
bounds (convex sets of probability distributions) over a subtree of the factor graph,
rooted in the variable of interest. By construction, the method not only bounds
the exact marginal probability distribution of a variable, but also its approximate
Belief Propagation marginal ("belief"). Thus, apart from providing a practical
means to calculate bounds on marginals, our contribution also lies in providing
a better understanding of the error made by Belief Propagation. We show that
our bound outperforms the state-of-the-art on some inference problems arising in
medical diagnosis.
Sufficient Conditions for Convergence of the Sum–Product Algorithm
Novel conditions are derived that guarantee convergence of the Sum-Product
Algorithm (also known as Loopy Belief Propagation or simply Belief Propagation
(BP)) to a unique fixed point, irrespective of the initial messages, for
parallel (synchronous) updates. The computational complexity of the conditions
is polynomial in the number of variables. In contrast with previously existing
conditions, our results are directly applicable to arbitrary factor graphs
(with discrete variables) and are shown to be valid also in the case of factors
containing zeros, under some additional conditions. The conditions are compared
with existing ones, numerically and, if possible, analytically. For binary
variables with pairwise interactions, sufficient conditions are derived that
take into account local evidence (i.e., single-variable factors) and the type
of pair interactions (attractive or repulsive). It is shown empirically that
this bound outperforms existing bounds.
Truncating the Loop Series Expansion for Belief Propagation
Recently, Chertkov and Chernyak (2006b) derived an exact expression
for the partition sum (normalization constant) corresponding to a
graphical model, which is an expansion around the belief propagation
(BP) solution. By adding correction terms to the BP free energy, one
for each "generalized loop" in the factor graph, the exact partition
sum is obtained. However, the usually enormous number of generalized
loops generally prohibits summation over all correction terms. In this
article we introduce truncated loop series BP (TLSBP), a particular
way of truncating the loop series of Chertkov and Chernyak by
considering generalized loops as compositions of simple loops. We
analyze the performance of TLSBP in different scenarios, including the
Ising model on square grids and regular random graphs, and on
PROMEDAS, a large probabilistic medical diagnostic system. We show
that TLSBP often improves upon the accuracy of the BP solution, at the
expense of increased computation time. We also show that the
performance of TLSBP strongly depends on the degree of interaction
between the variables. For weak interactions, truncating the series
leads to significant improvements, whereas for strong interactions it
can be ineffective, even if a high number of terms is considered.
Loop Corrections for Approximate Inference on Factor Graphs
We propose a method to improve approximate inference methods by
correcting for the influence of loops in the graphical model. The
method is a generalization and alternative implementation of a recent
idea from Montanari and Rizzo (2005). It is applicable to arbitrary
factor graphs, provided that the size of the Markov blankets is not
too large. It consists of two steps: (i) an approximate inference
method, for example, belief propagation, is used to approximate cavity
distributions for each variable (i.e., probability distributions on
the Markov blanket of a variable for a modified graphical model in
which the factors involving that variable have been removed); (ii) all
cavity distributions are improved by a message-passing algorithm that
cancels out approximation errors by imposing certain consistency
constraints. This loop correction (LC) method usually gives
significantly better results than the original, uncorrected,
approximate inference algorithm that is used to estimate the effect of
loops. Indeed, we often observe that the loop-corrected error is
approximately the square of the error of the uncorrected approximate
inference method. In this article, we compare different variants of
the loop correction method with other approximate inference methods on
a variety of graphical models, including "real world" networks, and
conclude that the LC method generally obtains the most accurate results.
Inference in the Promedas medical expert system
In the current paper, the Promedas model for internal medicine, developed
by our team, is introduced. The model is based on up-to-date medical knowledge
and consists of approximately 2000 diagnoses, 1000 findings and 8600 connections
between diagnoses and findings, covering a large part of internal medicine.
We show that Belief Propagation (BP) can be successfully applied as approximate
inference algorithm in the Promedas network. In some cases, however, we find
errors that are too large for this application. We apply a recently developed
method that improves the BP results by means of a loop expansion scheme. This
method, termed Loop Corrected (LC) BP, is able to improve the marginal
probabilities significantly, leaving a remaining error which is acceptable
for the purpose of medical diagnosis.
Loop Corrected Belief Propagation
We propose a method for improving Belief Propagation (BP) that takes into
account the influence of loops in the graphical model. The method is a
variation on and generalization of the method recently introduced by (Montanari
and Rizzo, 2005). It consists of two steps: (i) standard BP is used to
calculate cavity distributions for each variable (i.e. probability
distributions on the Markov blanket of a variable for a modified graphical
model, in which the factors involving that variable have been removed); (ii)
all cavity distributions are combined by a message-passing algorithm to obtain
consistent single node marginals. The method is exact if the graphical model
contains a single loop. The complexity of the method is exponential in the
size of the Markov blankets. The results are very accurate in general: the
error is often several orders of magnitude smaller than that of standard BP, as
illustrated by numerical experiments.
Sufficient conditions for convergence of Loopy Belief Propagation
We derive novel sufficient conditions for convergence of Loopy Belief
Propagation (also known as the Sum-Product algorithm) to a unique fixed point.
Our results improve upon previously known conditions. For binary variables with
(anti-)ferromagnetic interactions, our conditions seem to be sharp.
On the properties of the Bethe approximation and Loopy Belief Propagation on binary networks
We analyse the local stability of the high-temperature fixed point of the loopy
belief propagation (LBP) algorithm and how this relates to the properties of
the Bethe free energy which LBP tries to minimize. We focus on the case of
binary networks with pairwise interactions. In particular, we state sufficient
conditions for convergence of LBP to a unique fixed point and show that these
are sharp for purely ferromagnetic interactions. In contrast, in the purely
antiferromagnetic case, the undamped parallel LBP algorithm is suboptimal in
the sense that the stability of the fixed point breaks down much earlier than
for damped or sequential LBP; we observe that the onset of instability for the
latter algorithms is related to the properties of the Bethe free energy. For
spin-glass interactions, damping LBP only helps slightly. We estimate
analytically the temperature at which the high-temperature LBP fixed point
becomes unstable for random graphs with arbitrary degree distributions and
random interactions.
Validity Estimates for Loopy Belief Propagation on Binary Real-world Networks
We introduce a computationally efficient method to estimate the validity of the
BP method as a function of graph topology, the connectivity strength,
frustration and network size. We present numerical results that demonstrate the
correctness of our estimates for the uniform random model and for a real-world
network ("C. Elegans"). Although the method is restricted to pair-wise
interactions, no local evidence (zero "biases") and binary variables, we
believe that its predictions correctly capture the limitations of BP for
inference and MAP estimation on arbitrary graphical models. Using this
approach, we find that BP always performs better than MF. Especially for large
networks with broad degree distributions (such as scale-free networks) BP turns
out to significantly outperform MF.
Quantitative Imaging through a Spectrograph. 1. Principles and Theory
Laser-based optical diagnostics, such as planar laser-induced fluorescence and,
especially, Raman imaging, often require selective spectral filtering. We
advocate the use of an imaging spectrograph with a broad entrance slit as a
spectral filter for two-dimensional imaging. A spectrograph in this mode of
operation produces output that is a convolution of the spatial and spectral
information that is present in the incident light. We describe an analytical
deconvolution procedure, based on Bayesian statistics, that retrieves the
spatial information while it avoids excessive noise blowup. The method permits
direct imaging through a spectrograph, even under broadband illumination. We
introduce the formalism and discuss the underlying assumptions. The performance
of the procedure is demonstrated on an artificial but pathological example. In
a companion paper [Appl. Opt. 43, 5682-5690 (2004)] the method is applied to
the practical case of fuel equivalence ratio Raman imaging in a combustible
methane-air mixture.
Preprints
Novel Bounds on Marginal Probabilities
We derive two related novel bounds on single-variable marginal probability
distributions in factor graphs with discrete variables. The first method
propagates bounds over a subtree of the factor graph rooted in the variable,
and the second method propagates bounds over the self-avoiding walk tree
starting at the variable. By construction, both methods not only bound the
exact marginal probability distribution of a variable, but also its approximate
Belief Propagation marginal (``belief''). Thus, apart from providing a
practical means to calculate bounds on marginals, our contribution also lies in
an increased understanding of the error made by Belief Propagation.
Empirically, we show that our bounds often outperform existing bounds in terms
of accuracy and/or computation time. We also show that our bounds can yield
nontrivial results for medical diagnosis inference problems.
Loop corrections for approximate inference
We propose a method for improving approximate inference methods that corrects
for the influence of loops in the graphical model. The method is applicable to
arbitrary factor graphs, provided that the size of the Markov blankets is not
too large. It is an alternative implementation of an idea introduced recently
by Montanari and Rizzo (2005). In its simplest form, which amounts to the
assumption that no loops are present, the method reduces to the minimal Cluster
Variation Method approximation (which uses maximal factors as outer clusters).
On the other hand, using estimates of the effect of loops (obtained by some
approximate inference algorithm) and applying the Loop Correcting (LC) method
usually gives significantly better results than applying the approximate
inference algorithm directly without loop corrections. Indeed, we often observe
that the loop corrected error is approximately the square of the error of the
approximate inference method used to estimate the effect of loops. We compare
different variants of the Loop Correcting method with other approximate
inference methods on a variety of graphical models, including "real world"
networks, and conclude that the LC approach generally obtains the most accurate
results.
Truncating the loop series expansion for Belief Propagation
Recently, M. Chertkov and V.Y. Chernyak derived an exact expression for the
partition sum (normalization constant) corresponding to a graphical model,
which is an expansion around the Belief Propagation (BP) solution. By adding
correction terms to the BP free energy, one for each "generalized loop" in the
factor graph, the exact partition sum is obtained. However, the usually
enormous number of generalized loops generally prohibits summation over all
correction terms. In this article we introduce Truncated Loop Series BP
(TLSBP), a particular way of truncating the loop series of M. Chertkov and V.Y.
Chernyak by considering generalized loops as compositions of simple loops. We
analyze the performance of TLSBP in different scenarios, including the Ising
model, regular random graphs and on Promedas, a large probabilistic medical
diagnostic system. We show that TLSBP often improves upon the accuracy of the
BP solution, at the expense of increased computation time. We also show that
the performance of TLSBP strongly depends on the degree of interaction between
the variables. For weak interactions, truncating the series leads to
significant improvements, whereas for strong interactions it can be
ineffective, even if a high number of terms is considered.
Sufficient conditions for convergence of the Sum-Product Algorithm
We derive novel conditions that guarantee convergence of the Sum-Product
algorithm (also known as Loopy Belief Propagation or simply Belief Propagation)
to a unique fixed point, irrespective of the initial messages. The
computational complexity of the conditions is polynomial in the number of
variables. In contrast with previously existing conditions, our results are
directly applicable to arbitrary factor graphs (with discrete variables) and
are shown to be valid also in the case of factors containing zeros, under some
additional conditions. We compare our bounds with existing ones, numerically
and, if possible, analytically. For binary variables with pairwise
interactions, we derive sufficient conditions that take into account local
evidence (i.e., single variable factors) and the type of pair interactions
(attractive or repulsive). It is shown empirically that this bound outperforms
existing bounds.
Spin-glass phase transitions on real-world graphs
We use the Bethe approximation to calculate the critical temperature for the
transition from a paramagnetic to a glassy phase in spin-glass models on
real-world graphs. Our criterion is based on the marginal stability of the
minimum of the Bethe free energy. For uniform degree random graphs (equivalent
to the Viana-Bray model) our numerical results, obtained by averaging single
problem instances, are in agreement with the known critical temperature
obtained by use of the replica method. Contrary to the replica method, our
method immediately generalizes to arbitrary (random) graphs. We present new
results for Barabasi-Albert scale-free random graphs, for which no analytical
results are known. We investigate the scaling behavior of the critical
temperature with graph size for both the finite and the infinite connectivity
limit. We compare these with the naive Mean Field results. We observe that the
Belief Propagation algorithm converges only in the paramagnetic regime.