My publications dating before 2004 are available at
my NNRC page.
My current research at the Curious AI Company focuses on
combining unsupervised learning with currently successful supervised deep
learning techniques.
In the past, I have been active in these areas:
Deep learning
- Semi-supervised learning with ladder networks
-
A. Rasmus,
H. Valpola, M. Honkala, M. Berglund and
T. Raiko
- Accepted for publication in NIPS 2015
-
arXiv:1507.02672 [cs.NE]
-
- Lateral connections in denoising autoencoders support
supervised learning
-
A. Rasmus,
H. Valpola and
T. Raiko
-
arXiv:1504.08215 [cs.LG]
-
- Denoising autoencoder with modulated lateral connections learns
invariant representations of natural images
-
A. Rasmus,
T. Raiko
and H. Valpola
-
arXiv:1412.7210 [cs.NE]
-
- From neural PCA to deep unsupervised learning
- H. Valpola
-
In E. Bingham, S. Kaski, J. Laaksonen and J. Lampinen, eds.,
Advances in Independent Component Analysis and Learning Machines,
pp. 143-171 (Chapter 8), 2015.
-
arXiv:1411.7783 [stat.ML]
-
- Deep learning made easier by linear transformations in perceptrons
- T. Raiko,
Harri Valpola and Yann Lecun.
- In Proceedings of the 15th Conference on AI and Statistics,
AISTATS 2012, pp. 924-932, 2012.
-
[abstract]
[pdf]
-
Cognitive architecture
When I was running a computational neuroscience group at Aalto
University during 2005-2010, my main research topic was integrating
various components into a complete cognitive architecture.
- A cognitive architecture for developing sensory and motor
abstractions
- H. Valpola
- A presentation given at the First International Conference
on Biologically Inspired Cognitive Architectures,
BICA 2010.
-
- Oscillatory neural network for image
segmentation with biased competition for
attention
- T. Raiko
and H. Valpola.
- In the Brain Inspired Cognitive Systems (BICS 2010) symposium,
Madrid, Spain, 14-16 July, 2010.
-
- Selective attention improves learning
- A. Yli-Krekola,
J. Särelä and
H. Valpola.
- In Proceedings of the 19th International Conference of Artificial
Neural Networks, ICANN
2009, Limassol, Cyprus, Part II, pp. 285-294, 2009.
- [pdf 130 kb]
-
- From raw data to abstract concepts
- H. Valpola
- Keynote presentation in
AKRR'08: International
and Interdisciplinary Conference on Adaptive Knowledge Representation
and Reasoning, Porvoo, Finland, September 17-19, 2008.
- [pdf 2.2 MB, slides]
- This was the first presentation where I talked about modelling
correlation structures. (Just framing the problem, solution not
presented yet...) The other topic was the distributed selection
process, which gives rise to both attention and development
of abstract representations in the cerebral cortex. Read more about
the model here.
-
- The engine of thought -- a bio-insipred mechanism for distributed selection of useful information
- H. Valpola.
- Nokia Workshop on Machine Consciousness, Helsinki, Finland,
pp. 27-31, 2008.
- [pdf 172 kB, paper]
[pdf 291 kB, slides]
- This was a workshop organised by Pentti Haikonen, who invited the
papers. There wasn't any peer review beyond that. My paper discusses
the possible relations between consciousness and the distributed
selection process, which gives rise to both attention and development
of abstract representations in the cerebral cortex. Read more about
the model here.
-
- Computational model of co-operating covert attention and learning.
- A. Yli-Krekola and H. Valpola.
- Fifth
Nordic Neuroinformatics Workshop, Espoo, Finland, p. 34, 2007.
- [abstract]
[pdf 264 kB, poster]
- In his master's thesis, Antti Yli-Krekola implemented the model
about which I speculated earlier. In the
model, attention and development of internal representations are two
sides of the same coin: selection of useful information. Read more
about the model
here.
-
- A model of cerebellar automation of voluntary basal-ganglia control.
- M. Pihlaja and
H. Valpola.
- Fifth
Nordic Neuroinformatics Workshop, Espoo, Finland, p. 29, 2007.
- [abstract]
[pdf 301 kB, poster]
- The basal ganglia can learn by trial and error but the process is
slow and it is difficult to learn smooth motor control particularly if
there is a large number of degrees of freedom to control. We study how
the cerebellar
model can automate and perfect the control learned by basal
ganglia.
-
- Cerebellar model tested in control of a load-carrying robot.
- I. Aaltonen and H. Valpola.
- Fifth
Nordic Neuroinformatics Workshop, Espoo, Finland, p. 16, 2007.
- [abstract]
[pdf 58 kB,
slides]
- We investigate the limits of the
cerebellar
model to better understand what kind of processing and
representations are needed in the neocortex.
-
- Cerebellar model for coordination.
- T. J. Lukka and H. Valpola.
- Fifth
Nordic Neuroinformatics Workshop, Espoo, Finland, p. 25, 2007.
- [abstract]
[pdf 139 kB, poster]
- In humans, cerebellum is important for smooth coordination of
movements. Here we
study the coordination of a simulated 2-joint actuator. We
assume spring-like properties for the joints which makes it very
hard to control the tip of the actuator both fast and accurately
at the same time but the
cerebellar
model learns the task.
-
- Learning anticipatory behaviour using a simple cerebellar model.
- H. Valpola.
- In Proceedings of the Ninth Scandinavian Conference on Artificial
Intelligence, SCAI 2006,
Espoo, Finland, pp. 135-142, 2006.
- [pdf
416 kB]
- This is a review of a
control
scheme inspired by the cerebellar system.
-
- Development of representations, categories and concepts--a hypothesis.
- H. Valpola.
- In Proceedings of the 6th IEEE International Symposium on
Computational Intelligence in Robotics and Automation,
CIRA 2005, Espoo,
Finland, pp. 593-599, 2005.
- [pdf 72 kB]
- This is a brain-related "visions-and-ideas paper". I refer to
simulation results in many of the of the machine-learning-oriented
DSS papers, Deco's results with attention model and to some
biological findings. Based on these, I propose how the brain could
learn concepts and representations in active interaction with the
world. I could have said a lot more about many things (sparse
object representations, synchrony, planning and imagination, etc.)
but six pages is a bit short for a full-blown brain theory...
-
- Behaviourally meaningful representations from normalisation and
context-guided denoising.
- H. Valpola.
- AI Lab technical report, University of Zurich, 2004.
-
[abstract and link to pdf]
- Invariant features resembling complex-cell properties are known to
develop if temporal slowness is the learning criterion. I argue that
this is a special case of expectation and show that lateral expectation
from adjacent image location will also produce complex-cell-like feature
detectors. It also turned out that the expectation-driven learning with
DSS resembles in many ways Deco's model for attention. Finding invariant
features and attentional filtering are both selection processes, only on
different timescales. I discuss the connections and propose that
normalisation of activations of competing neuron assemblies makes
attentional process robust in the same way as decorrelation of
inputs helps DSS.
-
Denoising source separation
The background of this research is that I was doing excercises for
an ICA seminar (long time ago, the files seem to be dated to Jan
1997). The signals that were supposed to separate with
FastICA. They
had prominent temporal structure but the separation was supposed to
only use information about the marginal distribution of the signals.
In other words, the temporal structure was there only for visualising
the results. (Here are the separation results
with FastICA.)
At the time, I had been working on variational Bayesian methods and
it occurred to me that the nonlinearity in FastICA could actually be
considered as denoising or the expectation step in an EM-algorithm.
What if I would use knowledge about the periodicity of the signals?
(The signals were artificially generated and one of them had an exact
periodicity of 23, another was a sinusoidal signal.)
Lo and behold! The algorithm converged in two steps to an apparently
perfect solution. FastICA is, well, fast, but this was something quite
phenomenal.
I was busy doing "more important things" like my thesis and somehow
the method was too weird and simple. Still, every now and then this
strange new algorithm was bugging me and finally in 2000 I wrote the
first article about the method (together
with Petteri Pajunen). Obviously we had to use some "respectable"
Bayesian filtering algorithms, at least, so that I could include this
work in my thesis. After defending my
thesis in 2000, I was starting
to see more and more that the theoretically very nice Bayesian
algorithms for generative models are not going to deliver what I would
want from then (like implement the brain :-). They were too slow and
unreliable. Things were better if this funny new algorithm was used
for initialising the methods (see, e.g., Särelä et. al in ICA 2001,
available here).
But which method was doing the real work...?
Trying to break the Bayesian models into distributed,
asynchronously communication modules turned out to be a nightmare.
This became painfully obvious around 2003 and that was a bad sign
because I'm trying to learn from what the brain does and obviously the
neocortex seems to have a hierarchy of distributed areas. There must
be some communication delays and a method which cannot tolerate this
is probably missing something important.
In the meantime we had started using the funny method (we now
thought semi-blind source separation would be an appropriate name)
together with Jaakko Särelä for analysing MEG data and in spring 2002
we wrote a very short technical report just
to put the idea down. We started writing a long
journal article about the things (it finally
got out in the beginning of 2005). We now called the method denoising
source separation (DSS) because it can separate sources by means of
a denoising procedure.
In September 2003 I moved to Zurich to work on the
ADAPT project
where my part was the neural control architecture (mostly for perception)
of the robot. I spent the first weeks reading and thinking. In many
learning algorithms for hierarchical generative model, the higher areas
send predictions to lower areas and the lower areas send back error
signals about the mismatches. I was trying to figure how this fits the
known neocortical structure. It doesn't fit well. In the cortex,
long-range connections are excitatory. This is a bit problematic
if you are trying to find an error signal because you would expect
top-down signals to have an inhibitory effect. It occurred to me that
this looks more like denoising!
One reason that generative models need inhibitory top-down signals
is that those things which are represented already somewhere shouldn't
be represented over and over again in other places. Mutual inhibition
can implement competition which prevents everybody from learning the
same things but it is not efficient to connect everybody with
everybody only to watch that nobody else learns to represent the same
things I do. This is a standard issue in unsupervised learning: there
has to be competition which prevents units from learning the same
things from the same inputs. One obvious solution is that not everybody
gets the same inputs. Then only those neurons need to compete which
get the same inputs.
I decided to take DSS as the starting point for the "cortical"
feature extraction algorithm (a hierarchy of areas, each competing
locally and receiving similar inputs only locally). I had been working
on invariant features already since 1994 and knew it is pretty easy to
find invariances by combining "elementary features" into "invariant
features" using temporal slowness as a learning criterion. With DSS
this is pretty easy and I was supervising Allard Kamphuisen's
work on this topic. From some of his simulations it
occurred to me (December 2003) that expectations can drive the
development of invariant features. Temporal slowness is not required
(this is nice, e.g., for learning phonetic categories).
The idea of trying to learn predictable features was not new to
me. In 2000 I had been working on an unsupervised nonlinear dynamical
model which was able to learn a state representation which
- can represent the observations accurately,
- can predict the future state and
- is predictable by the past states.
The variational Bayesian method I used took weeks to learn even a
ten-dimensional representation (the results were quite amazing,
though!, see our article in Neural
Computation) but with DSS this could be done extremely efficiently.
Past expectations can be used to filter away noise and what is left
is the predictable part of the signal.
I have applied DSS to "engineering" applications (well, analysis of
climate data is not exactly engineering) but particularly the
expectation-driven learning of hierarchical invariant representations
is very much inspired by the cortical architecture. Some of the
papers listed below are therefore related to
cognitive-architecture section above.
I started doing this work while in
CIS but both me and recently
Jaakko Särelä, too, have moved to LCE.
Nevertheless,
DSS pages are still
hosted by CIS and there you can find more
publications, software and tutorials related to DSS.
- Finding interesting climate phenomena by exploratory statistical
techniques.
- A. Ilin, H.
Valpola and E. Oja.
- In Proceedings of the Fifth Conference on Artificial
Intelligence Applications to Environmental Science, 5AI,
as part of the
87th Annual Meeting of the
American Meteorological Society,
San Antonio, TX, USA, January 2007.
- [pdf 1.7 MB]
- This paper collects together results which have been published
in our previous papers. Alexander received an award for the best
student presentation.
-
- Extraction of climate components with structured variance.
- A. Ilin, H. Valpola and
E. Oja.
- In Proceedings of the IEEE World Congress on Computational
Intelligence, WCCI 2006,
Vancouver, BC, Canada, pp. 10528-10535, 2006.
- [pdf 1.7 MB]
- We present an efficient algorithm for extracting components with
structured variance. One of my earlier papers introduced a somewhat
similar method for estimating hierarchical models
of variance sources. Now we got similar results with a more efficient
DSS-based algorithm. There are some interesting components. For instance,
one components shows an abrupt decrease in its variance during mid
70's.
-
- Exploratory analysis of climate data using source separation methods.
- A. Ilin, H. Valpola and
E. Oja.
- Neural
Networks, 19(2):155-167, 2006.
-
[pdf 3.4 MB]
[html]
- This article combines our PKDD'05 and
IJCNN'05 articles. Some new results
are presented.
-
- Separation of nonlinear image mixtures by denoising source separation.
- M. S. C.
Almeida, H. Valpola and J.
Särelä.
- In Proceedings of the 6th International Conference on Independent
Component Analysis and Blind Signal Separation,
ICA 2006,
Charleston, SC, USA, pp. 8-15, 2006.
- [pdf 362 kB]
[abstract]
- We extend the DSS framework to nonlinear mixtures and apply it
to separation of
image
mixtures. Mariana received the best student paper award.
-
- Frequency-based separation of climate signals.
- A. Ilin and
H. Valpola.
- In Proceedings of the 9th European Conference on Principles and Practice
of Knowledge Discovery in Databases
(PKDD 2005),
Porto, Portugal, pp. 519-526, 2005.
- [abstract and link
to pdf]
- In this paper we extended the results of our
IJCNN'05 article by finding meaningful rotations among climate
components based on their frequency content. The rotation method
resembles the one used in our ICA'04
article.
-
- Semiblind source separation of climate data detects El Niño as
the component with the highest interannual variability.
- A. Ilin, H. Valpola and
E. Oja.
- In Proceedings of the International Joint Conference on Neural
Networks (IJCNN 2005),
Montréal, Québec, Canada, pp. 1722-1727, 2005.
- [pdf 1.4 MB]
- DSS can find features with certain characteristics. It turns out that
in a certain large climate dataset, the phenomenon with the highest
interannual variability is the well-known El Niño. Many other
intersting phenomena are found, too. The linear DSS method we used
here can only find a signal subspace, not a rotation in it. Real
separation results were published in our
PKDD 2005 paper.
-
- Denoising source separation: a novel approach to ICA and feature
extraction using denoising and Hebbian learning.
- J. Särelä and
H. Valpola.
- In AI 2005 special
session on
correlation learning, pp. 45-56, 2005.
-
[12-page paper, pdf 1.7 MB]
[2-page abstact, pdf 800 kB]
[slides, pdf 2.7 MB]
- Description of DSS and its biological relevance.
-
- Denoising source separation.
- J. Särelä and
H. Valpola.
- Journal of
Machine Learning Research, 6:233-272, 2005.
- [abstract
and link to pdf]
- This is a comprehensive machine learning perspective to DSS. No
brain-related things fitted in. Here is a link to
DSS page at CIS
where further information and software is available.
-
- Accurate, fast and stable denoising source separation algorithms.
- H. Valpola and
J. Särelä.
- In Proceedings of
the 5th International Conference on Independent
Component Analysis and Blind Signal Separation,
ICA 2004, Granada, Spain,
pp. 65-72, 2004.
- [abstract
and link to pdf]
- Even faster than the famous FastICA and robust, too. Combining
this paper with our IJCNN 2005 paper
resulted in our PKDD 2005 paper.
-
- Denoising source separation: from temporal to contextual invariance.
- H. Valpola and
J. Särelä.
- Presented in Early
Cognitive Vision Workshop, Isle of Skye, Scotland, 2004.
- [pdf
46 kB (abstract)]
[pdf
2.6 MB (poster about DSS)]
[pdf
89 kB (poster about context-guided denoising)]
- The first poster gives an overview of DSS and the second explains
how context can be used for denoising, promoting the development of
invariant representations. Jaakko Särelä attended the workshop.
-
- A fast semi-blind source separation algorithm.
- H. Valpola and
J. Särelä.
- In Publications in Computer and Information Science, Report A66,
Helsinki University of Technology, Espoo, Finland, 4 p., 2002.
- [pdf 140 kB]
- Here we put the basic idea of DSS down before starting to write the
JMLR article.
-
- Fast algorithms for Bayesian independent component analysis.
- H. Valpola and
P. Pajunen.
- In Proceedings of the Second International Workshop on Independent
Component Analysis and Blind Signal Separation,
ICA 2000, Helsinki,
Finland, pp. 233-237, 2000.
- [html]
[pdf 493 kB]
- The first publication of the method that became DSS. I wanted to
include this in my thesis and therefore used variational Bayesian
methods for denoising.
Variational Bayesian learning
In my thesis, I developed
variational Bayesian methods which are suited for unsupervised
learning. I'm not doing this actively anymore but in the past, I have
supervised students of the
Bayes group at CIS
and my recent articles in this field are collaborations with them.
Several
software packages
related to the research are available on-line.
I was doing this research because Bayesian probability theory
combined with decision theory provides a very solid theoretical
framework for intelligent behaviour. I learned a lot from this
research and I think that the Bayesian viewpoint does give many useful
concepts and tools for thinking about the brain. However, I believe
that the Bayesian theory should only be used as one constraint among
many others when designing intelligent systems. Instead of a one-to-one
mapping from a generative model and it's Bayesian learning algorithm
to the brain (or our implementation of intelligent behaving system),
there is a kind of mixture. What is atomic in one system may be a
distributed process in the other. In other words, Bayesian probability
theory and decision theory provide the golden standard for intelligent
behaviour but no specific instructions about how to
implement an intelligent, behaving system.
- Compact modeling of data using independent variable group analysis.
- E. Alhoniemi,
A. Honkela,
K. Lagus,
J. Seppä, P. Wagner and H.
Valpola.
- IEEE Transactions on Neural Networks, 18(6):1762-1776, 2007.
- [pdf 471 kB]
[abstract]
- IVGA was
invented by Krista Lagus many years ago. The basic idea is to
cluster input components: dependences between inputs are maximised
within the groups and minimised between the groups. Each group can
then be represented independently. I helped out with the variational
machinery which was used for model selection. This is the first
journal article about IVGA.
-
- Blind separation of nonlinear mixtures by variational Bayesian
learning.
- A. Honkela,
H. Valpola,
A. Ilin and
J. Karhunen.
- Digital Signal Processing, 17(5):914-934, 2007.
- [pdf 1.9 MB]
[abstract]
- This article brings together much of our research on nonlinear
source separation which hasn't been published in journals before:
an improved nonlinear factor analysis (NFA) method, hierarchical
NFA and post-nonlinear factor analysis.
-
- Building blocks for variational Bayesian learning
of latent variable models.
- T. Raiko,
H. Valpola, M. Harva
and J. Karhunen.
- Journal of Machine Learning
Research, 8:155-201, 2007.
- [abstract]
[pdf
417 kB]
- This paper collects together much of our research on
Bayes
Blocks software library. We have applied the library for instance
in extracting variance sources and in
hierarchical nonlinear factor analysis.
-
- Hyperparameter adaptation in variational Bayes for the gamma
distribution.
- H. Valpola and
A. Honkela.
- Helsinki University of Technology, Publications in Computer and
Information Science, Espoo, Finland, Tech. Rep. E6, 2006.
- [pdf 93 kB]
- This short technical report explains how to update the hyper
parameters of gamma-distributed variables in variational Bayesian
framework. We have used the method in
IVGA and the
main reason for writing this report was that we didn't want to
include all the details in our forthcoming article about IVGA.
-
- On the effect of the form of the posterior approximation in
variational learning of ICA models.
- A. Ilin and
H. Valpola.
- Neural
Processing Letters, 22(2):183-204, 2005.
- [pdf 489 kB]
- This is an extended version of our paper in ICA 2003. We show
that the functional form of the approximated posterior source density
has a large effect on separation capability in linear ICA models
with variational Bayesian learning. The
software for running the simulations is available on-line.
-
- Bayes Blocks: An implementation of the variational Bayesian
building blocks framework.
- M. Harva,
T. Raiko,
A. Honkela, H. Valpola
and J. Karhunen.
- In Proceedings of the 21st Conference on Uncertainty in Artificial
Intelligence (UAI 2005),
Edinburgh, Scotland, pp. 259-266, 2005.
- [pdf 164 kB]
- This is an update of our original Bayes Blocks paper.
-
- Unsupervised variational Bayesian learning of nonlinear models.
- A. Honkela and
H. Valpola.
- In L. K. Saul, Y. Weis and L. Bottou, eds., Advances in
Neural Information Processing Systems 17
(NIPS 2004),
pp. 593-600, 2005.
-
[pdf 118 kB]
- Antti Honkela has developed more accurate methods for
computing how probability distributions are transformed in nonlinear
mappings. This makes learning stabler and the results more
reliable.
-
- Using kernel PCA for initialisation of variational Bayesian nonlinear
blind source separation method.
- A. Honkela,
S. Harmeling,
L. Lundqvist and
H. Valpola
- In Proceedings of the 5th International Conference on Independent
Component Analysis and Blind Signal Separation,
ICA 2004, Granada, Spain,
pp. 65-72, 2004.
- [abstract
and link to pdf]
- Kernel PCA can get an answer quickly but it cannot evaluate it (tell
how probably the model is). Variational Bayesian methods can evaluate
the results but are slower and can suffer from local minima. Combining
the two gives the best of both worlds.
-
- Variational learning and bits-back coding: an information theoretic
view to Bayesian learning.
- A. Honkela and
H. Valpola.
-
IEEE Transactions on Neural Networks,
15(4):800-810, 2004.
- [abstract]
- We discuss the (well-known) connections between variational learning
and information theoretic bits-back coding. Both viewpoints are useful
for understanding different things. We give examples which we have
encountered in our research. (I originally came up with variational
Bayesian learning after looking at information-theoretic
minimum-message-length framework. It turned out that it had been
invented already but I was applying it to different problems at
least, nonlinear factor analysis.)
-
- Nonlinear dynamical factor analysis for state change detection.
- A. Ilin, H. Valpola and
E. Oja.
-
IEEE Transactions on Neural Networks,
15(3):559-575, 2004.
- [abstract]
- After learning a nonlinear state-space model
(with dynamics), we can use it for change detection. We report
simulations with artificial data (where we know the ground truth).
-
- Hierarchical models of variance sources.
- H. Valpola, M. Harva and
J. Karhunen.
-
Signal Processing, 84(2):267-282, 2004.
- [abstract]
- We propose a hierarchical latent-variable model where the higher
levels model both the changes is variables (as in factor analysis)
and changes in the variances of variables. The
software and
data for running the simulations is available on-line.
-
Most of my publications dating before 2004 are available only at
my NNRC page.
- An unsupervised ensemble learning method for nonlinear dynamic
state-space models.
- H. Valpola and
J. Karhunen.
- Neural Computation,
14(11):2647-2692, 2002.
-
[abstract] [pdf 937 kB]
- The article is based on a technical report that was part of my
thesis. The model looks like an extended Kalman filter (or smoother)
were the nonlinear mappings have been implemented by MLP networks.
The catch is that the state representation is not fixed, it is learned
from the data. The system is able to find state features which can
represent the data, can predict the future states and are predictable
by the past states. We also show that by suitably restricting the
functional form of the posterior approximation of the sources, we can
separate uncoupled dynamical processes. (This is quite opposite to
the situation reported here.) The
software
for running the simulations is available on-line.
-
- Bayesian ensemble learning for nonlinear factor analysis.
- H. Valpola.
- PhD thesis, Helsinki University of Technology, Espoo, 2000.
- Published in Acta Polytechnica Scandinavica, Mathematics and
Computing Series No. 108, 2000.
- [html]
- Eighty pages of introduction to Bayesian probability theory and decision
theory, practical Bayesian methods and variational methods in particular,
unsupervised learning, factor analysis, indpendent component analysis
and their nonlinear extensions. Out of the eight publications included
in the thesis, the one that least fits in (Pub.
VII) became the basis
of my current research.
Harri Valpola
Last modified: Fri Sep 19 23:00:38 EEST 2008