Publications of Harri Valpola

My publications dating before 2004 are available at my NNRC page.

My current research at the Curious AI Company focuses on combining unsupervised learning with currently successful supervised deep learning techniques.

In the past, I have been active in these areas:

Deep learning

Semi-supervised learning with ladder networks
A. Rasmus, H. Valpola, M. Honkala, M. Berglund and T. Raiko
Accepted for publication in NIPS 2015
arXiv:1507.02672 [cs.NE]
Lateral connections in denoising autoencoders support supervised learning
A. Rasmus, H. Valpola and T. Raiko
arXiv:1504.08215 [cs.LG]
Denoising autoencoder with modulated lateral connections learns invariant representations of natural images
A. Rasmus, T. Raiko and H. Valpola
arXiv:1412.7210 [cs.NE]
From neural PCA to deep unsupervised learning
H. Valpola
In E. Bingham, S. Kaski, J. Laaksonen and J. Lampinen, eds., Advances in Independent Component Analysis and Learning Machines, pp. 143-171 (Chapter 8), 2015.
arXiv:1411.7783 [stat.ML]
Deep learning made easier by linear transformations in perceptrons
T. Raiko, Harri Valpola and Yann Lecun.
In Proceedings of the 15th Conference on AI and Statistics, AISTATS 2012, pp. 924-932, 2012.
[abstract] [pdf]

Cognitive architecture

When I was running a computational neuroscience group at Aalto University during 2005-2010, my main research topic was integrating various components into a complete cognitive architecture.

A cognitive architecture for developing sensory and motor abstractions
H. Valpola
A presentation given at the First International Conference on Biologically Inspired Cognitive Architectures, BICA 2010.
Oscillatory neural network for image segmentation with biased competition for attention
T. Raiko and H. Valpola.
In the Brain Inspired Cognitive Systems (BICS 2010) symposium, Madrid, Spain, 14-16 July, 2010.
Selective attention improves learning
A. Yli-Krekola, J. Särelä and H. Valpola.
In Proceedings of the 19th International Conference of Artificial Neural Networks, ICANN 2009, Limassol, Cyprus, Part II, pp. 285-294, 2009.
[pdf 130 kb]
From raw data to abstract concepts
H. Valpola
Keynote presentation in AKRR'08: International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning, Porvoo, Finland, September 17-19, 2008.
[pdf 2.2 MB, slides]
This was the first presentation where I talked about modelling correlation structures. (Just framing the problem, solution not presented yet...) The other topic was the distributed selection process, which gives rise to both attention and development of abstract representations in the cerebral cortex. Read more about the model here.
The engine of thought -- a bio-insipred mechanism for distributed selection of useful information
H. Valpola.
Nokia Workshop on Machine Consciousness, Helsinki, Finland, pp. 27-31, 2008.
[pdf 172 kB, paper] [pdf 291 kB, slides]
This was a workshop organised by Pentti Haikonen, who invited the papers. There wasn't any peer review beyond that. My paper discusses the possible relations between consciousness and the distributed selection process, which gives rise to both attention and development of abstract representations in the cerebral cortex. Read more about the model here.
Computational model of co-operating covert attention and learning.
A. Yli-Krekola and H. Valpola.
Fifth Nordic Neuroinformatics Workshop, Espoo, Finland, p. 34, 2007.
[abstract] [pdf 264 kB, poster]
In his master's thesis, Antti Yli-Krekola implemented the model about which I speculated earlier. In the model, attention and development of internal representations are two sides of the same coin: selection of useful information. Read more about the model here.
A model of cerebellar automation of voluntary basal-ganglia control.
M. Pihlaja and H. Valpola.
Fifth Nordic Neuroinformatics Workshop, Espoo, Finland, p. 29, 2007.
[abstract] [pdf 301 kB, poster]
The basal ganglia can learn by trial and error but the process is slow and it is difficult to learn smooth motor control particularly if there is a large number of degrees of freedom to control. We study how the cerebellar model can automate and perfect the control learned by basal ganglia.
Cerebellar model tested in control of a load-carrying robot.
I. Aaltonen and H. Valpola.
Fifth Nordic Neuroinformatics Workshop, Espoo, Finland, p. 16, 2007.
[abstract] [pdf 58 kB, slides]
We investigate the limits of the cerebellar model to better understand what kind of processing and representations are needed in the neocortex.
Cerebellar model for coordination.
T. J. Lukka and H. Valpola.
Fifth Nordic Neuroinformatics Workshop, Espoo, Finland, p. 25, 2007.
[abstract] [pdf 139 kB, poster]
In humans, cerebellum is important for smooth coordination of movements. Here we study the coordination of a simulated 2-joint actuator. We assume spring-like properties for the joints which makes it very hard to control the tip of the actuator both fast and accurately at the same time but the cerebellar model learns the task.
Learning anticipatory behaviour using a simple cerebellar model.
H. Valpola.
In Proceedings of the Ninth Scandinavian Conference on Artificial Intelligence, SCAI 2006, Espoo, Finland, pp. 135-142, 2006.
[pdf 416 kB]
This is a review of a control scheme inspired by the cerebellar system.
Development of representations, categories and concepts--a hypothesis.
H. Valpola.
In Proceedings of the 6th IEEE International Symposium on Computational Intelligence in Robotics and Automation, CIRA 2005, Espoo, Finland, pp. 593-599, 2005.
[pdf 72 kB]
This is a brain-related "visions-and-ideas paper". I refer to simulation results in many of the of the machine-learning-oriented DSS papers, Deco's results with attention model and to some biological findings. Based on these, I propose how the brain could learn concepts and representations in active interaction with the world. I could have said a lot more about many things (sparse object representations, synchrony, planning and imagination, etc.) but six pages is a bit short for a full-blown brain theory...
Behaviourally meaningful representations from normalisation and context-guided denoising.
H. Valpola.
AI Lab technical report, University of Zurich, 2004.
[abstract and link to pdf]
Invariant features resembling complex-cell properties are known to develop if temporal slowness is the learning criterion. I argue that this is a special case of expectation and show that lateral expectation from adjacent image location will also produce complex-cell-like feature detectors. It also turned out that the expectation-driven learning with DSS resembles in many ways Deco's model for attention. Finding invariant features and attentional filtering are both selection processes, only on different timescales. I discuss the connections and propose that normalisation of activations of competing neuron assemblies makes attentional process robust in the same way as decorrelation of inputs helps DSS.

Denoising source separation

The background of this research is that I was doing excercises for an ICA seminar (long time ago, the files seem to be dated to Jan 1997). The signals that were supposed to separate with FastICA. They had prominent temporal structure but the separation was supposed to only use information about the marginal distribution of the signals. In other words, the temporal structure was there only for visualising the results. (Here are the separation results with FastICA.)

At the time, I had been working on variational Bayesian methods and it occurred to me that the nonlinearity in FastICA could actually be considered as denoising or the expectation step in an EM-algorithm. What if I would use knowledge about the periodicity of the signals? (The signals were artificially generated and one of them had an exact periodicity of 23, another was a sinusoidal signal.) Lo and behold! The algorithm converged in two steps to an apparently perfect solution. FastICA is, well, fast, but this was something quite phenomenal.

I was busy doing "more important things" like my thesis and somehow the method was too weird and simple. Still, every now and then this strange new algorithm was bugging me and finally in 2000 I wrote the first article about the method (together with Petteri Pajunen). Obviously we had to use some "respectable" Bayesian filtering algorithms, at least, so that I could include this work in my thesis. After defending my thesis in 2000, I was starting to see more and more that the theoretically very nice Bayesian algorithms for generative models are not going to deliver what I would want from then (like implement the brain :-). They were too slow and unreliable. Things were better if this funny new algorithm was used for initialising the methods (see, e.g., Särelä et. al in ICA 2001, available here). But which method was doing the real work...?

Trying to break the Bayesian models into distributed, asynchronously communication modules turned out to be a nightmare. This became painfully obvious around 2003 and that was a bad sign because I'm trying to learn from what the brain does and obviously the neocortex seems to have a hierarchy of distributed areas. There must be some communication delays and a method which cannot tolerate this is probably missing something important.

In the meantime we had started using the funny method (we now thought semi-blind source separation would be an appropriate name) together with Jaakko Särelä for analysing MEG data and in spring 2002 we wrote a very short technical report just to put the idea down. We started writing a long journal article about the things (it finally got out in the beginning of 2005). We now called the method denoising source separation (DSS) because it can separate sources by means of a denoising procedure.

In September 2003 I moved to Zurich to work on the ADAPT project where my part was the neural control architecture (mostly for perception) of the robot. I spent the first weeks reading and thinking. In many learning algorithms for hierarchical generative model, the higher areas send predictions to lower areas and the lower areas send back error signals about the mismatches. I was trying to figure how this fits the known neocortical structure. It doesn't fit well. In the cortex, long-range connections are excitatory. This is a bit problematic if you are trying to find an error signal because you would expect top-down signals to have an inhibitory effect. It occurred to me that this looks more like denoising!

One reason that generative models need inhibitory top-down signals is that those things which are represented already somewhere shouldn't be represented over and over again in other places. Mutual inhibition can implement competition which prevents everybody from learning the same things but it is not efficient to connect everybody with everybody only to watch that nobody else learns to represent the same things I do. This is a standard issue in unsupervised learning: there has to be competition which prevents units from learning the same things from the same inputs. One obvious solution is that not everybody gets the same inputs. Then only those neurons need to compete which get the same inputs.

I decided to take DSS as the starting point for the "cortical" feature extraction algorithm (a hierarchy of areas, each competing locally and receiving similar inputs only locally). I had been working on invariant features already since 1994 and knew it is pretty easy to find invariances by combining "elementary features" into "invariant features" using temporal slowness as a learning criterion. With DSS this is pretty easy and I was supervising Allard Kamphuisen's work on this topic. From some of his simulations it occurred to me (December 2003) that expectations can drive the development of invariant features. Temporal slowness is not required (this is nice, e.g., for learning phonetic categories).

The idea of trying to learn predictable features was not new to me. In 2000 I had been working on an unsupervised nonlinear dynamical model which was able to learn a state representation which

  1. can represent the observations accurately,
  2. can predict the future state and
  3. is predictable by the past states.
The variational Bayesian method I used took weeks to learn even a ten-dimensional representation (the results were quite amazing, though!, see our article in Neural Computation) but with DSS this could be done extremely efficiently. Past expectations can be used to filter away noise and what is left is the predictable part of the signal.

I have applied DSS to "engineering" applications (well, analysis of climate data is not exactly engineering) but particularly the expectation-driven learning of hierarchical invariant representations is very much inspired by the cortical architecture. Some of the papers listed below are therefore related to cognitive-architecture section above.

I started doing this work while in CIS but both me and recently Jaakko Särelä, too, have moved to LCE. Nevertheless, DSS pages are still hosted by CIS and there you can find more publications, software and tutorials related to DSS.

Finding interesting climate phenomena by exploratory statistical techniques.
A. Ilin, H. Valpola and E. Oja.
In Proceedings of the Fifth Conference on Artificial Intelligence Applications to Environmental Science, 5AI, as part of the 87th Annual Meeting of the American Meteorological Society, San Antonio, TX, USA, January 2007.
[pdf 1.7 MB]
This paper collects together results which have been published in our previous papers. Alexander received an award for the best student presentation.
Extraction of climate components with structured variance.
A. Ilin, H. Valpola and E. Oja.
In Proceedings of the IEEE World Congress on Computational Intelligence, WCCI 2006, Vancouver, BC, Canada, pp. 10528-10535, 2006.
[pdf 1.7 MB]
We present an efficient algorithm for extracting components with structured variance. One of my earlier papers introduced a somewhat similar method for estimating hierarchical models of variance sources. Now we got similar results with a more efficient DSS-based algorithm. There are some interesting components. For instance, one components shows an abrupt decrease in its variance during mid 70's.
Exploratory analysis of climate data using source separation methods.
A. Ilin, H. Valpola and E. Oja.
Neural Networks, 19(2):155-167, 2006.
[pdf 3.4 MB] [html]
This article combines our PKDD'05 and IJCNN'05 articles. Some new results are presented.
Separation of nonlinear image mixtures by denoising source separation.
M. S. C. Almeida, H. Valpola and J. Särelä.
In Proceedings of the 6th International Conference on Independent Component Analysis and Blind Signal Separation, ICA 2006, Charleston, SC, USA, pp. 8-15, 2006.
[pdf 362 kB] [abstract]
We extend the DSS framework to nonlinear mixtures and apply it to separation of image mixtures. Mariana received the best student paper award.
Frequency-based separation of climate signals.
A. Ilin and H. Valpola.
In Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2005), Porto, Portugal, pp. 519-526, 2005.
[abstract and link to pdf]
In this paper we extended the results of our IJCNN'05 article by finding meaningful rotations among climate components based on their frequency content. The rotation method resembles the one used in our ICA'04 article.

Semiblind source separation of climate data detects El Niño as the component with the highest interannual variability.
A. Ilin, H. Valpola and E. Oja.
In Proceedings of the International Joint Conference on Neural Networks (IJCNN 2005), Montréal, Québec, Canada, pp. 1722-1727, 2005.
[pdf 1.4 MB]
DSS can find features with certain characteristics. It turns out that in a certain large climate dataset, the phenomenon with the highest interannual variability is the well-known El Niño. Many other intersting phenomena are found, too. The linear DSS method we used here can only find a signal subspace, not a rotation in it. Real separation results were published in our PKDD 2005 paper.
Denoising source separation: a novel approach to ICA and feature extraction using denoising and Hebbian learning.
J. Särelä and H. Valpola.
In AI 2005 special session on correlation learning, pp. 45-56, 2005.
[12-page paper, pdf 1.7 MB] [2-page abstact, pdf 800 kB] [slides, pdf 2.7 MB]
Description of DSS and its biological relevance.
Denoising source separation.
J. Särelä and H. Valpola.
Journal of Machine Learning Research, 6:233-272, 2005.
[abstract and link to pdf]
This is a comprehensive machine learning perspective to DSS. No brain-related things fitted in. Here is a link to DSS page at CIS where further information and software is available.
Accurate, fast and stable denoising source separation algorithms.
H. Valpola and J. Särelä.
In Proceedings of the 5th International Conference on Independent Component Analysis and Blind Signal Separation, ICA 2004, Granada, Spain, pp. 65-72, 2004.
[abstract and link to pdf]
Even faster than the famous FastICA and robust, too. Combining this paper with our IJCNN 2005 paper resulted in our PKDD 2005 paper.
Denoising source separation: from temporal to contextual invariance.
H. Valpola and J. Särelä.
Presented in Early Cognitive Vision Workshop, Isle of Skye, Scotland, 2004.
[pdf 46 kB (abstract)] [pdf 2.6 MB (poster about DSS)] [pdf 89 kB (poster about context-guided denoising)]
The first poster gives an overview of DSS and the second explains how context can be used for denoising, promoting the development of invariant representations. Jaakko Särelä attended the workshop.
A fast semi-blind source separation algorithm.
H. Valpola and J. Särelä.
In Publications in Computer and Information Science, Report A66, Helsinki University of Technology, Espoo, Finland, 4 p., 2002.
[pdf 140 kB]
Here we put the basic idea of DSS down before starting to write the JMLR article.
Fast algorithms for Bayesian independent component analysis.
H. Valpola and P. Pajunen.
In Proceedings of the Second International Workshop on Independent Component Analysis and Blind Signal Separation, ICA 2000, Helsinki, Finland, pp. 233-237, 2000.
[html] [pdf 493 kB]
The first publication of the method that became DSS. I wanted to include this in my thesis and therefore used variational Bayesian methods for denoising.

Variational Bayesian learning

In my thesis, I developed variational Bayesian methods which are suited for unsupervised learning. I'm not doing this actively anymore but in the past, I have supervised students of the Bayes group at CIS and my recent articles in this field are collaborations with them. Several software packages related to the research are available on-line.

I was doing this research because Bayesian probability theory combined with decision theory provides a very solid theoretical framework for intelligent behaviour. I learned a lot from this research and I think that the Bayesian viewpoint does give many useful concepts and tools for thinking about the brain. However, I believe that the Bayesian theory should only be used as one constraint among many others when designing intelligent systems. Instead of a one-to-one mapping from a generative model and it's Bayesian learning algorithm to the brain (or our implementation of intelligent behaving system), there is a kind of mixture. What is atomic in one system may be a distributed process in the other. In other words, Bayesian probability theory and decision theory provide the golden standard for intelligent behaviour but no specific instructions about how to implement an intelligent, behaving system.

Compact modeling of data using independent variable group analysis.
E. Alhoniemi, A. Honkela, K. Lagus, J. Seppä, P. Wagner and H. Valpola.
IEEE Transactions on Neural Networks, 18(6):1762-1776, 2007.
[pdf 471 kB] [abstract]
IVGA was invented by Krista Lagus many years ago. The basic idea is to cluster input components: dependences between inputs are maximised within the groups and minimised between the groups. Each group can then be represented independently. I helped out with the variational machinery which was used for model selection. This is the first journal article about IVGA.
Blind separation of nonlinear mixtures by variational Bayesian learning.
A. Honkela, H. Valpola, A. Ilin and J. Karhunen.
Digital Signal Processing, 17(5):914-934, 2007.
[pdf 1.9 MB] [abstract]
This article brings together much of our research on nonlinear source separation which hasn't been published in journals before: an improved nonlinear factor analysis (NFA) method, hierarchical NFA and post-nonlinear factor analysis.
Building blocks for variational Bayesian learning of latent variable models.
T. Raiko, H. Valpola, M. Harva and J. Karhunen.
Journal of Machine Learning Research, 8:155-201, 2007.
[abstract] [pdf 417 kB]
This paper collects together much of our research on Bayes Blocks software library. We have applied the library for instance in extracting variance sources and in hierarchical nonlinear factor analysis.
Hyperparameter adaptation in variational Bayes for the gamma distribution.
H. Valpola and A. Honkela.
Helsinki University of Technology, Publications in Computer and Information Science, Espoo, Finland, Tech. Rep. E6, 2006.
[pdf 93 kB]
This short technical report explains how to update the hyper parameters of gamma-distributed variables in variational Bayesian framework. We have used the method in IVGA and the main reason for writing this report was that we didn't want to include all the details in our forthcoming article about IVGA.
On the effect of the form of the posterior approximation in variational learning of ICA models.
A. Ilin and H. Valpola.
Neural Processing Letters, 22(2):183-204, 2005.
[pdf 489 kB]
This is an extended version of our paper in ICA 2003. We show that the functional form of the approximated posterior source density has a large effect on separation capability in linear ICA models with variational Bayesian learning. The software for running the simulations is available on-line.
Bayes Blocks: An implementation of the variational Bayesian building blocks framework.
M. Harva, T. Raiko, A. Honkela, H. Valpola and J. Karhunen.
In Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence (UAI 2005), Edinburgh, Scotland, pp. 259-266, 2005.
[pdf 164 kB]
This is an update of our original Bayes Blocks paper.
Unsupervised variational Bayesian learning of nonlinear models.
A. Honkela and H. Valpola.
In L. K. Saul, Y. Weis and L. Bottou, eds., Advances in Neural Information Processing Systems 17 (NIPS 2004), pp. 593-600, 2005.
[pdf 118 kB]
Antti Honkela has developed more accurate methods for computing how probability distributions are transformed in nonlinear mappings. This makes learning stabler and the results more reliable.
Using kernel PCA for initialisation of variational Bayesian nonlinear blind source separation method.
A. Honkela, S. Harmeling, L. Lundqvist and H. Valpola
In Proceedings of the 5th International Conference on Independent Component Analysis and Blind Signal Separation, ICA 2004, Granada, Spain, pp. 65-72, 2004.
[abstract and link to pdf]
Kernel PCA can get an answer quickly but it cannot evaluate it (tell how probably the model is). Variational Bayesian methods can evaluate the results but are slower and can suffer from local minima. Combining the two gives the best of both worlds.
Variational learning and bits-back coding: an information theoretic view to Bayesian learning.
A. Honkela and H. Valpola.
IEEE Transactions on Neural Networks, 15(4):800-810, 2004.
We discuss the (well-known) connections between variational learning and information theoretic bits-back coding. Both viewpoints are useful for understanding different things. We give examples which we have encountered in our research. (I originally came up with variational Bayesian learning after looking at information-theoretic minimum-message-length framework. It turned out that it had been invented already but I was applying it to different problems at least, nonlinear factor analysis.)
Nonlinear dynamical factor analysis for state change detection.
A. Ilin, H. Valpola and E. Oja.
IEEE Transactions on Neural Networks, 15(3):559-575, 2004.
After learning a nonlinear state-space model (with dynamics), we can use it for change detection. We report simulations with artificial data (where we know the ground truth).
Hierarchical models of variance sources.
H. Valpola, M. Harva and J. Karhunen.
Signal Processing, 84(2):267-282, 2004.
We propose a hierarchical latent-variable model where the higher levels model both the changes is variables (as in factor analysis) and changes in the variances of variables. The software and data for running the simulations is available on-line.

Most of my publications dating before 2004 are available only at my NNRC page.

An unsupervised ensemble learning method for nonlinear dynamic state-space models.
H. Valpola and J. Karhunen.
Neural Computation, 14(11):2647-2692, 2002.
[abstract] [pdf 937 kB]
The article is based on a technical report that was part of my thesis. The model looks like an extended Kalman filter (or smoother) were the nonlinear mappings have been implemented by MLP networks. The catch is that the state representation is not fixed, it is learned from the data. The system is able to find state features which can represent the data, can predict the future states and are predictable by the past states. We also show that by suitably restricting the functional form of the posterior approximation of the sources, we can separate uncoupled dynamical processes. (This is quite opposite to the situation reported here.) The software for running the simulations is available on-line.
Bayesian ensemble learning for nonlinear factor analysis.
H. Valpola.
PhD thesis, Helsinki University of Technology, Espoo, 2000.
Published in Acta Polytechnica Scandinavica, Mathematics and Computing Series No. 108, 2000.
Eighty pages of introduction to Bayesian probability theory and decision theory, practical Bayesian methods and variational methods in particular, unsupervised learning, factor analysis, indpendent component analysis and their nonlinear extensions. Out of the eight publications included in the thesis, the one that least fits in (Pub. VII) became the basis of my current research.

Harri Valpola
Last modified: Fri Sep 19 23:00:38 EEST 2008