Yesterday I was watching Lucy, the Luc Besson film about a woman, Lucy, who develops the ability to use 100% of her brain power. While I was watching, my wife asked why we don’t use more of our brain’s processing power at any one time. I think it was kind of a hypothetical but I said, that’s interesting, I’ve got a paper on my iPad about that right now!

I would like to note that I am not a cognitive scientist nor an expert in biologically inspired systems, but I do enjoy exploring the topic through reading and going to some of the ICML workshops. When I was looking at the 2015 ICML accepted papers list this year (2015), I was extremely interested in MADE: Masked Autoencoder for Distribution Estimation by Germain et al.

The paper starts with the definition of an autoencoder, which is a neural network with one hidden layer where the goal is for the output layer to learn to output the same values that are passed to the input layer while minimizing reconstruction error. Or, with equations:

\[\begin{aligned} \textbf{h} \left( \textbf{x} \right) &= \textbf{g} \left( \textbf{b}+\textbf{Wx} \right) \\ \hat { \textbf{x} } &= \text{sigm}\left( \textbf{c} + \textbf{Vh}\left( \textbf{x} \right) \right) \end{aligned}\]

where the first equation indicates a non-linear activation function ( \(\textbf{g}\) ) is applied to an affine transformation of the input data. This forms the data at the hidden layer. In the second equation, the sigmoid function is applied to an affine transformation of the hidden layer values produced by the first equation. The number of nodes in the input and output layer is the same (or the input layer may have an additional bias unit). Typically, the number of nodes in the hidden layer is less than the number of nodes in the input and output layer, but this is not a requirement by any means. Autoencoders can also be stacked to form deeper networks. Pictorially, a basic autoencoder might look something like:

Autoencoder picture

The paper then shows how to modify the standard autoencoder to create a joint probability distribution estimator. The first step to doing this is recognizing that the probability calculations can be factored into the product of conditional probabilities via the probability product rule:

\[p\left( x \right) =\prod _{ d=1 }^{ D }{ p\left( { { x }_{ d } }|{ { \textbf{x} }_{ <d } } \right) }\]

An example of this might be:

\[p\left( { x }_{ 1 },{ x }_{ 2 },{ x }_{ 3 } \right) =p\left( { x }_{ 1 } \right) p\left( { { x }_{ 2 } } \mid { x }_{ 1 } \right) p\left( { { x }_{ 3 } } \mid { x }_{ 1 },{ x }_{ 2 } \right)\]

which means that the probability of the three events co-occurring is the same as the probability of the first event occurring times the probability that the second event occurs assuming the first occurred times the probability that the third event occurs assuming the first two events occurred.

Now, we know what the ADE in MADE stand for in the paper title, and it’s time to expound upon the M, which is the real meat of the paper. M, as the title MADE: Masked Autoencoder for Distribution Estimation implies, stands for masking. The key to the paper is the development of clever masking matrices for the encode and decode phases of the autoencoder. Once we have the masking matrices, this is a straightforward modification to the first set of equations listed above:

\[\begin{aligned} \textbf{h} \left( \textbf{x} \right) &= \textbf{g} \left( \textbf{b} + \left( \textbf{W} \odot { \textbf{M} }^{ \textbf{W} } \right) \textbf{x} \right) \\ \hat { \textbf{x} } &= \text{sigm}\left( \textbf{c} + \left( \textbf{V} \odot { \textbf{M} }^{ \textbf{V} } \right) \textbf{h}\left( \textbf{x} \right) \right) \end{aligned}\]

where \(\odot\) represents the Hadamard product. The key is the form of the masking matrices \({ \textbf{M} }^{ \textbf{W} }\) and \({ \textbf{M} }^{ \textbf{V} }\), which are used to remove connections from the input to the hidden layer, and from the hidden layer to the output layer, respectively. These connections are removed so that each input is reconstructed only from previous inputs in a given ordering. This is exactly the condition that was required above in the factoring of the joint probability into a product of conditional probabilities. For more details on constructing the masks, see the paper.

The interesting thing is that by disabling certain connections in the neural network, the authors divised a way of learning joint probability distributions. That’s interesting. Again, I’m not suggesting that this is biologically inspired. I don’t know if the brain disables synaptic pathways in order to calculate joint probability distributions. But I do think it’s an interesting thought.

References

Mathieu Germain, Karol Gregor, Iain Murray, Hugo Larochelle: MADE: Masked Autoencoder for Distribution Estimation. ICML 2015: 881-889