Hessian-Free Optimization: Supplementary
Materials
Contents
1 Pseudo-code for the damped Gauss-Newton vector product 2
2 Details of the pathological synthetic problems 3
2.1 The addition, multiplication, and XOR problem . . . . . . . . . . . . 3
2.2 The temporal order problem . . . . . . . . . . . . . . . . . . . . . . 4
2.3 The 3-bit temporal order problem . . . . . . . . . . . . . . . . . . . . 4
2.4 The random permutation problem . . . . . . . . . . . . . . . . . . . 4
2.5 Noiseless memorization . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Details of the natural problems 5
3.1 The bouncing balls problem . . . . . . . . . . . . . . . . . . . . . . 5
3.2 The MIDI dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 The speech dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1
1 Pseudo-code for the damped Gauss-Newton vector product
Algorithm 1 Computation of the matrix-vector product of the structurally-damped
Gauss-Newton matrix with the vector v, for the case when e is the tanh non-linearity, g the logistic sigmoid, D and L are the corresponding matching loss functions. The notation reflects the “convex approximation” interpretation of the GN matrix so that we are applying the R operator to the forwards-backwards pass through the linearized and structurally damped objective ~k, and the desired matrix-vector product is given by
Rd~k
d . All derivatives are implicitly evaluated at = n. The previously defined parameter symbols Wph, Whx, Whh, bh, bp binit h will correspond to the parameter vector n if they have no super-script and to the input parameter vector v if they have the ‘v’ superscript.
The Rz notation follows Pearlmutter [1994], and for the purposes of reading the pseudo-code can be interpreted as merely defining a new symbol. We assume that intermediate quantities of the network (e.g. hi) have already been computed (from n).
The operator
References: J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, and N.L. Dahlgren. Darpa Timit: Acoustic-phonetic Continuous Speech Corps CD-ROM. US Dept. of Commerce, National Institute of Standards and Technology, 1993. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 1997. 7 J. Martens. Deep learning via Hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML), 2010. B.A. Pearlmutter. Fast exact multiplication by the Hessian. Neural Computation, 1994. 8