Date

October 1995

Document Type

Dissertation

Degree Name

Ph.D.

Department

Dept. of Computer Science and Engineering

Institution

Oregon Graduate Institute of Science & Technology

Abstract

In this thesis we develop a mathematical formulation for the learning dynamics of stochastic or on-line learning algorithms in neural networks. We use this formulation to 1) model the time evolution of the weight space densities during learning, 2) predict convergence regimes with and without momentum, and 3) develop a new efficient algorithm with few adjustable parameters which we call adaptive momentum. In stochastic learning, the weights are updated at each iteration based on a single exemplar randomly chosen from the training set. Treating the learning dynamics a Markov process, we show that the weight space probability density P(w, t ) can be cast as a Kramers-Moyal series [equation: δP(w,t) divided by δt = LKM P(w,t)] where LKM is an infinite-order linear differential operator, the terms of which involve powers of the learning rate µ. We present several approaches for truncating this series so that approximate solutions can be obtained. One approach is the small noise expansion where the weights are modeled as a sum of a deterministic and noise component. However, in order to provide more accurate solutions, we also develop a perturbation expansion in µ. We demonstrate the technique on equilibrium weight-space densities. Unlike batch learning, stochastic updates are noisy but fast to compute. The speedup can be dramatic if training sets are highly redundant, and the noise can decrease the likelihood of becoming trapped in poor local minima. However, acceleration techniques based on estimating the local curvature of the cost surface can not be implemented stochastically because the estimates of second order effects are much too noisy. Disregarding such effects can greatly hinder learning in problems where the condition number of the hessian is large. A matrix of learning rates (the inverse hessian) that scales the stepsize according to the curvature along the different eigendirections of the hessian is needed. We propose adaptive momentum as a solution. It results in an effective learning rate matrix that approximates the inverse hessian. No explicit calculation of the hessian or its inverse is required. This algorithm is only O(n) in both space and time, where n is the dimension of the weight vector.

Identifier

doi:10.6083/M40P0WZ1

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.