I get the intuition that when using an Autodelta, and no batching, there should not be any random sampling going on and I would expect the model would follow the gradient smoothly… why do I see an oscillation of the loss over optimization?
probably because of the choices of optimization algorithm you’re using. you should only expect a smooth loss curve in the limit of infinitely small step sizes (learning rates). as the learning rate gets larger and larger the loss curve will get jumpier and jumpier
So using LBFGS should partially fix the problem?
it might, at least assuming there isn’t some other source of stochasticity (e.g. data subsampling a.k.a. mini-batching). however note that LBFGS is expected to be slow if the parameter space is sufficiently high dimensional.