I am reading the Pyro tutorial on normalizing flows (Normalizing Flows - Introduction (Part 1) — Pyro Tutorials 1.8.4 documentation) and I would like to understand better how the examples work under the hood. For instance, I am referring to the architecture of the network used to obtain the marginal distributions of the concentric circles example. In the example the base distribution (in the latent space) is normal and the flow is a rational spline:
base_dist = dist.Normal(torch.zeros(2), torch.ones(2))
spline_transform = T.Spline(2, count_bins=16)
flow_dist = dist.TransformedDistribution(base_dist, [spline_transform])
According to the tutorial, the knots (of the spline) and their derivatives are parameters that can be learnt e.g., through stochastic gradient descent on a maximum likelihood objective. The tutorial shows how to do that :
%%time
steps = 1 if smoke_test else 1001
dataset = torch.tensor(X, dtype=torch.float)
optimizer = torch.optim.Adam(spline_transform.parameters(), lr=1e-2)
for step in range(steps):
optimizer.zero_grad()
loss = -flow_dist.log_prob(dataset).mean()
loss.backward()
optimizer.step()
flow_dist.clear_cache()
if step % 200 == 0:
print('step: {}, loss: {}'.format(step, loss.item()))
Finally, it is indicated how to sample from the learned distribution in order to obtain a new sample :
X_flow = flow_dist.sample(torch.Size([1000,])).detach().numpy()
I would like to know what is the architecture of the NN used to learn those parameters and if is there a (possibly simple) way to modify this architecture (e.g. add or remove layers)
More generally, I would like to adapt these simple examples to the univariate case of learning the density of time series data.