In the first example, by conditioning on obs=data, we have joint_prob = p(f, obs | obs=data) = p(obs=data | f).p(f). With obs, the value generated from pyro.sample statement of p(obs | f) is data. Without obs, the value generated from pyro.sample statement is a sample from Bernoulli(f) distribution.
I think that the best way to understand what pyro really does under the hood is to use pyro.poutine.
You can generate a trace by trace = pyro.poutine.trace(model).get_trace(data), then print out trace.nodes, which is a dict containing all information of param/sample sites generated from model.
About question 2, I guess you mean how to build a model? I don’t have an answer for it. It depends on knowledge, intuition, data,… I guess. A first step may be to learn what is the input and output (value generated from sample method) of a distribution. In that post, the author used Categorical distribution because that distribution is popular for a multi-class classification problem; its input is a value (logits) generated from neural network and its output is a sampled digit class.