How to estimate and include an unknown variable in a regression model?

I’m working on extending the following regression model to include an additional variable that is unknown. It is the mean of unknown scores of a given observation’s relationships. In other words, y = intercept + b_x1 *x1 + b_x2 * x2 + b_x3 * x3 + b_x4 * x4, where x4 is the mean(subset of unknown values).

# Pyro model (linear regression)

def model(x1, x2, x3, y=None):
    """

    """
    # Coefficients
    intercept = pyro.sample("intercept", dist.Normal(0., 1.0))
    b_x1 = pyro.sample("b_x1", dist.Normal(0., 1.0))
    b_x2 = pyro.sample("b_x2", dist.Normal(0., 1.0))
    b_x3 = pyro.sample("b_x3", dist.Normal(0., 1.0))
    sigma = pyro.sample("sigma", dist.Uniform(0., 1.0))

    mean = \
        intercept +\
        b_x1 * x1 +\
        b_x2 * x2 +\
        b_x3 * x3

    with pyro.plate("data", len(x1)):
        return pyro.sample("y", dist.Normal(mean, sigma), obs=y)

It’s not correctly coded, but I’d like to add something like the following in order to estimate the values for another variable, x4:

    # Unknown variable for each person to be estimated.
    latent_scores = [pyro.sample("latent_score_{i}", dist.Normal(0., 1.0)) for i in range(x1)]

    # For each person, calculate the unknown x4 value, where x4[i] is the mean
    # of unknown values of a given person's connections (i.e. network edges).
    x4[i] = (latent_scores[0] + latent_scores[1]) / 2

Basically, I’m wanting to estimate the unknown values for each person as well as the model coefficients at the same time. I’m new to Pyro, so if there is a more appropriate modeling approach, please let me know!

This sounds like some sort of hierarchical model or mixed effect model. Could you provide a little more information:

  • What’s the relationship between data and people?
  • Where do edges and vertices and network models fit into your model?
  • What are the the values of a person’s connection? Are they some of the ys? How do you know only some are observed?
1 Like

Thank you for the reply. For your questions,

  • What’s the relationship between data and people?

    • Each row in the data corresponds to one person, and the columns are attributes.
  • Where do edges and vertices and network models fit into your model?

    • The vertices in the network are the people and the edges are the inputs to the average being calculated, i.e. x4[i] = (latent_scores[0] + latent_scores[1]) / 2 , where the indexes [0] and [1] are the given person’s (i.e. row) edges in the network.
  • What are the the values of a person’s connection? Are they some of the y s? How do you know only some are observed?

    • The values of a person’s connections are assumed to be normal distributions with mean 0 and sigma 1, and these are included in the regression model as an input to predict one of the attributes for the people.

If I need to clarify any of the details, please let me know!

Here is an updated example model to show what I am attempting to do with the edges.


def model(person_idx, edges_df, x1, x2, x3, y=None):
    """
    person_idx: person index (i.e. unique identifier for each person)
    edges_df: Pandas dataframe with "source" and "target" columns. The values in
                     the "target" column are lists of edges, e.g. [2, 17, 39]. Edge values
                     are based on person_idx.
    x1 - x3: numeric input variables
    y: numeric target
    """
    # Coefficients
    intercept = pyro.sample("intercept", dist.Normal(0., 1.0))
    b_x1 = pyro.sample("b_x1", dist.Normal(0., 1.0))
    b_x2 = pyro.sample("b_x2", dist.Normal(0., 1.0))
    b_x3 = pyro.sample("b_x3", dist.Normal(0., 1.0))
    b_emls = pyro.sample("b_emls", dist.Normal(0., 1.0))
    sigma = pyro.sample("sigma", dist.Uniform(0., 1.0))

    # A unique latent score for each person
    mu_ls = pyro.sample("mu_ls", dist.Normal(0.0, 1.0))
    sigma_ls = pyro.sample("sigma_ls", dist.HalfNormal(1.0))
    n_people = len(person_idx)
    with pyro.plate("plate_ls", n_people):
        latent_score = pyro.sample("latent_score", dist.Normal(mu_lis, sigma_lis))

    # Mean latent score of edges
    mean_latent_score = torch.empty_like(x1)
    for p in person_idx:
        # Returns list of edges for person p
        p_edges = edges_df.loc[edges_df['source'] == p, 'target']
        num_p_edges = len(p_edges)
        if num_p_edges > 0:
            p_edges_mean_latent_score = sum([latent_score[e] for e in p_edges]) / num_p_edges
            edges_mean_latent_score[p] = p_edges_mean_latent_score
        else:
            # Set value to 0 (i.e. mean) for those with no edges
            edges_mean_latent_score[p] = 0

    mean = \
        intercept +\
        b_x1 * x1 +\
        b_x2 * x2 +\
        b_x3 * x3 +\
        b_emls * edges_mean_latent_score

    with pyro.plate("data", len(x1)):
        return pyro.sample("y_score", dist.Normal(mean, sigma), obs=y)