Supervised learning with uncertainty in observations

I’m working on a problem with two large datasets where strings are mapped to numbers. In other words, it’s a supervised classification task where I want to learn a function to map strings to numbers, and I have a few hundred thousand examples to learn from.

In dataset 1, the number is a bit (0/1) like in typical binary classification problems. In dataset 2, it’s a real number from 0 to 1. The bit from dataset 1 is actually experimental uncertainty. The technique couldn’t give the exact real, which for bit 1 is something distributed in the interval 0.9-1 (and can be modeled), and for bit 0 is some wider distribution with a much smaller average.

I already have a deep network that works well for dataset 2.

I plan to extend this network to become a deep probabilistic one, with two outputs, mu and sigma as parameters for an observation, now modeled as a normal. For training data from dataset 2, I would supply a small sigma. An immediate advantage is that I can get uncertainty in predictions.

This would also allow me introducing dataset 1, where I would map 1 to a normal whose mean is close to 1, but a much bigger sigma than for dataset 2.

Is this approach correct? I couldn’t find a lot of literature on uncertainty in observations.

Hi @yarnton,

If I understand correctly, your two datasets can be matched row-by-row, so from one (or two?) text entries you want to predict (1) a bit in {0,1} that determines experiment accuracy, and (2) a number in the interval [0,1], and that the latter number denotes a noisy observation whose noise distribution depends on the first bit? I think that makes sense. You might consider using a Bernoulli distribution for part (1) and a Beta distribution for part (2), since those distributions are constrained to the correct spaces. If I were modeling this via a deep neural net, I’d make the last layer output logits for the bernoulli and concentration parameters for the beta. For example in a fused neural net, I’d try to model

my_nn = ...  # some neural network

def model(input_features, data1, data2):
    assert len(input_features) == len(data1)
    assert len(input_features) == len(data2)
    pyro.module("my_nn", my_nn)
    with pyro.plate("batch", len(input_features)):  # vectorize over minibatches
        params = my_nn(input_features)
        logits, c1, c0 = params.unbind(-1)
        concentraion1 = torch.nn.functional.softplus(c1)
        concentraion0 = torch.nn.functional.softplus(c0)
        pyro.sample("obs1", dist.Bernoulli(logits=logits),
                    obs=data1)
        pyro.sample("obs=2", dist.Beta(concentration1, concentration0),
                    obs=data2)

Thanks for the writeup @fritzo.

Apologies, my problem description was confusing. The two datasets are not matched row by row, but I think you got everything else correctly.

Just to recap, dataset 1 has short strings mapped to a bit {0,1}. Dataset 2 has short strings mapped to a number in the interval [0,1]. Entries are not matched. Each dataset has different examples. However, they are closely related. Bit 1 corresponds to some value in a normal distribution with mean approximately 0.9 and a small standard deviation. The reason for having bit 1 and not a number is that data was obtained with a less precise lab technique.

Dataset 2 is smaller than dataset 1. Hence, I would like to somehow integrate both to train a classifier on a larger dataset.

A straightforward option is to transform all numbers in dataset 2 to bits by mapping those above certain threshold (close to 0.9) to bit 1 and otherwise 0. Then train a regular deep neural network or a probabilistic one by modelling outputs as samples from a Bernoulli.

In my first message I said it might be possible to avoid this and still merge both datasets by modeling observations as sampling from something like a (constrained) normal, or anything where I can explicitly model spread?