I’m working on a problem with two large datasets where strings are mapped to numbers. In other words, it’s a supervised classification task where I want to learn a function to map strings to numbers, and I have a few hundred thousand examples to learn from.
In dataset 1, the number is a bit (0/1) like in typical binary classification problems. In dataset 2, it’s a real number from 0 to 1. The bit from dataset 1 is actually experimental uncertainty. The technique couldn’t give the exact real, which for bit 1 is something distributed in the interval 0.9-1 (and can be modeled), and for bit 0 is some wider distribution with a much smaller average.
I already have a deep network that works well for dataset 2.
I plan to extend this network to become a deep probabilistic one, with two outputs, mu and sigma as parameters for an observation, now modeled as a normal. For training data from dataset 2, I would supply a small sigma. An immediate advantage is that I can get uncertainty in predictions.
This would also allow me introducing dataset 1, where I would map 1 to a normal whose mean is close to 1, but a much bigger sigma than for dataset 2.
Is this approach correct? I couldn’t find a lot of literature on uncertainty in observations.
If I understand correctly, your two datasets can be matched row-by-row, so from one (or two?) text entries you want to predict (1) a bit in {0,1} that determines experiment accuracy, and (2) a number in the interval [0,1], and that the latter number denotes a noisy observation whose noise distribution depends on the first bit? I think that makes sense. You might consider using a Bernoulli distribution for part (1) and a Beta distribution for part (2), since those distributions are constrained to the correct spaces. If I were modeling this via a deep neural net, I’d make the last layer output logits for the bernoulli and concentration parameters for the beta. For example in a fused neural net, I’d try to model
Apologies, my problem description was confusing. The two datasets are not matched row by row, but I think you got everything else correctly.
Just to recap, dataset 1 has short strings mapped to a bit {0,1}. Dataset 2 has short strings mapped to a number in the interval [0,1]. Entries are not matched. Each dataset has different examples. However, they are closely related. Bit 1 corresponds to some value in a normal distribution with mean approximately 0.9 and a small standard deviation. The reason for having bit 1 and not a number is that data was obtained with a less precise lab technique.
Dataset 2 is smaller than dataset 1. Hence, I would like to somehow integrate both to train a classifier on a larger dataset.
A straightforward option is to transform all numbers in dataset 2 to bits by mapping those above certain threshold (close to 0.9) to bit 1 and otherwise 0. Then train a regular deep neural network or a probabilistic one by modelling outputs as samples from a Bernoulli.
In my first message I said it might be possible to avoid this and still merge both datasets by modeling observations as sampling from something like a (constrained) normal, or anything where I can explicitly model spread?