Use case: Flight delays

lqrz · July 16, 2019, 11:06am

Hi, I’m working on a use case which I’m trying to solve using a Pyro bayesian nnet, but for which I cannot get good results. Is anyone interested in taking a look at my approach, and criticise it or help me improving it?

My main concern is that I cannot get my model anywhere near the performance of extremely naive approaches (see below). I read on other posts that if my nr of parameters is much higher than my datapoints, then I wouldn’t be able to get a good performance; but that doesn’t hold in this situation.

The dataset
The dataset is a simplified and preprocessed version of a Kaggle competition dataset, the goal is to predict the arrival delay of flights, from a few features -24 features in total- like “flight distance”, “departure time”, “arrival time”, “origin”, “destination”, etc.

An example of some of the feats:

The training set consists of 4.291.428 flight records (with 24 features each).
The testing set has 922.928 additional samples.

My target distribution looks like the graph below:

The model
I’m modelling the delays with an Exponential distribution, and I’m fitting a bayesian nnet to output its rate based on the features.

Evaluation
In order to evaluate the output, I use a delay threshold (i.e. 60 min) and I use my model to predict the probability of being delayed more than this threshold. Then I use an uplift curve to compare the models (that is, I get my probability predictions and split them into deciles, for which I compute the precision). See an example in “the baseline” section.

The baseline
As a baseline I’m using the following approaches:

a groupby mean on categorical features.
fitting a -non bayesian- nnet with a loss equal to the mean likelihood of the data using different likelihood distributions (e.g. a Gamma).

Next, the results of the groupby baseline:

The code (plus data, plus instructions)

martinjankowiak · July 16, 2019, 5:19pm

you might try to use this machinery instead of lift.

perhaps this form post is helpful.

lqrz · July 22, 2019, 3:05pm

Hi @martinjankowiak, I was trying the HiddenLayer module, and if I enable pyro’s validation checks pyro.enable_validation(True) , I get an error:

AttributeError: 'HiddenLayer' object has no attribute '_batch_shape'.

This happens inside check_model_guide_match(model_trace, guide_trace, max_plate_nesting).

And the same applies if I enable validations in the form post tutorial you pointed out.

Surely, I can turn it off, but shouldn’t I be able to use it?
Thanks!

martinjankowiak · July 22, 2019, 3:25pm

no, unfortunately, the HiddenLayer is not a fully-fledged distribution and so it won’t play nice with validation logic