Predicting the occupancy of NMBS trains using logs from SpitsGids app (v1.0)

Gilles Vandewiele
5 min readNov 11, 2016

--

I got in contact with Pieter Colpaert. He told me about the SpitsGids app, a neat initiative in Belgium where people can state the occupancies on their trains, such that other people will be engaged to take a train earlier/later when it is going to be full. All the occupancy events coming from this app were logged and published. Now, wouldn’t it be cool if we combine this data with machine learning to try and predict whether the train is going to be full or not?

By the way, definitely check out similar blog posts from Kris Peeters (https://dataminded.be/blog/predicting-occupancy-nmbs-trains) and Nathan Bijnens (https://t.co/BSI8JTlr6x) who worked on the same dataset. I’m planning to incorporate some of their features in my model in the nearby future (together with event and delay data).

Information provided in logs
Our goal

Loading and exploring the data

The two log files contained 3735 entries in total. Additionally, a csv file with information about all stations ( https://github.com/iRail/stations ) was used, which linked the URIs for the from_ and to_station to their string representation and a coordinate. After parsing, 194 faulty logs were discarded. For now I just dropped duplicates based on the date (hour, quarter of the hour, day, month, year), vehicle and from_station, without taking the most occurring label. In total 1555 duplicates were removed (1400 were duplicates, even on the exact querytime (so one log file was probably a subset of the other)). So the final training set contains 1986 entries. The features used in my final baseline model were:

  • the weekday (one-hot encoded)
  • the from_station and to_station (string representation, one-hot encoded)
  • the latitude and longitude of the from_ and to_station
  • the vehicle type (IC, L, P, S and something called THA, again one-hot encoded)
  • the month (which was not very informative, since the data was only extensively collected during a limited timespan)
  • the number of seconds since midnight

Resulting in 508 features (due to the one-hot encoding of the stations). Now let’s visualize some of these features. But first, let’s start with a distribution of the classes.

It seems that trains tend to be more empty than full! Trains seem to be medium full most of the times. Now the vehicle types.

The labels on the bars are the counts. Seems like most events come from IC trains and that P (Piekuur/peak hour) is full most often. Of trains that I know of, the one I take (L) is most often empty. Does the day of the week matter as well?

Seems like the Friday is the worst day to take a train. Quite surprising is that Sunday is followed as second. The reason for this could be that students take the train on these days (which is the probably the largest part of the user group of the Spitsgids app).

The baseline model

I used XGBoost, the kaggle-winning classification algorithm, as a model and Bayesian optimization to tune the hyper-parameters (which are quite a lot), I then used the following configuration for the test:

XGBClassifier(learning_rate=0.075, n_estimators=1750,
gamma=0.9, subsample=0.75, colsample_bytree=0.7,
nthread=1, scale_pos_weight=1, reg_lambda=0.25,
min_child_weight=5, max_depth=13)

Then, 5-fold cross-validation (stratified) was used to evaluate the model. Resulting in the following output:

Features dataframe dimensions: 1986 x 508
Fold 1 / 5
[[105 27 29]
[ 42 37 31]
[ 34 31 59]]
accuracy: 0.508860759494
Fold 2 / 5
[[111 23 27]
[ 30 42 38]
[ 30 29 64]]
accuracy: 0.55076142132
Fold 3 / 5
[[106 24 30]
[ 41 43 26]
[ 33 28 62]]
accuracy: 0.5368956743
Fold 4 / 5
[[106 30 24]
[ 32 48 30]
[ 19 30 74]]
accuracy: 0.580152671756
Fold 5 / 5
[[95 28 37]
[45 35 30]
[41 27 55]]
accuracy: 0.470737913486
Avg accuracy: 0.529481688071

So a rather bad performance for the medium class, and not too well for the high class as well… And these are the 40 most important features from one of the five folds. Seems like the seconds since midnight is the most important feature (probably because of the higher occupancies during the morning peak and evening peak), followed by the coordinates of the station. Then followed by the month (which could be due to a bias), the weekdays and the vehicle types. Applying feature selection with BorutaPy did not improve the accuracy.

Finally, I plotted a learning curve using Logistic Regression:

So it seems that more data can increase the model performance!

Using weather data

I got the weather information from June till November from https://www.worldweatheronline.com/ . The following information was retrieved: temperature: humidity, windspeed, visibility and weather_category (one numerical and one string, which was one-hot encoded). This information was retrieved for the from and to-station. Resulting in a total of 547 features. Unfortunately, 18 records had to be discarded because something went wrong in retrieving weather information for those stations (I fixed four of them manually by replacing their information by the weather of the province capital close-by). I ran XGBoost again with the same hyper-parameters. Here’s the output:

Fold 1 / 5
[[110 23 27]
[ 40 41 29]
[ 35 27 61]]
accuracy: 0.539440203562
Fold 2 / 5
[[114 24 22]
[ 31 45 34]
[ 32 27 64]]
accuracy: 0.567430025445
Fold 3 / 5
[[107 26 27]
[ 39 41 30]
[ 37 23 63]]
accuracy: 0.5368956743
Fold 4 / 5
[[97 27 36]
[39 38 33]
[24 29 69]]
accuracy: 0.520408163265
Fold 5 / 5
[[95 30 34]
[41 38 30]
[29 21 72]]
accuracy: 0.525641025641
Avg accuracy: 0.537963018443

So only a small improvement, but a much more stable model. And the feature importances:

And another learning curve:

--

--