Current Competition Q&A

1. Ordernumber (in TimeTable)

The description for the feature 'Order Number' is given as "An integer indicating which number this activity is in the plan of this train." From  what I understand, 'activity' can be V, D, A , K_A or K_V. I don't understand what you mean by 'number' in this context. The ordernumber seems to go up to 25. Could you clarify what which represents?


It represents an index, indicating the order in of the activities. So a 2 for example indicates that this is the second activity of this train number.  


2. Pattern (in TimeTable)

Regarding 'Pattern', the data dictionary describes this as : "This string describes the timetable pattern to which this trainnumber belongs. This string is made up of 2 parts. The first is a character indicating the pattern, which indicates the hour pattern. The second part indicates the trainseries.". What exactly is the 'hour pattern'?


An hour pattern is a timetable pattern which repeats itself hourly. For example, (near)all trains belonging to the B500 pattern, have their departure time at Zl to be XX:45. The 515 departs at 06:45, the 519 departs at 06:45, the 523 departs at 07:45 etc. The only exceptions to this are trains at very early or very late times in the day. The D pattern for example then can be expected to have a different time pattern. The A, C, E, G, … etc all travel in the same direction and the B, D, F, H, … etc all travel in the same direction.


3. AverageAllowedSpeed (in InfraOverview): Specifically how this differs from TimetableSpeed (in TimeTable)

Regarding AverageAllowedSpeed, the description is as follows: "The average allowed speed on this track. Note that this nearly always lies below the maximum allowed speed and does not take into account the distancec over which each speed limit is allowed. In the majority of cases the maximum allowed speed will be allowed across the majority of the tracks. Lower speed limits often occur during departure, arrival, or the passing of certain stations." Also, the TimetableSpeed is described as "The maximum speed allowed by the time table in kilometers per hour." This value seems to be 140 almost everywhere. So how exactly is the average allowed speed calculated from the maximum speed? 

We did not provide detailed data on the Dutch infrastrure network. Instead, we provided a rough overview to give an idea on how the network looks. In reality, each track is divided by many track-sections that each could have their own speed-limits. The speed which is mentioned here is merely an estimate, to give an idea of track capacity.

The speed in Timetablespeed is more reliable. This is based on an actual calculation for the specific train and for example its stopping pattern.

4) In problem statement, getting stuck behind a slower train is listed as one of the situations that can cause delay.

Using the information provided in timetable, we know whether there is a train in front of the targeted train or not. However, the front train can cause delay only if the front train and targeted train are on the same track. How can we be sure that the two trains are on the same track and not on parallel tracks?

You cannot be sure. However, in the vast majority of cases the trains will run over the same tracks every day. In order to figure out which trains are likely to cause delay for other trains you can use the Cause column within the RealisationData.

(5) Data shows that it is possible that a train starts a ride with delay.

Is this because the previous ride that the train was used for was finished with a delay? Is it possible to provide us train IDs, so that we know which rides (Trainnumbers) are associated with the same train and are consecutive rides? 

Yes, it is possible that trains start a ride with a delay. This could indeed be cause by late arrival of a previous train (but also other issues could be the reason, for example absence of crew members..). The datafile RollingStockConnections.txt contains a list of arriving and departing trains that share the same rolling stock. Here connections are only mentioned if the connecting time interval is less than 30 minutes. 

For example on line 2 of the file:

1             3132      Ah          3032

Train 3032 starts it’s ride at station Ah. From this line it can be seen that on Mondays (day number 1) the prevous ride of the rolling stock for this train was on trainnumber 3132.

(6) The distances between some of the stations are not provided (e.g Ut and Utwa). Is it possible to provide us all distances? 

All locations should be within the LocationDistances file. If I filter on the From column on Ut, you can see within the To column Utwa. If there is a distance missing from this file please let us know.

Further Questions and answers from 24 May:

Regarding the availability of a full paper of the benchmark: 


No, the full thesis is only available in Dutch. The summary is aimed to give an idea of what work is already done and its limitations. We encourage the participants to come up with their own solution approach.


What does “stop”  in the data set “RealisationData” in the column “cause” mean?


The “Stopping procedure” entry means that something that happens during the time that the train is stopped that caused the delay. Think of that the doors did not close on time because passangers were still getting on, or a switch in train driver where the new train driver was late, etc. The most important part here is that the delayed departure is not caused by another train.


Are the entries in the data sets “DriverSwitches”, “RollingStockConnections” and “RollongStockCompositionChanges” weekly recurring planned actions (and thus do not change) or are they actually realized actions that may change from one week to another?


There are the recurring planned actions. On the day itself there can of course be exceptions.

Further Questions and answers from 13 June:

There are 179 entries where the distance between From and To location is non-zero. For example, the last entry is ‘Dn-Dn-6.2’. I was expecting it to be zero. Could you provide the reason behind this? 

This is a technical issue with the query. Please assume if From = To that distance is zero.

Further Questions and answers from 25 June:

For us to model the delay propagation, we require the pairwise distance between locations provided on the timetable. We have compiled a list of location pairs whose distance is missing. Could you kindly provide the missing distances in the excel attached?

A complete pairwise list of all locations is not possible due to there being multiple routes, thus distances, possible between two locations if they are further apart. For this reason the list that is provided in LocationDistances only contains directly bordering locations. If you are looking for a distance that crosses multiple locations you should use the sequence found in the TimeTable data. For example: From Bkl to Ac is not in the LocationDistances. If you look for the timetable of the train you want to use this information, say the 814 train, you see that the sequence for this train is: Bkl, Aco, Ac. These distances: Bkl to Aco and Aco to Ac can be found within the LocationDistances file.

The following files have been released for information today:

Further Questions and answers from 4 July:

For us to model the delay propagation, we require the pairwise distance between locations provided on the timetable. We have compiled a list of location pairs whose distance is missing. Could you kindly provide the missing distances in the excel attached?

If I look in the timetable file for train 516, I see from MDO to NWKI to NWK to CPS, all of these can be found within the LocationDistances file. However I do recognize that inconsistencies can still be present due to routing or other exceptions present in the differences between planning and realization. These should be very rare however. But for those cases that you really cannot find the path within the TimeTable data, you could use the GPS locations of the stations for a distance calculation. With this you should note that the further the stations apart, the larger your error will be. For this reason I strongly advise to use the TimeTable + LocationDistances whenever possible, and only as a last resort use the GPS distances. Also note that you can reasonably assume that trains of the same trainseries that go in the same direction will go along the same stations. See the attachment for a full list of locations.


In the RealisationData.csv, the TrainCharacteristic column, trains having an 'LL' characteristic have major data discontinuities. I believe there are only 4 days in the entire file where there is data for 'LL' trains. Is that on purpose?

LL trains are single loc trains. These are not used to carry passengers and can be disregarded.


Also, the LocationDistance.csv doesn't have all the locations in the RealisationData.csv 'Locations' column. Example: from Utand to Uto, from Htbaand to Wz, from Mpand to Gn, from Asnand to Gn, from Mdoand to Nwk.

Some of the locations mentioned do not exist within the RealisationData, like Utand, Htbaand, Mpand, Asnand and Mdoand. Besides that it is possible that there are sequences found within the realization data that are not within the LocationDistances, in these cases I would advise to look up the location sequence of that train within the timetable data to put back any missing locations that are not present within the realization data, and thus can cause inconsistencies. These cases should be rare however.


The number of unique entries in the TimeTable.csv 'Location' column is 523, versus 408 entries in the infra_overview.xlsx and 478 in the 'Location' column in the RealisationData.csv file giving a total of 667 unique locations in all three files. These 667 unique locations include 94 locations not found in the list of abbreviations found here:, example: Kpzhtb, Drokpz, Stbdro, Llsstb, Llsoa.

The mismatch in the number of locations is likely due to not all locations being measured (thus missing in realisation) or planned (thus missing in the TimeTable data). The list on the website is also not complete if it is missing Llsoa for example.


The 'TimeTable.csv' has 279900 rows that include null values (all in the PlannedTime column), which is about a third of its size.

In the TimeTable file I find 3625 rows with a null value for the PlannedTime. These cases you can leave out. This should not however be 279900 rows, thus I advise you to check the process/program you use to read in the data.


Another issue is that the glossary doesn't contain an explanation for some fields, such as Activity_x, and Activity_y in the Realisation table.

There should be no columns named Activity_x and Activity_y. I would check that you are reading in the data correctly and if you are not doing a join that causes this x and y naming.

New file posted today: