2018 Archive

Archived
The 2018 Railroad Problem Solving Competition
Advanced analytical approaches applied to real-world rail problems
First Prize: $2000 --- Second Prize $1000 --- Third Prize $750
Predicting Near Term Train Schedule Performance and Delay

Here are the instructions and files for the 2018 problem solving competition:

Problem Benchmark Method

Terms Glossary

Data Dictionary (spreadsheet)

First Cut Data Sample (spreadsheet)

Complete Analysis Data Set (zip archive)

Description:

Railroads have an intense level of congestion owing to high volumes, large trains, and highly constrained track infrastructure.  Often, a relatively minor event such as a late crew or defective car can cause large ripple effects throughout a rail network, causing congestion delays to a large number of trains. Often, the view of the delay is based on a single train, though often a train’s delay is the result of problems created by other trains and events in the rail network.

In the Netherlands passenger train transportation is a very important mode of transport, with over a million passengers travelling by train every day. Netherlands Railways is the biggest passenger train operator and operates almost 6,000 trains daily. ProRail is the infrastructure manager, responsible for track maintenance and coordinates train operations for all operators.

For smooth punctual operations reliable real time data is of the utmost importance. Over the recent years real time data regarding delay in the entire network has been collected and made available to the operators. This data is used for passenger information systems and in dispatching centers.

Competition Objective:

Netherlands Railways and ProRail would like to have accurate forecasts for the rail network on ultra-short term, say 20 minutes ahead. Currently, the most commonly used estimate for future delay is that the current delay remains unchanged. Obviously, this is often not the case. A good estimate for what the delays will be over the next period of time will benefit both passenger information and dispatching of the railway operation.

The goal of this competition is to create better forecasts of train performance and delays. Entrants will apply any appropriate advanced analytical techniques to better predict near-term train performance.

No rail experience necessary!  A full, detailed description of the problem is forthcoming, and questions from participants welcome!

Data:

Comprehensive data on train deviation from planned schedule will be provided from ProRail for contestants to use advanced analytical techniques to forecast future delays, including planned and realized train timetables, planned and realized crew schedules and planned and realized rolling stock schedules.

A sample of the time table plan and realization is available at the competition website for those interested in previewing a subset of sample data. Other more complete data sets and descriptions are forthcoming.

Competition Criteria:

The criteria that will be used to evaluate a solution include:

  • The accuracy of the solution – How accurate is the predicted train performance?
  • The soundness and sophistication of the solution approach –
    How advanced is the methodology, and how robust is its application?
  • The quality of the paper describing the approach – How clear is the explanation?
  • The quality of the presentation, to be given by three finalists at the Rail Application Section Meeting at the INFORMS conference in Phoenix, Sunday, November 4-Tuesday November 7. (The attendance and presentation of at least one team member required to win.)

Competition Awards:

  • First Prize: $2000
  • Second Prize $1000
  • Third Prize $750

Apart from the cash prizes, the first prize winner’s contribution will also be considered for publication in the journal Networks. The winning will receive an expedited refereeing and publication process.

Competition Calendar:

  • Registration: Deadline is April 30, 2018
    Interested parties should register by sending an email to:
    railwayapplicationssection@gmail.com
    Your email should include all team member name, contact information, and affiliation.
  • Full Problem Release:  February 28, 2018
    Additional data, more specifics of the problem
  • Question and Answer period:  February 28 – July 1
    Participants may ask questions; all questions and answers available to all participants.
  • Quiet Period: July 1 - Aug 1.
    Participants may continue to work on solutions; no additional information provided.
  • August 1 – Solution due
    Solution includes report on methodology, and solution data set
    (format of solution data set provided Feb 28)
  • September 1 – Finalists announced
    Finalists are expected to give a presentation at INFORMS conference,
    Phoenix, November 4, 2018
  • November 4, 2018, Phoenix - Finalist Presentation.
    Each finalist gives a 15-20 minute presentation on their approach;
    Judging panel asks questions.
    All finalists must attend.
  • November 4, 2018, Phoenix Winner Announced

Good luck in the competition!

Professor Michael F. Gorman, Ph.D.
University of Dayton


2018 PSC Results


Over 40 teams participated in this year's Rail Problem Competition, which seeks to predict delays in the Netherlands Railway passenger rail network. The three finalists used three very diverse methods.  The titles, abstracts and contributors are below. They will be presenting on Sunday, November 4 at the INFORMS Phoenix conference for first, second and third place.

 
Award Order:

First Prize: Forecasting Train Delays in the Netherlands using Neural Networks

Abstract: We investigate to what extent low-maintenance and out-of-the box machine learning models can provide accurate predictions of train delays. We focus on predicting actual delay and but also the delay development. The results on real-life data from the Netherlands indicate that our models can outperform a constant prediction model.

Authors:

  1. Jørgen Thorlund Haahr (PhD), Decision Scientist, QAMPO, jth@qampo.com
  2. Erik Hellsten, PhD candidate, Technical University of Denmark, erohe@dtu.dk
  3. Evelien van der Hurk (PhD), Assistent Professor, Technical University of Denmark, evdh@dtu.dk

Second Place: Predicting Near-Term Train Schedule Performance and Delay Using Bi-Level Random Forests 

Abstract: Near-term train delays prediction is critical for railway management. We propose a bi-level random forest approach to predict train delays. The primary level predicts the delay category, and the secondary level estimates the delay (in minutes). The proposed model is compared with several alternative approaches, validating its superior accuracy.
 
Authors:
 
1. Mohammad Amin Nabian, PhD Candidate, Department of Civil and Environmental Engineering, University of Illinois at Urbana-Champaign (email: mnabia2@illinois.edu)
2. Negin Alemazkoor, PhD Candidate, Department of Civil and Environmental Engineering, University of Illinois at Urbana-Champaign (email: alemazk2@illinois.edu)
3. Hadi Meidani, Assistant Professor, Department of Civil and Environmental Engineering, University of Illinois at Urbana-Champaign (email: meidani@illinois.edu)
 

Third Place: A Railway Delay Prediction Model Based on Non-Homogeneous Markov Chains
 
Abstract: Assuming the delay on a certain station depends only on delay attained in the previous station, we model the delay evolution over stations as a non-homogenous Markov chain. By discretizing the state space and retrieving transition matrices from historical data, we accurately predict delay using a probabilistic approach.
 
Authors:
 
1. Gao, Zheming  zgao5@ncsu.edu,  Department of Operations Research, NC State University
2. Luo, Haochen  hcluo@tamu.edu,  ISEN Department, Texas A&M University
3. Wu, Qian  hi_qianwu@tamu.edu,   ISEN Department, Texas A&M University
4. Xu, Jin  jinxu@tamu.edu,  ISEN Department, Texas A&M University






Q&A:

1. Ordernumber (in TimeTable)

The description for the feature 'Order Number' is given as "An integer indicating which number this activity is in the plan of this train." From  what I understand, 'activity' can be V, D, A , K_A or K_V. I don't understand what you mean by 'number' in this context. The ordernumber seems to go up to 25. Could you clarify what which represents?

Response: 

It represents an index, indicating the order in of the activities. So a 2 for example indicates that this is the second activity of this train number.  

 

2. Pattern (in TimeTable)

 
Regarding 'Pattern', the data dictionary describes this as : "This string describes the timetable pattern to which this trainnumber belongs. This string is made up of 2 parts. The first is a character indicating the pattern, which indicates the hour pattern. The second part indicates the trainseries.". What exactly is the 'hour pattern'?
 

Response:

An hour pattern is a timetable pattern which repeats itself hourly. For example, (near)all trains belonging to the B500 pattern, have their departure time at Zl to be XX:45. The 515 departs at 06:45, the 519 departs at 06:45, the 523 departs at 07:45 etc. The only exceptions to this are trains at very early or very late times in the day. The D pattern for example then can be expected to have a different time pattern. The A, C, E, G, … etc all travel in the same direction and the B, D, F, H, … etc all travel in the same direction.

 
 

3. AverageAllowedSpeed (in InfraOverview): Specifically how this differs from TimetableSpeed (in TimeTable)


Regarding AverageAllowedSpeed, the description is as follows: "The average allowed speed on this track. Note that this nearly always lies below the maximum allowed speed and does not take into account the distancec over which each speed limit is allowed. In the majority of cases the maximum allowed speed will be allowed across the majority of the tracks. Lower speed limits often occur during departure, arrival, or the passing of certain stations." Also, the TimetableSpeed is described as "The maximum speed allowed by the time table in kilometers per hour." This value seems to be 140 almost everywhere. So how exactly is the average allowed speed calculated from the maximum speed? 
Response:

We did not provide detailed data on the Dutch infrastrure network. Instead, we provided a rough overview to give an idea on how the network looks. In reality, each track is divided by many track-sections that each could have their own speed-limits. The speed which is mentioned here is merely an estimate, to give an idea of track capacity.


The speed in Timetablespeed is more reliable. This is based on an actual calculation for the specific train and for example its stopping pattern.

4) In problem statement, getting stuck behind a slower train is listed as one of the situations that can cause delay.


Using the information provided in timetable, we know whether there is a train in front of the targeted train or not. However, the front train can cause delay only if the front train and targeted train are on the same track. How can we be sure that the two trains are on the same track and not on parallel tracks?

You cannot be sure. However, in the vast majority of cases the trains will run over the same tracks every day. In order to figure out which trains are likely to cause delay for other trains you can use the Cause column within the RealisationData.

(5) Data shows that it is possible that a train starts a ride with delay.

Is this because the previous ride that the train was used for was finished with a delay? Is it possible to provide us train IDs, so that we know which rides (Trainnumbers) are associated with the same train and are consecutive rides? 

Yes, it is possible that trains start a ride with a delay. This could indeed be cause by late arrival of a previous train (but also other issues could be the reason, for example absence of crew members..). The datafile RollingStockConnections.txt contains a list of arriving and departing trains that share the same rolling stock. Here connections are only mentioned if the connecting time interval is less than 30 minutes. 

For example on line 2 of the file:

1             3132      Ah          3032

Train 3032 starts it’s ride at station Ah. From this line it can be seen that on Mondays (day number 1) the prevous ride of the rolling stock for this train was on trainnumber 3132.


(6) The distances between some of the stations are not provided (e.g Ut and Utwa). Is it possible to provide us all distances? 

All locations should be within the LocationDistances file. If I filter on the From column on Ut, you can see within the To column Utwa. If there is a distance missing from this file please let us know.

Further Questions and answers from 24 May:

Regarding the availability of a full paper of the benchmark: 

 

No, the full thesis is only available in Dutch. The summary is aimed to give an idea of what work is already done and its limitations. We encourage the participants to come up with their own solution approach.

 

What does “stop”  in the data set “RealisationData” in the column “cause” mean?

 

The “Stopping procedure” entry means that something that happens during the time that the train is stopped that caused the delay. Think of that the doors did not close on time because passangers were still getting on, or a switch in train driver where the new train driver was late, etc. The most important part here is that the delayed departure is not caused by another train.

 

Are the entries in the data sets “DriverSwitches”, “RollingStockConnections” and “RollongStockCompositionChanges” weekly recurring planned actions (and thus do not change) or are they actually realized actions that may change from one week to another?

 

There are the recurring planned actions. On the day itself there can of course be exceptions.

Further Questions and answers from 13 June:

There are 179 entries where the distance between From and To location is non-zero. For example, the last entry is ‘Dn-Dn-6.2’. I was expecting it to be zero. Could you provide the reason behind this? 

This is a technical issue with the query. Please assume if From = To that distance is zero.

Further Questions and answers from 25 June:

For us to model the delay propagation, we require the pairwise distance between locations provided on the timetable. We have compiled a list of location pairs whose distance is missing. Could you kindly provide the missing distances in the excel attached?

A complete pairwise list of all locations is not possible due to there being multiple routes, thus distances, possible between two locations if they are further apart. For this reason the list that is provided in LocationDistances only contains directly bordering locations. If you are looking for a distance that crosses multiple locations you should use the sequence found in the TimeTable data. For example: From Bkl to Ac is not in the LocationDistances. If you look for the timetable of the train you want to use this information, say the 814 train, you see that the sequence for this train is: Bkl, Aco, Ac. These distances: Bkl to Aco and Aco to Ac can be found within the LocationDistances file.

The following files have been released for information today:
DriverChangesALL_update.txt
INFORMS_RAS_Questions_1806.docx

Further Questions and answers from 4 July:

For us to model the delay propagation, we require the pairwise distance between locations provided on the timetable. We have compiled a list of location pairs whose distance is missing. Could you kindly provide the missing distances in the excel attached?

If I look in the timetable file for train 516, I see from MDO to NWKI to NWK to CPS, all of these can be found within the LocationDistances file. However I do recognize that inconsistencies can still be present due to routing or other exceptions present in the differences between planning and realization. These should be very rare however. But for those cases that you really cannot find the path within the TimeTable data, you could use the GPS locations of the stations for a distance calculation. With this you should note that the further the stations apart, the larger your error will be. For this reason I strongly advise to use the TimeTable + LocationDistances whenever possible, and only as a last resort use the GPS distances. Also note that you can reasonably assume that trains of the same trainseries that go in the same direction will go along the same stations. See the attachment for a full list of locations.

 

In the RealisationData.csv, the TrainCharacteristic column, trains having an 'LL' characteristic have major data discontinuities. I believe there are only 4 days in the entire file where there is data for 'LL' trains. Is that on purpose?

LL trains are single loc trains. These are not used to carry passengers and can be disregarded.

 

Also, the LocationDistance.csv doesn't have all the locations in the RealisationData.csv 'Locations' column. Example: from Utand to Uto, from Htbaand to Wz, from Mpand to Gn, from Asnand to Gn, from Mdoand to Nwk.

Some of the locations mentioned do not exist within the RealisationData, like Utand, Htbaand, Mpand, Asnand and Mdoand. Besides that it is possible that there are sequences found within the realization data that are not within the LocationDistances, in these cases I would advise to look up the location sequence of that train within the timetable data to put back any missing locations that are not present within the realization data, and thus can cause inconsistencies. These cases should be rare however.

 

The number of unique entries in the TimeTable.csv 'Location' column is 523, versus 408 entries in the infra_overview.xlsx and 478 in the 'Location' column in the RealisationData.csv file giving a total of 667 unique locations in all three files. These 667 unique locations include 94 locations not found in the list of abbreviations found here: http://www.rolandrail.net/drgl/afko.htm, example: Kpzhtb, Drokpz, Stbdro, Llsstb, Llsoa.

The mismatch in the number of locations is likely due to not all locations being measured (thus missing in realisation) or planned (thus missing in the TimeTable data). The list on the website is also not complete if it is missing Llsoa for example.

 

The 'TimeTable.csv' has 279900 rows that include null values (all in the PlannedTime column), which is about a third of its size.

In the TimeTable file I find 3625 rows with a null value for the PlannedTime. These cases you can leave out. This should not however be 279900 rows, thus I advise you to check the process/program you use to read in the data.

 

Another issue is that the glossary doesn't contain an explanation for some fields, such as Activity_x, and Activity_y in the Realisation table.

There should be no columns named Activity_x and Activity_y. I would check that you are reading in the data correctly and if you are not doing a join that causes this x and y naming.

New file posted today:

Station_Info.xlsx