Collection of railway traffic data

3 minute read


Traffic itself can be a huge challenge for most commuters regardless of the transportation method of their choice. For example, it is inevitable to experience delays and congestion during rush hours. All commute methods have their own specific characteristics when it comes to delays - cars and buses suffer from traffic jams and similar principles apply to railways as well. However, the causes of railway delays are not that straightforward and they need further investigation. According to my personal experiences most passengers are not aware of the reasons behind train delays even though they are usually encountered multiple times a day. In order to have a clearer understanding, I have started to collect data from various sources at the beginning of 2019.

Traffic dataset

I found that the most reliable publicly available data source for traffic is the official map of Hungarian State Railways, where all trains can be tracked in real-time. The official map uses Google Maps to display the currently traveling trains. The trains are color-coded depending on their delays - green means the delay is less than or equal to 5 minutes, yellow means the delay is between 6-14 minutes, orange means the delay is between 15-59 minutes and red means the delay is above 1 hour.

Real-time map of Hungarian State Railways

A snapshot of the map contains the following information about each of the trains that were active at the time the snapshot was taken in JSON format:

Field name Example value Note
@CreationTime “2020.06.25 17:41:05” Timestamp of the snapshot
@ElviraID “5740486_200625” Daily unique ID of a given train
@Menetvonal “MAV” Operator of a given train
@Line “100” Current railway line of a given train
@TrainNumber “552761” Unique ID of a given train
@Relation “Monor - Budapest-Nyugati” Terminals of a given train
@Lat 47.48300 Current latitude of a given train
@Lon 19.12723 Current longitude of a given train
@Delay 1 Current delay of a given train

Weather dataset

In addition to the traffic data I also collected the corresponding weather data for every train, because I suspect that weather has an influence on the delays as well. It was not easy to find a free provider which is capable of handling the necessary amount of requests, but after many trials I decided to use OpenWeatherMap. Its free tier gives access to 60 location-based weather requests per minute, which is still not enough for every individual train, but can be sufficient to place virtual weather stations all over Hungary with a resolution of approximately 35.5 km.

Virtual weather station. A virtual weather station is a GPS position which can be queried for up-to-date local weather information.

Positions of the virtual weather stations

The first task is to distribute the available 60 slots uniformly such that every train can be assigned to the closest virtual weather station. Finding an exact solution to the problem would have been infeasible, therefore I decided to develop an approximation algorithm for which I used the GeoNames geographical database which contains POIs in Hungary and is available for download free of charge under the Creative Commons Attribution 4.0 license. The algorithm is based on a k-d tree which allows fast nearest neighbor searches for POIs. It results in an approximately uniform placement of virtual weather stations and they are located at densely populated areas where accurate weather information benefits more people.

Virtual weather stations in Hungary

Extending the traffic snapshots with weather observations

Upon taking a traffic snapshot the closest virtual weather station is determined for each train and its most recent observation is assigned to the train which consists of the following:

Field name Example value
Weather “Clouds”
Temperature 26.28 °C
Pressure 1020 hPa
Cloudiness 98%
Humidity 44%
Wind 1.5 m/s
Visibility 10000 m
Rain 0 mm
Snow 0 mm


I am using the following system for the data collection which basically operates free of charge:

Architecture of the data collection system

The scheduler at invokes a script on Heroku every minute which prepares an extended snapshot based on the real-time map and the most recent weather observations. The snapshot is then uploaded to pCloud in JSON format:

  "date": "2020-06-25T17:41:05.717740",
  "elvira_id": "5740486_200625",
  "operator": "MAV",
  "line": "100",
  "train_id": "552761",
  "loc": {
    "type": "Point",
    "coordinates": [
  "relation": "Monor - Budapest-Nyugati",
  "delay": 1,
  "weather": 804,
  "temperature": 26.28,
  "pressure": 1020,
  "cloudiness": 0.98,
  "humidity": 0.44,
  "wind": 1.5,
  "visibility": 10000,
  "rain": 0,
  "snow": 0
}, ...]

As of June 2020, the dataset consists of 700,000 snapshots containing over 135 million train records. In the next article I am going to evaluate the dataset and provide possible answers for the delay problem.