Failed switch caused Sydney Trains network outage
Cutover to backup network also failed.
Transport for NSW believes a failed network switch caused yesterday’s hour-long communications outage, compounded by the system’s failure to automatically switch to a backup network.
The outage halted all trains at stations, because the communications network is critical to communication between drivers, guards, and the rail network's management centre.
Sydney Trains CEO Matt Longland told a press conference this morning the network in question communicates radio transmissions from central control to train drivers via 200 base stations.
Longland said the system had operated "reliably since 2016" and that this "is the first incident of its kind".
When the issue first emerged at around 2.45 pm Wednesday, “staff here at the rail operations looked to do a remote reboot of the system," Longland said.
“They looked at that process for around five minutes," he said.
"When they worked out that was not possible, and the impact across the network, we activated our crisis management plan.”
Longland said an investigation would examine why an automatic failover to a redundant system did not occur.
“The investigation will really focus on why the system wasn’t able to cut over automatically, as it should have, in an incident like this,” he said.
“The system has the redundancy to automatically switch across to a backup. That should have occurred immediately … [but] didn’t occur.
“We’ve got a secondary backup, which is a secondary data centre that operates in parallel, that we were able to move to in the event of a significant issue.”
The passive backup, in Homebush, was mobilised and running in parallel with the main system, but Longland said that production load was never cut over due to a fix being found.
Longland said the performance of the replacement network switch is being monitored.
The investigation will also include Sydney Trains’ use of incident response technology by vendor Frequentis.
So far, Longland added, there is “no suggestion" that a cyber security incident caused the problems.