NA - US - Denver network outage
Incident Report for HiperZ
Postmortem

Network Disruption RFO / RCA

Report Issued: April 12, 2019

Below, please find our reason for outage and root cause analysis reports related to the network disruption we suffered on April 11, 2019.

Timeline (All times are MT)

10:00 AM – A member of our provisioning time begins preparing normal router configuration changes to support a customer requirement.

10:46 AM – The changes are committed to our routers, using a “commit confirmed” statement, which would automatically rollback the changes after 5 minutes.

10:48 AM – We begin receiving a large volume of alerts indicating a major network disruption.

10:49 AM – Provisioning team engages management and senior network technical resources and begins the process of expediting the rollback of the configuration.

10:51 AM – It is evidently that in spite of rolling back the change, our network is still facing a disruption. Further troubleshooting efforts are underway.

11:00 AM – After reviewing logs and other technical data, we determine that our BGP peering to all of our external providers is not in a proper state. Specifically, all of our providers were forcing the BGP sessions offline because we exceeded the maximum number of prefixes allowed. We begin the process of clearing our BGP sessions, which we hope will force them back online.

11:02 AM – The sessions will still not come back online, and we begin engaging all of our providers via phone to get them to clear the BGP status on their side.

11:11 AM – Hurricane Electric circuit comes back online, and all alerts begin to clear and network access is restored.

11:40 AM – Cogent circuits comes back online.

11:49 AM – Zayo circuit comes back online.

Root Cause Analysis

Human error was the primary cause of this issue. A review of our logs and configurations on our border routers indicate that a typo in a routing “policy statement” resulted in our routers inadvertently advertising full BGP routes to our transit providers. This caused all of our providers to automatically shutdown our BGP sessions as soon as we reached our prefix announcement. We had to contact each of our 4 providers and get them to manually reset our BGP state to restore service.

After Action Plan

We will be conducting additional internal training to ensure that everyone on our team has an increased capability to make such routing changes. Additionally, we are currently investigating using commit scripts to validate such errors are not present, however, the research and development of such scripts will take some time.

Posted Apr 15, 2019 - 18:51 UTC

Resolved
This incident has been resolved.
Posted Apr 11, 2019 - 18:18 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Apr 11, 2019 - 17:16 UTC
Investigating
NA - US - Denver network outage
Posted Apr 11, 2019 - 16:46 UTC
This incident affected: Gameservers NA (USA - Denver).