Incident supported by the on-call team
Incident Report for AT Internet
Postmortem

Incident explanation:

  • 2 incidents occurred at the same time:

    • An incident in the datacenter: an intervention of our main ISP causing the interruption of the internet access to the datacenter.
    • An AWS incident: an outage in Zone C of the AWS Frankfurt region.
  • The impacts were:

    • The incident in the datacenter caused an unavailability of the interfaces of about 35 minutes and an unavailability of the VPN of about 1h30.
    • The AWS incident caused a delay in real time of approximately one hour for NDF and 2:44 for CDF.

Actions performed:

  • Traffic migrated to our backup ISP
  • Frankfurt data collection migrated to other regions
  • Kiss EMR in Frankfurt launched in a new zone
  • Traffic rollback to the main ISP

Actions to come:

  • Update of some documentations
  • Improvement of one of the on-call alerts
  • Added the possibility to use the VPN on the backup ISP
  • Expansion of the list of people to be contacted by the main ISP in case of scheduled maintenance.
  • And… many other minor actions
Posted Jun 16, 2021 - 08:08 UTC

Resolved
The issue has been resolved and all our tools and interfaces are fully available.
We are currently catching up the delay in real-time data retrieval.
Posted Jun 10, 2021 - 23:50 UTC
Investigating
We are currently experiencing a production incident.
Please excuse us for any inconvenience caused.
Posted Jun 10, 2021 - 20:05 UTC
This incident affected: API (API Rest, API Flow) and Data collection, Data processing, Interfaces.