SQL Importer Network Issue
Incident Report for hull
Postmortem

Summary

Between 18:30 UTC on 16th April 2020 and 19:20 UTC on 17th April 2020, the SQL Importer experienced network issues causing import jobs to not execute on schedule.

What happened?

An automatic update on the underlying K8s cluster caused the static IP addresses being detached at 18:30 UTC on 16th April 2020. Since the SQL Importer connector was no longer using static IPs, firewall rules prevented network traffic to the databases from those unknown IP addresses and hence no queries have been executed successfully. As a result customers did not see any import jobs.

Our support team was notified about failing SQL queries at 11:37 UTC on April 17th and escalated the issue at 11:52 UTC to the engineering team.

At 13:00 UTC the engineering team determined the root cause for the outage being the detached IP addresses and re-attached the IP addresses to bring to connector back to an operational state. After confirming that the connector was back operational, the engineering team started to replay jobs at 13:59 UTC for all customers.
Once the connector was fully operational again, the engineering team began root cause analysis.

At 19:36 UTC the engineering team took additional measures to bring data of all organizations up-to-date and all jobs have been fully imported and processed at 19:20 UTC on 17th April 2020.

What are we doing about this?

  • We have deactivated automatic updating of the K8s cluster to avoid further uncontrolled detachments.

  • We implemented a more rigorous alerting system for the SQL Importer to get notified immediately when the amount of erroneous import jobs increases, which would have alerted us within 86 minutes of the incident start.

  • Add additional alerting for all connectors to detect error spikes where multiple instances for various customers go into error mode to detect infrastructure issues further.

  • Determine further DevOps actions to ensure that static IPs are attached properly all the time.

Posted Apr 20, 2020 - 11:29 EDT

Resolved
Summary of Impact: Between 19:30 UTC on 16th April 2020 and 13:45 UTC on 17th April 2020, the SQL Importer experienced network issues causing import jobs to not execute on schedule. The root cause was a failing networking component which was supposed to apply static IP addresses to the cluster.

Mitigation: Our engineers have recovered the failing networking component and resumed or replayed all import jobs to ensure data is up-to-date again.

Next Steps: We will publish a post-mortem to this incident with further details including steps to ensure that this cannot happen again. Thank you for your patience.
Posted Apr 17, 2020 - 10:58 EDT
Monitoring
Summary of Impact: Starting around 19:30 UTC on 16th April 2020 the SQL Importer experienced network issues causing import jobs to not execute on schedule. The root cause was a failing networking component which was supposed to apply static IP addresses to the cluster.

Mitigation: Our engineers have recovered the failing networking component and resumed or replayed all import jobs.

Next Steps: Engineering will continue to monitor the situation over the next few hours and we will issue an update latest within the next 4 hours or as events warrant. Thank you for your patience.
Posted Apr 17, 2020 - 08:34 EDT
Identified
Summary of Impact: Starting around 19:30 UTC on 16th April 2020 the SQL Importer experienced network issues causing import jobs to not execute on schedule. The root cause was a failing networking component which was supposed to apply static IP addresses to the cluster.

Mitigation: Our engineers have recovered the failing networking component and are working on import jobs to be resumed and if necessary replayed.

Next Steps: We will issue an update latest within the next 4 hours or as events warrant. Thank you for your patience.
Posted Apr 17, 2020 - 08:06 EDT
Investigating
Starting around 19:30 UTC on 16th April 2020 the SQL Importer experienced network issues causing import jobs to not execute on schedule. Our engineering team is investigating the root cause.

Further updates will be posted as events warrant or latest within the next 120 minutes.
Posted Apr 17, 2020 - 07:07 EDT
This incident affected: Hull Connectors (SQL Connector).