Between 18:30 UTC on 16th April 2020 and 19:20 UTC on 17th April 2020, the SQL Importer experienced network issues causing import jobs to not execute on schedule.
An automatic update on the underlying K8s cluster caused the static IP addresses being detached at 18:30 UTC on 16th April 2020. Since the SQL Importer connector was no longer using static IPs, firewall rules prevented network traffic to the databases from those unknown IP addresses and hence no queries have been executed successfully. As a result customers did not see any import jobs.
Our support team was notified about failing SQL queries at 11:37 UTC on April 17th and escalated the issue at 11:52 UTC to the engineering team.
At 13:00 UTC the engineering team determined the root cause for the outage being the detached IP addresses and re-attached the IP addresses to bring to connector back to an operational state. After confirming that the connector was back operational, the engineering team started to replay jobs at 13:59 UTC for all customers.
Once the connector was fully operational again, the engineering team began root cause analysis.
At 19:36 UTC the engineering team took additional measures to bring data of all organizations up-to-date and all jobs have been fully imported and processed at 19:20 UTC on 17th April 2020.
We have deactivated automatic updating of the K8s cluster to avoid further uncontrolled detachments.
We implemented a more rigorous alerting system for the SQL Importer to get notified immediately when the amount of erroneous import jobs increases, which would have alerted us within 86 minutes of the incident start.
Add additional alerting for all connectors to detect error spikes where multiple instances for various customers go into error mode to detect infrastructure issues further.
Determine further DevOps actions to ensure that static IPs are attached properly all the time.