Outage Summary
The issue was caused by excessive ‘CPU Steal’ on our backend database causing connection issues and timeouts.
We temporarily and intentionally dropped connections by scaling down the servers to scale the database cluster in minimal time.
Services were restored after the cluster was upgraded, starting with the event and SDK servers.
Outage impact:
- 99.74% of feature flag requests were returned successfully via our global CDN
- ~10,000 Successful requests per minute for feature flags.
- 18 Failed responses per minute for feature flags.
- All feature analytic event postings were affected
- Sep 13, 18:19 UTC to Sep 13, 20:18 UTC - the SDK and Event APIs were degraded.
- Sep 13, 18:19 UTC to Sep 13, 20:23 UTC - the Admin APIs were degraded.