A network issue to our backend database is causing intermittent disruption
Incident Report for Featureflow
Postmortem

Outage Summary

The issue was caused by excessive ‘CPU Steal’ on our backend database causing connection issues and timeouts.

We temporarily and intentionally dropped connections by scaling down the servers to scale the database cluster in minimal time.

Services were restored after the cluster was upgraded, starting with the event and SDK servers.

Outage impact:

  • 99.74% of feature flag requests were returned successfully via our global CDN
  • ~10,000 Successful requests per minute for feature flags.
  • 18 Failed responses per minute for feature flags.
  • All feature analytic event postings were affected
  • Sep 13, 18:19 UTC to Sep 13, 20:18 UTC - the SDK and Event APIs were degraded.
  • Sep 13, 18:19 UTC to Sep 13, 20:23 UTC - the Admin APIs were degraded.
Posted Sep 13, 2023 - 21:27 UTC

Resolved
This incident has been resolved.
Posted Sep 13, 2023 - 20:26 UTC
Update
All Services are operational.
Posted Sep 13, 2023 - 20:23 UTC
Update
Feaatureflow SDK and Event Server is recovered. Admin Api is being recovered.
Posted Sep 13, 2023 - 20:18 UTC
Update
Issue has been identified in a rogue process causing database CPU issues. We are recovering services currently.
Posted Sep 13, 2023 - 20:07 UTC
Investigating
We are currently investigating this issue.
Posted Sep 13, 2023 - 18:19 UTC
This incident affected: Admin API, Event Server, and SDK Server.