A network issue to our backend database is causing intermittent disruption

Incident Report for Featureflow

Postmortem

Outage Summary

The issue was caused by excessive ‘CPU Steal’ on our backend database causing connection issues and timeouts.

We temporarily and intentionally dropped connections by scaling down the servers to scale the database cluster in minimal time.

Services were restored after the cluster was upgraded, starting with the event and SDK servers.

Outage impact:

  • 99.74% of feature flag requests were returned successfully via our global CDN
  • ~10,000 Successful requests per minute for feature flags.
  • 18 Failed responses per minute for feature flags.
  • All feature analytic event postings were affected
  • Sep 13, 18:19 UTC to Sep 13, 20:18 UTC - the SDK and Event APIs were degraded.
  • Sep 13, 18:19 UTC to Sep 13, 20:23 UTC - the Admin APIs were degraded.
Posted Sep 13, 2023 - 21:27 UTC

Resolved

This incident has been resolved.
Posted Sep 13, 2023 - 20:26 UTC

Update

All Services are operational.
Posted Sep 13, 2023 - 20:23 UTC

Update

Feaatureflow SDK and Event Server is recovered. Admin Api is being recovered.
Posted Sep 13, 2023 - 20:18 UTC

Update

Issue has been identified in a rogue process causing database CPU issues. We are recovering services currently.
Posted Sep 13, 2023 - 20:07 UTC

Investigating

We are currently investigating this issue.
Posted Sep 13, 2023 - 18:19 UTC
This incident affected: Admin API, Event Server, and SDK Server.