North America - Production Outage

Incident Report for Shibumi

Postmortem

Issue Summary

  • During scheduled patching by AWS, the managed Kafka cluster experienced rolling restarts across its three service brokers.
  • A critical Shibumi service requires all three service brokers to be available, which caused intermittent service unavailability as each service was patched and rebooted.

Immediate Resolution and Recovery

  • A temporary configuration change was applied to help restore service stability during the AWS maintenance.
  • Full availability was confirmed once all service brokers completed their restart sequence.

Preventive Measures

  • A hotfix for the affected critical service will be implemented to improve fault tolerance, allowing the service to operate when one of the Kafka service brokers is unavailable.
Posted Oct 21, 2025 - 18:50 UTC

Resolved

This incident has been resolved.
Posted Oct 21, 2025 - 18:00 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Oct 21, 2025 - 18:00 UTC

Identified

The cause of the issue has been identified and we are working with our Cloud Vendor to remediate.
Posted Oct 21, 2025 - 17:46 UTC

Investigating

We are currently looking into sporadic outage with the Shibumi platform.
Posted Oct 21, 2025 - 17:38 UTC
This incident affected: North America (APP).