Failed Kafka upgrade on production
Incident Report for traceableai

We have identified the following fixes to make Kafka upgrade more fault tolerant:

  1. Increase Kafka termination grace period time from 5 minutes to 30 minutes. This will give more time to Kafka process to shut down cleanly by flushing all unflushed segments.
  2. Disable rolling restart of Kafka pods in prod. In high-volume clusters, upgrading the Kafka version won’t trigger automatic pods restart. We will manually delete each pod one by one.
  3. Increase the number of brokers to 5. we currently have three pods and three replicas for each topic. With 5 pods, topic partitions will be distributed.
  4. Fix the hypertrace-collector issue. Internal bugs are log with P1 Priority.
  5. Review retention configuration of change log topics of data services. This will allow the services to start consuming faster after rebalances.
Posted Sep 19, 2022 - 19:34 UTC

We have upgraded the Kafka version from 2.6.0 to 3.2.1 ( The upgrade was validated on our sandbox, dev, staging, and prod next (mini production) clusters before deploying it on the prod cluster. While upgrading Kafka on the prod cluster, it did a rolling restart of Kafka pods. As a part of this restart, a few pods were restarted just fine. But, one of the pod took a long time to restart. This happened when the Kafka process was not shut down properly. After the un-clean shutdown, recovering the segments takes a lot of time.

As a result, our hyper trance-collector does not re-connect to Kafka brokers if the established connection fails. This led to the hypertrace-collector pod getting OOMKilled. This led to data loss from customers. We tried to restart the hypertrace-collector multiple times, but it got OOMKilled repeatedly.

We lost customer data for 70 minutes (12:00 am PST to 01:10 am PST) on 18th Sept 2022.

Please reach out to for more information.
Posted Sep 18, 2022 - 07:00 UTC