Increase the number of brokers to 5. we currently have three pods and three replicas for each topic. With 5 pods, topic partitions will be distributed.
Fix the hypertrace-collector issue. Internal bugs are log with P1 Priority.
Review retention configuration of change log topics of data services. This will allow the services to start consuming faster after rebalances.
Posted Sep 19, 2022 - 19:34 UTC
We have upgraded the Kafka version from 2.6.0 to 3.2.1 (https://github.com/hypertrace/kafka/pull/32). The upgrade was validated on our sandbox, dev, staging, and prod next (mini production) clusters before deploying it on the prod cluster. While upgrading Kafka on the prod cluster, it did a rolling restart of Kafka pods. As a part of this restart, a few pods were restarted just fine. But, one of the pod took a long time to restart. This happened when the Kafka process was not shut down properly. After the un-clean shutdown, recovering the segments takes a lot of time.
As a result, our hyper trance-collector does not re-connect to Kafka brokers if the established connection fails. This led to the hypertrace-collector pod getting OOMKilled. This led to data loss from customers. We tried to restart the hypertrace-collector multiple times, but it got OOMKilled repeatedly.
We lost customer data for 70 minutes (12:00 am PST to 01:10 am PST) on 18th Sept 2022.