Resolved -
Service Disruption Report: Some Background Tasks Not Starting or Delayed
Overview: We encountered a service disruption that affected the processing of certain tasks in certain background queues in our system.
Incident Timeline:
The disruption began shortly after the release of version 1.92.1 at 05:05 UTC on May 8th, affecting only a subset of queues. By 17:57 UTC, the customer support team reported issues with load updates not being processed. At 19:06 UTC, a rapid fix was deployed by removing a few queue restrictions, which restored functionality but led to high latency and database locking issues due to increased message throughput. Further optimizations were made, including scaling down low-priority tasks and implementing a new solution, which began showing significant improvement by 22:30 UTC.
Root Cause: The issue was linked to our transition to a new backend framework, which was incompatible with a library we use for queues. This incompatibility prevented tasks from being published, as they were erroneously locked in the system.
Resolution and Recovery Steps:
Immediate action was taken to resolve the publishing issue. A simpler, interim custom built mechanism was implemented for this resolution. A more sustainable solution was developed and validated in our staging environment, ensuring full compatibility and functionality going forward.
Moving Forward: We will be testing the sustainable solution in stage, before moving it to production in the next 2 days. We are committed to maintaining stable and efficient operations. Measures are being taken to enhance our testing environments to better replicate production conditions and ensure compatibility for all updates. We apologize for any inconvenience caused and appreciate your patience and understanding. For further assistance or inquiries, please contact our support team.
May 8, 22:00 UTC