Services Affected: Clarity Inbound Calling, Clarity Outbound Calling
Start Date/Time: 9:00 AM (EST), October 21, 2025
End Date/Time: 11:22 AM (EST), October 21, 2025
At approximately 9:00 AM (EST), Cloudli Support received multiple reports from customers indicating failed inbound and outbound calls. The Network Operations and Engineering teams were immediately engaged to investigate.
Initial analysis confirmed that the issue was not related to the prior night’s maintenance, which had already been rolled back. Further investigation revealed that the `shortlocation` entries in Redis for some Registrar Servers were incomplete, resulting in failed SIP registrations and call setup errors for a subset of customers.
At 9:13 AM, the GlobalSBC Registrar microservice was restarted to re-establish proper Redis synchronization. Service behavior normalized immediately following the restart, and call completion success rates returned to expected levels. The incident moved to a resolved but monitoring state at 9:45 AM.
By 10:15 AM, engineering confirmed that all SIP registrations were stable and that customers previously impacted were again able to complete calls successfully. Further validation through test calls, log reviews, and metric analysis confirmed full-service restoration by 11:22 AM, closing the incident.
The incident on October 21, 2025, was caused by the AWS US-East regional outage (occurring on October 20, 2025), which disrupted connectivity to Cloudli’s Kafka infrastructure. This interruption caused transient desynchronization between Kafka, Redis, and the Java microservices managing some Registrar microservices, resulting in incomplete `shortlocation` entries and subsequent SIP registration failures.
Once AWS service availability returned for Cloudli’s Kafka infrastructure, restarting the affected Registrar component restored proper Redis state and normal call routing.
To prevent recurrence, Cloudli Engineering will:
These measures will ensure platform resilience and reduce sensitivity to external cloud infrastructure interruptions.
At Cloudli, we take any interruption of service very seriously and are continuously evaluating new processes and mitigation measures that can be proactively implemented to ensure service continuity.
When service interruptions do occur, our incident management procedure prioritizes prompt and clear notification and timely status and resolution updates to our customers and partners.
We thank you for your continued support. Please feel free to reach out if you would like to discuss the particulars of this incident report further.