Increased error rates with SDCX servers

Incident Report for Signiant Cloud Services

Postmortem

Between 1:39 PM EST and 2:28 PM EST on Thursday, February 23rd, service degradation of AWS IoT (in the us-east-1 region) caused intermittent connectivity issues for Signiant SDCX servers. Signiant uses AWS IoT to exchange messages with SDCX servers without the need for inbound HTTP access to the SDCX Server.

Between 1:44 PM EST and 1:53 PM EST our monitoring alerted us to a problem with browsing share portals backed by SDCX server based storage.

During the course of the investigation into the above failures, our monitoring also alerting us to a potential issue with Jet transfers.

The AWS status page did not indicate any problems with AWS services at this time. In an attempt to mitigate the issues highlighted by our monitoring, we opted to fail over services that appeared to be showing increased error rates to another region while we continued to investigate. In addition, we made the decision to re-route IoT traffic to another region as well. These changes were made between 2:17 PM EST and 2:21 PM EST.

We immediately began to see improvement after the change to the IoT endpoint and subsequent investigation of the AWS status page showed an incident posted there for IoT in the us-east-1 region. Details as follows:

[10:57 AM PST] We are investigating increased API error rates in the US-EAST-1 Region.
[11:22 AM PST] We have identified the root cause for the elevated API rates and latency in the US-EAST-1 Region and are working towards recovery.
[11:34 AM PST] Between 10:39 AM and 11:28 AM PST, we experienced elevated API errors and latency for Publish operation in the US-EAST-1 Region. The issue has been resolved and the service is operating normally.

With all of our monitoring systems recovering after the above change to the IoT endpoint, we continued to closely monitor services and eventually closed out this incident at 3:11 PM EST.

Signiant SaaS services are designed to withstand major outages in cloud provider infrastructure; however, we do rely on AWS services for connectivity with SDCX servers, and in this case, those services were impacted. Going forward, we are investigating enhancements that allow us to more quickly react to issues with the AWS IoT service.

Posted Feb 24, 2023 - 12:21 EST

Resolved

We are no longer seeing increased error rates for SDCX connectivity and all systems are operational.

Posted Feb 23, 2023 - 15:11 EST

Monitoring

We are no longer seeing increased error rates for SDCX connectivity. MediaShuttle portal browsing and Jet transfers should now be operational. We continue to monitor the situation.

Posted Feb 23, 2023 - 14:48 EST

Identified

We have identified an issue with a cloud service currently in an outage state. We have applied a mitigation to move the traffic away from the affected resources. We expect a full recovery of connectivity to SDCX servers as error rates begin to decrease. We will continue to monitor to make sure services are operational once more. Customers may still see intermittent issues with MediaShuttle On-Premises share portals and jet jobs while the issue resolves.

Posted Feb 23, 2023 - 14:34 EST

Investigating

We are investigating intermittent connectivity issues with SDCX servers. Customers may have issues browsing on-premises share portals to start new transfers with MediaShuttle. Jet customers may have issues with existing jobs being delayed or failing. Cloud based storage and transfers are not affected at this moment.

Posted Feb 23, 2023 - 14:21 EST

This incident affected: Media Shuttle (Portal Interface) and Jet (Transfers).