Increased error rates with console.signiant.com

Incident Report for Signiant Cloud Services

Postmortem

Between 2:58 PM EDT and 5:30 PM EDT on Tuesday, June 13th, service interruptions within AWS (us-east-1 region) caused service degradation for multiple Signiant services. (AWS has not yet posted a post event summary for this incident, but it should be available at some point in the future: https://aws.amazon.com/premiumsupport/technology/pes/)

At 2:58 PM EDT our monitoring alerted us to a problem with Signiant service logins. The Signiant Status page was updated to this effect. At this point we initiated failing over impacted services to our backup region. Additional alerts notified us that the problem was more widespread than just console logins. The Signiant Status page was updated to reflect that additional services were affected.

At 3:08 PM EDT AWS posted a message on their status page indicating an issue with lambda in the us-east-1 region.

Once failover of impacted services was complete, Signiant Console logins recovered, but we continued to see increased error rates when attempting to browse Media Shuttle share portals. Investigation continued on that front and a configuration error that impacted operation in the failover region was uncovered with a microservice involved with portal browsing. Once this configuration error was resolved, share portal browsing recovered.

It should be noted that ongoing transfers were not affected throughout this incident, but given the impact on login and share portals browsing, it was not possible to start new transfers under some circumstances.

Based on information obtained during the investigation, customers may have also experienced intermittent issues with Jet Hot Folder jobs between 3:00 PM EDT and 4:40 PM EDT. The impact of this was possible delays in processing the hot folder events, but due to built in retries, all events were eventually processed and transfers completed.

Signiant SaaS services are designed to withstand major outages in cloud provider infrastructure without customer impact. Our services run in multiple regions and multiple availability zones within each region. Some services are active in multiple regions at the same time (e.g. transfer services) and others failover between regions when there is an underlying cloud provider issue. With this specific incident we experienced issues with regional failover for some of our services, and although our services recovered more rapidly than the underlying AWS services, we have identified several opportunities for improvement.

In particular, we use internal tooling to automate failover between regions and during this incident the effectiveness of our tooling was impacted by a regional dependency. Accordingly we had to revert to manual failover of our services which had an impact on the time required for failover. While we regularly test service failover, the impact on time to failover for this specific failure scenario was not covered in our testing. Going forward, we are making changes to ensure that all tooling that impacts time to failover is appropriately redundant and that we incorporate scenarios that might impact the effectiveness of our tooling in our failover testing.

Posted Jun 16, 2023 - 12:39 EDT

Resolved

This incident has been resolved. All services are operating normally.

Posted Jun 13, 2023 - 17:48 EDT

Monitoring

Error rates have returned to normal and all services are operating as expected. We will continue to monitor for any latent issues.

Posted Jun 13, 2023 - 17:30 EDT

Update

We are continuing to see improvements, and are continuing to work on full resolution of this issue.

Posted Jun 13, 2023 - 16:46 EDT

Update

We continue to work on infrastructure changes to minimize errors.

Posted Jun 13, 2023 - 16:37 EDT

Update

We are seeing improvements to performance, and continue working on full resolution.

Posted Jun 13, 2023 - 15:59 EDT

Update

We are continuing to update infrastructure to mitigate this issue.

Posted Jun 13, 2023 - 15:45 EDT

Identified

An issue has been identified with an upstream service provider. We are in the process of applying mitigations to rectify the degraded performance that some client may be encountering.

Posted Jun 13, 2023 - 15:18 EDT

Update

We are continuing to investigate this issue.

Posted Jun 13, 2023 - 15:02 EDT

Investigating

We are investigating increased error rates with console.signiant.com. Some clients may encounter errors when logging in. Transfers are not affected.

Posted Jun 13, 2023 - 15:02 EDT

This incident affected: Media Shuttle (Portal Interface, Console), Jet (Console), Flight Deck (Console), and Flight (Management Interface).