Increased error rates in AWS us-east-1 region

Incident Report for Signiant Cloud Services

Postmortem

Between 10:48 AM EST and 7:27 PM EST on Tuesday, December 7th, service interruptions within AWS (mostly in the us-east-1 region) caused intermittent issues for some Signiant services. AWS has not yet posted a post event summary for this incident, but it should be available at some point in the future: https://aws.amazon.com/premiumsupport/technology/pes/

At 10:48 AM EST our monitoring alerted us to a problem with app-less transfers in the us-east-1 region. Traffic was immediately directed away from this region to another region, so impact to customers would have been very minimal or non-existent.

During the course of the investigation into the above failure, our monitoring started alerting us to other anomalies in the us-east-1 region of AWS. The AWS status page did not indicate any problems at this time. In order to mitigate the issues highlighted by our monitoring, we opted to fail over services that appeared to be showing increased error rates to another region, while we continued to investigate. The services in question were all related to our Jet product.

We continued to investigate possible causes of the observed errors, and at 12:37 PM EST AWS posted a message on their status page stating, “We are seeing impact to multiple AWS APIs in the US-EAST-1 Region. This issue is also affecting some of our monitoring and incident response tooling, which is delaying our ability to provide updates. We have identified the root cause and are actively working towards recovery.” At this time, we noticed some intermittent difficulty with browsing share portals, and interacting with the Signiant console. Given the AWS status message, we opted to fail over Media Shuttle services to another region to mitigate the intermittent issues that our monitoring had picked up. It should be noted that Media Shuttle transfers were not affected at all during the course of this event. However, customers who had portals configured with S3 compatible storage may have experienced some problems with file operations for files in those portals between 11:08 AM EST and 7:27 PM EST.

As the day progressed, we continued to monitor AWS for updates and also closely watched for any issues with our own services. Some unexpected error were observed in some of the Jet services that had been failed over, so a deeper investigation was performed. The root cause of the observed problem was determined to be a configuration issue in the failover region and a fix was deployed to return service to normal. Based on information obtained during the investigation, it appears that customers may have had intermittent issues with Jet Hot Folder jobs between 11:00 AM EST and 3:00 PM EST. The impact of this would mean that there were possible delays in processing the hot folder events, but due to built in retries, all events were eventually processed and transfers would have been completed.

By 7:27 PM EST, all indications were that all issues had been resolved. We continued to closely monitor our services and finally closed out this incident at 8:43 PM EST.

Signiant SaaS services are designed to withstand major outages in cloud provider infrastructure and although we fared quite well compared to many other SaaS services, the issues with underlying AWS services weren't quite as transparent as we would have liked them to be (which was compounded by multiple AWS services failing intermittently and concurrently). In particular, we rely on some AWS services to perform some of our failover activities, and in this case, those services were impacted. Going forward, we are looking at enhancements to some of our failover procedures that will add additional resiliency in the unlikely event that both our existing tooling and backup console access is failing.

Posted Dec 09, 2021 - 16:14 EST

Resolved

We are no longer seeing increased error rates related to AWS infrastructure outages.

Posted Dec 07, 2021 - 20:43 EST

Monitoring

We are no longer seeing any increased error rates. We will continue to monitor AWS to confirm a resolution to any of their outstanding issues. All transfers are operational.

Posted Dec 07, 2021 - 20:27 EST

Update

We have applied mitigations and are no longer seeing increased error rates in S3 compatible storage file operations. All products and transfers should be operational. We continue to monitor our services and AWS's status to confirm resolution to any of their outstanding issues.

Posted Dec 07, 2021 - 19:49 EST

Update

We continue to monitor the recovery of AWS services. Customers may experience timeouts conducting file operations on portals with S3 compatible storage. All other products and transfers should be operational. We continue to address any outstanding issues.

Posted Dec 07, 2021 - 19:06 EST

Update

AWS is starting to see recovery on most of it's services. Customers may experience timeouts conducting file operations on portals with S3 compatible storage assigned. All other products and transfers should be operational. We expect to see all things return to operational, as we continue to address any outstanding issues.

Posted Dec 07, 2021 - 18:17 EST

Update

There continues to be issues with S3 compatible storage Mediashuttle portals. Customers may experience timeouts conducting file operations on portals with S3 compatible storage assigned. All other products and transfers should be operational. We continue to monitor error rates and AWS for resolution of its service outage.

Posted Dec 07, 2021 - 17:41 EST

Identified

While monitoring, we found another issue with S3 compatible storage Mediashuttle portals. Customers may experience issues conducting file operations on portals with S3 compatible storage assigned. All other products and transfers should be operational. We continue to monitor error rates and AWS for resolution of its service outage as we are starting to see their infrastructure return to normal.

Posted Dec 07, 2021 - 17:08 EST

Monitoring

We have addressed remaining issues related to the Signiant console. All products and transfers should be operational. We continue to monitor error rates and AWS for resolution of the outage today (https://status.aws.amazon.com).

Posted Dec 07, 2021 - 16:20 EST

Update

Some transfers and other web UI features are still experiencing issues in the Signiant conole. Work to address these issues continues. All other transfers should be operational. We continue to monitor AWS for resolution of their wider outage via https://status.aws.amazon.com. They currently have no ETR.

Posted Dec 07, 2021 - 15:40 EST

Update

Additional fixes have been applied to Jet resulting in significantly reduced error rates. Some transfers and other web UI features are still experiencing issues, and work to address these is still ongoing. All other transfers should be operational. We continue to monitor AWS for resolution of their wider outage via https://status.aws.amazon.com. They currently have no ETR.

Posted Dec 07, 2021 - 15:10 EST

Update

We are still seeing issues related to Jet jobs where the status and triggers are not reporting correctly. We work towards a mitigation on this issue. All other transfers should be operational. We continue to monitor AWS for resolution of their wider outage via https://status.aws.amazon.com/. They currently have no ETR.

Posted Dec 07, 2021 - 14:30 EST

Update

We have restored service to the Signiant console. We are monitoring a new issue where Jet jobs transfers may have issues reporting status. We will continue to work on mitigation strategies for resolving these errors. We also are monitoring AWS for status updates on the larger outage. All other transfers are operational.

Posted Dec 07, 2021 - 13:57 EST

Update

We have signs of recovery on Media Shuttle portals. We continue to see issues with browsing the Signiant console and are working to apply mitigations to combat AWS issues. AWS is reporting some recovery in us-east-1 on https://status.aws.amazon.com/. Transfers continue to be operational at this time.

Posted Dec 07, 2021 - 13:43 EST

Update

We continue to see increased error rates on Media Shuttle portals and the Signiant console. Customers may experience timeouts when browsing portals or loading the Signiant console. We await fixes to issues reported by AWS on https://status.aws.amazon.com/. We also continue to apply mitigations to combat the continued issues. Transfers continue to be operational at this time.

Posted Dec 07, 2021 - 13:23 EST

Update

We are seeing increased error rates on Media Shuttle portals and the Signiant console. Customers may experience timeouts when browsing portals or loading the Signiant console. We continue to apply infrastructure mitigations to work around issues with AWS as reported on https://status.aws.amazon.com/. Transfers continue to be operational at this time.

Posted Dec 07, 2021 - 12:42 EST

Update

We are starting to see increased error rates on Media Shuttle portals and the Signiant console. Customers may experience timeouts when browsing portals or loading the Signiant console. We are applying infrastructure mitigations to work around continued issues with AWS as reported on https://status.aws.amazon.com/. Transfers should not be affected at this time.

Posted Dec 07, 2021 - 12:28 EST

Identified

We were able to mitigate continuing AWS infrastructure issues at this time. All services and transfers should be operational but we can expect possible intermittent issues due to AWS' continued investigation. We will continue to monitor AWS to see if we have to make changes as required.

Posted Dec 07, 2021 - 11:42 EST

Update

We are continuing to investigate the issues with services interacting with the AWS us-east-1 region. We have determined customers have have issues browsing local storage on media shuttle portals. Appless transfers to storage in the us-east-1 region may also have issues. We are narrowing in on an outage in AWS infrastructure and are awaiting updates from AWS to determine what services are affected. In the mean time we continue to make steps to mitigate these issues as much as possible.

Posted Dec 07, 2021 - 11:19 EST

Investigating

We are investigating increased error rates with services interacting with the AWS us-east-1 region. Customers may have issues transfering to storage in AWS us-east-1 region. Jet, Flight, or Media Shuttle transfers may be affected by this issue. We are taking steps to mitigate issues in that region and will provide further updates as we continue to investigate.

Posted Dec 07, 2021 - 11:05 EST

This incident affected: Flight (Flight Gateway Transfers), Media Shuttle (Portal Interface, Cloud Transfers), and Jet (Console).