Between 10:48 AM EST and 7:27 PM EST on Tuesday, December 7th, service interruptions within AWS (mostly in the us-east-1 region) caused intermittent issues for some Signiant services. AWS has not yet posted a post event summary for this incident, but it should be available at some point in the future: https://aws.amazon.com/premiumsupport/technology/pes/
At 10:48 AM EST our monitoring alerted us to a problem with app-less transfers in the us-east-1 region. Traffic was immediately directed away from this region to another region, so impact to customers would have been very minimal or non-existent.
During the course of the investigation into the above failure, our monitoring started alerting us to other anomalies in the us-east-1 region of AWS. The AWS status page did not indicate any problems at this time. In order to mitigate the issues highlighted by our monitoring, we opted to fail over services that appeared to be showing increased error rates to another region, while we continued to investigate. The services in question were all related to our Jet product.
We continued to investigate possible causes of the observed errors, and at 12:37 PM EST AWS posted a message on their status page stating, “We are seeing impact to multiple AWS APIs in the US-EAST-1 Region. This issue is also affecting some of our monitoring and incident response tooling, which is delaying our ability to provide updates. We have identified the root cause and are actively working towards recovery.” At this time, we noticed some intermittent difficulty with browsing share portals, and interacting with the Signiant console. Given the AWS status message, we opted to fail over Media Shuttle services to another region to mitigate the intermittent issues that our monitoring had picked up. It should be noted that Media Shuttle transfers were not affected at all during the course of this event. However, customers who had portals configured with S3 compatible storage may have experienced some problems with file operations for files in those portals between 11:08 AM EST and 7:27 PM EST.
As the day progressed, we continued to monitor AWS for updates and also closely watched for any issues with our own services. Some unexpected error were observed in some of the Jet services that had been failed over, so a deeper investigation was performed. The root cause of the observed problem was determined to be a configuration issue in the failover region and a fix was deployed to return service to normal. Based on information obtained during the investigation, it appears that customers may have had intermittent issues with Jet Hot Folder jobs between 11:00 AM EST and 3:00 PM EST. The impact of this would mean that there were possible delays in processing the hot folder events, but due to built in retries, all events were eventually processed and transfers would have been completed.
By 7:27 PM EST, all indications were that all issues had been resolved. We continued to closely monitor our services and finally closed out this incident at 8:43 PM EST.
Signiant SaaS services are designed to withstand major outages in cloud provider infrastructure and although we fared quite well compared to many other SaaS services, the issues with underlying AWS services weren't quite as transparent as we would have liked them to be (which was compounded by multiple AWS services failing intermittently and concurrently). In particular, we rely on some AWS services to perform some of our failover activities, and in this case, those services were impacted. Going forward, we are looking at enhancements to some of our failover procedures that will add additional resiliency in the unlikely event that both our existing tooling and backup console access is failing.