At 09:16 AM EDT on April 8th, 2020, we were alerted by our monitoring systems that an issue with browsing Media Shuttle portals had been detected. After initial investigation and manual verification of the problem, we switched all Media Shuttle web traffic to our standby environment. A status page update was also posted at this time. Once traffic was switched, both manual verification and our automated monitoring systems showed the issue was no longer occurring as of 9:23 AM EDT.
During the time between the detection of the problem, and the switch to the standby environment, some customers experienced slow web page load times, or potentially even timeouts, accessing Media Shuttle portals. There was no impact on Media Shuttle transfers in progress or the functionality of any other Signiant SaaS products.
After the issue was resolved, further investigation showed that two servers responsible for serving up the Media Shuttle web interface had simultaneously failed healthchecks, a standard for determining their ability to process requests. The failed healthchecks in turn caused the servers to be removed from the pool of available servers. While this represents a small portion of the overall server capacity, the resulting reduction in capacity triggered the observed issues. Although our environment automatically scales the pool of servers when load increases, or when servers are removed from the pool and need to be replaced, this did not happen quickly enough in this case, so manual intervention was required.
Going forward, to ensure this problem doesn’t occur again, we are increasing the spare capacity in the pool of servers and reducing thresholds that trigger scaling the pool of servers responsible for the Media Shuttle web interface. In addition, further investigation is ongoing with respect to what exactly happened to cause the healthchecks on two servers to fail simultaneously.