To start, we would like to let everyone know connectivity to our GNAX data center in Atlanta has been restored and all sites are back online. Please keep reading for an explanation of the events that lead up to the outage and what we've done to restore connectivity.
At 9:30am EST we started receiving alerts from our monitoring systems that one of our customer's servers was no longer accessible via SSH. There were only a few alerts to start so we began blocking the source IP at the server level and investigating each event to correlate the cause.
At 9:45am EST the frequency of the alerts increased and a trend developed to point to a single IP range that was the target of the connection flooding. Initially we were blocking the IP address at the server level since it was isolated to only a few target servers.
At 9:50am EST, when the alerts starting increasing rapidly, we confirmed the target was a single IP range and not a specific customer site. As we initially blocked the incoming traffic the failed SSH connection counts started increasing as the attack progressed through the IPs in the target range. This hadn't affected the availability of applications hosted on the target server, only their access via SSH.
At 9:53am EST the inbound SSH connections flooded our edge networking gear and other team members were pulled in to assist. We contacted our data center team to assist in dropping all inbound traffic so we could apply a fix to our edge networking gear.
By 10:00am EST we stopped all incoming traffic from the source on our edge networking gear. After verifying with our data center we had measures in place all traffic was then allowed to pass and our monitoring systems started reporting all alerts were recovering.
By 10:10am EST all alerts had cleared and our manual verification of sites that were affected had been completed.
One of the counter measures most data centers employ to mitigate incoming DoS and DDoS attacks is to null route all inbound traffic to the target to prevent saturating the upstream connectivity and causing a larger scale outage across all data center customers using any particular incoming uplink.
This a slow process to detect at the data center level due to the amount of traffic needed to flood their upstream connections but is more easily identified at the data center customer level (us) where the targeted attack is taking place. Our monitoring systems, by default, monitor for SSH connectivity as well as HTTP/HTTPS connectivity and response times among other things. This helps to more quickly identify the type of attack and its source and allows us to put stop gap measures in place until a final solution is determined and applied.
At this time, all connectivity has been restored and steps to more quickly mitigate these types of attacks have been put in place to help prevent future outages.
Please file a support request if you're still having issues. Thanks!