We had the chance to fully review the outage last night with our system admin. The main cause of the problem was a limit set in our virtual environment for disk space. After the initial migration period of moving to SoftLayer, everything has been very smooth and the servers have been handling our fast growth very well.
The reason we hit the limit was pretty embarrassing, but I guess it can happen to all of us. We have more than enough storage on our servers, including iSCSI which allows us to quickly expand, but a backup service that we setup in the beginning as a test was processing backups on the server and eating up disk space each time it ran. At a certain point, we hit the limit and it killed our sites.
We do have backups on S3 and SoftLayer's remote datacenter, this was just a result of an initial process from our migration running on the server.
The problem was trivial and should have been resolved much faster. I admit that our response time to fix the issue was awful as well as our ability to communicate the issue. We are updating our monitoring services to provide more redundancy and we are working on some ways that we can communicate with you if this ever happens again. Possibly a Twitter feed or having our blog hosted in a remote datacenter. Once it is ready, we will email everyone.
We apologize for any problems this may have caused. Please get in touch if you have any further concerns.
To receive updates faster, please subscribe to our RSS feed for now.