Last Tuesday around 5pm EST Beanstalk experienced a pretty major outage. It reminded me of the days before we had Engine Yard around. While we do battle with performance issues and periodic slow downs, an outage like this is pretty rare. I want to explain what happened.
At Engine Yard, we have a group of slices that manage SVN, Web Servers, and our backend processes for things like deployments, integration tools, and so on. All of these slices use a shared storage system called GFS. This allows us to scale to many slices while still having access to essential data.
When the outage occurred, one of our slices had a severe memory problem. We attempted to fix it with Engine Yard, but due to the nature of GFS we had to first make sure there we no data corruptions. This forced us to shut down the entire environment while we checked the disks. It’s basically a precaution to make sure we don’t reboot with data corruption.
Since we have so much data, the disk checks took a really long time, which resulted in a full environment outage. It was the best approach to make sure we safely brought the environment back, but obviously not ideal for those who needed access to their code. After the disk checks and reboot all was back to normal and the data was verified.
When things like this happen, we always need to reflect and figure out ways to avoid them from happening again. In this situation, we need to make sure we have enough resources to handle severe spikes that might cause full environment problems. I am confident we added the necessary resources and can control it in the future.
More updates coming
We recently got back from Railsconf. I have to say that hanging out with Engine Yard was the most valuable part of being there. We had a great opportunity to sit down with the team and discuss better ways for us to scale Beanstalk. There are many options, so choosing the right path can be tricky. We feel like we came up with a good plan to not only scale the service, but also isolate our shards (nodes) to avoid system wide issues when things go wrong. As of now, many of you should have already noticed SVN speed improvements.
We have tons of cool features to work on, but our first priority is performance. It’s amazing how challenging and rewarding a successful web product can be at the same time. We’re excited about our growth and ready to take on the challenge to continue scaling the service. Thanks for being patient and helping us grow.