Andrey November 12th
2008

Using Amazon S3 to improve performance (using asp.net)

Most websites store a large number of media files. It may be photos, audio, video files etc. Of course that data takes up a lot of space on the server and, even more important, its eats your bandwidth and server resources. When we were working on Nuzizo we were faced with this problem. In order to improve performance, we setup an automated system using asp.net and Amazon S3.

The challenge: improve front-end performance

Nuzizo is a big social app, well designed, with a lot of great functionality (great to use and really fun to implement). It allows people to share and listen to music in real time, share photos, video, etc. One day we were faced with the challenge to speed up the site.

We took a lot of steps, like reducing the number of javascript references, compressing it, reducing the size of each page, etc using YSlow. One of our steps was move all our website and user’s personal data like images, video, scripts to another server that is more distributed than ours. We decided to use S3. It’s really awesome to use since you don’t have to worry about server side, how they are storing etc. Of course it’s not free, but it’s pretty cheap.

Its easy to say, move all media files to the server, but very hard to do. When we were faced with the problem, our project was very large. This means that we had to review each page, find all links and references and replace them with the reference to the S3 server. Imagine how hard this was. Also imagine that Amazon could go down. I understand, it’s almost impossible but it has happened already. In this way, every time when we place a link to the image or media, we should know which link to use, our server or S3.

The solution: Response.Filter

All this stuff makes a developer’s life completely crazy. So, we were faced with the problem of how to do it with much less influence on the business logic.

One option is to use some kind of urlrewriter that can catch requests to images or data then redirect them to S3. That works, but in this way the server is processing each request to the images or data. So, our goal was how to replace all references to the media and images at one time. Another goal — easy to switch between modes when data is coming from our server or from S3

I started to dig into the challenge and realized that the only way is to process the final page that was already rendered by the asp.net server. In this case we can use regular expressions to replace all links that we need. One more benefit is that we can log the final page to a file and then check the HTML consistency with other tools.

Ok, what was next? I had to find a way to get the rendered page. My first solution was to intercept the event that comes right after rendering the page and get access to the buffer in the HttpContext.Response object. But, for some reason I could not do it. I got the final html but only part of it even when I was sure that rendering was completed. I tried to find the solution but nothing came out of it. After some time I decided to try another solution, substitute the output buffer with my own, so the system put the rendered page in there, then I could replace links and send it out to the client. After some digging, I found the solution.

Response.Filter property was my real saver

I never knew about it, but it’s really awesome. So, what does MSDN say about this property?

Gets or sets a wrapping filter object that is used to modify the HTTP entity body before transmission.

When you create a Stream object and set the Filter property to the Stream object, all HTTP output sent by Write passes through the filter.

That’s was the solution!

Next steps were much easier for me:

I created a base page that inherited System.Web.UI.Page and named it CustomContentPage
I replaced base page for all pages on the site so they are all inherited from my new CustomContentPage
In my custom content page I implemented a checker that allows us to easily switch between S3 and our server. That flag was moved out to web.config. So, if S3 is down or you just want to use the web server (for instance when developing the site) you can just switch the flag and the server will use local storage automatically:

public class CustomContentPage : Page
{
    private string _newImagesPath = AppSettings.S3Preferences.ImagesPath;
    private string _originalPath = AppSettings.AppAbsoluteRoot;


    protected override void OnLoad(EventArgs e)
    {
        base.OnLoad(e);
      
        if (AppSettings.S3Preferences.IsEnabled && _newImagesPath != _originalPath)
            Response.Filter = new CustomContentStream(Response.Filter);
    }
}

As you can see, before I create a new intercepting Stream I check whether S3 usage is enabled and whether the path to the S3 server is different from the path to application.

I created CustomContentStream where all rewriting logic should be implemented:

public class CustomContentStream : System.IO.Stream
{
    private static ILog log = LogManager.GetLogger(typeof(CustomContentPage));
    private string _newImagesPath = AppSettings.S3Preferences.ImagesPath;
    private string _originalPath = AppSettings.AppAbsoluteRoot;
    private int _notProcessedCount = 0;
    private Stream _originalStream;
    public CustomContentStream(Stream responseStream)
    {
        _originalStream = responseStream;
    }
    public override void Flush()
    {
        _originalStream.Flush();
    }
    public override long Seek(long offset, SeekOrigin origin)
    {
        return _originalStream.Seek(offset, origin);
    }

Here we implement each abstract method of the Stream so it will use the original stream that we pass in constructor:

public override void Write(byte[] buffer, int offset, int count)
    {
        StringBuilder originalContent = new StringBuilder(Encoding.UTF8.GetString(buffer));
        RewriteImages(originalContent);
        byte[] bytes = Encoding.UTF8.GetBytes(originalContent.ToString());
        _originalStream.Write(bytes, 0, bytes.Length);
    }
    private void RewriteImages(StringBuilder originalContent)
    {
        _tmpBuilder = new StringBuilder();
        DoRewrite("(http|https|~)?://[\w./-]+\/[\w./-]+\.(bmp|png|jpg|gif|js|css|flv|swf|php|mp3)", originalContent);
    }
    private StringBuilder _tmpBuilder;
    private void DoRewrite(string matchToRewrite, StringBuilder originalContent)
    {
        Regex reg = new Regex(matchToRewrite);
        MatchEvaluator matchEvaluator = new MatchEvaluator(ReplaceImagesPath);
        _tmpBuilder.Length = 0;
        _tmpBuilder.Append(reg.Replace(originalContent.ToString(), matchEvaluator));
        originalContent.Length = 0;
        originalContent.Append(_tmpBuilder.ToString());
    }
    public string ReplaceImagesPath(Match m)
    {
        if (m.Value.IndexOf("tiny_mce.js") != -1)
            return m.Value;
        else if (m.Value.IndexOf("/Players/") != -1)
            return m.Value;
        else if (m.Value.IndexOf("map.swf") != -1)
            return m.Value;
        string result = m.Value.Replace(_originalPath, _newImagesPath);
        result = result.Replace("~", _newImagesPath);
        if (result == m.Value)
            _notProcessedCount++;
        return result;
    }
}

As you can see, all I did was create a wrapper around the real stream. When the page is loading Response.Filter contains the original stream that is used by default. Every time when a page or control on the page is rendering html it’s using that property’s value to put the final html in. In other words, every time when Render method is executing in the page or in a control it calls Response.Filter.Write(byte[] buffer, int offset, int count).

So this is the point when we can rewrite our images links. I used regular expressions as you can see in DoRewrite method. Every time when the control is rendering our DoRewrite method is calling and we can do everything with final the html.

As you can see, there is no need to rewrite all the pages to implement it, the only thing that you need is to make each page inherited from your new CustomContentPage class, but usually when I’m creating the project I’m doing it by default because there are always some situations when you need to do something common for all pages.

The results

The final result was fantastic. Here are the main benefits we achieved:

By using a CDN, we images and media will download simultaneously.
We reduced the load for processing requests on our server.
Storage and bandwidth was greatly reduced, using S3 as a cheaper and more effective alternative.
The ability to switch back and forth not only created a quick fix if S3 is down, but allows us to refresh cached items if needed.

If you have recommendations or feedback, I’d love to hear it.