Eric Florenzano's Blog

I am me.

Reducing Code Nesting

"This guy’s code sucks!" It’s something we’ve all said or thought when we run into code we don’t like. Sometimes it’s because it’s buggy, sometimes it’s because it conforms to a style we don’t like, and sometimes it’s because it just feels wrong. Recently I found myself thinking this, and automatically jumping to the conclusion that the developer who wrote it was a novice. The code had a distinct property that I dislike: lots of nesting. But the more I think about it, the more I realized that it’s not really something I’ve heard discussed much.

So let’s talk about it. I’m going to first talk about what I mean by nesting, why I think it’s a bad quality, and then I’m going to go over some tricks I’ve learned over the years to reduce it.

What do I mean by code nesting, and why is it bad?

It’s easier to demonstrate rather than talk about it. This is what I mean by deep code nesting, with my apologies for the contrived example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def get_cached_user(user_id=None, username=None):
    """
    Returns a cached user object either by their user_id or by their username.
    """
    user = cache.get_user_by_id(user_id)
    if not user:
        user = cache.get_user_by_username(username)
        if not user:
            user = db.get_user_by_id(user_id)
            if not user:
                user = db.get_user_by_username(username)
                if not user:
                    raise ValueError('User not found')
            cache.set_user(user, id=user.id, username=user.username)
    return user

You can see in this Python code just by looking at the indentation level that there’s lots of nesting. Before we can determine that the user was not found, we must pass through four conditionals, and each conditional is nested within the previous conditional.

I argue that this is bad code. Every added level of nesting is another piece of context that your brain has to keep track of. Each nested block is one you have to line up by eye to see what conditional it lines up with (even if your editor helps at this with visuals, it doesn’t remove the issue entirely.) And this is just a straightforward example where we just return the user at the end, let’s take a look at an example that does something more complicated:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def get_media_details(media):
    """
    Returns a dictionary of extra data about the given media object.
    """
    data = {}
    if media.is_video:
        data['kind'] = 'video'
        if media.is_youtube:
            data['url'] = 'http://youtube.com/'
        if media.is_vimeo:
            data['vimeo'] = True
            if media.vimeo_version == 2:
                data['url'] = 'http://vimeo.com/v2/'
        if 'url' in data:
            data['secure_url'] = data['url'].replace('http:', 'https:')
    elif media.is_audio:
        data['kind'] = 'audio'
    elif media.is_text:
        data['kind'] = 'text'
    if 'kind' in data:
        data['kind_verbose'] = {
            'video': 'Video Stream',
            'audio': 'Audio File',
            'text': 'Text Content',
        }[data['kind']]
    return data

It was unbelievably hard for me to even write that last example. It’s obviously contrived and such, but the point is that it’s so difficult to even understand what it’s doing. Unlike the previous example, this doesn’t simply nest and then return; it nests and then un-nests, and then nests again, and then finally returns.

How to Avoid Nesting

The best way that I’ve discovered to avoid nesting is to return early. Caching is the perfect example of this. Instead of testing for a cache failure and fetching from the database inside the conditional, check for cache success and return that early.

So this code:

1
2
3
4
5
6
7
def get_cached_user(user_id):
    user = cache.get_user_by_id(user_id)
    # The main logic all happens in this nested block
    if not user:
        user = db.get_user_by_id(user_id)
        cache.set_user_for_id(user_id, user)
    return user

Becomes this:

1
2
3
4
5
6
7
8
def get_cached_user(user_id):
    user = cache.get_user_by_id(user_id)
    if user:
        return user
    # The main logic happens outside of the nested block
    user = db.get_user_by_id(user_id)
    cache.set_user_for_id(user_id, user)
    return user

In the simple case, it doesn’t seem to improve much, but what happens if we apply this technique to our first example? It’s dramatically improved:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def get_cached_user(user_id, username):
    # First check the cache by id
    user = cache.get_user_by_id(user_id)
    if user:
        return user

    # Now check the cache by username
    user = cache.get_user_by_username(username)
    if user:
        return user

    # Both caches failed, so try hitting the db for the id
    user = db.get_user_by_id(user_id)
    if user:
        cache.set_user(user, id=user.id, username=user.username)
        return user

    # Looks like that didn't exist, try the username
    user = db.get_user_by_username(username)
    if not user:
        raise ValueError('User not found')

    # Cache our final user value for future use
    cache.set_user(user, id=user.id, username=user.username)
    return user

Not only does it make it easier to read top-to-bottom, and force us to keep track of way less context, and make our code editors do less line wrapping, but it also makes it easier to separate the blocks of code and more easily comment them.

So what other techniques can we use? It starts to depend more on the situation. Are you nesting because you’re writing a bunch of callbacks? If so, you can usually restructure your code to use named functions instead of anonymous functions. Here’s how that would might look before refactoring:

1
2
3
4
5
6
7
8
9
10
11
12
function getCachedUser(userId, callback) {
    cache.getUser(userId, function(user) {
        if(user) {
            return callback(user);
        }
        db.getUser(userId, function(user) {
            cache.setUser(userId, user, function() {
                callback(user);
            });
        });
    });
}

Note that in this example we even applied the technique of returning early in the first callback function, but as you can see there’s still a bunch of nesting going on. Now if we switch to using named functions?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
function curry(fn) {
    var slice = Array.prototype.slice;
    var args = slice.apply(arguments, [1]);
    return function () {
        return fn.apply(null, args.concat(slice.apply(arguments)));
    };
}

function final(callback, user) {
    callback(user);
}

function dbResult(callback, userId, user) {
    cache.setUser(userId, user, curry(final, callback, user));
}

function cacheResult(callback, userId, user) {
    if(user) {
        return callback(user);
    }
    db.getUser(userId, curry(dbResult, callback, userId));
}

function getCachedUser(userId, callback) {
    cache.getUser(userId, curry(cacheResult, callback, userId));
}

This is a lot better in terms of nesting. Unfortunately we had to write a helper function called curry, but that only has to be written once and can be re-used for all code written in this style. Also unfortunately I still find this kind of code difficult to follow, which is why I avoid writing much callback-style code. However, at least you can reduce the nesting. In all honesty, there are probably better ways of reducing nesting that I’m not aware of. If you can rewrite the getCachedUser function in JS in a better way, please blog it!

Another way to reduce nesting is to assign an intermediate variable. Here’s an example in Erlang of some file function that nests a case statement within another case statement.

1
2
3
4
5
6
7
8
9
10
11
12
13
do_some_file_thing(File) ->
    case file:open(File, [raw, binary, read]) of
        {ok, Fd} ->
            Start = now(),
            case process_file_data(Fd) of
                {ok, Processed} ->
                    {ok, Start, now(), Processed};
                Error ->
                    Error
            end;
        Error ->
            Error
    end.

We can assign to an intermediate "Resp" variable, and bring that second case statement out into the function’s main code block, like so:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
do_some_file_thing(File) ->
    Resp = case file:open(File, [raw, binary, read]) of
        {ok, Fd} ->
            {timestamp, now(), process_file_data(Fd)};
        Error ->
            Error
    end,
    case Resp of
        {timestamp, Start, {ok, Processed}} ->
            {ok, Start, now(), Processed};
        {timestamp, Start, Error} ->
            Error;
        Error ->
            Error
    end.

What does this all mean?

At the end of the day, this isn’t going to make or break you as a programmer. In fact, nothing I’ve mentioned even changes the code’s logic, but simply its implementation. It’s simply something to think about as you code, as you read other people’s code. Hopefully you agree with me that less nesting is an admirable goal, and you find more and more ways to achieve it.

Discuss this post on Hacker News.

The Technology Behind Convore

We launched Convore last week, and the first question developers tend to ask when they find Convore is "what technology powers this site?" It is asked so often, in fact, that we have started to copy and paste the same short response again and again. That response was good enough to satisfy people who simply wanted to know if we were Rails or Django, or whether we were using node.js for the real-time stuff, but this article will expand upon that– not only giving more details for the curious, but also giving us a link to point people at when they ask the question in the future. I always wish other people were totally open about their architectures, so that I can learn from their good choices and their bad, so I’d like to be as open as possible about ours. Let’s dive in!

The basics

All of our application code is powered by Python. Our front-end html page generation is done by Django, which we use in a surprisingly traditional way given the real-time nature of Convore as a product. Everything is assembled at once: all messages, the sidebar, and the header are all rendered on the server instead of being pulled in after-the-fact with JavaScript. All of the important data is canonically stored in PostgreSQL, including messages, topics, groups, unread counts, and user profiles. Search functionality is provided by Solr, which is interfaced into our application by way of the handy Haystack Django application.

The message lifecycle

When a new message comes into the system, first it’s parsed by a series of regular expressions designed to pull out interesting bits of information from the message. Right now all we’re looking for is username references and links (and further, whether those links point at images which should be rendered in-line.) At the end of this parsing stage, we have a structured message parse list, which is converted into JSON.

So, for example if someone posted the message:

1
@ericflo @simonw Here's how we connect/disconnect from Redis in production: http://dpaste.com/406797/

The resulting JSON parse list would look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[
    {
        "type": "username",
        "user_id": 1,
        "username": "ericflo",
        "markup": "<a href=\"/users/ericflo/\">@ericflo</a>"
    },
    {
        "type": "username",
        "user_id": 56,
        "username": "simonw",
        "markup": " <a href=\"/users/simonw/\">@simonw</a>"
    },
    {
        "type": "text",
        "markup": " Here&#39;s how we connect/disconnect from Redis in production: "
    },
    {
        "type": "url",
        "url": "http://dpaste.com/406797/",
        "markup": "<a href=\"http://dpaste.com/406797/\" target=\"_blank\">http://dpaste.com/406797/</a>"
    }
]

After this is constructed, we log all our available information about this message, and then save to the database– both the raw message as it was received, and the JSON-encoded parsed node list.

Now a task is sent to Celery (by way of Redis) notifying it that this new message has been received. This Celery task now increments the unread count for everyone who has access to the topic that the message was posted in, and then it publishes to a Redis pub/sub for the group that the message was posted to. Finally, the task scans through the message, looking for any users that were mentioned in the message, and writes entries to the database for every mention.

On the other end of that pub/sub are the many open http requests that our users have initiated, which are waiting for any new messages or information. Those all simultaneously return the new message information, at which point they reconnect again, waiting for the next message to arrive.

The real-time endpoint

Our live updates endpoint is actually a very simple and lightweight pure-WSGI Python application, hosted using Eventlet. It spawns off a coroutine for each request, and in that coroutine, it looks up all the groups that a user is a member of, and then opens a connection to Redis subscribing to all of those channels. Each of these Eventlet-hosted Python applications has the ability to host hundreds-to-thousands of open connections, and we run several instances on each of our front-end machines. It has a few more responsibilities, like marking a topic as read before it returns a response, but the most important thing is to be a bridge between the user and Redis pub/sub.

Future improvements

There are so many places where our architecture can be improved. This is our first version, and now that real users are using the system, already some of our initial assumptions are being challenged. For instance, we thought that pub/sub to a channel per group would be enough, but what that means is that everyone in a group sees the exact same events as everyone else in that group.

This means we don’t have the ability to customize each user’s experience based on their preferences–no way to put a user on ignore, filter certain messages, etc. It also means that we aren’t able to sync up a user’s experience across tabs or browsers, since we don’t really want to broadcast to everyone in the group that one user has visited a topic, thereby removing any unread messages in that topic. So going forward we’re going to have to break up that per-group pub/sub into per-user pub/sub.

Another area that could be improved is our unread counts. Right now they’re stored as rows in our PostgreSQL database, which makes it extremely easy to batch update them and do aggregate queries on them, but the number of these rows is increasing rapidly, and without some kind of sharding scheme, it will at some point become more difficult to work with such a large amount of rows. My feeling is that this will eventually need to be moved into a non-relational data store, and we’ll need to write a service layer in front of it to deal with pre-aggregating and distributing updates, but nothing is set in stone just yet.

Finally, Python may not be the best language for this real-time endpoint. Eventlet is a fantastic Python library and it allowed us to build something extremely fast that has scaled to several thousand concurrent connections without breaking a sweat on launch day, but it has its limits. There is a large body of work out there on handling a large number of open connections, using Java’s NIO framework, Erlang’s mochiweb, or node.js.

That’s all folks

We’re pretty proud of what we’ve built in a very short time, and we’re glad it has held up as well as it has on our launch day and afterwards. We’re excited about the problems we’re now being faced with, both scaling the technology, and scaling the product. I hope this article has quenched any curiosity out there about how Convore works. If there are any questions, feel free to join Convore and ask away!

(Or discuss it on Hacker News)

An Object Lesson in How to Respond to Criticism

In the moments leading up to my DjangoCon keynote this year (called Why Django Sucks, and How We Can Fix It), I pictured the room an hour in the future and imagined what things would be like. I envisioned all kinds of scenarios: one where people were literally booing, or another where I needed to sneak back to my hotel room to avoid embarassment. I imagined all of these scenarios because I was about to level some harsh criticism at a technology that everyone in the room was enthusiastic about.

And then I went on stage and gave the talk.

When it was finished, I braced myself for one of these scenarios to unfold. Instead, what happened in the following moments, the following days, and indeed the following month since then has been what I consider to be the most gracious and useful response to criticism that I’ve ever seen. And I think it deserves to be highlighted, because I believe that reacting well to criticism is vital for any successful community.

So what happened in the moments directly following the talk? Russell Keith-Magee, arguably the most active core developer at the time, made a special point to march up to the front of the stage and publicly shake my hand. This gesture not only legitimized some of the things that I said (some of which were quite extreme), but it also set the tone for the discourse around my talk; civility.

Afterwards, privately, I apologized to several of the Django core developers. These are people I respect and admire, and I wanted to make sure that they knew that, at least on my end, it wasn’t personal. I was expecting forgiveness, but instead I received thanks. Thanking me for criticising their project? Not what I was expecting.

One of the things that I suggested was to make Alex Gaynor a core developer. A few people at the conference made comments wondering how many days it would take before that would happen. Days went by, and the conference ended without this happening. A week went by. A few more. But I knew better, because the Django core committers don’t do anything brashly. They took their time, let all dust and emotions settle down, and listened to the discussion.

And then Jacob Kaplan-Moss issued an official announcement. I particularly agreed with one of the top comments on hackernews:

"I don't know enough about the specifics of the case to
assess the new policy on the merits, but I will say that
as a piece of writing this is a model of how to change
policy gracefully: clear, forthright, and non-defensive."

Clear. Forthright. Non-defensive.

Some people saw this as an opening to change even more about Django, and even when some of the discussion was phrased rudely or accusatory, the Django team’s responses were thoughtful and calm. Now, over a week later (again demonstrating that things aren’t done brashly), we see that a half-dozen new people have been added to the core team. Yes, that includes Alex Gaynor. So not only was the new policy clear, forthright, and non-defensive, but there was concrete follow-through so that everyone can see that it wasn’t just words.

And even in that short time since the new committers were added, Django’s code is already reaping the benefits–we’ve seen a flurry of bugs fixed and tickets closed. Django’s future has never looked so promising!

Why do I think this is so important? Because a community which welcomes constructive criticism is one in which people feel they can make a difference. And it’s because they can make a difference. It’s a community that’s never satisfied with the status quo. It’s a community that can grow and change and adapt, when the world around it changes. It’s a community that we can be proud of.

NB: I was not the only one to criticize Django at DjangoCon. I was merely one voice. This is just written from my perspective.

Why node.js Disappoints Me

Node.js is currently at the center of a huge cyclone of hype. At this point it’s clear that Node is going to be a major player in the next few years of web development. It’s no wonder, either! It represents a fresh start, one with no legacy synchronous baggage in our increasingly asynchronous, increasingly real-time web. And it’s accessible to anyone who’s written JavaScript (read: all web developers.)

Oh, and those benchmarks! Compared trivially to the currently popular web technologies, it seems an obvious leap forward.

Yet one of Node’s advantages that you’ll hear repeated again and again by its proponents is that you can now code in One Language, and you won’t have to deal with the cognitive load of context switching between different languages. Especially as ORMs and NoSQL continue to rise in popularity, there’s no need to even deal with SQL. At the end of the day you’re writing JavaScript, HTML, and CSS, and that’s it.

Seeing this happen with all of these pieces falling into place–a fresh start with a unified language–I started to get excited. This would change the way we thought about our web code. The frontend is the backend is the query language is the storage layer is JavaScript! It’s a revolution.

And then I saw what people did with this opportunity. They effectively ported Sinatra/Django/Rails to JavaScript–and did it in such a way that it would only run on the server, with a specific feature set of JavaScript that only Node can reasonably understand.

Not exactly the revolution I was hoping for.

Instead of coding in one language, we’re actually coding in two. One is the subset JavaScript that can be run in all browsers, and another is the set of JavaScript that can be run by Node. Knowing the difference between the two languages and context switching between them is simply a required skill.

You know what would be awesome? If we wrote our libraries so that they could run either on the server or on the client, and they did so in a transparent way. Maybe it would help to give a concrete example of how this could be awesome. Let’s talk about HTML templating.

Imagine a framework where the first page-load was always rendered server-side, meaning the client gets a single fully-rendered page. Then for desktop browsers, browsing around the site just made calls to API endpoints returning JSON or XML, and the client rendered the templates for the changed portions of the page. For mobile browsers with less power or for search engines, the rendering would always be done on the server. Imagine that the templating library could record some key metrics to determine how long things were taking to render, and dynamically switch between rendering on the server and client based on server load or client speed.

Imagine a case where a back-end service fails temporarily. In this case the rendering of that particular component could be deferred, the browser could be told to poll a resource. When the back-end service is recovered, it could send the data for the client to render on its own.

How awesome would that be?

It’s not just HTML templating, either. This same principle could be applied to any number of things: URL routing, form validation, hell even most application logic could be done using this style.

But it’s going to take discipline. Instead of reaching for those fancy V8 features, code will need to be written in a strict subset of the JavaScript that’s available. Maybe Node could detect incompatible code and throw warnings, that would be cool.

I just really hope that someday we stop re-inventing the same exact wheel, and instead build something substantially different and better.

We Need a New Word for “Open”

When I say I’m developing an "open" API, what does that mean to you?

To Facebook, it means something where you can get at some of your own data, as long as you abide by their restrictions on what you can do with it.

To Twitter, it means something where you get back what you put in.

To Flickr, it means something where you can get every single piece of data that you’ve ever put in the system, enriched by any value that Flickr can provide.

To Google, it means something where you can get back what you put in, you can host your own version of the API on your own server, and your server has exactly the same status as any Google server has, and that Google’s servers and your server will interact and work as a system.

We need new words. Words that have more specific meaning–which encompass the broad idea of "open", yet remain accessible and recognizable by everyone. Federated is a good word, but it’s possible to have a closed, federated system.

What words should we use?

How Do We Kick Our Synchronous Addiction?

Asynchronous programming is superior both in memory usage and in overall throughput when compared to synchronous programming . We’ve known this fact for years. If we look at Django or Ruby on Rails, arguably the two most promising new web application frameworks to emerge in the past few years, both of them are written in such a way that synchronous programming is assumed. Why is it that even in 2010 we’re still writing programs that rely on synchronous programming ?

The reason that we’re stuck on synchronous programming is twofold. Firstly, the programming model required for straightforward asynchronous implementations is inconvenient. Secondly, popular and/or mainstream languages lack the built-in language constructs that are needed to implement a less-straightforward approach to asynchronous programming.

Asynchronous programming is too hard

Let’s first examine the straightforward implementation: an event loop. In this programming model, we have a single process with a single loop that runs continuously. Functionality is achieved by writing functions to execute small tasks quickly, and inserting those functions into that event loop. One of those functions might read some bytes from a socket, while another function might write a few bytes to a file, and yet another function might do something computational like calculating an XOR on the data that’s been buffered from that first socket.

The most important part about this event loop is that only one thing is ever happening at a time. That means that you really have to break your logic up into small chunks that can be performed incrementally. If any one of our functions blocks, it hogs the event loop and nothing else can execute during that time.

We have some really great frameworks geared towards making this event loop model easier to work with. In Python, there’s Twisted and, more recently, Tornado. In Ruby there’s EventMachine. In PERL there’s POE. What these frameworks do is twofold: provide constructs for more easily working with an event loop (e.g. Deferreds or Promises), and provide asynchronous implementations of common tasks (e.g. HTTP clients and DNS resolution).

But these frameworks stop very short of making asynchronous programming easy for two reasons. The first reason is that we really do have to completely change our coding style. Consider what it would take to render a simple blog web page with comments. Here’s some JavaScript code to demonstrate how this might work in a synchronous framework:

1
2
3
4
5
6
7
8
9
function handleBlogPostRequest(request, response, postSlug) {
    var db = new DBClient();
    var post = db.getBlogPost(postSlug);
    var comments = db.getComments(post.id);
    var html = template.render('blog/post.html',
        {'post': post, 'comments': comments});
    response.write(html);
    response.close();
}

Now here’s some JavaScript code to demonstrate how this might look in an asynchronous framework. Note several things here: We’ve specifically written this in such a way that it doesn’t become nested four levels deep. We’ve also written these callback functions inside of the handleBlogPostRequest function to take advantage of closure so as to retain access to the request and response objects, the template context, and the database client. Both the desire to avoid nesting and the closure are things that we need to think about as we write this code, that were not even considerations in the synchronous version:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
function handleBlogPostRequest(request, response, postSlug) {
    var context = {};
    var db = new DBClient();
    function pageRendered(html) {
        response.write(html);
        response.close();
    }
    function gotComments(comments) {
        context['comments'] = comments;
        template.render('blog/post.html', context).addCallback(pageRendered);
    }
    function gotBlogPost(post) {
        context['post'] = post;
        db.getComments(post.id).addCallback(gotComments);
    }
    db.getBlogPost(postSlug).addCallback(gotBlogPost);
}

I’ve chosen JavaScript here to prove a point, by the way. People are very excited about node.js right now, and it’s a very cool framework, but it doesn’t hide all of the complexities involved in doing things asynchronously. It only hides some of the implementation details of the event loop.

The second reason why these frameworks fall short is because not all IO can be handled properly by a framework, and in these cases we have to resort to bad hacks. For example, MySQL does not offer an asynchronous database driver, so most of the major frameworks end up using threads to ensure that this communication happens out of band.

Given the inconvenient API, the added complexity, and the simple fact that most developers haven’t switched to using this style of programming, leads us to the conclusion that this type of framework is not a desirable final solution to the problem (even though I do concede that you can get Real Work done today using these techniques, and many people do). That being the case, what other options do we have for asynchronous programming? Coroutines and lightweight processes, which brings us to our next major problem.

Languages don’t support easier asynchronous paradigms

There are a few language constructs that, if implemented properly in modern programming languages, could pave the way for alternative methods of doing asynchronous programming that don’t have the drawbacks of the event loop. These constructs are coroutines and lightweight processes.

A coroutine is a function that can suspend and resume its execution at certain, programmatically specified, locations. This simple concept can serve to transform blocking-looking code to be non-blocking. At certain critical points in your IO library code, the low-level functions that are doing IO can choose to "cooperate". That is, it can choose to suspend execution in order for another function to resume execution and continue on.

Here’s an example (it’s Python, but fairly understandable for all I hope):

1
2
3
def download_pages():
    google = urlopen('http://www.google.com/').read()
    yahoo = urlopen('http://www.yahoo.com/').read()

Normally the way this would work is that a socket would be opened, connected to Google, an HTTP request sent, and the full response would be read, buffered, and assigned to the google variable, and then in turn the same series of steps would be taken for the yahoo variable.

Ok, now imagine that the underlying socket implementation were built using coroutines that cooperated with each other. This time, just like before, the socket would be opened and a connection would be made to Google, and then a request would be fired off. This time, however, after sending the request, the socket implementation suspends its own execution.

Having suspended its execution (but not yet having returned a value), execution continues on to the next line. The same thing happens on the Yahoo line: once its request has been fired off, the Yahoo line suspends its execution. But now there’s something else to cooperate with–there’s actually some data ready to be read on the Google socket–so it resumes execution at that point. It reads some data from the Gooogle socket, and then suspends its execution again.

It jumps back and forth between the two coroutines until one has finished. Let’s say that the Yahoo socket has finished, but the Google one has not. In this case, the Google socket just continues to read from its socket until it has completed, because there are no other coroutines to cooperate with. Once the Google socket is finally finished, the function returns with all of the buffered data.

Then the Yahoo line returns with all of its buffered data.

We’ve preserved the style of our blocking code, but we’ve used asynchronous programming to do it. Best of all, we’ve preserved our original program flow–the google variable is assigned first, and then the yahoo variable is assigned. In truth, we’ve got a smart event loop going on underneath the covers to control who gets to execute, but it’s hidden from us due to the fact that coroutines are in play.

Languages like PHP, Python, Ruby, and Perl simply don’t have built-in coroutines that are robust enough to implement this kind of behind-the-scenes transformation. So what about these lightweight processes?

Lightweight processes are what Erlang uses as its main concurrency primitive. Essentially these are processes that are mostly implemented in the Erlang VM itself. Each process has approximately 300 words of overhead and its execution is scheduled primarily by the Erlang VM, sharing no state at all amongst processes. Essentially, we don’t have to think twice about spawning a process, as it’s essentially free. The catch is that these processes can only communicate via message passing.

Implementing these lightweight processes at the VM level gets rid of the memory overhead, the context switching, and the relative sluggishness of interprocess communication provided by the operating system. Since the VM also has insight into the memory stack of each process, it can freely move or resize those processes and their stacks. That’s something that the OS simply cannot do.

With this model of lightweight processes, it’s possible to again revert back to the convenient model of using a separate process for all of our asynchronous programming needs. The question becomes this: can this notion of lightweight processes be implemented in languages other than Erlang? The answer to that is "I don’t know." To my knowledge, Erlang takes advantage of some features of the language itself (such as having no mutable data structures) in its lightweight process implementation.

Where do we go from here?

The key to moving forward is to drop the notion that developers need to learn to think about all of their code in terms of callbacks and asynchrony, as the asynchronous event loop frameworks require them to do. Over the past ten years, we can see that most developers, when faced with that decision, simply choose to ignore it. They continue to use the inferior blocking methodologies of yesteryear.

We need to look at these alternative implementations like coroutines and lightweight processes, so that we can make asynchronous programming as easy as synchronous programming. Only then will we be able to kick this synchronous addiction.

You Care About Facebook, You Just Might Not Know It Yet

I recently had a conversation with a few friends that I know through the Python community. These are people who I respect a great deal, and look to for advice and insight when it comes to web development. During the course of this conversation, the subject of Facebook came up. What I wasn’t expecting at all was that a large number of them said that they have no interest in Facebook, and quite a few of them proudly didn’t even have accounts.

To put it bluntly, I believe that ignorance of Facebook is a major handicap today, both for developers and for entrepreneurs, and people who are not paying attention to it are burying their head in the sand.

Facebook has gotten to a point where it’s the destination where the most time is spent online. It gets an estimated 260 billion (that’s a B on that number) pageviews per month. In a single week, new properties are able to garner 8.6 million new active users. Sites which implement Facebook Connect typically see a 1.3x-2x boost in registrations. Companies building on Facebook’s platform are being sold for 400 million dollars and are making 100 million dollars yearly. There are over 300 million active registered users. In other words, if you talk to a person that spends any time on the internet at all, they likely have a Facebook account. These numbers, by the way, are trending up and to the right–.

Take the word Facebook out of the above paragraph, and replace it with icanhazcheeseburger. Replace it with 4chan, or somethingawful, or barbie, or anything. No matter what you replace as the company name, it doesn’t change the fact that those numbers are compelling enough to be irresponsible to ignore.

Let me be perfectly clear: I don’t particularly enjoy using Facebook. I find its UI cluttered, its privacy controls confusing, and its content fairly trivial. From the development side, Facebook’s APIs are clunky at best. I’m definitely not advocating that you should log in every day and love it. I don’t. The mass populous, however, does. I’m simply advocating that you should at least have an account, log in every once in a while, and keep tabs on the announcements that they make for developers.

As people spend more and more time consuming and producing content inside of the Facebook ecosystem, it’s going to be those who change and embrace it who succeed, and those who fail to adapt that stand to lose. Note that I said producing content "inside of the Facebook ecosystem", and not "inside Facebook", as Facebook has been making very strong pushes to extend the reach of its platform well outside of its destination site. You can augment your site to have people pushing content into Facebook from your own destination sites.

Facebook is willing to give you access to a person’s information, to their entire social graph, and to allow your user to become an advocate of your site, all while making registration on your site as simple as one click. Sure, you may evaluate all of that as an option and decide that it’s not important enough to implement, but surely it’s an important enough option to warrant your cognizance.

Do I think that Facebook will be relevant in 10 years? Probably. I’m not willing to bet much on it. One thing you can be absolutely sure about though, is that if another service with this much power comes along, I’ll have an account and be acutely aware of its developer resources.

My Thoughts on NoSQL

Over the past few years, relational databases have fallen out of favor for a number of influential people in our industry. I’d like to weigh in on that, but before doing so, I’d like to give my executive summary of the events leading up to this movement:

In the late nineties and early thousands, websites were mostly read-only–a publisher would create some content and users would consume that content. The data access patterns for these types of applications became very well-understood, and as a result many tools were created and much research and development was done to further develop these technologies.

As the web has grown more social, however, more and more it’s the people themselves who have become the publishers. And with that fundamental shift away from read-heavy architectures to read/write and write-heavy architectures, a lot of the way that we think about storing and retrieving data needed to change.

Most people have done this by relying less on the features provided by traditional relational databases and engineering more database logic in their application code. Essentially, they stop using relational databases the way they were intended to be used, and they instead use them as dumb data stores.

Other people have engineered new database systems from the ground up, each with a different set of tradeoffs and differences from their relational database brethren. It’s these new databases that have some in our industry excited, and it’s these databases that I’m going to focus on primarily in this post.

(By the way, there’s a whole lot more theory behind the movement away from SQL. Primarily of interest is the CAP theorem and the Dynamo paper. Both of these illustrate the necessary tradeoffs of between different approaches to designing databases.)

Let’s get this out of the way

I love SQL. More than even that, I love my precious ORM and being able to query for whatever information I want whenever I want it. For the vast majority of sites out there (we’re talking 99.9% of the sites out there, folks) it suits their needs very well, providing a good balance of ease of use and performance.

There’s no reason for them to switch away from SQL, and there’s no way they will. If there’s one thing I don’t like about this whole NoSQL movement, it’s the presumption that everyone who’s interested in alternative databases hates the status quo. That’s simply not true.

But we’re not talking about most sites out there, we’re not talking about the status quo, we’re talking about the few applications that need something totally different.

Tokyo Cabinet / Tokyo Tyrant

Tokyo Cabinet (and its network interface, Tokyo Tyrant) is the logical successor to Berkeley DB–a blazing fast, open-source, embeddable key-value store that does just about what you would expect from its description. It supports 3 modes of operation: hashtable mode, b-tree mode, and table mode.

(Table mode still pretty much sucks, and I’m not convinced it’s a good idea for the project since it’s added bloat and other systems like RDBMs are probably better for storing tabular data, so I’m going to skip it.)

Essentially, the API into Tokyo Cabinet is that of a gigantic associative array. You give it a key and a value, and then later, given a key, it will give you back the value you put in. Its largest assets are that it’s fast and straightforward.

If your problem is such that you have a small to medium-sized amount of data, which needs to be updated rapidly, and can be easily modeled in terms of keys and values (almost all scenarios can be rewritten in terms of keys and values, but some problems are easier to convert than others), then Tokyo Cabinet and Tokyo Tyrant are the way to go.

CouchDB

CouchDB is similar to Tokyo Cabinet in that it essentially maps keys to data, but CouchDB’s philosophy is completely different. Instead of arbitrary data, its data has structure–it’s a JSON object. Instead of only being able to query by keys, you can upload functions that index your data for you and then you can call those functions. All of this is done over a very simple REST interface.

But none of this really matters. None of these really set CouchDB apart, because you could just encode JSON data and store it in Tokyo Cabinet, you can maintain your own indexes of data fairly easily, and you can build a simple REST API in a matter of days, if not hours.

What really sets CouchDB apart from the pack is it’s innovative replication strategy. It was written in such a way that nodes which are disconnected for long periods of time can reconnect, sync with each other, and reconcile their differences in a way that no other database (since Lotus Notes?) could do.

It’s functionality that allows for interesting and new distributed types of applications and data that I think could possibly change the way we take our applications offline. I imagine that some day every computer will come with CouchDB pre-installed and it’ll be a data store that we use without even knowing that we’re using it.

However, I wouldn’t choose it for a super high scalability site with lots of data and sharding and replication and high availability and all those buzzwords, because I’m not convinced it’s the right tool for that job, but I am convinced that its replication strategy will keep it relevant for years to come.

Redis

Wow, looking at the bullet points this database seems to do just about everything, perfectly! Yeah, it’s a bit prone to hyperbole and there are some great things about it, but a lot of it is hot air. For example, it claims to support sharding but really all it does is have the client run a hash function on its key and use that to determine which server to send its value to. This is something that any database can do.

When you get down to it, Redis is a key-value store which provides a richer API than something like Tokyo Cabinet. It does more operations in memory, only periodically flushing to disk, so there’s more of a risk that you could lose data on a crash. The tradeoff is that it’s extremely fast, and it does some neat things like allow you to append a value to the end of a list of items already stored for a given key.

It also has atomic operations. This is honestly the only reason I find this project interesting, because the atomic operation support that it has means that it can be turned into a best-of-breed tally server. If you are building a server to keep real-time counts of various things, you would be remiss to overlook Redis as a very viable option.

Cassandra

It’s good to save the best for last, and that’s exactly what I’ve done as I find Cassandra to be easily the most interesting non-relational database out there today. Originally developed by Facebook, it was developed by some of the key engineers behind Amazon’s famous Dynamo database.

Cassandra can be thought of as a huge 4-or-5-level associative array, where each dimension of the array gets a free index based on the keys in that level. The real power comes from that optional 5th level in the associative array, which can turn a simple key-value architecture into an architecture where you can now deal with sorted lists, based on an index of your own specification. That 5th level is called a SuperColumn, and it’s one of the reasons that Cassandra stands out from the crowd.

Cassandra has no single points of failure, and can scale from one machine to several thousands of machines clustered in different data centers. It has no central master, so any data can be written to any of the nodes in the cluster, and can be read likewise from any other node in the cluster.

It provides knobs that can be tweaked to slide the scale between consistency and availability, depending on your particular application and problem domain. And it provides a high availability guarantee, that if one node goes down, another node will step in to replace it smoothly.

Writing about all the features of Cassandra is a whole different post, but I am convinced that its data model is rich enough to support a wide variety of applications while providing the kind of extreme scalability and high availability features that few other databases can achieve–all while maintaining a lower latency than other solutions out there.

Conclusion

There are many other non-relational databases out there: HBase and Hypertable, which are replicating Google’s BigTable despite its complexity and problems with single points of failure. MongoDB is another database that has been getting some traction, but it seems to be a jack of all trades, master of none. In short, the above databases are the ones that I find interesting right now, and I would use each of them for different use cases.

What do you all think about this whole non-relational database thing? Do you agree with my thoughts or do you think I’m full of it?

Flojax: A Unobtrusive and Easy Strategy for Creating AJAX-style Web Applications

Writing AJAX-style web applications can be very tedious. If you’re using XML as your transport layer, you have to parse the XML before you can work with it. It’s a bit easier if you’re using JSON, but once you have parsed the data, the data still needs to be turned into HTML markup that matches the current markup on the page. Finally, the newly created markup needs to be inserted into the correct place in the DOM, and any event handlers need to be attached to the appropriate newly-inserted markup.

So there’s the parsing, the markup assembly, the DOM insertion, and finally the event handler attachment. Most of the time, people tend to write custom code for each element that needs asynchronous updating. There are several drawbacks with this scenario, but the most frustrating part is probably that the presentation logic is implemented twice–once in a templating language on the server which is designed specifically for outputting markup, and again on the client with inline Javascript. This leads to problems both in the agility and in the maintainability of this type of application.

With flojax, this can all be accomplished with one generalized implementation. The same server-side logic that generates the data for the first synchronous request can be used to respond to subsequent asynchronous requests, and unobtrusive attributes specify what to do for the rest.

The Basics

The first component for creating an application using the flojax strategy is to break up the content that you would like to reload asynchronously into smaller fragments. As a basic example of this, let’s examine the case where there is a panel of buttons that you would like to turn into asynchronous requests instead of full page reloads.

The rendered markup for a fragment of buttons could look something like this:

1
2
3
4
5
<div class="buttons">
    <a href="/vote/up/item1/">Vote up</a>
    <a href="/vote/down/item1/">Vote down</a>
    <a href="/favorite/item1/">Add to your favorites</a>
</div>

In a templating language, the logic might look something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
<div class="buttons">
    {% if voted %}
        <a href="/vote/clear/{{ item.id }}/">Clear your vote</a>
    {% else %}
        <a href="/vote/up/{{ item.id }}/">Vote up</a>
        <a href="/vote/down/{{ item.id }}/">Vote down</a>
    {% endif %}
    {% if favorited %}
        <a href="/favorite/{{ item.id }}/">Add to your favorites</a>
    {% else %}
        <a href="/unfavorite/{{ item.id }}/">Remove from your favorites</a>
    {% endif %}
</div>

(Typically you wouldn’t use anchors to do operations that can change state on the server, so you can imagine this would be accomplished using forms. However, for demonstration and clarity purposes I’m going to leave these as links.)

Now that we have written a fragment, we can start using it in our larger templates by way of an include, which might look something like this:

1
2
3
4
...
<p>If you like this item, consider favoriting or voting on it:</p>
{% include "fragments/buttons.html" %}
...

To change this from being standard links to being asynchronously updated, we just need to annotate a small amount of data onto the relevant links in the fragment.

1
2
3
4
5
6
7
8
9
10
11
12
13
<div class="buttons">
    {% if voted %}
        <a href="/vote/clear/{{ item.id }}/" class="flojax" rel="buttons">Clear your vote</a>
    {% else %}
        <a href="/vote/up/{{ item.id }}/" class="flojax" rel="buttons">Vote up</a>
        <a href="/vote/down/{{ item.id }}/" class="flojax" rel="buttons">Vote down</a>
    {% endif %}
    {% if favorited %}
        <a href="/favorite/{{ item.id }}/" class="flojax" rel="buttons">Add to your favorites</a>
    {% else %}
        <a href="/unfavorite/{{ item.id }}/" class="flojax" rel="buttons">Remove from your favorites</a>
    {% endif %}
</div>

That’s it! At this point, all of the click events that happen on these links will be changed into POST requests, and the response from the server will be inserted into the DOM in place of this div with the class of "buttons". If you didn’t catch it, all that was done was to add the "flojax" class onto each of the links, and add a rel attribute that refers to the class of the parent node in the DOM to be replaced–in this case, "buttons".

Of course, there needs to be a server side component to this strategy, so that instead of rendering the whole page, the server just renders the fragment. Most modern Javascript frameworks add a header to the request to let the server know that the request was made asynchronously from Javascript. Here’s how the code on the server to handle the flojax-style request might look (in a kind of non-web-framework-specific Python code):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def vote(request, direction, item_id):
    item = get_item(item_id)

    if direction == 'clear':
        clear_vote(request.user, item)
    elif direction == 'up':
        vote_up(request.user, item)
    elif direction == 'down':
        vote_down(request.user, item)

    context = {'voted': direction != 'clear', 'item': item}

    if request.is_ajax():
        return render_to_response('fragments/buttons.html', context)

    # ... the non-ajax implementation details go here

    return render_to_response('items/item_detail.html', context)

There are several advantages to writing your request handlers in this way. First, note that we were able to totally reuse the same templating logic from before–we just render out the fragment instead of including it in a larger template. Second, we have provided a graceful degradation path where users without javascript are able to interact with the site as well, albeit with a worse user experience.

That’s really all there is to writing web applications using the flojax strategy.

Implementation Details

I don’t believe that the Javascript code for this method can be easily reused, because each web application tends to have a different way of showing errors and other such things to the user. In this post, I’m going to provide a reference implementation (using jQuery) that can be used as a starting point for writing your own versions. The bulk of the work is done in a function that is called on every page load, called flojax_init.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
function flojax_clicked() {
    var link = $(this);
    var parent = link.parents('.' + link.attr('rel'));

    function successCallback(data, textStatus) {
        parent.replaceWith(data);
        flojax_init();
    }
    function errorCallback(request, textStatus, errorThrown) {
        alert('There was an error in performing the requested operation');
    }

    $.ajax({
        'url': link.attr('href'),
        'type': 'POST',
        'data': '',
        'success': successCallback,
        'error': errorCallback
    });

    return false;
}

function flojax_init() {
    $('a.flojax').live('click', flojax_clicked);
}

There’s really not a lot of code there. It POSTS to the given URL and replaces the specified parent class with the content of the response, and then re-initializes the flojax handler. The re-initialization could even be done in a smarter way, as well, by targeting only the newly inserted content. Also, you might imagine that an alert message probably wouldn’t be such a great user experience, so you could integrate error messages into some sort of Javascript messaging or growl-style system.

Extending Flojax

Often times you’ll want to do other things on the page when the asynchronous request happens. For our example, maybe there is some kind of vote counter that needs to be updated or some other messages that need to be displayed.

In these cases, I have found that using hidden input elements in the fragments can be useful for transferring that information from the server to the client. As long as the value in the hidden elements adheres to some predefined structure that your client knows about (it could even be something like JSON if you need to go that route).

If what you want can’t be done by extending the fragments in this way, then flojax isn’t the right strategy for that particular feature.

Limitations

This technique cannot solve all of the world’s problems. It can’t even solve all of the problems involved in writing an AJAX-style web application. It can, however, handle a fair amount of simple cases where all you want to do is quickly set up a way for a user’s action to replace content on a page.

Some specific examples of things that flojax can’t help with are if a user action can possibly update many items on a page, or if something needs to happen without a user clicking on a link. In these situations, you are better off coding a custom solution instead of trying to shoehorn it into the flojax workflow.

Conclusion

Writing AJAX-style web applications is usually tedious, but using the techniques that I’ve described, a large majority of the tedious work can be reduced. By using the same template code for rendering the page initially as with subsequent asynchronous requests, you ensure that code is not duplicated. By rendering HTML fragments, the client doesn’t have to go through the effort of parsing the output and converting the result into correct DOM objects. Finally, by using a few unobtrusive conventions (like the rel attribute and the flojax class), the Javascript code that a web application developer writes is able to be reused again and again.

I don’t believe that any of the details that I’m describing are new. In fact, people have been doing most of these things for years. What I think may in fact be new is the generalization of the sum of these techniques in this way. It’s still very much a work in progress, though. As I use flojax more and more, I hope to find not only places where it can be extended to cover more use cases, but also its limitations and places where it makes more sense to use another approach.

What do you think about this technique? Are you using any techniques like this for your web applications? If so, how do they differ from what I’ve described?

Tagging Cache Keys for O(1) Batch Invalidation

Recently I’ve been spending some quality time trying to decrease page load times and decrease the number of database accesses on a site I’m working on. As you would probably suspect, that means dealing with caching. One common thing that I need to do, however, is invalidate a large group of cache keys when some action takes place. I’ve devised a pattern for doing this, and while I’m sure it’s not novel, I haven’t seen any recent write-ups of this technique. The base idea is that we’re going to add another thin cache layer, and use the value from that first layer in the key to the second layer.

First, let me give a concrete example of the problem that I’m trying to solve. I’m going to use Django/Python from here on in, but you could substitute anything else, as this pattern should work across other frameworks and even other languages.

1
2
3
4
5
6
7
8
9
10
import datetime
from django.db import models

class Favorite(models.Model):
    user = models.ForeignKey(User)
    item = models.ForeignKey(Item)
    date_added = models.DateTimeField(default=datetime.datetime.now)

    def __unicode__(self):
        return u'%s has favorited %s' % (self.user, self.item)

Given this model, now let’s say that we have a function that gets the Favorite instances for a given user, which might look like this:

1
2
3
def get_favorites(user, start=None, end=None):
    faves = Favorite.objects.filter(user=user)
    return list(faves[start:end])

There’s not much here yet–we’re simply filtering to only include the Favorite instances for the given user, slicing it based on the given start and end numbers, and forcing evaluation before returning a list. Now let’s start thinking about how we will cache this. We’ll start by just implementing a naive cache strategy, which in this case simply means that the cache is never invalidated:

1
2
3
4
5
6
7
8
9
10
from django.core.cache import cache

def get_favorites(user, start=None, end=None):
    key = 'get_favorites-%s-%s-%s' % (user.id, start, end)
    faves = cache.get(key)
    if faves is not None:
        return faves
    faves = Favorite.objects.filter(user=user)[start:end]
    cache.set(key, list(faves), 86400 * 7)
    return faves

Now we come to the hard part: how do we invalidate those cache keys? It’s especially tricky because we don’t know exactly what keys have been created. What combinations of start/end have been given? We could invalidate all combinations of start/end up to some number, but that’s horribly inefficient and wasteful. So what do we do? My solution is to introduce another layer. Let me explain with code:

1
2
3
4
5
6
7
8
9
10
11
12
import uuid
from django.core.cache import cache

def favorite_list_hash(user):
    key = 'favorite-list-hash-%s' % (user.id,)
    cached_key_hash = cache.get(key)
    if cached_key_hash:
        key_hash = cached_key_hash
    else:
        key_hash = str(uuid.uuid4())
        cache.set(key, key_hash, 86400 * 7)
    return (key_hash, not cached_key_hash)

Essentially what this gives us is a temporary unique identifier for each user, that’s either stored in cache or generated and stuffed into the cache. How does this help? We can use this identifier in the keys to the get_favorites function:

1
2
3
4
5
6
7
8
9
10
11
12
from django.core.cache import cache

def get_favorites(user, start=None, end=None):
    key_hash, created = favorite_list_hash(user)
    key = 'get_favorites-%s-%s-%s-%s' % (user.id, start, end, key_hash)
    if not created:
        faves = cache.get(key)
        if faves is not None:
            return faves
    faves = Favorite.objects.filter(user=user)[start:end]
    cache.set(key, list(faves), 86400 * 7)
    return faves

As you can see, the first thing we do is grab that hash for the user, then we use it as the last part of the key for the function. The whole if not created thing is just an optimization that helps to avoid cache fetches when we know they will fail. Here’s the great thing now: invalidating all of the different cached versions of get_favorite for a given user is a single function call:

1
2
3
4
from django.core.cache import cache

def clear_favorite_cache(user):
    cache.delete('favorite-list-hash-%s' % (user.id,))

By deleting that single key, the next time get_favorites is called, it will call favorite_list_hash which will result in a cache miss, which will mean it will generate a new unique identifier and stuff it in cache, meaning that all of the keys for get_favorites are instantly different. I think that this is a powerful pattern that allows for coarser-grained caching without really sacrificing much of anything.

There is one aspect of this technique that some people will not like: it leaves old cache keys around taking up memory. I don’t consider this a problem because memory is cheap these days and Memcached is generally smart about evicting the least recently used data.

I’m interested though, since I don’t see people posting much about nontrivial cache key generation and invalidation. How are you doing this type of thing? Are most people just doing naive caching and calling that good enough?