Lately I've really fallen in love with writing utilities whose interface is simply HTTP. By making it accessible via HTTP, it's really easy to write clients that talk to the utility and, if the need arises, there are lots of tools that already exist for doing things with HTTP, like load balancing and caching, etc.
While it would be easy to use a framework to build these utilities, lately I've been choosing not to do so. Web frameworks like Django and Pylons are great when you need to build a fully-featured web application that will be accessible by people. When it will only be computers talking to the service, however, a lot of the machinery provided by frameworks is unneeded and will only slow your utility down. Instead of using a framework, we're going to write a pure WSGI application.
An Example: Music Discovery Website
This has all been very abstract, so let's take an example: Suppose you run a music discovery website that lets you play songs online. Next to each song, you simply want to display how many times the song has been played.
One solution to that problem could be to have a play_count column on the table where the song metadata is stored. Every time someone plays the song, you could issue an UPDATE on the row and increase the play_count by one. This solution will work while your site is small, but as more and more people begin using the application, the number of writes to your database is going to kill its performance.
A much more robust and scalable solution is to append a new line to a text log file every time a song is played, and have a process run regularly to scoop up all of the log files and update those play_count fields in the database.
However, even if you have that regular process run once every hour, there's still too great a lag time between when a user takes an action and when they see the results of that action. This is where our WSGI utility comes into play. It can serve as a realtime play counter to count the plays in between the time when the logs are analyzed and the play_count columns updated.
Song Play Counter
We can design the interface for our WSGI song play counter utility any way that we like, but I'm going to try to keep it as RESTful as I can. The interface will look like this:
- GET /song/SONGID will return the current play count of the given song
- POST /song/SONGID will increment the play count of the given song by one, and return its new value
- GET / will return a mapping of all songs registered to their respective play counts
- DELETE / will clear the whole mapping
So let's get started. First, I always like to start with a very basic skeleton:
def application(environ, start_response):
start_response('200 OK', [('content-type', 'text/plain')])
return ('Hello world!',)
This does what you would imagine, returns Hello world! to each and every request that it receives. Not very useful, so let's make it more interesting:
from collections import defaultdict
counts = defaultdict(int)
def application(environ, start_response):
global counts
path = environ['PATH_INFO']
method = environ['REQUEST_METHOD']
if path.startswith('/song/'):
song_id = path[6:]
if method == 'GET':
start_response('200 OK', [('content-type', 'text/plain')])
return (str(counts[song_id]),)
elif method == 'POST':
counts[song_id] += 1
start_response('200 OK', [('content-type', 'text/plain')])
return (str(counts[song_id]),)
else:
start_response('405 METHOD NOT ALLOWED', [('content-type', 'text/plain')])
return ('Method Not Allowed',)
start_response('404 NOT FOUND', [('content-type', 'text/plain')])
return ('Not Found',)
We've now added the data structure that we're using to keep track of the counts, which in this case is a defaultdict(int). We're also now looking at the request path and method, as well. If it's a GET starting with /song/, we look up the count and return it, and if it's a POST starting with /song/, we increment it by one before returning it. Also, we're doing the proper thing if we detect a method that's not allowed: we're returning HTTP error code 405.
Now let's add the final bit of functionality:
from collections import defaultdict
counts = defaultdict(int)
def application(environ, start_response):
# ... start of app
if path.startswith('/song/'):
# ... song-specific logic
elif path == '/':
if method == 'GET':
res = ','.join(['%s=%s' % (k, v) for k, v in counts.iteritems()])
start_response('200 OK', [('content-type', 'text/plain')])
return (res,)
elif method == 'DELETE':
counts = defaultdict(int)
start_response('200 OK', [('content-type', 'text/plain')])
return ('OK',)
else:
start_response('405 METHOD NOT ALLOWED', [('content-type', 'text/plain')])
return ('Method Not Allowed',)
# ... rest of app
We've done basically the same thing here as we did with the previous example: we are looking at the request path and method and doing the appropriate action. There really is nothing very tricky going on here. We're inventing our own format for the case where we return the counts for all songs, but it's nothing that will be hard to parse.
NOTE: Generally you would want to use some sort of threading lock primitive before accessing a global dictionary like this. I will be using Spawning to run this WSGI application, with a threadpool size of 0 to use cooperative coroutines instead of standard threads, so I am able to get away without locks for this application. To install Spawning for yourself, just type:
sudo easy_install Spawning
Running the Utility
Let's just take a quick look at how this utility works, from the command line:
$ spawn -t 0 -p 8000 counter.application
...and in another window:
$ curl http://127.0.0.1:8000/song/1
0
$ curl -X POST http://127.0.0.1:8000/song/1
1
$ curl http://127.0.0.1:8000/song/1
1
$ curl -X POST http://127.0.0.1:8000/song/5
1
$ curl -X POST http://127.0.0.1:8000/song/5
2
$ curl http://127.0.0.1:8000/
1=1,5=2
$ curl -X DELETE http://127.0.0.1:8000/
OK
As you can see, it seems to be working correctly. The play counter is behaving as expected.
Writing a Client to Talk to our Utility
Now that we have our WSGI utility written to keep track of the counts on our songs, we should write a client library to communicate with this server.
import httplib
class CountClient(object):
def __init__(self, servers=['127.0.0.1:8000']):
self.servers = servers
def _get_server(self, song_id):
return self.servers[song_id % len(self.servers)]
def _song_request(self, song_id, method):
conn = httplib.HTTPConnection(self._get_server(song_id))
conn.request(method, '/song/%s' % (song_id,))
resp = conn.getresponse()
play_count = int(resp.read())
conn.close()
return play_count
def get_play_count(self, song_id):
return self._song_request(song_id, 'GET')
def increment_play_count(self, song_id):
return self._song_request(song_id, 'POST')
def get_all_play_counts(self):
dct = {}
for server in self.servers:
conn = httplib.HTTPConnection(server)
conn.request('GET', '/')
counts = conn.getresponse().read()
conn.close()
if not counts:
continue
dct.update(dict([map(int, pair.split('=')) for pair in counts.split(',')]))
return dct
def reset_all_play_counts(self):
status = True
for server in self.servers:
conn = httplib.HTTPConnection(server)
conn.request('DELETE', '/')
resp = conn.getresponse().read()
if resp != 'OK':
status = False
conn.close()
return status
What we have here is a simple class that converts Python method calls to the RESTful HTTP equivalents that we have written for our WSGI utility. The best part about this setup, though, is that it uses a hash based on the song_id to determine which server to connect to. If you only ever do per-song operations, this setup is quite literally infinitely scalable. You could have thousands of servers keeping track of song counts, none of them knowing about each other. Since the decision about which server to talk to happens on the client side, there needs to be no communication between the servers whatsoever.
However, if you start to use the get_all_play_counts and reset_all_play_counts, then eventually after many many servers are added it will start to get slower.
Let's explore this client:
>>> from countclient import CountClient
>>> c = CountClient()
>>> c.get_play_count(1)
0
>>> c.increment_play_count(1)
1
>>> c.increment_play_count(1)
2
>>> c.get_play_count(1)
2
>>> c.increment_play_count(5)
1
>>> c.get_all_play_counts()
{1: 2, 5: 1}
>>> c.reset_all_play_counts()
True
>>> c.get_all_play_counts()
{}
Benchmarks!
I'm not a benchmarking nut in any way, shape, or form these days. However, in Python it's quite tough to beat pure-WSGI applications for raw speed. Using my MacBook Pro with a 2.5GHz Intel Core 2 Duo and 2 GB 667 MHz DDR2 SDRAM I got these results from ApacheBench:
e:Desktop ericflo$ ab -n 10000 http://127.0.0.1:8000/song/1
...
Concurrency Level: 1
Time taken for tests: 7.792 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Total transferred: 1020000 bytes
HTML transferred: 10000 bytes
Requests per second: 1283.31 [#/sec] (mean)
Time per request: 0.779 [ms] (mean)
Time per request: 0.779 [ms] (mean, across all concurrent requests)
Transfer rate: 127.83 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 2
Processing: 0 1 0.8 1 43
Waiting: 0 1 0.5 0 43
Total: 1 1 0.8 1 43
Take these results with a huge grain of salt, but suffice it to say, it's fast. It would probably be even faster using mod_wsgi instead of Spawning.
Drawing Conclusions From This Exercise
I don't want to misconstrue my standpoint on this: frameworks definitely have their place. There's no way you would want to write an entire user-facing application with pure WSGI unless you were using lots of middleware and stuff and at some point you're just recreating Pylons. But when you're writing a HTTP utility like we did here, then I think that pure-WSGI is the way to go.
I'd like to touch on one more nice side effect of using pure-WSGI: You can run it in any application server that supports WSGI. That means Google App Engine, Apache, Spawning, CherryPy, and many other containers. It can easily be served by pure python so even on very restrictive shared hosting it's possible to run your utility.
What do you think of pure-WSGI utilities? Are you using them in your app? I'd love to hear about it--leave me a comment and tell me your thoughts on this subject.
Yesterday was the day of "2008 in review" posts. I really enjoyed reading over what people had accomplished during the year, and what they planned on doing in the new year. I hadn't planned on writing this post, but for some reason I'm doing it anyway. So here goes, not much technical stuff ahead, so if personal stuff bores you then click away now.
2008 in Review
What a fantastic year. What a crazy year. I don't think there's ever been a year in my life so filled with change. For starters, I...
- Learned Erlang
- Joined Twitter and FriendFeed
- Drove 3 hours to caucus in Iowa (what a strange ritual that is!)
- Met some really cool people at PyCon 2008
- Became involved in the Pinax project
- Wrote and released my first 7 open source Django applications
- Guest hosted a podcast
- Learned how to use Git
- Graduated from college with bachelor's degree in Computer Science
- Accepted a job offer with Mochi Media
- Moved to San Francisco
- Released a series of 14 screencasts building a project from the ground up
- Wrote one blog post every day in the month of November
- Quit drinking soda
- Participated in the most notable election of my lifetime
- Started a Django San Francisco local user group
...but those are just the highlights. It's been really interesting transitioning from college student to the real world. Somehow I thought that not having homework would mean that I would have more free time to do other things, but it turns out that you actually have less time if you have a job to go to every day.
But I don't think that being a student ever stopped. I've learned so much from my coworkers: different ways of thinking about problems, theoretical problems, real world challenges, prioritizing work, RESTful principles of the web, and a whole lot about databases. This is what I like about this industry. However much you know, there's more to learn, and people are generally willing to share their expertise with you.
I've learned a lot about what it means to be part of a community; part of a company. I've learned a lot about life, this year.
But the learning's not over.
Goals for 2009
While I'm weary of posting my goals for all the world to see, I think it's important to codify them and make them public. That way, maybe it'll compel me to actually follow through on these goals. I've tried to keep them realistic and with a few exceptions, specific enough to be testable. Without further ado, here are my goals for 2009 (in no particular order):
- Learn a concatenative/stack-based language
- Meet new people, especially those that I wouldn't normally meet
- Write a virtual machine
- Double both my blog and Twitter followership
- Be more consistent in my contributions to the Pinax project, instead of helping in fits and spurts like I currently am
- Post at least twice a month to this blog
- Release some non-open source software, and maybe even charge for it
- Work out an average of 2 times a week or more over the whole year
- Learn Processing
- Learn about investments, and invest 10% of my earnings each month
- Give a talk at a conference
I think that all of these goals are achievable. In fact, I'd like to do more than just what I listed here, but these are the ones I'm willing to commit to. Frankly, I can't wait to see how this year shapes up. If it turns out to be anything like last year, I'm in for quite a ride.
All Content

