Lately I've really fallen in love with writing utilities whose interface is simply HTTP. By making it accessible via HTTP, it's really easy to write clients that talk to the utility and, if the need arises, there are lots of tools that already exist for doing things with HTTP, like load balancing and caching, etc.
While it would be easy to use a framework to build these utilities, lately I've been choosing not to do so. Web frameworks like Django and Pylons are great when you need to build a fully-featured web application that will be accessible by people. When it will only be computers talking to the service, however, a lot of the machinery provided by frameworks is unneeded and will only slow your utility down. Instead of using a framework, we're going to write a pure WSGI application.
An Example: Music Discovery Website
This has all been very abstract, so let's take an example: Suppose you run a music discovery website that lets you play songs online. Next to each song, you simply want to display how many times the song has been played.
One solution to that problem could be to have a play_count column on the table where the song metadata is stored. Every time someone plays the song, you could issue an UPDATE on the row and increase the play_count by one. This solution will work while your site is small, but as more and more people begin using the application, the number of writes to your database is going to kill its performance.
A much more robust and scalable solution is to append a new line to a text log file every time a song is played, and have a process run regularly to scoop up all of the log files and update those play_count fields in the database.
However, even if you have that regular process run once every hour, there's still too great a lag time between when a user takes an action and when they see the results of that action. This is where our WSGI utility comes into play. It can serve as a realtime play counter to count the plays in between the time when the logs are analyzed and the play_count columns updated.
Song Play Counter
We can design the interface for our WSGI song play counter utility any way that we like, but I'm going to try to keep it as RESTful as I can. The interface will look like this:
- GET /song/SONGID will return the current play count of the given song
- POST /song/SONGID will increment the play count of the given song by one, and return its new value
- GET / will return a mapping of all songs registered to their respective play counts
- DELETE / will clear the whole mapping
So let's get started. First, I always like to start with a very basic skeleton:
def application(environ, start_response):
start_response('200 OK', [('content-type', 'text/plain')])
return ('Hello world!',)
This does what you would imagine, returns Hello world! to each and every request that it receives. Not very useful, so let's make it more interesting:
from collections import defaultdict
counts = defaultdict(int)
def application(environ, start_response):
global counts
path = environ['PATH_INFO']
method = environ['REQUEST_METHOD']
if path.startswith('/song/'):
song_id = path[6:]
if method == 'GET':
start_response('200 OK', [('content-type', 'text/plain')])
return (str(counts[song_id]),)
elif method == 'POST':
counts[song_id] += 1
start_response('200 OK', [('content-type', 'text/plain')])
return (str(counts[song_id]),)
else:
start_response('405 METHOD NOT ALLOWED', [('content-type', 'text/plain')])
return ('Method Not Allowed',)
start_response('404 NOT FOUND', [('content-type', 'text/plain')])
return ('Not Found',)
We've now added the data structure that we're using to keep track of the counts, which in this case is a defaultdict(int). We're also now looking at the request path and method, as well. If it's a GET starting with /song/, we look up the count and return it, and if it's a POST starting with /song/, we increment it by one before returning it. Also, we're doing the proper thing if we detect a method that's not allowed: we're returning HTTP error code 405.
Now let's add the final bit of functionality:
from collections import defaultdict
counts = defaultdict(int)
def application(environ, start_response):
# ... start of app
if path.startswith('/song/'):
# ... song-specific logic
elif path == '/':
if method == 'GET':
res = ','.join(['%s=%s' % (k, v) for k, v in counts.iteritems()])
start_response('200 OK', [('content-type', 'text/plain')])
return (res,)
elif method == 'DELETE':
counts = defaultdict(int)
start_response('200 OK', [('content-type', 'text/plain')])
return ('OK',)
else:
start_response('405 METHOD NOT ALLOWED', [('content-type', 'text/plain')])
return ('Method Not Allowed',)
# ... rest of app
We've done basically the same thing here as we did with the previous example: we are looking at the request path and method and doing the appropriate action. There really is nothing very tricky going on here. We're inventing our own format for the case where we return the counts for all songs, but it's nothing that will be hard to parse.
NOTE: Generally you would want to use some sort of threading lock primitive before accessing a global dictionary like this. I will be using Spawning to run this WSGI application, with a threadpool size of 0 to use cooperative coroutines instead of standard threads, so I am able to get away without locks for this application. To install Spawning for yourself, just type:
sudo easy_install Spawning
Running the Utility
Let's just take a quick look at how this utility works, from the command line:
$ spawn -t 0 -p 8000 counter.application
...and in another window:
$ curl http://127.0.0.1:8000/song/1
0
$ curl -X POST http://127.0.0.1:8000/song/1
1
$ curl http://127.0.0.1:8000/song/1
1
$ curl -X POST http://127.0.0.1:8000/song/5
1
$ curl -X POST http://127.0.0.1:8000/song/5
2
$ curl http://127.0.0.1:8000/
1=1,5=2
$ curl -X DELETE http://127.0.0.1:8000/
OK
As you can see, it seems to be working correctly. The play counter is behaving as expected.
Writing a Client to Talk to our Utility
Now that we have our WSGI utility written to keep track of the counts on our songs, we should write a client library to communicate with this server.
import httplib
class CountClient(object):
def __init__(self, servers=['127.0.0.1:8000']):
self.servers = servers
def _get_server(self, song_id):
return self.servers[song_id % len(self.servers)]
def _song_request(self, song_id, method):
conn = httplib.HTTPConnection(self._get_server(song_id))
conn.request(method, '/song/%s' % (song_id,))
resp = conn.getresponse()
play_count = int(resp.read())
conn.close()
return play_count
def get_play_count(self, song_id):
return self._song_request(song_id, 'GET')
def increment_play_count(self, song_id):
return self._song_request(song_id, 'POST')
def get_all_play_counts(self):
dct = {}
for server in self.servers:
conn = httplib.HTTPConnection(server)
conn.request('GET', '/')
counts = conn.getresponse().read()
conn.close()
if not counts:
continue
dct.update(dict([map(int, pair.split('=')) for pair in counts.split(',')]))
return dct
def reset_all_play_counts(self):
status = True
for server in self.servers:
conn = httplib.HTTPConnection(server)
conn.request('DELETE', '/')
resp = conn.getresponse().read()
if resp != 'OK':
status = False
conn.close()
return status
What we have here is a simple class that converts Python method calls to the RESTful HTTP equivalents that we have written for our WSGI utility. The best part about this setup, though, is that it uses a hash based on the song_id to determine which server to connect to. If you only ever do per-song operations, this setup is quite literally infinitely scalable. You could have thousands of servers keeping track of song counts, none of them knowing about each other. Since the decision about which server to talk to happens on the client side, there needs to be no communication between the servers whatsoever.
However, if you start to use the get_all_play_counts and reset_all_play_counts, then eventually after many many servers are added it will start to get slower.
Let's explore this client:
>>> from countclient import CountClient
>>> c = CountClient()
>>> c.get_play_count(1)
0
>>> c.increment_play_count(1)
1
>>> c.increment_play_count(1)
2
>>> c.get_play_count(1)
2
>>> c.increment_play_count(5)
1
>>> c.get_all_play_counts()
{1: 2, 5: 1}
>>> c.reset_all_play_counts()
True
>>> c.get_all_play_counts()
{}
Benchmarks!
I'm not a benchmarking nut in any way, shape, or form these days. However, in Python it's quite tough to beat pure-WSGI applications for raw speed. Using my MacBook Pro with a 2.5GHz Intel Core 2 Duo and 2 GB 667 MHz DDR2 SDRAM I got these results from ApacheBench:
e:Desktop ericflo$ ab -n 10000 http://127.0.0.1:8000/song/1
...
Concurrency Level: 1
Time taken for tests: 7.792 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Total transferred: 1020000 bytes
HTML transferred: 10000 bytes
Requests per second: 1283.31 [#/sec] (mean)
Time per request: 0.779 [ms] (mean)
Time per request: 0.779 [ms] (mean, across all concurrent requests)
Transfer rate: 127.83 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 2
Processing: 0 1 0.8 1 43
Waiting: 0 1 0.5 0 43
Total: 1 1 0.8 1 43
Take these results with a huge grain of salt, but suffice it to say, it's fast. It would probably be even faster using mod_wsgi instead of Spawning.
Drawing Conclusions From This Exercise
I don't want to misconstrue my standpoint on this: frameworks definitely have their place. There's no way you would want to write an entire user-facing application with pure WSGI unless you were using lots of middleware and stuff and at some point you're just recreating Pylons. But when you're writing a HTTP utility like we did here, then I think that pure-WSGI is the way to go.
I'd like to touch on one more nice side effect of using pure-WSGI: You can run it in any application server that supports WSGI. That means Google App Engine, Apache, Spawning, CherryPy, and many other containers. It can easily be served by pure python so even on very restrictive shared hosting it's possible to run your utility.
What do you think of pure-WSGI utilities? Are you using them in your app? I'd love to hear about it--leave me a comment and tell me your thoughts on this subject.
Most people in the Django community are deploying their apps these days with mod_wsgi. If not, then you're at least using WSGI as a communication layer with your application server, in one way or another. The great thing about WSGI is that it gives everyone a common interface through which to talk. It also has the added benefit of being a common abstraction that many people have built these great, really useful tools on top of.
Consider Repoze. If you navigate to the middleware section of their website, they have some really cool stuff available! There are utilities for logging, authentication, security, profiling, templating, etc. All of these pieces of middleware are designed to be totally pluggable, because they are designed to work solely based on what's available through WSGI.
My personal favorite of that lot is repoze.profile. It accumulates Python profiling information about whatever app is being run, and allows you to view that profile information via a web interface by visiting a special URL. There is absolutely no reason that the Pylons, TurboGears, or CherryPy guys should be able to get away with keeping this stuff for themselves, so I want to show just how easy it is to integrate this profiling module with Django.
First, though, here's a typical .wsgi file that might be used in conjunction with mod_wsgi:
import os, sys
sys.stdout = sys.stderr
os.environ['DJANGO_SETTINGS_MODULE'] = 'settings'
import django.core.handlers.wsgi
application = django.core.handlers.wsgi.WSGIHandler()
There's really nothing special going on here, and if you would like to learn more about how to set up this WSGI file, visit mod_wsgi's documentation on the subject. Now if you'll notice, application is simply an instance of WSGIHandler, which is simply a callable. A WSGI middleware is just a wrapper around that callable. Here's how easy it is to add the profiling middleware:
from repoze.profile.profiler import AccumulatingProfileMiddleware
application = AccumulatingProfileMiddleware(
application,
log_filename='/tmp/djangoprofile.log',
discard_first_request=True,
flush_at_shutdown=True,
path='/__profile__')
There we go! We have imported the profiling middleware, and passed the Django WSGI application as the first argument. The rest is just setting options for the middleware. You can restart apache and the WSGI profiling middleware is already working.
Sometimes, though, you don't want all of Apache just to run some middleware. You want to be able to do the same thing, but locally. Believe it or not, Django's local development server is just a WSGI server itself, so one option would be to do the wrapping directly in django, right here. But you really don't want to be hacking inside of Django internals if you don't have to. Fortunately there are many alternative WSGI servers out there. Brian Rosner has created a custom management command to use the excellent CherryPy WSGI server with Django, on his blog.
Let's say you just want to try this out quickly after reading this blog post, though. If you're running Python 2.5 or greater, you're in luck, because a script less than 10 lines long can get you up and running:
#!/usr/bin/env python
import sys
from wsgiref.simple_server import make_server
if __name__ == "__main__":
execfile(sys.argv[1])
httpd = make_server('', 8000, application)
httpd.serve_forever()
Now, to run it, simply invoke it like this:
python runserver.py my_wsgi_file.wsgi
Now, navigate around your app for a little bit and then point your browser to the profile url and see how freaking awesome middleware can be.
I'm not trying to stir up any controversy, I'm not saying we should stop making Django middleware or anything like that. But I seriously, seriously hope that someone tries this out and realizes the multitudes of great WSGI apps out there that can be taken advantage of. Mark Ramm wasn't full of hot air when he talked about this at DjangoCon or blogged about it later. He was right, and I for one wish I had listened sooner.
All Content

