Posted on Jan. 2, 2009 at 9:19 P.M.

Yesterday was the day of "2008 in review" posts. I really enjoyed reading over what people had accomplished during the year, and what they planned on doing in the new year. I hadn't planned on writing this post, but for some reason I'm doing it anyway. So here goes, not much technical stuff ahead, so if personal stuff bores you then click away now.

2008 in Review

What a fantastic year. What a crazy year. I don't think there's ever been a year in my life so filled with change. For starters, I...

...but those are just the highlights. It's been really interesting transitioning from college student to the real world. Somehow I thought that not having homework would mean that I would have more free time to do other things, but it turns out that you actually have less time if you have a job to go to every day.

But I don't think that being a student ever stopped. I've learned so much from my coworkers: different ways of thinking about problems, theoretical problems, real world challenges, prioritizing work, RESTful principles of the web, and a whole lot about databases. This is what I like about this industry. However much you know, there's more to learn, and people are generally willing to share their expertise with you.

I've learned a lot about what it means to be part of a community; part of a company. I've learned a lot about life, this year.

But the learning's not over.

Goals for 2009

While I'm weary of posting my goals for all the world to see, I think it's important to codify them and make them public. That way, maybe it'll compel me to actually follow through on these goals. I've tried to keep them realistic and with a few exceptions, specific enough to be testable. Without further ado, here are my goals for 2009 (in no particular order):

  • Learn a concatenative/stack-based language
  • Meet new people, especially those that I wouldn't normally meet
  • Write a virtual machine
  • Double both my blog and Twitter followership
  • Be more consistent in my contributions to the Pinax project, instead of helping in fits and spurts like I currently am
  • Post at least twice a month to this blog
  • Release some non-open source software, and maybe even charge for it
  • Work out an average of 2 times a week or more over the whole year
  • Learn Processing
  • Learn about investments, and invest 10% of my earnings each month
  • Give a talk at a conference

I think that all of these goals are achievable. In fact, I'd like to do more than just what I listed here, but these are the ones I'm willing to commit to. Frankly, I can't wait to see how this year shapes up. If it turns out to be anything like last year, I'm in for quite a ride.


Posted on Nov. 30, 2008 at 5:58 P.M.

This post is my final post in the blog-post-per-day challenge. There have been days when I really didn't want to blog (resulting in posts like this), and there have been days where I was excited about what I was writing about and spent a lot of time on a post (resulting in posts like this one). It was much more difficult and time-consuming than I was expecting. Another thing I wasn't expecting which actually kept motivating me to write more: actual traffic to this site. Previously this site saw maybe 100 hits/day due to mostly various google hits on certain Django topics. When I started doing the blog post-per-day thing, however, this is what my traffic turned into:

http://media.eflorenzano.com/img/analytics.png

But as I started to look at where that traffic was coming from, I realized it wasn't organic at all. (insert sad puppy face here) In fact, most of the traffic was coming from just a few sites. Here's what my top 10 referring sites were:

http://media.eflorenzano.com/img/analytics-2.png

Once I found out that nearly all my traffic was coming from reddit, I started to try to cater towards that audience. To reddit's credit, however, the more I tried to target content towards what I thought redditers would like, the less successful the articles did over there. Either I was doing a bad job of writing articles for that audience, or they wisened up to my act (I'm thinking the latter is more likely).

I was also very surprised by which articles turned out to be the most popular. Here's the list of articles that I thought were my best:

  1. Easy Multi-Database Support for Django
  2. Writing an Markov-Chain IRC Bot with Twisted and Python
  3. Lambda Calculus
  4. Drop-dead simple Django caching
  5. Using CouchDB with Django
  6. Why CouchDB Rocks
  7. "Web Hooks"
  8. Reverse HTTP

Here's the list of my top 8 most popular articles over the past few days, in order, by traffic:

  1. Gems of Python
  2. Why CouchDB Sucks
  3. It's caches all the way down
  4. The internet is in immediate danger of collapse
  5. Why use VARCHAR when you can use TEXT?
  6. Using CouchDB with Django
  7. Why CouchDB Rocks
  8. Secrets of the Django ORM

Interestingly enough there are only two posts that appear on both lists. Really, it surprised me which articles got picked up and which didn't. I suppose it has something to do with the sensational titles and the sometimes-controversial posts. Specifically the VARCHAR/TEXT post got much much more push-back than I was expecting. In hindsight, I was wrong to mention anything other than PostgreSQL and SQLite, as those are what I've actually done the TEXT-only experimentation in.

After doing 14 screencasts and now 30 blog posts over the past few months, I'm pretty well spent in terms of creating new, original, content. That's not to say that I'm going to stop writing here or anything like that, but certainly I won't be posting quite as much. When I do, it will be because I legitimately have something to say, instead of because of an obligation.

Thanks for visiting my site while I participated in this challenge. I hope you stick around.



Posted on Nov. 29, 2008 at 4:31 P.M.

Last week I wrote an article called Why CouchDB Sucks, which many people correctly said should have been called "What CouchDB Sucks at Doing". Nearly everyone pointed out that it was not designed to do the things that I was mentioning in the article. This time around, I'd like to focus on some of the features about CouchDB that I think absolutely rock.

CouchDB is schema-free

One of the most annoying parts of dealing with a traditional SQL database is that you invariably need to change your schemata. This can be done usually with some ALTER TABLE statements, but other times it requires scripts and careful use of transactions, etc. In CouchDB, the solution is to just start using your new schema. No migration needed. If it's a significant change, then you might need to change your views slightly, but nothing as annoying as what would be needed with SQL.

The other advantage of having no schema is that some types of data just aren't well suited to having a strict schema enforced upon them. My CouchDB-based lifestreaming application is a perfect example of the inherent flexibility of CouchDB's schemaless design is that all kinds of disparate information can be stored alongside each other and sorted and aggregated. There's also no reason that you need to use its schema-free nature this way. You could, for example, manually enforce a schema for certain databases, if needed.

CouchDB is RESTFUL HTTP

When is the last time you tried to install MySQL or PostgreSQL drivers for your web development platform of choice? If you're using apt-get it's not so bad, but for just about every other platform, it's a total pain to get these drivers up and running. With CouchDB, there's no need. It speaks HTTP. Want to create a new database? Send an HTTP PUT request. Want to retrieve a document from the database? Send an HTTP GET. Want to delete a database? Send an HTTP DELETE. As you can see, the API is quite straightforward and if a client library doesn't already exist for your language of choice (hint: it does), then it will take you only a few minutes to write one.

But the best part about this is that we already have so many amazing and well-tested tools to deal with HTTP. For example, let's say you want to store one database on one server and another database on another server? It's as simple as setting up nginx or perlbal or varnish as a reverse proxy and having each URL go to a different machine. The same thing goes for transparent caching, etc. Oh, and also, web browsers know how to speak HTTP, too. You could easily write whole web apps served only from CouchDB.

Map/Reduce

Map/Reduce will kill every traditional data warehousing vendor in the market. Those who adapt to it as a design/deployment pattern will survive, the rest won't.

Sounds like someone from Google must have said this, or some Hadoop evangelist, or maybe someone who works on CouchDB. In fact, this comes from Brian Aker, a MySQL hacker who was Director of Architecture at MySQL AB and is now developing the open source fork of MySQL named Drizzle (also a very exciting project in its own right). He's right, too. Google was on to something in a big way when they unveiled their whitepaper on Map/Reduce. It's not the be-all end-all for processing and generating large data sets, but it certainly is a proven technology for that task.

Brian talks about massively multi-core machines which seem the inevitability these days, and we will need to start writing logic that is massively parallelizable to take advantage of these masses of CPUs. Map/Reduce is one way to force ourselves to write logic that can be parallelized. It is a good choice for any new database system to adopt for this reason, and that's why it's great to see that CouchDB has adopted it. It's just one more reason why CouchDB rocks.

So much more

I could talk about how it can handle 2,500 concurrent requests in 10mb of resident memory usage. I could talk about its pluggable view server backends, so that instead of writing views in JavaScript you can write them in Python or any other language (given the correct bindings). I could talk about CouchDBX, which makes installing it on the Mac, quite literally, one click. I could even talk about how it's written in Erlang, with an eye towards scalability. Or maybe about how its database store is append-only.

I could talk about any of those things, and more. It just comes down to this: CouchDB rocks. But don't take my word for it--try it out for yourself!


Posted on Nov. 28, 2008 at 11:30 P.M.

Caching is easy to screw up. Usually it's a manual process which is error-prone and tedious. It's actually quite easy to cache, but knowing when to invalidate which caches becomes a lot harder. There is a subset of caching the caching problem that, with Django, can be done quite easily. The underlying idea is that every Django model has a primary key, which makes for an excellent key to a cache. Using this basic idea, we can cover a fairly large use case for caching, automatically, in a much more deterministic way. Let's begin.

First, we need to decide upon a setting for how long each individual item should be saved in the cache. I'm going to call that SIMPLE_CACHE_SECONDS and grab it like so:

from django.conf import settings

SIMPLE_CACHE_SECONDS = getattr(settings, 'SIMPLE_CACHE_SECONDS', 2592000)

The next thing we need to do is be able to generate a cache key from an instance of a model. Thanks to Django's _meta information, we can get the app label and model name, plus the primary key, and we're all set.

def key_from_instance(instance):
    opts = instance._meta
    return '%s.%s:%s' % (opts.app_label, opts.module_name, instance.pk)

So now let's start setting the cache! My preferred way to do it is via a signal, but you could do it in a less generic way by overriding save on a model. My signal looks like this:

from django.core.cache import cache
from django.db.models.signals import post_save

def post_save_cache(sender, instance, **kwargs):
    cache.set(key_from_instance(instance), instance, SIMPLE_CACHE_SECONDS)
post_save.connect(post_save_cache)

Now that we're putting items in the cache, we should probably delete them from the cache when the model instance is deleted:

from django.db.models.signals import pre_delete

def pre_delete_uncache(sender, instance, **kwargs):
    cache.delete(key_from_instance(instance))
pre_delete.connect(pre_delete_uncache)

This is all good and well, but right now we don't really have a way to get at that information. Cache is pretty useless if we never use it! Our interface to the database is through the model's QuerySet, so let's make sure that our QuerySet is making good use of our newly-populated cache. To do so, we'll subclass QuerySet:

from django.db.models.query import QuerySet

class SimpleCacheQuerySet(QuerySet):
    def filter(self, *args, **kwargs):
        pk = None
        for val in ('pk', 'pk__exact', 'id', 'id__exact'):
            if val in kwargs:
                pk = kwargs[val]
                break
        if pk is not None:
            opts = self.model._meta
            key = '%s.%s:%s' % (opts.app_label, opts.module_name, pk)
            obj = cache.get(key)
            if obj is not None:
                self._result_cache = [obj]
        return super(SimpleCacheQuerySet, self).filter(*args, **kwargs)

The only method that we really need to overwrite is filter, since get and get_or_create both just rely on filter anyway. The first for loop in the filter method just checks to see if there is a query by id or pk, and if so, then we construct a key and try to fetch it from the cache. If we found the item in the cache, then we place it into Django's internal result cache. At that point we're as good as done. Then we just let Django do the rest!

This SimpleCacheQuerySet won't be used all on its own though, we need to actually force a model to use it. How do we do that? We create a manager:

from django.db import models

class SimpleCacheManager(models.Manager):
    def get_query_set(self):
        return SimpleCacheQuerySet(self.model)

Now that we have this transparent caching library set up, we can go around to all of our models and import it and attach it as needed. Here's how that might look:

from django.db import models
from django_simplecache import SimpleCacheManager

class BlogPost(models.Model):
    title = models.TextField()
    body = models.TextField()

    objects = SimpleCacheManager()

That's it! Just by attaching this manager to our model we're getting all the benefits of per-object caching right away. Of course, this isn't comprehensive. It does hit the vast majority of use cases, though. If you were to use this for a real site, however, then you wouldn't be able to use update method. It's a little bit trickier since there's no post_update signal, but it's nowhere near impossible. Let's just say that, for now, it's being left unimplemented as an exercise for the reader. in_bulk would be actually quite fun to implement, too, because you could get all of the results possible from cache, and all the rest could be gotten from the database, then merge those two dictionaries before returning.

I think this would be a really good reusable Django application. Essentially, we've grown a library from the ground up that really isn't all that much code. I think it took me 20 minutes to write the actual code, but with some serious polish and love, this library could evolve into something that I think many reusable apps would use to great benefit. What do you think? What should a good, simple, Django caching library have?


Posted on Nov. 27, 2008 at 12:31 P.M.

Yesterday I wrote about Web Hooks and how powerful it could be if one web service sends HTTP requests to another web service. Today I want to take that concept one step further. What if you tell that service that you would like it to send a POST request back to you, whenever an event happens? This slight modification makes for a very powerful tool.

Let's take the example of popular real-time web applications like Facebook's instant messenger or FriendFeed's "Real-time" view. Both of these services make use of a technique called long polling, where the client sends an HTTP request and the server does not respond until it has some event to deliver. The client can only keep the request open for so long, so it periodically times out and re-sends the request. (It also re-sends the request if it does receive some data).

The problem with this technique is that it's really trying to turn a client into a server. It's really fighting against the way that HTTP wants to work. So why fight it? Imagine that all of our browsers have simple, lightweight, HTTP servers installed. The client could request to upgrade to reverse HTTP, and then the server could initiate a connection with the client. Now, as events come in to the web service, the service could directly send those updates to the client.

Going back to the example of Facebook IM, here's how that would work: When I open a Facebook page, my client sends a request to Facebook's IM server. Facebook's IM server sends a response with the HTTP/1.1 Upgrade header reading "PTTH/0.9" (funny, huh?). Then, the client knows to accept an HTTP connection from Facebook's IM server. Facebook's IM server then opens that connection with the client, and sends HTTP POSTs every time it receives a new instant message that the client should receive. The client's web browser would have some JavaScript hooks to parse the body of those requests, so that it could update the content of the instant message window on the page.

Isn't this brilliant? It directly meshes with the HTTP protocol, and makes this system which seems like a hack right now, instantly become an elegant solution. I really wish I could take credit for thinking this up, but I did not. My coworker Donovan Preston blew my mind with this a few weeks back. If you're looking for a more visual example of how this might work, or a reference implementation of the protocol in action, check out this wiki page.

Search

 

Badges

  • django badge
  • apache badge
  • GeoURL
  • XFN Friendly
  • Valid HTML 4.01 Transitional