Posted on May 6, 2008 at 3:24 A.M.

There are many times when the programming task at hand is to iterate over some semi-structured text, transform parts of that text in some way, and reintegrate those transformed parts back into the original text.

Typically using a regular expression with re.sub and a callback function, but sometimes you want a bit more control of the process (especially over those parts that dont match the regex). Usually my solution is to write a one-off function that does it, but today I had to write that function yet again and decided to generalize it and post it here.

To be completely honest, this post is more for my own archival purposes than for the internet as a whole, but if anyone else finds it useful, then I'm ecstatic.

def re_parts(regex_list, text):
    """
    An iterator that returns the entire text, but split by which regex it
    matched, or none at all.  If it did, the first value of the returned tuple
    is the index into the regex list, otherwise -1.

    >>> first_re = re.compile('asdf')
    >>> second_re = re.compile('an')
    >>> list(re_parts([first_re, second_re], 'This is an asdf test.'))
    [(-1, 'This is '), (1, 'an'), (-1, ' '), (0, 'asdf'), (-1, ' test.')]

    >>> list(re_parts([first_re, second_re], 'asdfasdfasdf'))
    [(0, 'asdf'), (0, 'asdf'), (0, 'asdf')]

    >>> list(re_parts([], 'This is an asdf test.'))
    [(-1, 'This is an asdf test.')]

    >>> third_re = re.compile('sdf')
    >>> list(re_parts([first_re, second_re, third_re], 'This is an asdf test.'))
    [(-1, 'This is '), (1, 'an'), (-1, ' '), (0, 'asdf'), (-1, ' test.')]
    """
    def match_compare(x, y):
        return x.start() - y.start()
    prev_end = 0
    iters = [r.finditer(text) for r in regex_list]
    matches = []
    while iters:
        if matches:
            match = matches.pop(0)
            (start, end) = match.span()
            if start > prev_end:
                yield (-1, text[prev_end:start])
                yield (regex_list.index(match.re), text[start:end])
            elif start == prev_end:
                yield (regex_list.index(match.re), text[start:end])
            prev_end = end
        else:
            matches = []
            for iterator in iters:
                try:
                    matches.append(iterator.next())
                except StopIteration:
                    iters.remove(iterator)
            matches = sorted(matches, match_compare)
    last_bit = text[prev_end:]
    if len(last_bit) > 0:
        yield (-1, last_bit)
at 2:07 p.m.
on May 6, 2008

So, this is useful for what, exactly?

at 2:30 p.m.
on May 6, 2008

I would be interested to see an example of how you use this that is a bit more specific than the doctests.

at 11:26 a.m.
on May 7, 2008

Looks kinda like a sre scanner, there's one in simplejson:
<a href="http://code.google.com/p/simplejson/source/browse/trunk/simplejson/scanner.py">scanner.py</a>

Arien
at 7:35 p.m.
on May 8, 2008

Alternatively, you could let the regex engine to do its work. :-) Combine the patterns into a single regex using capturing and alternation:

regex = re.compile('|'.join('(%s)' % s for s in patterns))

This way you won't have to do the extra work of matching "in parallel", and finding the index of the matched pattern will be as simple as match.lastindex - 1:

prev_end = 0
for match in regex.finditer(text):
start, end = match.span()
if prev_end < start:
yield (-1, text[prev_end:start])
prev_end = end
yield (match.lastindex - 1, text[start:end])
if prev_end < len(text):
yield (-1, text[prev_end:])

Simpler, shorter and faster. What's not to like?

Well... the above recipe assumes there will be no captures in the patterns in your and that any flags to be used are embedded in the patterns. I don't think that's a problem in practice, though.

Now let's see if I can find docs for those SRE scanners...

Arien
at 7:35 a.m.
on May 9, 2008

Okay, so it looks like sre exposes the machinery required to fairly easily(?) solve those two points.

What do you use this for btw?

Search

 

Recent Links

  • git-issues: A distributed issue tracker built-in to Git.
  • I predicted this back in March--can't believe a solution has surfaced so soon. It makes so much sense to build in an issue tracker to a revision control system. Since you're working with the code, might as well track the issues in the same system and take advantage of the extra metadata. This is really cool (and as a bonus, it's written in Python) so I hope to see it really grow and flourish!

  • How I Work Daily
  • Daily blog by Kevin Fricovsky. In addition to having some really great content, he has started to post audio interviews with people from the Django community. This is a site to keep an eye on in the coming days and months.

  • django-arcade
  • Demo site for django-arcade, an open source reusable Django app to add new flash games to any django-powered site. Looks very cool for easily creating game portals. It also comes from my future employer.

  • Facebook Chat and Scalability (with Erlang)
  • Eugene Letuchy talks about how they they took Facebook Chat from no users to 70 million users, with the help of Erlang.

  • Simon Willison: The Implications of OpenID
  • I somehow missed this presentation when it came out, but it's an absolutely fantastic overview and defense of OpenID by Simon Willison. If you are in any way interested in what OpenID is and what it can offer, you owe it to yourself to check out this presentation.

  • StupidXML
  • Probably the simplest XML library that I've seen for Python. Sometimes you just want to generate some stupid XML, and this is the perfect tool for the job.

  • See the rest of my links...

Pownce

Badges

  • django badge
  • apache badge
  • GeoURL
  • XFN Friendly
  • Valid HTML 4.01 Transitional