There are many times when the programming task at hand is to iterate over some semi-structured text, transform parts of that text in some way, and reintegrate those transformed parts back into the original text.
Typically using a regular expression with re.sub and a callback function, but sometimes you want a bit more control of the process (especially over those parts that dont match the regex). Usually my solution is to write a one-off function that does it, but today I had to write that function yet again and decided to generalize it and post it here.
To be completely honest, this post is more for my own archival purposes than for the internet as a whole, but if anyone else finds it useful, then I'm ecstatic.
def re_parts(regex_list, text):
"""
An iterator that returns the entire text, but split by which regex it
matched, or none at all. If it did, the first value of the returned tuple
is the index into the regex list, otherwise -1.
>>> first_re = re.compile('asdf')
>>> second_re = re.compile('an')
>>> list(re_parts([first_re, second_re], 'This is an asdf test.'))
[(-1, 'This is '), (1, 'an'), (-1, ' '), (0, 'asdf'), (-1, ' test.')]
>>> list(re_parts([first_re, second_re], 'asdfasdfasdf'))
[(0, 'asdf'), (0, 'asdf'), (0, 'asdf')]
>>> list(re_parts([], 'This is an asdf test.'))
[(-1, 'This is an asdf test.')]
>>> third_re = re.compile('sdf')
>>> list(re_parts([first_re, second_re, third_re], 'This is an asdf test.'))
[(-1, 'This is '), (1, 'an'), (-1, ' '), (0, 'asdf'), (-1, ' test.')]
"""
def match_compare(x, y):
return x.start() - y.start()
prev_end = 0
iters = [r.finditer(text) for r in regex_list]
matches = []
while iters:
if matches:
match = matches.pop(0)
(start, end) = match.span()
if start > prev_end:
yield (-1, text[prev_end:start])
yield (regex_list.index(match.re), text[start:end])
elif start == prev_end:
yield (regex_list.index(match.re), text[start:end])
prev_end = end
else:
matches = []
for iterator in iters:
try:
matches.append(iterator.next())
except StopIteration:
iters.remove(iterator)
matches = sorted(matches, match_compare)
last_bit = text[prev_end:]
if len(last_bit) > 0:
yield (-1, last_bit)
All Content


So, this is useful for what, exactly?
I would be interested to see an example of how you use this that is a bit more specific than the doctests.
Looks kinda like a sre scanner, there's one in simplejson:
<a href="http://code.google.com/p/simplejson/source/browse/trunk/simplejson/scanner.py">scanner.py</a>
Alternatively, you could let the regex engine to do its work. :-) Combine the patterns into a single regex using capturing and alternation:
regex = re.compile('|'.join('(%s)' % s for s in patterns))
This way you won't have to do the extra work of matching "in parallel", and finding the index of the matched pattern will be as simple as match.lastindex - 1:
prev_end = 0
for match in regex.finditer(text):
start, end = match.span()
if prev_end < start:
yield (-1, text[prev_end:start])
prev_end = end
yield (match.lastindex - 1, text[start:end])
if prev_end < len(text):
yield (-1, text[prev_end:])
Simpler, shorter and faster. What's not to like?
Well... the above recipe assumes there will be no captures in the patterns in your and that any flags to be used are embedded in the patterns. I don't think that's a problem in practice, though.
Now let's see if I can find docs for those SRE scanners...
Okay, so it looks like sre exposes the machinery required to fairly easily(?) solve those two points.
What do you use this for btw?