mercredi 1 avril 2015

Cool stuff with generators in Python

You can do great things using generator in Python.

A generator is a special function or method that keeps its state in memory. Its most common usage is to return a result at each iteration of a loop. You can recognize a genrator in a loop easily since it uses the keyword yield instead of return.

Example:

def simple_generator():
    for i in range(10):
        yield i

for value in simple_generator():
    print(value)

# Result:

0
1
2
3
4
5
6
7
8
9

You can also use this persistent state to do more complicated things. Some days ago I wanted to write a little parser and formatter for performance logs. It would list calling methods, called methods and time spent in calls. The goal was to track the relation between calling and called methods to produce sequence diagrams and CSV data to look for performance pitfalls.

The logs display the thread information: it should be used to avoid a complete mess in calling sequence. So the log file may look like:

timestamp [1] called execution called_method1: 1.432 seconds
timestamp [1] called execution called_method2: 0.546 seconds
timestamp [2] called execution called_method1bis: 0.456 seconds
timestamp [2] called execution called_method2bis: 0.456 seconds
timestamp [2] called execution called_method3bis: 0.456 seconds
timestamp [1] calling execution calling_method1: 3.000 seconds
timestamp [2] calling execution calling_method2: 2.697 seconds

Where the thread number is between braces. I want something like this as an output:

[{'name': 'calling_method1',
'thread': 1,
'duration': '3.000',
'called_methods': [
    {name: 'called_method1',
    …
    }
]
},
{'name': 'calling_method2',
'thread': 2,
'duration': '2.697',
'called_methods': [
    {name: 'called_method1bis',
    …
    }
]
},
]

Parsing code using a generator could be like:

# Parsing regexes. They use named groups defined by (?P<name>) as they improve
# readability and extensibility.
CALLING_RE = re.compile(
    r'^.*\[(?P<thread>\d+)\].*calling execution (?P<method>.*)'
    r': (?P<duration>.*) secondes')
CALLED_RE = re.compile(
    r'^.*\[(?P<thread>\d+)\].*called execution (?P<method>.*)'
    r':(?P<duration>.*) secondes')

def extract(file_lines):
    # The dict keep the state of the generator. It is used to organize
    # calling / called chains by thread number.
    current_elts = {}

    for line in file_lines:

        g = CALLING_RE.search(line)
        f = CALLED_RE.search(line)

        # If I have a calling match, I build the whole calling chain for the
        # thread and return a copy of the CallingElement.
        if g:
            current_thread = g.group("thread")
            # I could have no CalledElement here, hence the use of
            # setdefault method.
            elt = current_elts.setdefault(current_thread, CallingElement())
            elt.calling_method = g.group("method")
            elt.calling_duration = g.group("duration")
            elt.calling_thread = current_thread
            # A copy is returned since I want to get rid of the original
            # reference.
            new_elt = copy.deepcopy(elt)
            # I del the current thread entry, so it can be reused for another
            # calling chain.
            del current_elts[current_thread]
            yield new_elt
        if f:
            # In regular cases, the calling chain starts with a called
            # method.
            elt = current_elts.setdefault(f.group("thread"), GraphElement())
            elt.called_method.append(
                CalledElement(f.group("method"), f.group("duration"))
            )
            # I yield None. It doesn't seem so elegant to me but I'll stick
            # with it. I yield anyway because the generator has to return
            # something at each iteration. It can be seen as a map function
            # over a stream content, not a filter.
            yield None

# These two classes are data structures convenient to store logged
# information.

class CallingElement:
    def __init__( self):
        self.calling_method = None
        self.calling_duration = None
        self.calling_thread = None
        self.called_method = []

class CalledElement:
    def __init__(self, name, duration):
        self.name = name
        self.duration = duration

# One can then extract the chains from a log file like this:
# I filter the generator to leave None entries behind.
for calling_meth in (m for m in extract(open("file.log")) if m):
    …

I like the atomicity of the approach. The exctraction process does carry a state and yet it remains within a single function, so it reduces side effects that I would have with a complete object.

I also like the fact that the input is processed as a stream, with obviously a low memory foot print and a good performance level.