Affichage des articles dont le libellé est python. Afficher tous les articles

jeudi 24 mai 2018

Advanced scraping

I wrote web scraping scripts lately for a client to download invoices. I really like web scraping because it’s an automation that allows to save a lot of time and it often demands to reverse engineer a site. Just like a real hacker, ma! Here are some techniques I use.

Disclaimer: using a robot to scrape a website can be prohibited by its owner, so check that you stay in the terms of usage conditions and please have a fair behaviour.

Tooling

Mostly, I use Python 3, Requests and BeautifulSoup. Everything can be installed with Pip.

Session

Send all your requests through a Session. It has several advantages:

it handles session cookies for you,
it keeps the TCP sockets open,
it allows you to set default header for all the requests you’ll do. For instance, to setup the user agent header in a session.

Like this:

# Wow, the UA is quite old!
session = requests.session()
session.headers.update(
    {
        "User-Agent": (
            "Mozilla/5.0 (X11; Linux x86_64; rv:58.0)"
            "Gecko/20100101 Firefox/58.0"
        )
    }
)

Authentication

When you want to gather data from the Internet, web sites will often require you need to sign in. Mostly, you’ll have to send your credentials within your session in a HTTP POST request:

session.post(your_url, data=form_data)

CSRF protection

A good practice when handling forms in a website is to associate them with a random token. Hence you have to load a the form from the website before submitting it. That’s basicly CSRF protection, to avoid a malicious site to make use of an existing session cookie to discretly perform post requests.

It’s a general good practice, though it’s questionnable on authentication form!

By the way, when authenticating, you’ll often need to load the sign-in page, gather all the input fields then update them with username and password. Something like this:

soup = BeautifulSoup(self._session.get(self._url).text, "html5lib")
login_data = {}
for input in soup.find_all("input"):
    if input["type"] in {"hidden", "text", "password"}:
        login_data[input["name"]] = input["value"]

login_data["credential"] = my_username
login_data["password"] = my_password

SAML

Sometimes, the authentication process is complicated. For a website, I needed to pass through a SAML authentication protocol.

SAML authentication consists in several exchanges between client and server to generate the authentication token. The token has to be extracted from the redirection history.

To do so, requests lib allows you to browse the redirection history from a response and update the session cookies.

In requests, access the response history with:

response.history  # get the response chain has a list

You can then get the cookies from one of the responses as a dict:

saml_response_cookies = requests.utils.dict_from_cookiejar(
    response.history[1].cookies
)

Finally, update your session cookie jar with these cookies:

session.cookies.update(saml_cookies)

The information is not in the source code

Classic web scraping works well when all the content of the page is sent in the html on page load. What if it is a Single Page App on which all the content is loaded dynamicly in JavaScript?

Look for XHR (use of AJAX) in the "Network" tab in the browser’s dev tools. You can then replay these XHR directly with requests and parse the response.

Most of the time, the response will be a JSON document. It’s good news, as parsing JSON content is far easier than parsing HTML. Just use json module from Python stdlib.

Sometimes the requests are built by the JavaScript code. I once scraped a website that converted session information in base64 to build a request QueryString in Javascript. By chance, these functions where readable, so I reimplemented them in my Python code.

I don’t see the whole document!

Sometimes, beautifulsoup silently fails to parse all your HTML document. It’s really annoying because you often spend a ridiculous amount of time to figure out that the error comes from beautiful soup parsing.

Try to vary parsers with beautifulsoup. Using html5lib gave me the best results. I think you can add this module directly with you dependencies and use it by default:

soup = BeautifulSoup(content, "html5lib")

When everything fails

Sometimes though, reverse engineering how a website works is too time consuming (JavaScript is minified, lot of state information is kept on client side,…). I then switch to headless browser automation. I do so with Chrome with headless option and Selenium.

This is a heavier approach as you need to script all the actions which would be performed by a real user: loading a page, waiting for all the content to be loaded, fill in form fields, click on buttons.

Usually, I fire up Chrome without the headless option during the script writing.I switch back to headless when everyting works.

samedi 7 avril 2018

Download files with Python, Selenium and Chrome headless

I have recurring tasks these days that consist to automatically download files on the Internet. Usually, requests, beautifulsoup and some tricks do the job effectively. Sometimes though, I have to play it hard and ask Selenium and Chromium* headless to do the heavy lifting. Alas, asking Chromium to automatically download files is not clear.

I found the solution in Chrome tracker, and you read it bellow written in Python:

from selenium import webdriver

# Let's create some option to make Chromium go headless
options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_argument('disable-gpu')

# Launch the browser 
browser = webdriver.Chrome(chrome_options=options)
download_dir = tempfile.TemporaryDirectory().name
os.mkdir(download_dir)

# Send a command to tell chrome to download files in download_dir without
# asking.
browser.command_executor._commands["send_command"] = (
    "POST",
    '/session/$sessionId/chromium/send_command'
)
params = {
    'cmd': 'Page.setDownloadBehavior',
    'params': {
        'behavior': 'allow',
        'downloadPath': download_dir
    }
}
browser.execute("send_command", params)

There you go, happy scraping!

* Of course, it works with regular Chrome too!

mercredi 20 septembre 2017

Recaptcha in Django

I've been writing a small website with Django these days. I wanted to put a form with a captcha to prevent an evil hacker from exploding my database quotas. So I looked at the popular Google's Recaptcha service.

According to documentation, when processing the form, you have to send a token to Google and analyze the response to actually validate the form. Well, OK, it's not so complex, but I expected a 2 minutes solution!

Fortunately, Django is easily extensible through third party apps. And an app exists for my reCaptcha purpose.

So the message here is: if you want to add an external service to your website, look at what Django has for you or if a Django app exists for that purpose.

You (are old and you) want to add an RSS / Atom feed? It comes directly with Django!

More on reCaptcha. If your want to see the current reCaptcha widget to let Google guess automagickly if you're a robot, add the option bellow to your settings.py:

NOCAPTCHA = True

And a last word about privacy. Google's recaptcha uses DoubleClick and the fact that Google knows everything on everyone. I chose that because I believe that users are used to it and because of its simplicity. If privacy is an issue for you (it's legitimate), other kind of captcha exists: color based, math based, etc.

vendredi 14 avril 2017

logging.exception

By skimming Python documentation today, I realized that logging module provides an exception level. Its goal is to show you a stacktrace when you call it from an except block.

You may have noticed that if you log an exception with level error, you only get exception message:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import logging

def send_unexpected_exception():
    raise Exception("I did not expect this")

logging.basicConfig(level=logging.DEBUG)

try:
    send_unexpected_exception()
except Exception as e:
    logging.error(e)

Returns:

ERROR:root:I did not expect this

Now use exception level:

It returns:

ERROR:root:I did not expect this
Traceback (most recent call last):
  File "logging_exception.py", line 12, in <module>
    send_unexpected_exception()
  File "logging_exception.py", line 7, in send_unexpected_exception
    raise Exception("I did not expect this")
Exception: I did not expect this

I think it’s an interesting way to provide feedback and context if your app raises an unexpected exception.

mercredi 23 novembre 2016

My talk at Pyconfr 2016

This year I got the chance to speak at the Pyconfr! It was motivating for me, as it is a national event and, as you may know, I really like Python.

I was a bit nervous, of course, but as trained myself before the talk, I gained confidence. It's a chance, because I was not so motivated to rehearse. But if I did not, my talk would have been lame. I mean, more that it actually was!

For the record, I submitted 2 talks at the CFP this year. The first one was about Hypothesis and property based testing. I really wished it would be accepted, as it would have motivated me to explore this library.

The one that has been accepted is Help, we to not have any Python project in my company, not so easy to present. You can watch the video here!

And, yes, it's in French.

mardi 25 octobre 2016

Namedtuples in Python

Lately, I integrated namedtuples in my Python programming vocabulary. They allow you to create data structure classes in one line.

Start by importing them from collections module:

from collections import namedtuple

Define a new class as a namedtuple:

Person = namedtuple("Person", ("firstname", "lastname"))

You can know create new instances of Person, as you would do with any other class:

john = Person("John", "Doe")

And if you print it:

In [8]: print(john)
Person(firstname='John', lastname='Doe')

(yes, I use IPython, don't you?)

Now let's see what namedtuples give you.

Unpacking:

In [9]: f,l = john

In [10]: f
Out[10]: 'John'

In [11]: l
Out[11]: 'Doe'

Field access by name:

In [12]: john.firstname
Out[12]: 'John'

In [13]: john.lastname
Out[13]: 'Doe'

You also have access by index:

In [27]: john[0]
Out[27]: 'John'

In [28]: john[1]
Out[28]: 'Doe'

(OK, I tried some stuff during the redaction of the article)

And that means you can iterate on them, great!

In [31]: for value in john:
   ....:     print(value)
   ....:
John
Doe

You can retrieve the indexes of defined values (like in tuples):

In [29]: john.index('Doe')
Out[29]: 1

And count the occurrences of the values for free (also like in tuples, which is useless in my example)

In [30]: john.count('John')
Out[30]: 1

There's more. Contrary to classes, equality is defined for you for free:

In [32]: john2 = Person('John', 'Doe')

In [33]: john == john2
Out[33]: True

And last but not least, like standard tuples, namedtuples are immutable

In [34]: john.firstname = "Billy"
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-34-a7d9f29302d8> in <module>()
----> 1 john.firstname = "Billy"

AttributeError: can't set attribute

This last one is awesome.

So what is cool with all that? The great strength of tuples is that they are immutable, unlike lists. Thats why, when you have an array of values that is not subject to change, you should consider to create it as a tuple by default. Tuples are memory efficient and give you the insurance that nothing will alter them in your application. Besides, the syntax to create them is a bit shorter:

In [39]: t = 1, 2, 3, 4, 5 # you don't even need the parens!

In [40]: t
Out[40]: (1, 2, 3, 4, 5)

Named tuples extends this ability to any data structure you could create, giving you access to fields by name for readability.

The only drawback is that you cannot define methods or properties on them, as you could do in immutable data structures in other languages. Yet it is still a nice feature of Python.

mardi 23 août 2016

Playing with Tkinter

I committed to produce a GUI for an utility at work. My idea was to use Tkinter module in Python. It was a great pretext to use it for the first time!

Tkinter is a GUI toolkit provided with Python’s standard distribution. It’s great since it avoid the burden of installing an external dependency on target systems. I’m impressed because the development is really simple. I’ve faced far less difficulties that I had with wxWindow in the past (I’m also more experienced though).

Before coding I believed that the toolkit would produce bad looking UI. Actually that’s the case, unless you use themed widgets from ttk submodule.

I struggle to find decent documentation as there is too little documentation on Python’s website. I found something relevant here (doc also available in PDF). Stack Overflow takes care of the rest.

Epilogue: eventually the piece of software will be coded in vb.Net and integrated in a larger app. It’s a shame! By the way I had fun working with Tkinter.

vendredi 2 octobre 2015

Now I'm using py.test and you should too

You all know that Python is battery included. In particular it is shipped with unittest module, which is a test framework inspired by JUnit. Globally, it's a great module, to which nice features have been added in the latest version of the language:

New assertions
Mocks
Test discovery

However, it has a annoying drawbacks: it's verbosity and the fact it's not PEP8 compliant (or let's say not Pythonic). Even JUnit got better with Java 1.5 and the usage of annotation, freeing the programmer to extend a base TestCase class to name the testing methods test_something.

If you're tired of this or if you just want to spent less effort on testing, give a try to py.test.

Name a function test_something and write you assertions using assert Python instruction. Then simply run:

py.test

Which will discover and run all the tests in your project directory structure. The results are then displayed in a detail way (configurable, of course).

Though everything is not so bright, I enjoy using it on a daily basis. Here is my return of experience.

Good points:

It's very concise: no subclassing (nor even class) needed, no self.assertWeirdStuff either. This boilerplate avoidance is a countable gain of time. It's by itself a reason to switch.
It runs also your unittest-style tests and your doctests, therefore you can take your time to migrate your testing base.
Output is configurable and is very helpful. For instance, Using -vv switch provides a detailed diff which I used lately to compare string blocks. It can also help to integrate with CI platform.
You have lots of plugins available, for instance to write test cases in Gherkin or implement property based testing.

Not so good points:

It is not included in the standard distribution, so you've to install it through pip. Not so long ago, you also had to install pip. For Python beginners, it is not so simple (believe me).
Forget what you know about setup and teardown method. This features are available in py.test through @pytest.fixture decorator. The thing is, if implementing setup method through fixture is straightforward, teardown is a bit weird when you're not used to. See bellow for the explanation.
Lots lots lots of options. It is not an absolute bad thing, you'll say. But all I want to do is testing!

So you see the bad points are not so bad ^_^

Here is the trick to introduce teardown behaviour, extracted from py.test official documentation:

import smtplib
import pytest

@pytest.fixture(scope="module")
def smtp(request):
    smtp = smtplib.SMTP("merlinux.eu")
    def fin():
        print ("teardown smtp")
        smtp.close()
    request.addfinalizer(fin)
    return smtp # provide the fixture value

Explanations:

scope parameter allows to chose if the fixture should be run for each testing function, each test class or once for the whole test modules.
request parameter in the fixture function allows you to access your testing session. The teardown function should be added to the request with the addfinalizer, that accepts a callable.

To conclude, use py.test. It is really easy to put tests in place and it's highly versatile. And you won't lose your time writing boilerplate.

mardi 29 septembre 2015

The Zen Of Python

Do you know the Zen of Python? It is also known as PEP-0020. There's good programming principles in it! It's a shame that it is not applied more by others languages practitioners.

In case of emergency, if you are off-line, just type in your favorite REPL:

In [1]: import this
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

jeudi 11 juin 2015

Check if a string is a substring of another in Python

I learned a simple yet extremely useful trick in Python lately. I think it was while watching a talk from Raymond Hettinger.

To find a substring from a string in Python, you do not need str.find nor str.index. You can instead use in operator:

In [5]: if 'bar' in 'foobarbaz': print('bam!')
bam!

Notice that it is not as logical as you may think. As you may know, a str is a particular form of list. You can access a character or a range just like a list:

In [6]: 'foobarbaz'[2]
Out[6]: 'o'

You can iterate on it:

In [7]: for i, c in enumerate('foobarbaz'):
   ...:     print('Char {0} is {1}.'.format(i, c))
      ...:
      Char 0 is f.
      Char 1 is o.
      Char 2 is o.
      Char 3 is b.
      Char 4 is a.
      Char 5 is r.
      Char 6 is b.
      Char 7 is a.
      Char 8 is z.

But in cannot be used to check if a list is a subset of another:

In [9]: [2, 3] in [1, 2, 3]
Out[9]: False

That's why I was a bit amazed!

That’s all for today: I realized I have a some scripts to simplify using this nice feature!

lundi 1 juin 2015

Please learn a scripting language and use it.

A conversation I had last friday:

Alex: “Okay, so I have to separate table creation statements and alter table statements. But It’ll be a tedious copy-and-paste code work!”

Me: “Why don't you script that? At least use Excel to generate your queries from a list of table…” (can’t believe I said that)

Alex: “It'll make the job even harder!”

I sighed.

I’m shocked by the amount of my fellow programmer not taking time to learn a scripting language. I don’t care what you want to use, Python, Ruby, Perl, Bash or PowerShell… but learn (at least) one. Automate all repetitive tasks if feasible: you are paid to bring value by being creative, not to do machine job. And guess what, the machine is infinitely more reliable and faster at this than you are.

I write lots of scripts, and it saved my butt lots of times. I script app deployment, I script KPI gathering, I script commercial proposal generation, I script service reference modification when WCF makes shit. I get furious when a piece of software I have to use doesn’t provide an API nor a CLI.

Please learn a scripting language and stop wasting your time, wasting my time and screw up the whole project with your hazardous copy and past scenario.

mercredi 13 mai 2015

Getting Started With Python

When I confess about my love for Python language, people often tell me they did not really get into it. They may be just polite… like thinking “using indentations to delimit code blocks?! Come on!”

…

or they may actually do want to learn! Here are my advices to start with this beautiful piece of technology.

Start with Python 3

As you may know, it still exists two branches of Python: Python 2 and Python 3. Today, Python 2 is maintained for compatibility issues as some specficic libs has not been upgraded yet. However, as a beginner, you should start with Python 3, as it comes with less gotchas (better support of unicode and one style of classes are two examples of improvements).

Install it with your favorite packet manager on linux, homebrew on OSX (choose python3 package) or download it from python.org if you’re on Windows.

Choose a good editor

If you already are a software programmer, you may have a favorite editor so stick with it. If not, you may start with Idle, that is normally comes with Python distribution. I think the mandatory capacity for the editor in that context are:

Syntax highlighting to help you to learn Python’s syntax,
Auto-indentation because having to type them by yourself everytime is cumbersome.

Choose a good REPL

Without argument, python provides a REPL (Read Evaluate Process Loop). It’s a good idea but it’s not very convenient. You should install ipython. It will gives you:

Autocompletion with tab key for both python language and system path. It allows you to browse package content easily.
Easy access to documentation. Just type function then ? then return.
Lots of other cool thing that I let you dig by yourself.

In most of the time, I have a ipython running in a term to check stuffs while I’m coding.

You can install ipython with pip, the package manager that come with Python 3:

pip install ipython

On Windows, you'll need to add pyreadline package to get color and unix-like terminal key strokes.

pip install pyreadline

(you'll be prompted to do so when you start ipython without it anyway)

Ipython also come with a fonctionality called the notebook. It allows you to run a python interpreter from within a web app. It brings a lot of cool stuff, like editing markdown or displaying graphics and animations. To use the notebook, you'll have to install zero MQ wrapper for python and jsonschema.

pip install zmq jsonschema

Then run ipython3 with notebook option:

ipython3 notebook

Unit testing

To be confortable when learning a new language, I need a unit test framework. Python is delivered with unittest module, which is very much like the early versions of JUnit. I used it a lot. Then I tried py.test and I'll never look back!

py.test offers a very concise syntax to write tests and the runner come with a lot of options. Moreover, you can install additional plugin to do BDD or use it with a web framework. You can also install it with pip.

pip install pytest

Launch it with py.test command, and it will discover and run all the tests in a given directory structure.

Web framework

If you want to show what you code to your wife and your kids, you'd better display the result in a web browser than on an austere terminal window.

I advise you to start with bottle, which is a very simple yet efficient framework if you want to do simple things. Write functions to serve http routes, launch it and you're done.

Conclusion

I hope this article gave you the keys to enter the magical realm of Python programming language. I wish you'll have fun while learning it!

jeudi 9 avril 2015

dict.setdefault versus defaultdict

There is something annoying with dict in Python: when you want to use a key that is not defined, you get a KeyError. The solution is to be able to get a default value for the key when no value has been defined. Actually, the solution is made of two solutions.

The first one is to use dict.setdefault method. It makes the dict return a value for an undefined key:

In [3]: t = {}

In [4]: a = t.setdefault("foo", "bar")

In [5]: a
Out[5]: 'bar'

(If you ask, this is IPython's output. You must use IPython as a REPL)

The thing is that using this solution makes the assignation cumbersome. For instance, when I want to count or agregate values for a given key, I cannot do it this way.

In [6]: t = {}

In [7]: t.setdefault("foo", 0) += 1
  File "<ipython-input-7-15225840c433>", line 1
      t.setdefault("foo", 0) += 1
                                 ^
 SyntaxError: can't assign to function call
 # shit.

So you'll need some heavier stuff that is collections.defaultdict. Defaultdict is a dict-like object that returns a default value provided by a factory function for every undefined key. You just need to import it and set up the factory (I'll do it with a lambda).

In [8]: from collections import defaultdict

In [9]: dd = defaultdict(lambda: 0)

In [10]: dd["foo"]
Out[10]: 0

In [11]: dd["bar"] += 1

In [12]: dd["bar"]
Out[12]: 1
# Hooray!

To conclude, you'll often need to set a default value when querying an unknown key from a dict. To do so, the most flexible way might be the defaultdict as you can use it like a standard dictionary.

Last remark: collections package contains a lot of cool stuff. For instance, if you actually want to count (hashable) elements from an iterable, you should look at collections.Counter that does the job directly for you.

mercredi 1 avril 2015

Cool stuff with generators in Python

You can do great things using generator in Python.

A generator is a special function or method that keeps its state in memory. Its most common usage is to return a result at each iteration of a loop. You can recognize a genrator in a loop easily since it uses the keyword yield instead of return.

Example:

def simple_generator():
    for i in range(10):
        yield i

for value in simple_generator():
    print(value)

# Result:

0
1
2
3
4
5
6
7
8
9

You can also use this persistent state to do more complicated things. Some days ago I wanted to write a little parser and formatter for performance logs. It would list calling methods, called methods and time spent in calls. The goal was to track the relation between calling and called methods to produce sequence diagrams and CSV data to look for performance pitfalls.

The logs display the thread information: it should be used to avoid a complete mess in calling sequence. So the log file may look like:

timestamp [1] called execution called_method1: 1.432 seconds
timestamp [1] called execution called_method2: 0.546 seconds
timestamp [2] called execution called_method1bis: 0.456 seconds
timestamp [2] called execution called_method2bis: 0.456 seconds
timestamp [2] called execution called_method3bis: 0.456 seconds
timestamp [1] calling execution calling_method1: 3.000 seconds
timestamp [2] calling execution calling_method2: 2.697 seconds

Where the thread number is between braces. I want something like this as an output:

[{'name': 'calling_method1',
'thread': 1,
'duration': '3.000',
'called_methods': [
    {name: 'called_method1',
    …
    }
]
},
{'name': 'calling_method2',
'thread': 2,
'duration': '2.697',
'called_methods': [
    {name: 'called_method1bis',
    …
    }
]
},
]

Parsing code using a generator could be like:

# Parsing regexes. They use named groups defined by (?P<name>) as they improve
# readability and extensibility.
CALLING_RE = re.compile(
    r'^.*\[(?P<thread>\d+)\].*calling execution (?P<method>.*)'
    r': (?P<duration>.*) secondes')
CALLED_RE = re.compile(
    r'^.*\[(?P<thread>\d+)\].*called execution (?P<method>.*)'
    r':(?P<duration>.*) secondes')

def extract(file_lines):
    # The dict keep the state of the generator. It is used to organize
    # calling / called chains by thread number.
    current_elts = {}

    for line in file_lines:

        g = CALLING_RE.search(line)
        f = CALLED_RE.search(line)

        # If I have a calling match, I build the whole calling chain for the
        # thread and return a copy of the CallingElement.
        if g:
            current_thread = g.group("thread")
            # I could have no CalledElement here, hence the use of
            # setdefault method.
            elt = current_elts.setdefault(current_thread, CallingElement())
            elt.calling_method = g.group("method")
            elt.calling_duration = g.group("duration")
            elt.calling_thread = current_thread
            # A copy is returned since I want to get rid of the original
            # reference.
            new_elt = copy.deepcopy(elt)
            # I del the current thread entry, so it can be reused for another
            # calling chain.
            del current_elts[current_thread]
            yield new_elt
        if f:
            # In regular cases, the calling chain starts with a called
            # method.
            elt = current_elts.setdefault(f.group("thread"), GraphElement())
            elt.called_method.append(
                CalledElement(f.group("method"), f.group("duration"))
            )
            # I yield None. It doesn't seem so elegant to me but I'll stick
            # with it. I yield anyway because the generator has to return
            # something at each iteration. It can be seen as a map function
            # over a stream content, not a filter.
            yield None

# These two classes are data structures convenient to store logged
# information.

class CallingElement:
    def __init__( self):
        self.calling_method = None
        self.calling_duration = None
        self.calling_thread = None
        self.called_method = []

class CalledElement:
    def __init__(self, name, duration):
        self.name = name
        self.duration = duration

# One can then extract the chains from a log file like this:
# I filter the generator to leave None entries behind.
for calling_meth in (m for m in extract(open("file.log")) if m):
    …

I like the atomicity of the approach. The exctraction process does carry a state and yet it remains within a single function, so it reduces side effects that I would have with a complete object.

I also like the fact that the input is processed as a stream, with obviously a low memory foot print and a good performance level.

jeudi 19 février 2015

Experimenting with event sourcing 3

Third and last article on this topic. You can find the others here and there.

I left with a working implementation of the reading ability for the PersonRegistry. It was straight formard but not efficient. The idea to improve that is to make the EventTimeLine trigger some writting action in a more reading-friendly data structure, such as a dictionary. To do so, we can simply implement the Observer pattern: the registry will declare itself as a subscriber to the timeline to be notified at every event.

class PersonRegistry:
    """Provide services to deal with people"""

    def __init__(self, timeline):
        """Initialize timeline event"""
        self.timeline = timeline
        timeline.add_subscriber(self)
        self._current_id = 0
        # _registry will allow me to store 
        # the state of the system in order to
        # simplify querying
        self._registry = {}

    …

    def notify(self, data):
        """Get notified that something happened in the 
        system. It should be called by an event emmitter 
        only to update internal data representation."""

        # data is the event data, sent back by the EventTimeLine

        if 'personId' in data.keys():
            person_id = data['personId']
            if data['type'] == EventTimeLine.PERSON_CREATION:
                self._registry[person_id] = {
                    'name': data['name'],
                    'address': data['address'],
                    'status': data['status']
                }
            if data['type'] == EventTimeLine.PERSON_STATUS_CHANGE:
                self._registry[person_id]['status'] = data['newStatus']

The EventTimeLine is also allows subscription and notify its listeners on every event.

class EventTimeLine:

    PERSON_CREATION = 1
    PERSON_STATUS_CHANGE = 2

    def __init__(self):
        """Initialize the inner event list"""
        self.event_list = []
        # Yes subscribers are in a set, it'll prevent multiple notifications
        # if a subscriber is added several times. 
        self._subscribers = set()

def add_event(self, event_data):
    """Add an event in the event list."""
    event_data['_datetime'] = datetime.datetime.today()
    self.event_list.append(event_data)
    # Notify all the listeners
    self._notify_all(event_data)

def _notify_all(self, event_data):
    # trivial…
    for subs in self._subscribers:
        subs.notify(event_data)

And… it works! You can find the whole code in commit 5488ee5 on github. Yet, we can go a bit further and adopting CQRS simply by separating the PersonRegistry service into a service for commands (still PersonRegistry) and a service for query (PersonRegistryReader). The latter will be subscribe to the EventTimeLine and bear the get_person_by_id method. This is a big change since it is a modification of the API, but it is acceptable to me (I'm the only user BTW :p).

class PersonRegistryReader:
    """PersonRegistry aimed at read access"""

    def __init__(self):
        """Initialize the read registry"""
        self._registry = {}

    def notify(self, data):

        if 'personId' in data.keys():
            person_id = data['personId']
            if data['type'] == EventTimeLine.PERSON_CREATION:
                self._registry[person_id] = {
                    'name': data['name'],
                    'address': data['address'],
                    'status': data['status']
                }
            if data['type'] == EventTimeLine.PERSON_STATUS_CHANGE:
                self._registry[person_id]['status'] = data['newStatus']

    def get_person_by_id(self, demanded_id):
        """Retrieve a person from it's identifier in the system."""
        stored_person = self._registry[demanded_id]
        returned_person = Person(
            stored_person['status'],
            Address(
                stored_person['address']['street'],
                stored_person['address']['city']
            ),
            Name(
                stored_person['name']['firstname'],
                stored_person['name']['lastname']
                )
        )
        return returned_person

    def __hash__(self):
        return id(self)

Extracting was easy, it is just a cut and paste from PersonRegistry. Two remarks:

I keep the Person value object structure that I use for command but I really don't have to. I could also pass a dict directly or a JSON string. It is a strength of CQRS approach: even both domain models can be differentiated.
I did not mentionned it but I've been TDDing this example. I focused on keeping the tests green as long as possible. To do so, I keep the existing API alive while I'm building the new one. I branch when it's done then clean up while the tests stay green. I want to be able to ship at any time.

The creation and injection of all these object is done in the web app. In last commits, I added an event listener to produce logs from the timeline and versionning for person, just for fun and because it was easy.

To conclude, I had a great pleasure building this example and sense what event sourcing means. I wanted to show that it is actually a simple yet very powerful concept. I also wanted to propose a way to build apps incrementally, by making some compromises at first (see previous article) then continouusly improving without breaking. All was made within 30 minutes to 1 hour iterations. Finally, I wanted to show that programming or design concepts can be expressed using only language capabilities and data structures, without introducing fancy NoSQL system or AMQP Message queues hosted in the cloud and blah blah blah.

Of course, the example is really simple and there are ways to improvement: physically separate query and command models, externalize EventTimeLine, add more controls, etc. but I'll leave it that way. It's just tooling and making the architecture too complex for the example.

I hope you enjoyed reading these articles!

mardi 17 février 2015

Experimenting with event sourcing 2 - reading

Second article about playing with event sourcing. You can have a look at the first one here.

I left you with the need to read your model. As you will see my first implementation is straightforward as it consists in iterating through the EventTimeLine, filtering to what we need and mutating an object that is convenient to read. The retrieval method is in PersonRegistry service:

def get_person_by_id(self, demanded_id):
    returned_person = None
    person_events = (
        p for p in self.timeline 
        if p['personId'] == demanded_id
    )
    for event in person_events:
        if event['type'] == EventTimeLine.PERSON_CREATION:
            returned_person = Person(
                event['status'],
                Address(
                    event['address']['street'],
                    event['address']['city']
                ),
                Name(
                    event['name']['firstname'],
                    event['name']['lastname']
                    )
            )
        if event['type'] == EventTimeLine.PERSON_STATUS_CHANGE:
            returned_person.status = event['newStatus']
    return returned_person

It works there are obvious drawbacks.

I have to iterate through the whole EventTimeLine every time I need a person, which will take longer and longer as my timeline will grow (list → O(n) in read).
I may have to mutate my entity object a lot, for nothing.

EventTimeLine is greate for writing purpose but not so convenient for reading. The solution is to use the events to populate an intermediate data structure with better read access. To do so we can use the Observer pattern. This is for next article.

The repository is updated with model code, tests and a UI wrote using Bottle

samedi 14 février 2015

Livereload FTW

If you deal with HTML on a daily basis, you should really use a livereload software. What do theses tools do?

watching a set of files,
send notification to web browser to trigger page reload. To do so, an extension is used.

I use an implementation of livereload in Python. Install it with pip:

pip install livereload

It comes with livereload command that serve port 35729 and watch everything in current working directory:

romain:~/Documents/sites/blogs/romCodeCorner なに? livereload
Serving on 127.0.0.1:35729
[I 150214 21:50:02 handlers:99] Browser Connected: file:///Users/romain/Documents/sites/blogs/romCodeCorner/cqrs.htm
[I 150214 21:50:02 handlers:104] Watch current working directory
[I 150214 21:50:02 handlers:108] Start watching changes
[I 150214 21:50:03 handlers:49] File /Users/romain/Documents/sites/blogs/romCodeCorner/tools_should_help_our_creativity.md changed

Notice that on Windows, you should run it through python with the absolute path. For instance:

python C:\Python34\scripts\livereload

Then open a watched file in your browser and activate the extension. Modify a watched file and enjoy auto refresh.

There is more of course as you can read on PyPI page. You can trigger a command when a watched file is modified. I wrote following script to watch markdown files, transform them into HTML and reload the page.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from livereload import Server

# create a server
server = Server()
# run make everytime a markdown file is modified
server.watch('*.md', 'make')
# serve on livereload extension default listening port.
server.serve(port=35729)

It's neat.

I'm a huge fan of GNU make… The Makefile I use to compile markdown files is:

all: $(shell ls *.md | sed 's/\.md/.htm/')

%.htm: %.md
        markdown_py -e utf-8 -f $@ $^

clean: 
        rm *.htm

Of course, I use Python implementation of markdown ;-)

Experimenting with event sourcing

Some month ago I got interested in CQRS and event sourcing approach. Yesterday, I took time to experiment with event sourcing a bit with Python.

The goal of event sourcing approach is to induce changes into a system by saving all the events that appends in it rather than mutating its state. Here is some possiblities that it brings:

you do not have to know the wholes state of the entities of your system. All you need is their unique identifier to create events related to them,
all you have to care at first is the persistence of your event storem
you can theorically bring back the system to any state of its history, take a snapshot and represent it the way you want.
as you store functional events (create user, sell item, whatever), data migration is not needed. All you have to think of is the way you propaged the event through your whole information system.

So the idea is pretty cool. It seems disconcerting at first. Though, notice that that's the way work our well known relational data bases management systems. Every transaction details is at first written in redo logs, even if it is in progress and not committed. This way, the RDBMS is able to rerun a set of operation in case of a crash, reducing the risk of losing an important part of a transaction. This is event sourcing! And it is consider as the safest way to track all the transactions in the system.

Let's look at some code. The stuff I wrote in the train is about to create people entities, or in other word associate them with an identifier in the system and being able to change their situation, like marital status or tell they have moved. It could be useful for a revenue declaration app.

I begin with creating a service to deal with people. I call it PersonRegistry. With it, I can create a person in the system and change her marital status:

class PersonRegistry:
    """Provide services to deal with people"""

    # timeline parameter is my event store
    def __init__(self, timeline):
        """Initialize timeline event"""
        self.timeline = timeline
        self._currentId = 0

    def create(self, person):
        """Create a person in the system"""
        self._currentId += 1
        self.timeline.addEvent({
            'type': EventTimeLine.PERSON_CREATION,
            'personId': self._currentId,
            'status': person.status,
            'address': person.address.to_dict(),
            'name': person.name.to_dict()
        })
        return self._currentId

    def changeStatus(self, personId, newStatus):
        self.timeline.addEvent({
            'type': EventTimeLine.PERSON_STATUS_CHANGE,
            'personId': personId,
            'newStatus': newStatus
        })

I wanted the way to store events to be very strait forward. They're simple dicts, easily serializable and readable.

To act as a serious Domain Driven Design guy, I created some value objects as well:

class Person:

    SINGLE = 1
    MARRIED = 2

    def __init__(self, status, address, name):
        self.status = status
        self.address = address
        self.name = name


class Name:
    def __init__(self, firstname, lastname):
        """Blah Blah Blah"""
        self.firstname, self.lastname = firstname, lastname

    def to_dict(self):
        return {'firstname': self.firstname, 'lastname': self.lastname}


class Address:
    def __init__(self, street, city):
        self.street = street
        self.city = city

    def to_dict(self):
        return {'street': self.street, 'city': self.city}

The event store is called EventTimeLine. What it does is allow to addEvent in the system and retrieve them by iterating in its data.

class EventTimeLine:
    """Basically, the list of all events in the system. Allows to add an event.
    The object itself is iterable if you want to browse the created events."""

    PERSON_CREATION = 1
    PERSON_STATUS_CHANGE = 2

    def __init__(self):
        """Initialize the inner event list"""
        self.event_list = []

    def addEvent(self, eventData):
        """Add an event in the event list."""
        eventData['_datetime'] = datetime.datetime.today()
        self.event_list.append(eventData)

    def __iter__(self):
        for e in self.event_list:
            yield e

I can store my events. It is a good start for the first iteration. I can ship it to the users, so thay can populating their system with people.

Next step is to be able to get a snapshot of the system state. I'll be my monday iteration! For now, you can find the code on github.

vendredi 9 janvier 2015

Is it responsible to push Python on my professional project

I'm a huge fan of Python programming language and I can't prevent myself to push Python scripts in my professional projects. In a way, it's no big deal since it's not deliverable code but some tooling to automate stuff (project build, document generation, etc.). And guys, Python is simple to learn, right?!

However, I have a bad felling about this. Indeed, it's rare that my teammates know enough Python to be able to maintain these scripts efficiently. Eventually, I remain the main user and maintainer of these tools and I don't know if the usage is continued when I leave a project.

My other objective is to share my passion for this particular language and to promote it. In return, this initiative is not often taken seriously. Maybe it's because developers in my teams has bad experience with Python, maybe it's because they still think that dynamic languages are toys. Maybe it's because they think that Python addresses the gnu-linux-geeks and that real developers use Java a C#.

Meanwhile, sometimes some developers get into it and discover that the language worths learning.

So it does not discourage me, because I'm aware of what Python has to offer. I take it as opportunities to share it with people. It motivates me to go deeply into it in order to teach it to teams.

Is it irresponsible to introduce Python on projects? Of course, it is! Will I keep on doing it anyway? Sure I will!

Maybe some of my team mates will be thankful when they'll get the job they're secretly dreaming of in a Python shop in the Bay Area!

samedi 8 novembre 2014

About BDD pt 1

I went to pyconfr this year and attend a talk about bdd with behave by chbrun. After the talk, I felt disappointed since I know about it for a while now without practicing at work and I'm waiting for successful return of experience of bdd applied on a whole project. Since then, I feel the urge to write about it and I have material for 2 posts, one on the principles and one on the practices. I'll start with the principles.

Behavior Driven development, or bdd, is a concept proposed by Dan North. It came from a reflection about TDD which, according to Dan, semantically lead to bad development practices. By its name, TDD seems to be a testing practice. Actually, it is not. So called testing code is rather a runnable specification for software under development. Non regression test suite is a by-product, an argument that sells the practice to your project manager.

Consider TDD provides a specification. The problem is that it is written in code. Even with the cleanest code, it's a barrier for non developers. You got the idea: there is room for TDD improvement to make it communicate system specification to the WHOLE team. It's called BDD.

BDD is not about parsing regexes to produce test cases. BDD is not about system testing from GUI to database. BDD is not about replacing asserts by shoulds. BDD is about making runnable specifications, formerly known as unit tests, understandable for average Joe.

Next time, I'll write about practices and tools. When problems begin.