samedi 16 juin 2018

Programming with dates is hard

Time and date is not a trivial problem in programming. You can expect different precision, handle different timezones and different format when you convert them from and to a string.

You may remember some issues related to date that appended in the IT world. Take Y2K, that obliged companies to check and fix the way they handled date in their program before the millennium. I guess it's been a costly procedure in some of the case in term of research, fix and testing. For non tech people though, it's a ridiculous case: in our daily life, we use time and dates casually without thinking about it.

Another sign that handling dates is not trivial: the plethora of library available. In Python, the standard module datetime does not handle time zones, so we have pytz. But as it's not human friendly, everyone got to use Arrow. Then Pendulum came out, claiming that arrow got it wrong and it became the new standard for community. Meanwhile, Kenneth Reitz (requests, pipenv,…) released Maya to do the same thing. Help, we're lost!

It's not specific to Python. In Java, Joda-Time has been the preferred library to handle dates, rather than the standard library. Then Java 8 came out with a new java.time package that can be used instead of Joda. I worked recently in a code base that mixed both libraries. It inspired me this post.

On the other hand, when you have chosen a library, getting the current date is super simple, using a call like mytimelib.today(). Cool. How about testing then?

Actual time does not belong in your system, it's a dependency, bound to the real world. If your business domain relies on date for some sort, you may avoid to use your date library directly.

When I need to integrate business logic based on date, I use my own Date and Calendar classes which allows me to:

  • Get the precision I want for comparison. For instance, my domain can rely on lapsing dates with month precision and I don't need more.
  • Have a reference implementation. My dates can be created from instances created with different date libs.
  • Get default formats for date as String
  • Create a fake calendar for testing that gives me control on current date generation. So I do not need to scratch my head when I have to compare dates in tests.

In the past I worked on a project for an insurance company. We naively used mylib.today() calls all over the place. When the main features were implemented and tested, business people told us that they wanted to test the system behaviour when contracts get on term. To do so they wanted to be able to alter system current date.

Guess who replaced all the calls by injecting a custom Calendar and provided a service endpoint that allow users to modify the current date for testing…

So pay attention to your business domain. If it relies on dates in some sort, you'd better treat them as an external dependency. Then you'll be able to use the lib you want, and even to change it seemlessly if it get oldfashionned.

dimanche 10 juin 2018

My typing is bad (but getting better, thanks for asking)

I've always found I'm bad at typing. I make a lot of typos, my speed is not that fast and I don't manage to use all my fingers.

Recently, I have been working on a Java code base and I got very frustrated by the fact I'm a slow typer!

As my productivity is bound to a typing speed that could be improved, I decided to train myself to touch typing. It's a bit frustrating for now because my speed is about 40 wps when I'm training on gtypist. But I'm training every day, and even if my improvements are not as fast as I'd like, I have to admit I'm already a better typist:

  • I can now type with my ten fingers,
  • I discovered that I can use the left shift key,
  • I can type without looking at the keys with a decent accuracy.

Till now, I was using tricks to compensate my lack of speed :

  • Using Vim,
  • Switching to QWERTY layout though I'm French: the accessibility of keys [, ], { and } on the AZERTY keyboard is soooo painful!
  • Adopting a programming language that requires less typing (Python vs Java)

Though being a proficient developer doesn't rely only on typing speed, typing faster maximizes the throughput between idea and code.

I'm still far from Gary Bernhardt. Seriously, what is this guy typing speed? 120 wpm? It's insane!

I'm also thinking about investing into a mechanical keyboard. Maybe an ergonomic one, like the Ergodox EZ or the Keyboardio.

Edit 1: according to @fabi1cazenave, the Keyboardio is far better than the Ergodox and French people should use the QWERTY Lafayette layout.

Edit 2: ten years ago, Steve Yegge wrote this nice article in favor of touch typing

jeudi 24 mai 2018

Advanced scraping

I wrote web scraping scripts lately for a client to download invoices. I really like web scraping because it’s an automation that allows to save a lot of time and it often demands to reverse engineer a site. Just like a real hacker, ma! Here are some techniques I use.

Disclaimer: using a robot to scrape a website can be prohibited by its owner, so check that you stay in the terms of usage conditions and please have a fair behaviour.

Tooling

Mostly, I use Python 3, Requests and BeautifulSoup. Everything can be installed with Pip.

Session

Send all your requests through a Session. It has several advantages:

  • it handles session cookies for you,
  • it keeps the TCP sockets open,
  • it allows you to set default header for all the requests you’ll do. For instance, to setup the user agent header in a session.

Like this:

# Wow, the UA is quite old!
session = requests.session()
session.headers.update(
    {
        "User-Agent": (
            "Mozilla/5.0 (X11; Linux x86_64; rv:58.0)"
            "Gecko/20100101 Firefox/58.0"
        )
    }
)

Authentication

When you want to gather data from the Internet, web sites will often require you need to sign in. Mostly, you’ll have to send your credentials within your session in a HTTP POST request:

session.post(your_url, data=form_data)

CSRF protection

A good practice when handling forms in a website is to associate them with a random token. Hence you have to load a the form from the website before submitting it. That’s basicly CSRF protection, to avoid a malicious site to make use of an existing session cookie to discretly perform post requests.

It’s a general good practice, though it’s questionnable on authentication form!

By the way, when authenticating, you’ll often need to load the sign-in page, gather all the input fields then update them with username and password. Something like this:

soup = BeautifulSoup(self._session.get(self._url).text, "html5lib")
login_data = {}
for input in soup.find_all("input"):
    if input["type"] in {"hidden", "text", "password"}:
        login_data[input["name"]] = input["value"]

login_data["credential"] = my_username
login_data["password"] = my_password

SAML

Sometimes, the authentication process is complicated. For a website, I needed to pass through a SAML authentication protocol.

SAML authentication consists in several exchanges between client and server to generate the authentication token. The token has to be extracted from the redirection history.

To do so, requests lib allows you to browse the redirection history from a response and update the session cookies.

In requests, access the response history with:

response.history  # get the response chain has a list

You can then get the cookies from one of the responses as a dict:

saml_response_cookies = requests.utils.dict_from_cookiejar(
    response.history[1].cookies
)

Finally, update your session cookie jar with these cookies:

session.cookies.update(saml_cookies)

The information is not in the source code

Classic web scraping works well when all the content of the page is sent in the html on page load. What if it is a Single Page App on which all the content is loaded dynamicly in JavaScript?

Look for XHR (use of AJAX) in the "Network" tab in the browser’s dev tools. You can then replay these XHR directly with requests and parse the response.

Most of the time, the response will be a JSON document. It’s good news, as parsing JSON content is far easier than parsing HTML. Just use json module from Python stdlib.

Sometimes the requests are built by the JavaScript code. I once scraped a website that converted session information in base64 to build a request QueryString in Javascript. By chance, these functions where readable, so I reimplemented them in my Python code.

I don’t see the whole document!

Sometimes, beautifulsoup silently fails to parse all your HTML document. It’s really annoying because you often spend a ridiculous amount of time to figure out that the error comes from beautiful soup parsing.

Try to vary parsers with beautifulsoup. Using html5lib gave me the best results. I think you can add this module directly with you dependencies and use it by default:

soup = BeautifulSoup(content, "html5lib")

When everything fails

Sometimes though, reverse engineering how a website works is too time consuming (JavaScript is minified, lot of state information is kept on client side,…). I then switch to headless browser automation. I do so with Chrome with headless option and Selenium.

This is a heavier approach as you need to script all the actions which would be performed by a real user: loading a page, waiting for all the content to be loaded, fill in form fields, click on buttons.

Usually, I fire up Chrome without the headless option during the script writing.I switch back to headless when everyting works.

samedi 7 avril 2018

Download files with Python, Selenium and Chrome headless

I have recurring tasks these days that consist to automatically download files on the Internet. Usually, requests, beautifulsoup and some tricks do the job effectively. Sometimes though, I have to play it hard and ask Selenium and Chromium* headless to do the heavy lifting. Alas, asking Chromium to automatically download files is not clear.

I found the solution in Chrome tracker, and you read it bellow written in Python:

from selenium import webdriver

# Let's create some option to make Chromium go headless
options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_argument('disable-gpu')

# Launch the browser 
browser = webdriver.Chrome(chrome_options=options)
download_dir = tempfile.TemporaryDirectory().name
os.mkdir(download_dir)

# Send a command to tell chrome to download files in download_dir without
# asking.
browser.command_executor._commands["send_command"] = (
    "POST",
    '/session/$sessionId/chromium/send_command'
)
params = {
    'cmd': 'Page.setDownloadBehavior',
    'params': {
        'behavior': 'allow',
        'downloadPath': download_dir
    }
}
browser.execute("send_command", params)

There you go, happy scraping!

* Of course, it works with regular Chrome too!

dimanche 14 janvier 2018

Some talks

I'm watching tech talks less often these days. Most of the time, the talks are too long (45+ minutes). Nowadays, I don't even try to watch a talk about new piece of software or framework: if I want to spend my energy on these topics, I'd better play with an online tutorial.

Here are two talks videos I watched and liked lately. First one is Hammock Driven Development by Rich Hickey. He is the creator of Clojure programming language. In this talk, he explores the fact that we have 2 modes of thinking, the first in conscious and the second is unconscious and work by itself. It's about these "Eureka" moment that we experience as developers when we are away from the keyboard. You'll see why the hammock is a smart choice to enforce unconscious thinking!

The second talk is by Greg Young, who is known for his work on CQRS and Event Sourcing architecture patterns. He demands us to stop over engineering. It's OK not to automate all the stuff by programming. It's about the struggle the desire to show the power of our programmer brains by building over complicated frameworks.

Good watching!

PS: if you think videos on Youtube still last too long, you can increase or decrease speed with shortcuts shif+. and shift+,, look at this page to a more complete list of shortcuts.

dimanche 7 janvier 2018

First months as a freelancer

I realize I hardly take time to write these days. Actually, I've had a lot to do for the last few months, with my activity as a freelancer to build, the work in my house and my family to take care of. But I'm having a great time!

I quit my salary job last summer to start my own business as a freelancer. I barely had a professional network, just one or two contacts that lead to nothing. So I created a profile on various freelancer platforms. And it worked! Even if what I earn is bellow my goals, I think it's a good start.

Currently, I'm focusing on short missions. I did not want to go full-time at the beginning since I wanted to be present for my family after more than 10 years working a lot. I also wanted to start slowly, because I ended my previous job quite unmotivated, and basically I wondered if I still wanted to do programming for a living.

And I found some missions, I realized the job on time and clients have been quite happy with what I delivered. That's a good point. Here is what I have done so far:

  • A script to save NASDAQ stock events in a database and to perform simple analytics on it,
  • Writing of functional specifications an mock-ups in balsamiq,
  • Add features for a wysiwyg editor for a printing company,
  • Some web scrapers to automatically retrieve invoices from websites.

I'm glad because I worked with Python and JavaScript, which is what I want to become specialized in. It's not much but I learned a lot:

  • I can find missions and get paid.
  • I do great job.
  • I like it.
  • It's not a big deal if I don't get all the missions I apply for. There will be other opportunities.

So to me it's a good start and I hope to work more in 2018, to meet great people, and get decent revenue from this!

So, do I still want to write software for a living? Of course I do.

lundi 20 novembre 2017

Poor man's pomodoro

It seems that I have not written about it until know: I have to confess I'm a big fan of the Pomodoro Technique.

Now that I have a new laptop, I've been looking for a timer software. On Windows and MacOs, I used Tomighty. Cool, but Qt version is not in Debian repo and Java version… well it runs on the JVM!

Then I learned about a magical command, notify-send, which sends desktop notifications. So here we go, my timer will be super simple and written in shell script:

# pomodoro aliases
alias PS="pomodoro_start"
alias PSB="pomodoro_short_break"
alias PLB="pomodoro_long_break"

# launch a 25 minute pomodoro
pomodoro_start() {
    # I write the start time, in case I miss the notification
    echo "pomodoro start `date +%H:%M:%S`..."
    sleep 1500

    # using critical level of urgency force me to click on the notification
    # pop up to dismiss it.
    notify-send  --urgency=critical "Poor man's pomodoro" "Time for a break"

    # cvlc is VLC CLI. I play a sound to help me to get out of the zone.
    cvlc --play-and-exit --quiet /some/sound.mp3 > /dev/null 2>&1

    echo "pomodoro end `date +%H:%M:%S`..."
}


pomodoro_short_break() {
    echo "pomodoro break `date +%H:%M:%S`..."
    sleep 300
    notify-send --urgency=critical "Poor man's pomodoro" "Break over, go back to work"
    cvlc --play-and-exit --quiet /some/sound.mp3 > /dev/null 2>&1
    echo "pomodoro end `date +%H:%M:%S`..."
}


pomodoro_long_break() {
    echo "pomodoro long break `date +%H:%M:%S`..."
    sleep 900
    notify-send --urgency=critical "Poor man's pomodoro" "Long break over, go back to work"
    cvlc --play-and-exit --quiet /some/sound.mp3 > /dev/null 2>&1
    echo "pomodoro end `date +%H:%M:%S`..."
}

The Pomodoro Technique is far to make unanimity. I do not practice it strickly by the book. What I like is that it forces me to get out of the zone when I work on a long running task. Though being the zone and experiencing the flow feels good, relying on my timer helps me to get out of it, take a step back from your work and think. Does what I'm doing right know really worth it?

Contrary to what is usually told about the flow, I don't think it's more productive. It's just pleasant. So while it's a good thing for your hobbies to gain the most of energy out of them, it may not be so appropriate when you have stuff to do at work.