jeudi 24 mai 2018

Advanced scraping

I wrote web scraping scripts lately for a client to download invoices. I really like web scraping because it’s an automation that allows to save a lot of time and it often demands to reverse engineer a site. Just like a real hacker, ma! Here are some techniques I use.

Disclaimer: using a robot to scrape a website can be prohibited by its owner, so check that you stay in the terms of usage conditions and please have a fair behaviour.

Tooling

Mostly, I use Python 3, Requests and BeautifulSoup. Everything can be installed with Pip.

Session

Send all your requests through a Session. It has several advantages:

  • it handles session cookies for you,
  • it keeps the TCP sockets open,
  • it allows you to set default header for all the requests you’ll do. For instance, to setup the user agent header in a session.

Like this:

# Wow, the UA is quite old!
session = requests.session()
session.headers.update(
    {
        "User-Agent": (
            "Mozilla/5.0 (X11; Linux x86_64; rv:58.0)"
            "Gecko/20100101 Firefox/58.0"
        )
    }
)

Authentication

When you want to gather data from the Internet, web sites will often require you need to sign in. Mostly, you’ll have to send your credentials within your session in a HTTP POST request:

session.post(your_url, data=form_data)

CSRF protection

A good practice when handling forms in a website is to associate them with a random token. Hence you have to load a the form from the website before submitting it. That’s basicly CSRF protection, to avoid a malicious site to make use of an existing session cookie to discretly perform post requests.

It’s a general good practice, though it’s questionnable on authentication form!

By the way, when authenticating, you’ll often need to load the sign-in page, gather all the input fields then update them with username and password. Something like this:

soup = BeautifulSoup(self._session.get(self._url).text, "html5lib")
login_data = {}
for input in soup.find_all("input"):
    if input["type"] in {"hidden", "text", "password"}:
        login_data[input["name"]] = input["value"]

login_data["credential"] = my_username
login_data["password"] = my_password

SAML

Sometimes, the authentication process is complicated. For a website, I needed to pass through a SAML authentication protocol.

SAML authentication consists in several exchanges between client and server to generate the authentication token. The token has to be extracted from the redirection history.

To do so, requests lib allows you to browse the redirection history from a response and update the session cookies.

In requests, access the response history with:

response.history  # get the response chain has a list

You can then get the cookies from one of the responses as a dict:

saml_response_cookies = requests.utils.dict_from_cookiejar(
    response.history[1].cookies
)

Finally, update your session cookie jar with these cookies:

session.cookies.update(saml_cookies)

The information is not in the source code

Classic web scraping works well when all the content of the page is sent in the html on page load. What if it is a Single Page App on which all the content is loaded dynamicly in JavaScript?

Look for XHR (use of AJAX) in the "Network" tab in the browser’s dev tools. You can then replay these XHR directly with requests and parse the response.

Most of the time, the response will be a JSON document. It’s good news, as parsing JSON content is far easier than parsing HTML. Just use json module from Python stdlib.

Sometimes the requests are built by the JavaScript code. I once scraped a website that converted session information in base64 to build a request QueryString in Javascript. By chance, these functions where readable, so I reimplemented them in my Python code.

I don’t see the whole document!

Sometimes, beautifulsoup silently fails to parse all your HTML document. It’s really annoying because you often spend a ridiculous amount of time to figure out that the error comes from beautiful soup parsing.

Try to vary parsers with beautifulsoup. Using html5lib gave me the best results. I think you can add this module directly with you dependencies and use it by default:

soup = BeautifulSoup(content, "html5lib")

When everything fails

Sometimes though, reverse engineering how a website works is too time consuming (JavaScript is minified, lot of state information is kept on client side,…). I then switch to headless browser automation. I do so with Chrome with headless option and Selenium.

This is a heavier approach as you need to script all the actions which would be performed by a real user: loading a page, waiting for all the content to be loaded, fill in form fields, click on buttons.

Usually, I fire up Chrome without the headless option during the script writing.I switch back to headless when everyting works.