Comparison of HTML5 Parsers: Gumbo vs html5lib

When developing content plugin for Kodi mediacenter the most important part is where to get the content from. One of the possible ways is to scrap websites that host multimedia content. Yes, legality of that content is another question, but legal matters are beyond the scope of this post.

In Python world BeautifulSoup library (BS for short) in combination with html5lib parser is a popular choice. However, according to the BeautifulSoup documentation the html5lib parser is the slowest, albeit the most reliable, of all html parsers. So I googled for alternatives and found Gumbo parser made by Google itself. According to the description it's fully HTML5-compliant and written in pure C99 with no external dependencies. And it has Python bindings compatible with popular Python HTML parsing libraries, including BeautifulSoup. The BeautifulSoup binding was written for BS 3 of but making it compatible with BS 4 was relatively easy, which I did and submitted a pull request on GitHub (which seems to be ignored by the repo maintainers).

So at first glance the Gumbo parser looks promising. Yes, it requires compiling of a binary shared library which may be problematic if you are going to support multiple platforms, but it's written in pure C99 so all you need is a GCC compiler. And since the parser is written in C one can expect that it will be faster than pure Python solutions. Let's see if it is true.

To test both parser I wrote a Python script that is close to real-life parsers that I use in my pet projects:

Click to show

# coding: utf-8

import time
from collections import namedtuple
import requests
from bs4 import BeautifulSoup
from soup_adapter import parse

MediaItem = namedtuple('MediaItem', ['title', 'thumb', 'path'])
iterations = 50


def parse_vidsplay(soup):
    row_tag = soup.find('div', id='featured')
    a_tags = row_tag.find_all('a', href=True)
    img_tags = row_tag.find_all('img', class_='image full')
    p_tags = row_tag.find_all('p')
    for p, img, a in zip(p_tags, img_tags, a_tags):
        yield MediaItem(p.text, img['src'], a['href'])


html = requests.get('http://www.vidsplay.com').text

# Testing html5lib
start1 = time.time()
for _ in xrange(iterations):
    soup1 = BeautifulSoup(html, 'html5lib')
    list(parse_vidsplay(soup1))
print 'html5lib Soup parsing: {0}s average'.format((time.time() - start1) / iterations)
# Testing Gumbo parser
start2 = time.time()
for _ in xrange(iterations):
    soup2 = parse(html)
    list(parse_vidsplay(soup2))
print 'Gumbo Soup parsing: {0}s elapsed average'.format((time.time() - start2) / iterations)

The script loads the start page from www.vidsplay.com site (a hosting for free sample videos) and parses "Featured" items to extract a title, a thumbnail and a link to the respective item page. Both parsers run 50 times and then average parsing times for both solutions are calculated. On a system with Intel Core i5 4440@3.10GHz processor and 8GB RAM running Windows 7 the script showed the following results:

html5lib Soup parsing: 0.0447800016403s average
Gumbo Soup parsing: 0.0425999975204s elapsed average

Strangely enough, the Gumbo parser showed minimal performance improvement compared to html5lib. According to the Gumbo description, the parsing speed was not among its design goals, but still usually you expect that a C-based solution will be much faster than a Python-based one. I simply cannot explain such results.

In my tests I used the following versions of the tested software:

Python: 2.7.11

BeautifulSoup: 4.5.0

html5lib: 0.9999999.0

Gumbo: 0.10.1

Other Parsing Alternatives

I also tried Python's built-in html.parser and lxml library. The html.parser was not able to extract any data from the test page and was disqualified. The lxml parser was much faster than html5lib. However, it is writen in Cython, so it may be hard to compile it for some of the platforms that run Kodi (if we are talking about plugins for Kodi mediacenter).

Conclusion

If you are writing webpage parsers in Python, the BeautifulSoup library in combination with the html5lib parser is still one of the optimal choices. This combination shows a decent speed, it's resilient to malformed HTML and it's truly cross-platform, which is important if you are targeting multiple platforms like in plugins for Kodi.

The Google's Gumbo parser shows minimal speed improvements compared to html5lib and since it requires compiling a binary library for your target platforms, it is safe to say that Gumbo does not offer any advantages for Python. This parser may be a good choice for other programming languages that do not have good HTML parsing libraries equivalent to html5lib, but not for Python.

And if you are developing webpage parsers for mainstream platforms like Windows, Linux of OS X (essentially, the platforms that support compiling binary Python extensions from Cython sources), then you may consider the lxml parser which is much faster than html5lib, but according to the BeautifulSoup documentation it is not as resilient to malformed HTML as html5lib.