Comparison of HTML5 Parsers: Gumbo vs html5lib
July 29, 2016 0 comments
When developing content plugin for Kodi mediacenter the most important part is where to get the content from. One of the possible ways is to scrap websites that host multimedia content. Yes, legality of that content is another question, but legal matters are beyond the scope of this post.
In Python world BeautifulSoup library (BS for short) in combination with html5lib parser is a popular choice. However, according to the BeautifulSoup documentation the html5lib
parser is the slowest, albeit the most reliable, of all html parsers. So I googled for alternatives and found Gumbo parser made by Google itself. According to the description it's fully HTML5-compliant and written in pure C99 with no external dependencies. And it has Python bindings compatible with popular Python HTML parsing libraries, including BeautifulSoup
. The BeautifulSoup
binding was written for BS 3 of but making it compatible with BS 4 was relatively easy, which I did and submitted a pull request on GitHub (which seems to be ignored by the repo maintainers). (...)
Wsgi-Boost-Server: A Python WSGI Server Written in C++
July 15, 2016 0 comments
At last I found some time to write about my recent project — WsgiBoostServer. I started it to learn C++ and, specifically, writing binary extension modules for Python using Boost.Python. As the name implies, this is a WSGI server, that is, a HTTP server for Python web applications. But in addition to Python applications WsgiBoostServer can also serve static files that allows to use it for serving standalone Python micro-services with all their static content.
Because WsgiBoostServer is written in C++ using Boost.Asio library, it is faster than pure Python WSGI servers like Waitress or CherryPy. And since it can be used as a regular Python module (although binary), it does not require complex set-up and can be included in any Python application. More info about WsgiBoostServer and its source code can be found in my GitHub repository. It's MIT-licensed so feel free to use it as you like if you find this my little side-project interesting.
Update: Unfortunately, deeper testing revealed serious problems. WsgiBoostServer works with pure-Python WSGI applications without problems but crashes because of memory corruption if I add some binary Python modules in the mix. My guess is that Boost.Asio does not work well inside a Python interpreter which does its own memory management. Since diagnosing such arcane memory problems is way over my head, I had to abandon this project .
Featured Posts
-
Running Multiple Celery Beat Instances in One Python Project
Feb. 1, 2021 -
Setting Up MySQL in LibreELEC on Raspberry Pi
Nov. 17, 2017 -
Autodocumenting your Python code with Sphinx - part 2
Feb. 24, 2016