How I Built a Fast HTML Parser Using Regex and TypeScript

Elson Correia
10 min readNov 13
Photo by Jay Zhang on Unsplash

While working on a side project, the need to parse HTML came up, and to save time, I tried the fastest HTML parsers I could find. After fighting and trying to hack them, I realized I needed a custom or super customizable one to fit all project needs. Unfortunately, I had no luck. So, I created one.

I thought it was a simple enough thing to do…

The Motivation

For my specific project, I needed something fast, which was easy to find, but I needed to be customizable enough. However, everything I found mainly failed in two areas:

  • They offered no way to tap into nodes while they were being parsed — That’s something I desperately needed.
  • They offered no ability to specify custom API for the parsed result, forcing me to learn something new they came up with or remain stuck with really non-performant APIs. — This ability would allow me to adapt the parser to the project, not vice versa.

Some offer customizations that often come with performance loss — I wanted both performance and customization. Additionally, I needed it to work in any JavaScript runtime environment, and because I was going to use it in a client library, it needed to be light.

Here is a list of best parsers I tried: html-parser, htmljs-parser (good callback options), html-dom-parser, html5parser, cheerio (really good offering more than just a parsing solution), parse5 (as good as cheerio), htmlparser2, htmlparser, node-html-parser (really fast)

Is my parser better? (disclaimer)

I am not claiming my parser is better than any of the above or that everyone should use it instead. I created a parser specifically to solve a problem I was having. I don’t believe in single solutions, and one should always try to find the best tool for the job— or create one if necessary.

The Result

  • It is a ~40kB package, ~4kB minified, and <2kB CDN size when used in the browser.
  • Using the htmlparser-benchmark package, it benchmarks around 1.68957 ms/file ± 1.11577 by default, and using third-party API like jsDOM, it benchmarks around 26.3847 ms/file ± 18.8658 .
Elson Correia

Software Engineer sharing knowledge, experience, and perspective from an employee and personal point of view.