Parsing HTML with Node

The perils of using regular expressions to parse HTML is well documented. Take a look at the articles here and here and here and here for why parsing HTML with regex is such a bad idea.

Using regex

That said, if you are a least a moderate regex user, it's hard to resist the pull of a non-greedy regular expression for parsing HTML.

I needed to parse the <link> tags where rel = stylesheet out of an HTML document with Node the other day. In a hurry I threw this code at the problem (fileContents is the contents of an HTML file):

and it spit out this result:

which at first seemed correct, but one of those link tags was embedded in an HTML comment. Notice that the result returned for the first line above ends with -->. While that would mostly do what I wanted, it just didn't feel right.

Using JSSoup

Having used Python's Beautiful Soup HTML-parsing library, I then looked at the JSSoup NPM package.

This produced the correct results:

JSSoup has a Beautiful Soup-like syntax and using it was almost a go when I noticed this comment at the bottom of its NPM package home page (sic): "There's a lot of work need to be done." I agree. JSSoup is not nearly as effective as Python's Beautiful Soup is.

Yikes. No thanks. While I'm not sure if that comment reflects incompleteness (which may be OK) or some tests not passing (which is definitely not OK) I moved on.

Using node-html-parser

Next up was the the node-html-parser NPM package. This package has a very JavaScript-like API, is mature, and has many users.

which also produced correct result:

Note that the node-html-parser say that the querySelectorAll supports only tagName, #id, .class selectors.

Putting node-html-parser to work

The HTML is being parsed to add cache-busting query strings to the end of stylesheet link tags. The nanoid NPM package creates unique querystring values. Even though I noticed that doc.querySelectorAll('link[rel="stylesheet"]'); did work I had trouble with other advanced CSS selectors so I only used querySelectorAll as it is documented.

The HTML file can be updated like this:

The code above transformed these two stylesheet links:

into these two stylesheet links:

With the links updated, the file contents can be written back out to the original file using doc.outerHTML as the new file contents:

Leave a Comment

Your email address will not be published. Required fields are marked *