The perils of using regular expressions to parse HTML is well documented. Take a look at the articles here and here and here and here for why parsing HTML with regex is such a bad idea.
That said, if you are a least a moderate regex user, it's hard to resist the pull of a non-greedy regular expression for parsing HTML.
I needed to parse the
<link> tags where
stylesheet out of an HTML document with Node the other day. In a hurry I threw this code at the problem (
fileContents is the contents of an HTML file):
and it spit out this result:
which at first seemed correct, but one of those
link tags was embedded in an HTML comment. Notice that the result returned for the first line above ends with
-->. While that would mostly do what I wanted, it just didn't feel right.
Having used Python's Beautiful Soup HTML-parsing library, I then looked at the JSSoup NPM package.
This produced the correct results:
JSSoup has a Beautiful Soup-like syntax and using it was almost a go when I noticed this comment at the bottom of its NPM package home page (sic): "There's a lot of work need to be done." I agree. JSSoup is not nearly as effective as Python's Beautiful Soup is.
Yikes. No thanks. While I'm not sure if that comment reflects incompleteness (which may be OK) or some tests not passing (which is definitely not OK) I moved on.
which also produced correct result:
Note that the node-html-parser say that the
querySelectorAll supports only tagName, #id, .class selectors.
Putting node-html-parser to work
The HTML is being parsed to add cache-busting query strings to the end of stylesheet link tags. The nanoid NPM package creates unique querystring values. Even though I noticed that
doc.querySelectorAll('link[rel="stylesheet"]'); did work I had trouble with other advanced CSS selectors so I only used querySelectorAll as it is documented.
The HTML file can be updated like this:
The code above transformed these two stylesheet links:
into these two stylesheet links:
With the links updated, the file contents can be written back out to the original file using
doc.outerHTML as the new file contents: