The perils of using regular expressions to parse HTML is well documented. Take a look at the articles here and here and here and here for why parsing HTML with regex is such a bad idea.
Using regex
That said, if you are a least a moderate regex user, it's hard to resist the pull of a non-greedy regular expression for parsing HTML.
I needed to parse the <link>
tags where rel
= stylesheet
out of an HTML document with Node the other day. In a hurry I threw this code at the problem (fileContents
is the contents of an HTML file):
1 2 3 4 5 |
const regexp = /\<link.*stylesheet.*\>/gmi; const links = [...fileContents.matchAll(regexp)]; links.forEach(link => { console.log(<code>with regex = ${link.toString()}</code>); }); |
and it spit out this result:
1 2 3 |
with regex = <link rel="stylesheet" href="./assets/css/global.css?BkWXyVMYKB21uTEkkQPoD"> --> with regex = <link rel="stylesheet" href="./assets/css/global.min.css?I1juSFg_GA0s6pyS5olWN"> with regex = <link href="./assets/css/main.css?o5qR8xufWxXzm-BKPb6m7" rel="stylesheet"> |
which at first seemed correct, but one of those link
tags was embedded in an HTML comment. Notice that the result returned for the first line above ends with -->
. While that would mostly do what I wanted, it just didn't feel right.
Using JSSoup
Having used Python's Beautiful Soup HTML-parsing library, I then looked at the JSSoup NPM package.
1 2 3 4 5 6 7 8 9 |
const JSSoup = require('jssoup').default; const soup = new JSSoup(fileContents); const links = soup.findAll('link'); links.forEach(link => { if (link.attrs.rel == 'stylesheet') { console.log(<code>jssoup = ${link}</code>); } }); |
This produced the correct results:
1 2 |
jssoup = <link rel="stylesheet" href="./assets/css/global.min.css?I1juSFg_GA0s6pyS5olWN" /> jssoup = <link rel="stylesheet" href="./assets/css/main.css?o5qR8xufWxXzm-BKPb6m7"/> |
JSSoup has a Beautiful Soup-like syntax and using it was almost a go when I noticed this comment at the bottom of its NPM package home page (sic): "There's a lot of work need to be done." I agree. JSSoup is not nearly as effective as Python's Beautiful Soup is.
Yikes. No thanks. While I'm not sure if that comment reflects incompleteness (which may be OK) or some tests not passing (which is definitely not OK) I moved on.
Using node-html-parser
Next up was the the node-html-parser NPM package. This package has a very JavaScript-like API, is mature, and has many users.
1 2 3 4 5 6 7 8 9 |
const parser = require('node-html-parser'); const doc = parser.parse(fileContents); const links = doc.querySelectorAll('link'); links.forEach(link => { const href = link.getAttribute('href'); console.log(link.outerHTML); }); |
which also produced correct result:
1 2 |
node-html-parser = <link rel="stylesheet" href="./assets/css/global.min.css?I1juSFg_GA0s6pyS5olWN"> node-html-parser = <link rel="stylesheet" href="./assets/css/main.css?o5qR8xufWxXzm-BKPb6m7"> |
Note that the node-html-parser say that the querySelectorAll
supports only tagName, #id, .class selectors.
Putting node-html-parser to work
The HTML is being parsed to add cache-busting query strings to the end of stylesheet link tags. The nanoid NPM package creates unique querystring values. Even though I noticed that doc.querySelectorAll('link[rel="stylesheet"]');
did work I had trouble with other advanced CSS selectors so I only used querySelectorAll as it is documented.
The HTML file can be updated like this:
1 2 3 4 5 6 7 8 9 10 11 |
const fileContents = fs.readFileSync(filename, 'utf8'); const doc = parser.parse(fileContents); const links = doc.querySelectorAll('link'); links.forEach(link => { if (link.getAttribute('rel').toLowerCase() == 'stylesheet') { const href = link.getAttribute('href'); const fileRef = href.replace(/\?.*$/, ''); link.setAttribute('href', <code>${fileRef}?${nanoid()}</code>); } }); |
The code above transformed these two stylesheet links:
1 2 |
<link rel="stylesheet" href="./assets/css/global.min.css?I1juSFg_GA0s6pyS5olWN"> <link rel="stylesheet" href="./assets/css/main.css?o5qR8xufWxXzm-BKPb6m7"> |
into these two stylesheet links:
1 2 |
<link rel="stylesheet" href="./assets/css/global.min.css?v=V2hRK2kZz2fHzaKUjB7aj"> <link rel="stylesheet" href="./assets/css/main.css?v=DsS0PcGdPBi4zfPas1AaK"> |
With the links updated, the file contents can be written back out to the original file using doc.outerHTML
as the new file contents:
1 |
fs.writeFileSync(filename, doc.outerHTML); |