Categories
html-parsing regex

Using regular expressions to parse HTML: why not?

230

It seems like every question on stackoverflow where the asker is using regex to grab some information from HTML will inevitably have an “answer” that says not to use regex to parse HTML.

Why not? I’m aware that there are quote-unquote “real” HTML parsers out there like Beautiful Soup, and I’m sure they’re powerful and useful, but if you’re just doing something simple, quick, or dirty, then why bother using something so complicated when a few regex statements will work just fine?

Moreover, is there just something fundamental that I don’t understand about regex that makes them a bad choice for parsing in general?

7

230

Entire HTML parsing is not possible with regular expressions, since it depends on matching the opening and the closing tag which is not possible with regexps.

Regular expressions can only match regular languages but HTML is a context-free language and not a regular language (As @StefanPochmann pointed out, regular languages are also context-free, so context-free doesn’t necessarily mean not regular). The only thing you can do with regexps on HTML is heuristics but that will not work on every condition. It should be possible to present a HTML file that will be matched wrongly by any regular expression.

12

  • 27

    Best answer so far. If it can only match regular grammars then we would need an infinitely large regexp to parse a context-free grammar like HTML. I love when these things have clear theoretical answers.

    – ntownsend

    Feb 26, 2009 at 14:48

  • 2

    I assumed we were discussing Perl-type regexes where they aren’t actually regular expressions.

    – Hank Gay

    Feb 26, 2009 at 16:12

  • 6

    Actually, .Net regular expressions can match opening with closing tags, to some extent, using balancing groups and a carefully crafted expression. Containing all of that in a regexp is still crazy of course, it would look like the great code Chtulhu and would probably summon the real one as well. And in the end it still won’t work for all cases. They say that if you write a regular expression that can correctly parse any HTML the universe will collapse onto itself.

    Sep 16, 2010 at 19:55

  • 5

    Some regex libs can do recursive regular expressions (effectively making them non-regular expressions 🙂

    Aug 5, 2011 at 22:38

  • 46

    -1 This answer draws the right conclusion (“It’s a bad idea to parse HTML with Regex”) from wrong arguments (“Because HTML isn’t a regular language”). The thing that most people nowadays mean when they say “regex” (PCRE) is well capable not only of parsing context-free grammars (that’s trivial actually), but also of context-sensitive grammars (see stackoverflow.com/questions/7434272/…).

    – NikiC

    Sep 17, 2011 at 22:14

37

For quick´n´dirty regexp will do fine. But the fundamental thing to know is that it is impossible to construct a regexp that will correctly parse HTML.

The reason is that regexps can’t handle arbitarly nested expressions. See Can regular expressions be used to match nested patterns?

1

  • 1

    Some regex libs can do recursive regular expressions (effectively making them non-regular expressions 🙂

    Aug 5, 2011 at 22:40

30

(From http://htmlparsing.com/regexes)

Say you’ve got a file of HTML where you’re trying to extract URLs from
<img> tags.

<img src="http://example.com/whatever.jpg">

So you write a regex like this in Perl:

if ( $html =~ /<img src="https://stackoverflow.com/questions/590747/(.+)"/ ) {
    $url = $1;
}

In this case, $url will indeed contain
http://example.com/whatever.jpg. But what happens when
you start getting HTML like this:

<img src="http://example.com/whatever.jpg">

or

<img src=http://example.com/whatever.jpg>

or

<img border=0 src="http://example.com/whatever.jpg">

or

<img
    src="http://example.com/whatever.jpg">

or you start getting false positives from

<!-- // commented out
<img src="http://example.com/outdated.png">
-->

It looks so simple, and it might be simple for a single, unchanging file, but for anything that you’re going to be doing on arbitrary HTML data, regexes are just a recipe for future heartache.

3

  • 4

    This looks to be the real answer – while it’s probably possible to parse arbitrary HTML with regex since todays regexes are more than just a finite automata, in order to parse arbitrary html and not just a concrete page you have to reimplement a HTML parser in regexp and regexes surely become 1000 times unreadable.

    Aug 6, 2015 at 13:25


  • 1

    Hey Andy, I took the time to come up with an expression that supports your mentioned cases. stackoverflow.com/a/40095824/1204332 Let me know what you think! 🙂

    Oct 17, 2016 at 21:22

  • 2

    The reasoning in this answer is way outdated, and applies even less today than it did originally (which I think it didn’t). (Quoting OP: “if you’re just doing something simple, quick, or dirty…”.)

    – Sz.

    Mar 15, 2017 at 22:03