Categories
html regex xhtml

RegEx match open tags except XHTML self-contained tags

2024

I need to match all of these opening tags:

<p>
<a href="https://stackoverflow.com/questions/1732348/foo">

But not these:

<br />
<hr class="https://stackoverflow.com/questions/1732348/foo" />

I came up with this and wanted to make sure I’ve got it right. I am only capturing the a-z.

<([a-z]+) *[^/]*?>

I believe it says:

  • Find a less-than, then
  • Find (and capture) a-z one or more times, then
  • Find zero or more spaces, then
  • Find any character zero or more times, greedy, except /, then
  • Find a greater-than

Do I have that right? And more importantly, what do you think?

0

    4409

    You can’t parse [X]HTML with regex. Because HTML can’t be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-regexp will liquify the n​erves of the sentient whilst you observe, your psyche withering in the onslaught of horror. Rege̿̔̉x-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the transgression of a chi͡ld ensures regex will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of reg​ex parsers for HTML will ins​tantly transport a programmer’s consciousness into a world of ceaseless screaming, he comes, the pestilent slithy regex-infection wil​l devour your HT​ML parser, application and existence for all time like Visual Basic only worse he comes he comes do not fi​ght he com̡e̶s, ̕h̵i​s un̨ho͞ly radiańcé destro҉ying all enli̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo​͟ur eye͢s̸ ̛l̕ik͏e liq​uid pain, the song of re̸gular exp​ression parsing will exti​nguish the voices of mor​tal man from the sp​here I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful t​he final snuffing of the lie​s of Man ALL IS LOŚ͖̩͇̗̪̏̈́T ALL I​S LOST the pon̷y he comes he c̶̮omes he comes the ich​or permeates all MY FACE MY FACE ᵒh god no NO NOO̼O​O NΘ stop the an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e n​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ


    Have you tried using an XML parser instead?


    Moderator’s Note

    This post is locked to prevent inappropriate edits to its content. The post looks exactly as it is supposed to look – there are no problems with its content. Please do not flag it for our attention.

    2

    • 179

      Kobi: I think it’s time for me to quit the post of Assistant Don’t Parse HTML With Regex Officer. No matter how many times we say it, they won’t stop coming every day… every hour even. It is a lost cause, which someone else can fight for a bit. So go on, parse HTML with regex, if you must. It’s only broken code, not life and death.

      – bobince

      Nov 13, 2009 at 23:18

    • 2

      If you can’t see this post, here’s a screencapture of it in all its glory: imgur.com/gOPS2.png

      Nov 19, 2009 at 14:37


    3483

    +50

    While arbitrary HTML with only a regex is impossible, it’s sometimes appropriate to use them for parsing a limited, known set of HTML.

    If you have a small set of HTML pages that you want to scrape data from and then stuff into a database, regexes might work fine. For example, I recently wanted to get the names, parties, and districts of Australian federal Representatives, which I got off of the Parliament’s web site. This was a limited, one-time job.

    Regexes worked just fine for me, and were very fast to set up.

    27

    • 147

      Also, scraping fairly regularly formatted data from large documents is going to be WAY faster with judicious use of scan & regex than any generic parser. And if you are comfortable with coding regexes, way faster to code than coding xpaths. And almost certainly less fragile to changes in what you are scraping. So bleh.

      Apr 17, 2012 at 20:47

    • 298

      @MichaelJohnston “Less fragile”? Almost certainly not. Regexes care about text-formatting details than an XML parser can silently ignore. Switching between &foo; encodings and CDATA sections? Using an HTML minifier to remove all whitespace in your document that the browser doesn’t render? An XML parser won’t care, and neither will a well-written XPath statement. A regex-based “parser”, on the other hand…

      Jul 11, 2012 at 16:03

    • 44

      @CharlesDuffy for an one time job it’s ok, and for spaces we use \s+

      – quantum

      Jul 12, 2012 at 13:50

    • 76

      @xiaomao indeed, if having to know all the gotchas and workarounds to get an 80% solution that fails the rest of the time “works for you”, I can’t stop you. Meanwhile, I’m over on my side of the fence using parsers that work on 100% of syntactically valid XML.

      Jul 12, 2012 at 16:07

    • 417

      I once had to pull some data off ~10k pages, all with the same HTML template. They were littered with HTML errors that caused parsers to choke, and all their styling was inline or with <font> etc.: no classes or IDs to help navigate the DOM. After fighting all day with the “right” approach, I finally switched to a regex solution and had it working in an hour.

      Sep 7, 2012 at 7:14

    2222

    I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and a regular expression is a Chomsky Type 3 grammar (regular grammar). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar (see the Chomsky hierarchy), it is mathematically impossible to parse XML with a regular expression.

    But many will try, and some will even claim success – but until others find the fault and totally mess you up.

    12

    • 259

      The OP is asking to parse a very limited subset of XHTML: start tags. What makes (X)HTML a CFG is its potential to have elements between the start and end tags of other elements (as in a grammar rule A -> s A e). (X)HTML does not have this property within a start tag: a start tag cannot contain other start tags. The subset that the OP is trying to parse is not a CFG.

      – LarsH

      Mar 2, 2012 at 8:43

    • 118

      In CS theory, regular languages are a strict subset of context-free languages, but regular expression implementations in mainstream programming languages are more powerful. As noulakaz.net/weblog/2007/03/18/… describes, so-called “regular expressions” can check for prime numbers in unary, which is certainly something that a regular expression from CS theory can’t accomplish.

      Mar 19, 2012 at 23:50

    • 17

      @eyelidlessness: the same “only if” applies to all CFGs, does it not? I.e. if the (X)HTML input is not well-formed, not even a full-blown XML parser will work reliably. Maybe if you give examples of the “(X)HTML syntax errors implemented in real world user agents” you’re referring to, I’ll understand what you’re getting at better.

      – LarsH

      May 22, 2012 at 5:09

    • 93

      @AdamMihalcin is exactly right. Most extant regex engines are more powerful than Chomsky Type 3 grammars (eg non-greedy matching, backrefs). Some regex engines (such as Perl’s) are Turing complete. It’s true that even those are poor tools for parsing HTML, but this oft-cited argument is not the reason why.

      May 31, 2012 at 13:44

    • 30

      This is the most “full and short” answer here. It leads people to learn basics of formal grammars and languages and hopefully some maths so they will not wast time on hopeless things like solving NP-tasks in polynomial time

      Apr 19, 2013 at 12:15