Categories
html parsing regex

What to do Regular expression pattern doesn’t match anywhere in string?

193

I am trying to match <input> type “hidden” fields using this pattern:

/<input type="hidden" name="([^"]*?)" value="([^"]*?)" />/

This is sample form data:

<input type="hidden" name="SaveRequired" value="False" /><input type="hidden" name="__VIEWSTATE1" value="1H4sIAAtzrkX7QfL5VEGj6nGi+nP" /><input type="hidden" name="__VIEWSTATE2" value="0351118MK" /><input type="hidden" name="__VIEWSTATE3" value="ZVVV91yjY" /><input type="hidden" name="__VIEWSTATE0" value="3" /><input type="hidden" name="__VIEWSTATE" value="" /><input type="hidden" name="__VIEWSTATE" value="" />

But I am not sure that the type, name, and value attributes will always appear in the same order. If the type attribute comes last, the match will fail because in my pattern it’s at the start.

Question:
How can I change my pattern so it will match regardless of the positions of the attributes in the <input> tag?

P.S.: By the way I am using the Adobe Air based RegEx Desktop Tool for testing regular expressions.

2

  • 4

    regex is excellent solution if you have the control over generated html because it is regual vs not regular debate, but i my case i dont know how html is going to change in future so best thing is to use parser instead of regex, i have used regex in my project in parts that i am have control over

    – Salman

    Nov 20, 2010 at 21:18

  • 1

    The Stack Overflow classic is the question for which an answer starts with “You can’t parse [X]HTML with regex.”.

    Jul 22, 2020 at 12:51

112

Contrary to all the answers here, for what you’re trying to do regex is a perfectly valid solution. This is because you are NOT trying to match balanced tags– THAT would be impossible with regex! But you are only matching what’s in one tag, and that’s perfectly regular.

Here’s the problem, though. You can’t do it with just one regex… you need to do one match to capture an <input> tag, then do further processing on that. Note that this will only work if none of the attribute values have a > character in them, so it’s not perfect, but it should suffice for sane inputs.

Here’s some Perl (pseudo)code to show you what I mean:

my $html = readLargeInputFile();

my @input_tags = $html =~ m/
    (
        <input                      # Starts with "<input"
        (?=[^>]*?type="hidden")     # Use lookahead to make sure that type="hidden"
        [^>]+                       # Grab the rest of the tag...
        \/>                         # ...except for the />, which is grabbed here
    )/xgm;

# Now each member of @input_tags is something like <input type="hidden" name="SaveRequired" value="False" />

foreach my $input_tag (@input_tags)
{
  my $hash_ref = {};
  # Now extract each of the fields one at a time.

  ($hash_ref->{"name"}) = $input_tag =~ /name="([^"]*)"/;
  ($hash_ref->{"value"}) = $input_tag =~ /value="([^"]*)"/;

  # Put $hash_ref in a list or something, or otherwise process it
}

The basic principle here is, don’t try to do too much with one regular expression. As you noticed, regular expressions enforce a certain amount of order. So what you need to do instead is to first match the CONTEXT of what you’re trying to extract, then do submatching on the data you want.

EDIT: However, I will agree that in general, using an HTML parser is probably easier and better and you really should consider redesigning your code or re-examining your objectives. 🙂 But I had to post this answer as a counter to the knee-jerk reaction that parsing any subset of HTML is impossible: HTML and XML are both irregular when you consider the entire specification, but the specification of a tag is decently regular, certainly within the power of PCRE.

17

  • 15

    Not contrary to all the answers here. 🙂

    – tchrist

    Nov 20, 2010 at 20:02

  • 6

    @tchrist: Your answer wasn’t here when I posted mine. 😉

    Nov 20, 2010 at 21:51

  • 8

    yah well — for some reason it took me longer to type than yours did. I think my keyboard must need greasing. 🙂

    – tchrist

    Nov 20, 2010 at 21:53

  • 6

    That’s invalid HTML – it should be value=”&lt;Are you really sure about this?&gt;” If the place he’s scraping does a poor job escaping things like this, then he’ll need a more sophisticated solution – but if they do it right (and if he has control over it, he should make sure it’s right) then he’s fine.

    Jul 8, 2011 at 12:45

  • 14

    Obligatory link to the best SO answer on the subject (possibly best SO answer period): stackoverflow.com/questions/1732348/…

    Jul 8, 2011 at 14:29

130

  1. You can write a novel like tchrist did
  2. You can use a DOM library, load the HTML and use xpath and just use //input[@type="hidden"]. Or if you don’t want to use xpath, just get all inputs and filter which ones are hidden with getAttribute.

I prefer #2.

<?php

$d = new DOMDocument();
$d->loadHTML(
    '
    <p>fsdjl</p>
    <form><div>fdsjl</div></form>
    <input type="hidden" name="blah" value="hide yo kids">
    <input type="text" name="blah" value="hide yo kids">
    <input type="hidden" name="blah" value="hide yo wife">
');
$x = new DOMXpath($d);
$inputs = $x->evaluate('//input[@type="hidden"]');

foreach ( $inputs as $input ) {
    echo $input->getAttribute('value'), '<br>';
}

Result:

hide yo kids<br>hide yo wife<br>

5

  • 77

    That was kinda my point, actually. I wanted to show how hard it is.

    – tchrist

    Nov 20, 2010 at 19:38

  • 19

    Very good stuff there. I had really hoped people would show how much easier it is using a parsing class, so thanks! I just wanted a working example of the extreme trouble you have to go through to do it from scratch using regexes. I sure hope most people conclude to use prefab parsers on generic HTML instead of rolling their own. Regexes are still great for simple HTML they made themselves, though, because that gets rid of 99.98% of the complexity.

    – tchrist

    Nov 20, 2010 at 19:48

  • 5

    What would be nice after reading those 2 very interesting approaches would be comparing the speed/memory usage/CPU of one approach against another (i.e. regex-based VS parsing class).

    Aug 31, 2014 at 10:25

  • 1

    @Avt’W Yeah, not that you should go write a ‘novel’ if Regexes happen to be faster, but in fact it really would be just interesting to know. 🙂 But my guess already is, that a parser takes less resources, too..

    – Dennis98

    Sep 20, 2016 at 13:42

  • This is actually why XPath was invented in the first place!

    Jul 22, 2017 at 7:38

112

Contrary to all the answers here, for what you’re trying to do regex is a perfectly valid solution. This is because you are NOT trying to match balanced tags– THAT would be impossible with regex! But you are only matching what’s in one tag, and that’s perfectly regular.

Here’s the problem, though. You can’t do it with just one regex… you need to do one match to capture an <input> tag, then do further processing on that. Note that this will only work if none of the attribute values have a > character in them, so it’s not perfect, but it should suffice for sane inputs.

Here’s some Perl (pseudo)code to show you what I mean:

my $html = readLargeInputFile();

my @input_tags = $html =~ m/
    (
        <input                      # Starts with "<input"
        (?=[^>]*?type="hidden")     # Use lookahead to make sure that type="hidden"
        [^>]+                       # Grab the rest of the tag...
        \/>                         # ...except for the />, which is grabbed here
    )/xgm;

# Now each member of @input_tags is something like <input type="hidden" name="SaveRequired" value="False" />

foreach my $input_tag (@input_tags)
{
  my $hash_ref = {};
  # Now extract each of the fields one at a time.

  ($hash_ref->{"name"}) = $input_tag =~ /name="([^"]*)"/;
  ($hash_ref->{"value"}) = $input_tag =~ /value="([^"]*)"/;

  # Put $hash_ref in a list or something, or otherwise process it
}

The basic principle here is, don’t try to do too much with one regular expression. As you noticed, regular expressions enforce a certain amount of order. So what you need to do instead is to first match the CONTEXT of what you’re trying to extract, then do submatching on the data you want.

EDIT: However, I will agree that in general, using an HTML parser is probably easier and better and you really should consider redesigning your code or re-examining your objectives. 🙂 But I had to post this answer as a counter to the knee-jerk reaction that parsing any subset of HTML is impossible: HTML and XML are both irregular when you consider the entire specification, but the specification of a tag is decently regular, certainly within the power of PCRE.

17

  • 15

    Not contrary to all the answers here. 🙂

    – tchrist

    Nov 20, 2010 at 20:02

  • 6

    @tchrist: Your answer wasn’t here when I posted mine. 😉

    Nov 20, 2010 at 21:51

  • 8

    yah well — for some reason it took me longer to type than yours did. I think my keyboard must need greasing. 🙂

    – tchrist

    Nov 20, 2010 at 21:53

  • 6

    That’s invalid HTML – it should be value=”&lt;Are you really sure about this?&gt;” If the place he’s scraping does a poor job escaping things like this, then he’ll need a more sophisticated solution – but if they do it right (and if he has control over it, he should make sure it’s right) then he’s fine.

    Jul 8, 2011 at 12:45

  • 14

    Obligatory link to the best SO answer on the subject (possibly best SO answer period): stackoverflow.com/questions/1732348/…

    Jul 8, 2011 at 14:29