Categories
javascript regex

How do you access the matched groups in a JavaScript regular expression?

1633

I want to match a portion of a string using a regular expression and then access that parenthesized substring:

    var myString = "something format_abc"; // I want "abc"

    var arr = /(?:^|\s)format_(.*?)(?:\s|$)/.exec(myString);

    console.log(arr);     // Prints: [" format_abc", "abc"] .. so far so good.
    console.log(arr[1]);  // Prints: undefined  (???)
    console.log(arr[0]);  // Prints: format_undefined (!!!)

What am I doing wrong?


I’ve discovered that there was nothing wrong with the regular expression code above: the actual string which I was testing against was this:

"date format_%A"

Reporting that “%A” is undefined seems a very strange behaviour, but it is not directly related to this question, so I’ve opened a new one, Why is a matched substring returning “undefined” in JavaScript?.


The issue was that console.log takes its parameters like a printf statement, and since the string I was logging ("%A") had a special value, it was trying to find the value of the next parameter.

0

    1913

    You can access capturing groups like this:

    var myString = "something format_abc";
    var myRegexp = /(?:^|\s)format_(.*?)(?:\s|$)/g;
    var myRegexp = new RegExp("(?:^|\s)format_(.*?)(?:\s|$)", "g");
    var match = myRegexp.exec(myString);
    console.log(match[1]); // abc

    And if there are multiple matches you can iterate over them:

    var myString = "something format_abc";
    var myRegexp = new RegExp("(?:^|\s)format_(.*?)(?:\s|$)", "g");
    match = myRegexp.exec(myString);
    while (match != null) {
      // matched text: match[0]
      // match start: match.index
      // capturing group n: match[n]
      console.log(match[0])
      match = myRegexp.exec(myString);
    }

    Edit: 2019-09-10

    As you can see the way to iterate over multiple matches was not very intuitive. This lead to the proposal of the String.prototype.matchAll method. This new method is expected to ship in the ECMAScript 2020 specification. It gives us a clean API and solves multiple problems. It has been started to land on major browsers and JS engines as Chrome 73+ / Node 12+ and Firefox 67+.

    The method returns an iterator and is used as follows:

    const string = "something format_abc";
    const regexp = /(?:^|\s)format_(.*?)(?:\s|$)/g;
    const matches = string.matchAll(regexp);
        
    for (const match of matches) {
      console.log(match);
      console.log(match.index)
    }

    As it returns an iterator, we can say it’s lazy, this is useful when handling particularly large numbers of capturing groups, or very large strings. But if you need, the result can be easily transformed into an Array by using the spread syntax or the Array.from method:

    function getFirstGroup(regexp, str) {
      const array = [...str.matchAll(regexp)];
      return array.map(m => m[1]);
    }
    
    // or:
    function getFirstGroup(regexp, str) {
      return Array.from(str.matchAll(regexp), m => m[1]);
    }
    

    In the meantime, while this proposal gets more wide support, you can use the official shim package.

    Also, the internal workings of the method are simple. An equivalent implementation using a generator function would be as follows:

    function* matchAll(str, regexp) {
      const flags = regexp.global ? regexp.flags : regexp.flags + "g";
      const re = new RegExp(regexp, flags);
      let match;
      while (match = re.exec(str)) {
        yield match;
      }
    }
    

    A copy of the original regexp is created; this is to avoid side-effects due to the mutation of the lastIndex property when going through the multple matches.

    Also, we need to ensure the regexp has the global flag to avoid an infinite loop.

    I’m also happy to see that even this StackOverflow question was referenced in the discussions of the proposal.

    6

    • 116

      +1 Please note that in the second example you should use the RegExp object (not only “/myregexp/”), because it keeps the lastIndex value in the object. Without using the Regexp object it will iterate infinitely

      – ianaz

      Aug 28, 2012 at 12:06


    • 7

      @ianaz: I don’t believe ’tis true? http://jsfiddle.net/weEg9/ seems to work on Chrome, at least.

      Oct 16, 2012 at 7:26

    • 17

      Why do the above instead of: var match = myString.match(myRegexp); // alert(match[1])?

      – JohnAllen

      Dec 30, 2013 at 17:39

    • 29

      No need for explicit “new RegExp”, however the infinite loop will occur unless /g is specified

      Jun 6, 2014 at 18:33


    • important to note that 0th index is the entire match. so const [_, group1, group2] = myRegex.exec(myStr); is my pattern.

      Aug 24, 2021 at 7:32

    200

    Here’s a method you can use to get the n​th capturing group for each match:

    function getMatches(string, regex, index) {
      index || (index = 1); // default to the first capturing group
      var matches = [];
      var match;
      while (match = regex.exec(string)) {
        matches.push(match[index]);
      }
      return matches;
    }
    
    
    // Example :
    var myString = 'something format_abc something format_def something format_ghi';
    var myRegEx = /(?:^|\s)format_(.*?)(?:\s|$)/g;
    
    // Get an array containing the first capturing group for every match
    var matches = getMatches(myString, myRegEx, 1);
    
    // Log results
    document.write(matches.length + ' matches found: ' + JSON.stringify(matches))
    console.log(matches);

    1

    • 13

      This a far superior answer to the others because it correctly shows iteration over all matches instead of only getting one.

      – Rob Evans

      May 11, 2013 at 12:08

    64

    var myString = "something format_abc";
    var arr = myString.match(/\bformat_(.*?)\b/);
    console.log(arr[0] + " " + arr[1]);

    The \b isn’t exactly the same thing. (It works on --format_foo/, but doesn’t work on format_a_b) But I wanted to show an alternative to your expression, which is fine. Of course, the match call is the important thing.

    4

    • 2

      It’s exactly reverse. ‘\b’ delimits words. word= ‘\w’ = [a-zA-Z0-9_] . “format_a_b” is a word.

      – B.F.

      Apr 22, 2015 at 21:09


    • 1

      @B.F.Honestly, I added “doesn’t work on format_a_b” as an after thought 6 years ago, and I don’t recall what I meant there… 🙂 I suppose it meant “doesn’t work to capture a only”, ie. the first alphabetical part after format_.

      – PhiLho

      Apr 23, 2015 at 7:41


    • 1

      I wanted to say that \b(–format_foo/}\b do not return “–format_foo/” because “-” and “/” are no \word characters. But \b(format_a_b)\b do return “format_a_b”. Right? I refer to your text statement in round brackets. (Did no down vote!)

      – B.F.

      Apr 23, 2015 at 10:43

    • 1

      Note that the g flag is important here. If the g flag is added to the pattern, you’ll get an array of matches disregarding capture groups. "a b c d".match(/(\w) (\w)/g); => ["a b", "c d"] but "a b c d".match(/(\w) (\w)/); => ["a b", "a", "b", index: 0, input: "a b c d", groups: undefined].

      – ggorlen

      May 15, 2021 at 19:18