Categories
regex

My regex is matching too much. How do I make it stop? [duplicate]

113

I have this gigantic ugly string:

J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully

I’m trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).

Here’s the regex I’ve been playing with:

Project name:\s+(.*)\s+J[0-9]{7}:

The problem is that it doesn’t stop until it hits the J0000020: at the end.

How do I make the regex stop at the first occurrence of J[0-9]{7}?

1

  • Project name:[^\n]*\n(J[0-9]{7})

    – Aphton

    May 6, 2019 at 20:36

168

Make .* non-greedy by adding ‘?‘ after it:

Project name:\s+(.*?)\s+J[0-9]{7}:

0

    15

    Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.

    However, consider using a negative character class instead:

    Project name:\s+(\S*)\s+J[0-9]{7}:
    

    \S means “everything except a whitespace and this is exactly what you want.

    1

    • When possible to implement, a greedy negative (or positive) character class will usually perform notably better than a lazy quantifier. Laziness requires the engine to forward-track character by character, checking the pattern that follows each time until it matches; a greedy character class can mindlessly repeat just the desired characters, which can be a lot quicker. So, you might consider making a stronger case for a negative character class, seeing as this is the greedy-vs-lazy canonical.

      Oct 30, 2018 at 9:26

    6

    Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.

    Here’s what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.

    string m = Regex.Match(s, @"Project name: (?<name>.*?) J\d+").Groups["name"].Value;