Categories
pattern-matching regex regex-negation

Regular expression to match a line that doesn’t contain a word

4999

I know it’s possible to match a word and then reverse the matches using other tools (e.g. grep -v). However, is it possible to match lines that do not contain a specific word, e.g. hede, using a regular expression?

Input:
hoho
hihi
haha
hede
Code:
grep "<Regex for 'doesn't contain hede'>" input
Desired output:
hoho
hihi
haha

6

  • 97

    Probably a couple years late, but what’s wrong with: ([^h]*(h([^e]|$)|he([^d]|$)|hed([^e]|$)))*? The idea is simple. Keep matching until you see the start of the unwanted string, then only match in the N-1 cases where the string is unfinished (where N is the length of the string). These N-1 cases are “h followed by non-e”, “he followed by non-d”, and “hed followed by non-e”. If you managed to pass these N-1 cases, you successfully didn’t match the unwanted string so you can start looking for [^h]* again

    Sep 29, 2011 at 3:44

  • 395

    @stevendesu: try this for ‘a-very-very-long-word’ or even better half a sentence. Have fun typing. BTW, it is nearly unreadable. Don’t know about the performance impact.

    Jan 30, 2012 at 18:45

  • 14

    @PeterSchuetze: Sure it’s not pretty for very very long words, but it is a viable and correct solution. Although I haven’t run tests on the performance, I wouldn’t imagine it being too slow since most of the latter rules are ignored until you see an h (or the first letter of the word, sentence, etc.). And you could easily generate the regex string for long strings using iterative concatenation. If it works and can be generated quickly, is legibility important? That’s what comments are for.

    Feb 2, 2012 at 3:14

  • 64

    @stevendesu: i’m even later, but that answer is almost completely wrong. for one thing, it requires the subject to contain “h” which it shouldn’t have to, given the task is “match lines which [do] not contain a specific word”. let us assume you meant to make the inner group optional, and that the pattern is anchored: ^([^h]*(h([^e]|$)|he([^d]|$)|hed([^e]|$))?)*$ this fails when instances of “hede” are preceded by partial instances of “hede” such as in “hhede”.

    – jaytea

    Sep 10, 2012 at 10:41


  • 16

    This question has been added to the Stack Overflow Regular Expression FAQ, under “Advanced Regex-Fu”.

    Apr 10, 2014 at 1:30

6906

The notion that regex doesn’t support inverse matching is not entirely true. You can mimic this behavior by using negative look-arounds:

^((?!hede).)*$

Non-capturing variant:

^(?:(?!:hede).)*$

The regex above will match any string, or line without a line break, not containing the (sub)string ‘hede’. As mentioned, this is not something regex is “good” at (or should do), but still, it is possible.

And if you need to match line break chars as well, use the DOT-ALL modifier (the trailing s in the following pattern):

/^((?!hede).)*$/s

or use it inline:

/(?s)^((?!hede).)*$/

(where the /.../ are the regex delimiters, i.e., not part of the pattern)

If the DOT-ALL modifier is not available, you can mimic the same behavior with the character class [\s\S]:

/^((?!hede)[\s\S])*$/

Explanation

A string is just a list of n characters. Before, and after each character, there’s an empty string. So a list of n characters will have n+1 empty strings. Consider the string "ABhedeCD":

    ┌──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┐
S = │e1│ A │e2│ B │e3│ h │e4│ e │e5│ d │e6│ e │e7│ C │e8│ D │e9│
    └──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┘
    
index    0      1      2      3      4      5      6      7

where the e‘s are the empty strings. The regex (?!hede). looks ahead to see if there’s no substring "hede" to be seen, and if that is the case (so something else is seen), then the . (dot) will match any character except a line break. Look-arounds are also called zero-width-assertions because they don’t consume any characters. They only assert/validate something.

So, in my example, every empty string is first validated to see if there’s no "hede" up ahead, before a character is consumed by the . (dot). The regex (?!hede). will do that only once, so it is wrapped in a group, and repeated zero or more times: ((?!hede).)*. Finally, the start- and end-of-input are anchored to make sure the entire input is consumed: ^((?!hede).)*$

As you can see, the input "ABhedeCD" will fail because on e3, the regex (?!hede) fails (there is "hede" up ahead!).

13

  • 37

    I would not go so far as to say that this is something regex is bad at. The convenience of this solution is pretty obvious and the performance hit compared to a programmatic search is often going to be unimportant.

    Mar 3, 2016 at 16:09

  • 49

    Strictly speaking negative loook-ahead makes you regular expression not-regular.

    – Peter K

    Nov 18, 2016 at 15:03

  • 87

    @PeterK, sure, but this is SO, not MathOverflow or CS-Stackexchange. People asking a question here are generally looking for a practical answer. Most libraries or tools (like grep, which the OP mentions) with regex-support all have features that mke them non-regular in a theoretical sense.

    Nov 18, 2016 at 15:08

  • 23

    @Bart Kiers, no offense to you answer, just this abuse of terminology irritates me a bit. The really confusing part here is that regular expressions in the strict sense can very much do what OP wants, but the common language to write them does not allow it, which leads to (mathematically ugly) workarounds like look-aheads. Please see this answer below and my comment there for (theoretically aligned) proper way of doing it. Needless to say it works faster on large inputs.

    – Peter K

    Nov 18, 2016 at 15:33

  • 21

    In case you ever wondered how to do this in vim: ^\(\(hede\)\@!.\)*$

    – baldrs

    Nov 24, 2016 at 11:58

878

Note that the solution to does not start with “hede”:

^(?!hede).*$

is generally much more efficient than the solution to does not contain “hede”:

^((?!hede).)*$

The former checks for “hede” only at the input string’s first position, rather than at every position.

7

  • 7

    Thanks, I used it to validate that the string dosn’t contain squence of digits ^((?!\d{5,}).)*

    – Samih A

    May 10, 2015 at 10:42

  • 5

    Hello! I can’t compose does not end with “hede” regex. Can you help with it?

    – Aleks Ya

    Oct 18, 2015 at 21:33

  • 3

    @AleksYa: just use the “contain” version, and include the end anchor into the search string: change the string to “not match” from “hede” to “hede$”

    – Nyerguds

    May 4, 2016 at 10:42


  • 5

    @AleksYa: the does not end version could be done using negative lookbehind as: (.*)(?<!hede)$. @Nyerguds’ version would work as well but completely misses the point on performance the answer mentions.

    Sep 14, 2017 at 16:53

  • 8

    Why are so many answers saying ^((?!hede).)*$ ? Is it not more efficient to use ^(?!.*hede).*$ ? It does the same thing but in fewer steps

    – JackPRead

    Jan 15, 2019 at 10:53


235

If you’re just using it for grep, you can use grep -v hede to get all lines which do not contain hede.

ETA Oh, rereading the question, grep -v is probably what you meant by “tools options”.

6

  • 27

    Tip: for progressively filtering out what you don’t want: grep -v “hede” | grep -v “hihi” | …etc.

    May 5, 2014 at 22:08


  • 55

    Or using only one process grep -v -e hede -e hihi -e ...

    Apr 26, 2015 at 5:42

  • 19

    Or just grep -v "hede\|hihi" 🙂

    – Putnik

    Dec 9, 2016 at 15:29

  • 4

    If you have many patterns that you want to filter out, put them in a file and use grep -vf pattern_file file

    Mar 11, 2018 at 18:35


  • 9

    Or simply egrep or grep -Ev "hede|hihi|etc" to avoid the awkward escaping.

    Jun 3, 2018 at 10:54