Categories
regex

What special characters must be escaped in regular expressions?

448

I am tired of always trying to guess, if I should escape special characters like ‘()[]{}|‘ etc. when using many implementations of regexps.

It is different with, for example, Python, sed, grep, awk, Perl, rename, Apache, find and so on.
Is there any rule set which tells when I should, and when I should not, escape special characters? Does it depend on the regexp type, like PCRE, POSIX or extended regexps?

8

412

Which characters you must and which you mustn’t escape indeed depends on the regex flavor you’re working with.

For PCRE, and most other so-called Perl-compatible flavors, escape these outside character classes:

.^$*+?()[{\|

and these inside character classes:

^-]\

For POSIX extended regexes (ERE), escape these outside character classes (same as PCRE):

.^$*+?()[{\|

Escaping any other characters is an error with POSIX ERE.

Inside character classes, the backslash is a literal character in POSIX regular expressions. You cannot use it to escape anything. You have to use “clever placement” if you want to include character class metacharacters as literals. Put the ^ anywhere except at the start, the ] at the start, and the – at the start or the end of the character class to match these literally, e.g.:

[]^-]

In POSIX basic regular expressions (BRE), these are metacharacters that you need to escape to suppress their meaning:

.^$*[\

Escaping parentheses and curly brackets in BREs gives them the special meaning their unescaped versions have in EREs. Some implementations (e.g. GNU) also give special meaning to other characters when escaped, such as \? and +. Escaping a character other than .^$*(){} is normally an error with BREs.

Inside character classes, BREs follow the same rule as EREs.

If all this makes your head spin, grab a copy of RegexBuddy. On the Create tab, click Insert Token, and then Literal. RegexBuddy will add escapes as needed.

5

  • 13

    / is not a metacharacter in any of the regular expression flavors I mentioned, so the regular expression syntax does not require escaping it. When a regular expression is quoted as a literal in a programming language, then the string or regex formatting rules of that language may require / or " or ' to be escaped, and may even require `\` to be doubly escaped.

    Feb 6, 2015 at 23:39


  • 2

    what about colon, “:”? Shall it be escaped inside character classes as well as outside? en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions says “PCRE has consistent escaping rules: any non-alpha-numeric character may be escaped to mean its literal value […]”

    May 22, 2015 at 14:05

  • 5

    MAY be escaped is not the same as SHOULD be escaped. The PCRE syntax never requires a literal colon to be escaped, so escaping literal colons only makes your regex harder to read.

    Jun 9, 2015 at 7:52

  • 1

    For non-POSIX ERE (the one I use most often because it’s what’s implemented by Tcl) escaping other things don’t generate errors.

    – slebetman

    Aug 21, 2015 at 4:47

  • For JavaScript developers: const escapePCRE = string => string.replace(/[.*+?^${}()|[\]\\]/g, "\\$&"); from Mozilla developer network.

    Sep 24, 2016 at 14:28


78

Modern RegEx Flavors (PCRE)

Includes C, C++, Delphi, EditPad, Java, JavaScript, Perl, PHP (preg), PostgreSQL, PowerGREP, PowerShell, Python, REALbasic, Real Studio, Ruby, TCL, VB.Net, VBScript, wxWidgets, XML Schema, Xojo, XRegExp.
PCRE compatibility may vary

    Anywhere: . ^ $ * + - ? ( ) [ ] { } \ |


Legacy RegEx Flavors (BRE/ERE)

Includes awk, ed, egrep, emacs, GNUlib, grep, PHP (ereg), MySQL, Oracle, R, sed.
PCRE support may be enabled in later versions or by using extensions

ERE/awk/egrep/emacs

    Outside a character class: . ^ $ * + ? ( ) [ { } \ |
    Inside a character class: ^ - [ ]

BRE/ed/grep/sed

    Outside a character class: . ^ $ * [ \
    Inside a character class: ^ - [ ]
    For literals, don’t escape: + ? ( ) { } |
    For standard regex behavior, escape: \+ \? \( \) \{ \} \|


Notes

  • If unsure about a specific character, it can be escaped like \xFF
  • Alphanumeric characters cannot be escaped with a backslash
  • Arbitrary symbols can be escaped with a backslash in PCRE, but not BRE/ERE (they must only be escaped when required). For PCRE ] - only need escaping within a character class, but I kept them in a single list for simplicity
  • Quoted expression strings must also have the surrounding quote characters escaped, and often with backslashes doubled-up (like "(\")(/)(\\.)" versus /(")(\/)(\.)/ in JavaScript)
  • Aside from escapes, different regex implementations may support different modifiers, character classes, anchors, quantifiers, and other features. For more details, check out regular-expressions.info, or use regex101.com to test your expressions live

6

  • 1

    There are many errors in your answer, including but not limited to: None of your “modern” flavors require - or ] to be escaped outside character classes. POSIX (BRE/ERE) doesn’t have an escape character inside character classes. The regex flavor in Delphi’s RTL is actually based on PCRE. Python, Ruby, and XML have their own flavors that are closer to PCRE than to the POSIX flavors.

    Feb 23, 2017 at 8:05

  • 1

    @JanGoyvaerts Thanks for the correction. The flavors you mentioned are indeed closer to PCRE. As for the escapes, I kept them that way for simplicity; it’s easier to remember just to escape everywhere than a few exceptions. Power users will know what’s up, if they want to avoid a few backslashes. Anyway, I updated my answer with a few clarifications that hopefully address some of this stuff.

    – Beejor

    Mar 7, 2017 at 3:15

  • I have been trying to find this for days! You are the BEST!

    – Caterina

    Oct 2, 2021 at 12:12

  • Do you need to scape “\” inside a character class as well?

    – Caterina

    Oct 2, 2021 at 12:17

  • Also what about single, double quotes and “/”? How do you get the literal values of them in BRE and ERE syntax?

    – Caterina

    Oct 2, 2021 at 12:20


23

Unfortunately there really isn’t a set set of escape codes since it varies based on the language you are using.

However, keeping a page like the Regular Expression Tools Page or this Regular Expression Cheatsheet can go a long way to help you quickly filter things out.

1

  • 1

    The Addedbytes cheat sheet is grossly oversimplified, and has some glaring errors. For example, it says \< and \> are word boundaries, which is true only (AFAIK) in the Boost regex library. But elsewhere it says < and > are metacharacters and must be escaped (to \< and \>) to match them literally, which not true in any flavor

    Mar 7, 2017 at 5:00