Categories
character-properties python regex ucd unicode

Python regex matching Unicode properties

75

Perl and some other current regex engines support Unicode properties, such as the category, in a regex. E.g. in Perl you can use \p{Ll} to match an arbitrary lower-case letter, or p{Zs} for any space separator. I don’t see support for this in either the 2.x nor 3.x lines of Python (with due regrets). Is anybody aware of a good strategy to get a similar effect? Homegrown solutions are welcome.

1

  • 18

    Actually, Perl supports all Unicode properties, not just the general categories. Examples include \p{Block=Greek}, \p{Script=Armenian}, \p{General_Category=Uppercase_Letter}, \p{White_Space}, \p{Alphabetic}, \p{Math}, \p{Bidi_Class=Right_to_Left}, \p{Word_Break=A_Letter }, \p{Numeric_Value=10}, \p{Hangul_Syllable_Type=Leading_Jamo}, \p{Sentence_Break=SContinue}, and around 1,000 more. Only Perl’s and ICU’s regexes bother to cover the full complement of Unicode properties. Everybody else covers a tiny few, usually not even enough for minimal Unicode work.

    – tchrist

    Apr 25, 2011 at 23:03

25

Have you tried Ponyguruma, a Python binding to the Oniguruma regular expression engine? In that engine you can simply say \p{Armenian} to match Armenian characters. \p{Ll} or \p{Zs} work too.

2

25

Have you tried Ponyguruma, a Python binding to the Oniguruma regular expression engine? In that engine you can simply say \p{Armenian} to match Armenian characters. \p{Ll} or \p{Zs} work too.

2

7

You can painstakingly use unicodedata on each character:

import unicodedata

def strip_accents(x):
    return u''.join(c for c in unicodedata.normalize('NFD', x) if unicodedata.category(c) != 'Mn')

2

  • Thanks. Although outside regex’s, this might be viable alternative for certain cases.

    – ThomasH

    Nov 13, 2010 at 21:09

  • It seems that the Python unicodedata module doesn’t presently contain information about e.g. the script or Unicode block of a character. See also stackoverflow.com/questions/48058402/…

    – tripleee

    Jan 10, 2018 at 6:44