Regular expressions

Regular expressions (regexes) are patterns describing a certain amount of text. Usually this pattern is then used by string searching algorithms for "find" or "find and replace" operations on strings, or for input validation.

 

Metacharacters :

  • ?: Zero or one occurrences of the preceding element, e.g: colou?r matches both color and colour.
  • +: One or more occurrences of the preceding element, e.g: ab+c matches abc, abbc, abbbc, and so on, but not ac.
  • *: Zero or more occurrences of the preceding element, e.g: ab*c matches ac, abc, abbc, abbbc, and so on.
  • ^: Every regular expression starts with ^.
  • $: Every regular expression ends with $.
  • |: OR, e.g: (91|93) matches the 91 or 93 values inserted (not 9193).
  • (): The number inserted, e.g: (8) indicates when the number 8 is inserted.
  • {}: The amount inserted, e.g: {8} indicates 8 digits inserted.
    • {n}: The preceding item is matched exactly n times.
    • {min,}: The preceding item is matched min or more times.
    • {min,max} ({m,n}): The preceding item is matched at least min times, but not more than max times, e.g: a{3,5} matches only "aaa", "aaaa", and "aaaaa".
  • -: From to, e.g: [0-9] indicates numbers inserted from 0 to 9.
  • []: Interval [a-z] indicates any letter (small letter) from a to z inserted.
  •  ,: Options separated, e.g: (1,2,3) indicates the numbers 1, 2 or 3 inserted.
  • .: Matches a single character, without caring what that character is, e.g: gr.y matches gray, grey, gr%y, etc.
  • \: Placed before a special character, escapes its special value and indicates that character inserted, e.g: \+ matches the + sign inserted and not its special value more than 1. It also indicates a shorthand:
    • \d: Matches digits, e.g: \d(1-9) matches digits from 1 to 9 inserted. 
    • \D: Matches non-digits.
    • \w: Matches letters, e.g: \d(a-z) matches letters from a to z inserted (small letters).
    • \W: Matches non-letters.
    • \s: Matches white space, e.g: ^\s+ matches any inserted group of characters that start with _(white space).
    • \b: Word boundaries
    • \A: Only ever matches at the start of the string.
    • \Z: Only ever matches at the end of the string.

 

 

Examples:

  • a.: a is the the literal character a and matches just a and . is a meta character that matches every character (except a newline). This regex matches, for example, 'a ', or 'ax', or 'a0'.
  • seriali[sz]e: matches both "serialise" and "serialize".
  • ^[ \t]+|[ \t]+$: matches excess whitespace at the beginning or end of a line.
  • [+-]?(\d+(\.\d+)?|\.\d+)([eE][+-]?\d+)?: matches any numeral.

    Tw01.png

  • H(ae?|ä)ndel, H(a|ae|ä)ndel, H(ä|ae?)ndel.
  • a|b*: denotes {ε, "a", "b", "bb", "bbb", …}.
  • (a|b)*: denotes the set of all strings with no symbols other than "a" and "b", including the empty string: {ε, "a", "b", "aa", "ab", "ba", "bb", "aaa", …}.
  • ab*(c|ε): denotes the set of strings starting with "a", then zero or more "b"s and finally optionally a "c": {"a", "ac", "ab", "abc", "abb", "abbc", …}.
  • (0|(1(01*0)*1))*: denotes the set of binary numbers that are multiples of 3: { ε, "0", "00", "11", "000", "011", "110", "0000", "0011", "0110", "1001", "1100", "1111", "00000", … }.
  • .at: matches any three-character string ending with "at", including "hat", "cat", and "bat".
  • [hc]at: matches "hat" and "cat".
  • [^b]at: matches all strings matched by .at except "bat".
  • [^hc]at: matches all strings matched by .at other than "hat" and "cat".
  • ^[hc]at: matches "hat" and "cat", but only at the beginning of the string or line.
  • [hc]at$: matches "hat" and "cat", but only at the end of the string or line.
  • \[.\]: matches any single character surrounded by "[" and "]" since the brackets are escaped, for example: "[a]" and "[b]".
  • s.*: matches s followed by zero or more characters, for example: "s" and "saw" and "seed".
  • [hc]?at: matches "at", "hat", and "cat".
  • [hc]*at: matches "at", "hat", "cat", "hhat", "chat", "hcat", "cchchat", and so on.
  • [hc]+at: matches "hat", "cat", "hhat", "chat", "hcat", "cchchat", and so on, but not "at".
  • cat|dog: matches "cat" or "dog".
  • \b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b: matches an email address.

 

 

Tip: To know more about Regular Expressions, click on the following: