Unicode and Regular Expressions

Version 0.5

Contents

Credits are due to

Foreword

Regular expressions are a powerful way to search certain patterns in a text. Apart of simply giving the text you want to find, you can build a set of rules to be matched, allowing for much more complex search patterns.

Unicode

Unicode is a way of encoding characters that aims to support all possible character sets in the world. Using Unicode we can assing an unique ID to every char, disregarding whether it's Arabic, Hebrew or Cyrillic. The teletext engine uses an Unicode flavour called UCS-2, that uses 16 bits to represent each char.

Introduction

Regular expressions have a syntax in which a few characters are special constructs and the rest are "ordinary". An ordinary character is a simple regular expression which matches that same character and nothing else. The special characters are $, ^, ., *, +, ?, [, ] and \. Any other character appearing in a regular expression is ordinary, unless a \ precedes it.

For example, f is not a special character, so it is ordinary, and therefore f is a regular expression that matches the string "f" and no other string. (It does not match the string "ff".) Likewise, o is a regular expression that matches only "o". (When case distinctions are being ignored, these regexps also match "F" and "O", but we consider this a generalization of "the same string," rather than an exception.)

Any two regular expressions A and B can be concatenated. The result is a regular expression which matches a string if A matches some amount of the beginning of that string and B matches the rest of the string.

As a simple example, we can concatenate the regular expressions f and o to get the regular expression fo, which matches only the string "fo". Still trivial. To do something nontrivial, you need to use one of the special characters. Here is a list of them.

. (Period)
is a special character that matches any single character except a newline. Using concatenation, we can make regular expressions like a.b, which matches any three-character string that begins with "a" and ends with "b".

*
is not a construct by itself; it is a postfix operator that means to match the preceding regular expression repetitively as many times as possible. Thus, o* matches any number of "o"s (including no "o"s).

* always applies to the smallest possible preceding expression. Thus, fo* has a repeating o, not a repeating fo. It matches "f", "fo", "foo", and so on.

The matcher processes a * construct by matching, immediately, as many repetitions as can be found. Then it continues with the rest of the pattern. If that fails, backtracking occurs, discarding some of the matches of the *-modified construct in case that makes it possible to match the rest of the pattern. For example, in matching ca*ar against the string "caaar", the a* first tries to match all three "a"s; but the rest of the pattern is ar and there is only "r" left to match, so this try fails. The next alternative is for a* to match only two "a"s. With this choice, the rest of the regexp matches successfully.

+
is a postfix operator, similar to * except that it must match the preceding expression at least once. So, for example, ca+r matches the strings "car" and "caaaar" but not the string "cr", whereas ca*r matches all three strings.

?
is a postfix operator, similar to * except that it can match the preceding expression either once or not at all. For example, ca?r matches "car" or "cr"; nothing else.

[ ... ]
is a "character set", which begins with [ and is terminated by ]. In the simplest case, the characters between the two brackets are what this set can match.

Thus, [ad] matches either one "a" or one "d", and [ad]* matches any string composed of just "a"s and "d"s (including the empty string), from which it follows that c[ad]*r matches "cr", "car", "cdr", "caddaar", etc.

You can also include character ranges in a character set, by writing the starting and ending characters with a - between them. Thus, [a-z] matches any lower-case ASCII letter. Ranges may be intermixed freely with individual characters, as in [a-z$%.], which matches any lower-case ASCII letter or "$", "%" or period.

Note that the usual regexp special characters are not special inside a character set. A completely different set of special characters exists inside character sets: ], - and ^.

To include a ] in a character set, you must make it the first character. For example, []a] matches "]" or "a". To include a -, write - as the first or last character of the set, or put it after a range. Thus, []-] matches both "]" and "-".

To include ^ in a set, put it anywhere but at the beginning of the set.

When you use a range in case-insensitive search, you should write both ends of the range in upper case, or both in lower case, or both should be non-letters.

[^ ... ]
[^ begins a "complemented character set", which matches any character except the ones specified. Thus, [^a-z0-9A-Z] matches all characters except letters and digits. ^ is not special in a character set unless it is the first character. The character following the ^ is treated as if it were first (in other words, - and ] are not special there).

^
is a special character that matches the empty string, but only at the beginning of a line in the text being matched. Otherwise it fails to match anything. Thus, ^foo matches a "foo" that occurs at the beginning of a line.

$
is similar to ^ but matches only at the end of a line. Thus, x+$ matches a string of one "x" or more at the end of a line.

\
has two functions: it quotes the special characters (including \), and it introduces additional special constructs.

Because \ quotes special characters, \$ is a regular expression that matches only "$", and "\[" is a regular expression that matches only "[", and so on.

You can find a more comprehensive introduction to regular expressions in the manual page: man 7 regex

Formal description

The current implementation is by no means complete, but it is pretty stable, and has most of the commonly used features. An interesting missing operator is {a[,[b]]} (matching a definite amount of times, similar to * and friends, but finite).

From the URE README:
This is a simple regular expression package for matching
against Unicode text in UCS2 form. The implementation of this
URE package is a variation on the RE->DFA algorithm done by
Mark Hopkins (markh@csd4.csd.uwm.edu). Mark Hopkins' algorithm
had the virtue of being very simple, so it was used as a model.

Assumptions:

  o  Regular expression and text already normalized.

  o  Conversion to lower case assumes a 1-1 mapping.

Definitions:

  Separator - any one of U+2028, U+2029, '\n', '\r'.

Operators:
  .   - match any character.
  *   - match zero or more of the last subexpression.
  +   - match one or more of the last subexpression.
  ?   - match zero or one of the last subexpression.
  ()  - subexpression grouping.

  Notes:

    o  The "." operator normally does not match separators, but a flag is
       available for the ure_exec() function that will allow this operator to
       match a separator.

Literals and Constants:

  c       - literal UCS2 character.
  \x....  - hexadecimal number of up to 4 digits.
  \X....  - hexadecimal number of up to 4 digits.
  \u....  - hexadecimal number of up to 4 digits.
  \U....  - hexadecimal number of up to 4 digits.

Character classes:

  [...]           - Character class.
  [^...]          - Negated character class.
  \pN1,N2,...,Nn  - Character properties class.
  \PN1,N2,...,Nn  - Negated character properties class.

  POSIX character classes recognized:

    :alnum:
    :alpha:
    :cntrl:
    :digit:
    :graph:
    :lower:
    :print:
    :punct:
    :space:
    :upper:
    :xdigit:

  And a Unicode specific character class:
    :title:

Aliases
  #url# -> Any internet address (http://www.nowhe.re). Expands to
    "https?://([:alnum:]|[-~./?%_=+])+"
  #email# -> Email address (somebody@somebody.sth). Expands to
    "([:alnum:]|[-~.])+@([:alnum:]|[-~.])+"

  Notes:

    o  Character property classes are \p or \P followed by a comma separated
       list of integers between 1 and 32.  These integers are references to
       the following character properties:

        N   Character Property
        --------------------------
    1   _URE_ALNUM
    2   _URE_ALPHA
    3   _URE_CNTRL
    4   _URE_DIGIT
    5   _URE_GRAPH
    6   _URE_LOWER
    7   _URE_PRINT
    8   _URE_PUNCT
    9   _URE_SPACE
    10  _URE_UPPER
    11  _URE_XDIGIT
    12  _URE_TITLE
    13  _URE_DEFINED
    14  _URE_WIDE
    15  _URE_NONSPACING

    o  Character classes can contain literals, constants, and character
       property classes. Example:

       [abc\U10A\p1,3,4]