For example, f is not a special character, so it is ordinary, and therefore f is a regular expression that matches the string "f" and no other string. (It does not match the string "ff".) Likewise, o is a regular expression that matches only "o". (When case distinctions are being ignored, these regexps also match "F" and "O", but we consider this a generalization of "the same string," rather than an exception.)
Any two regular expressions A and B can be concatenated. The result is a regular expression which matches a string if A matches some amount of the beginning of that string and B matches the rest of the string.
As a simple example, we can concatenate the regular expressions f and o to get the regular expression fo, which matches only the string "fo". Still trivial. To do something nontrivial, you need to use one of the special characters. Here is a list of them.
* always applies to the smallest possible preceding expression. Thus, fo* has a repeating o, not a repeating fo. It matches "f", "fo", "foo", and so on.
The matcher processes a * construct by matching, immediately, as many repetitions as can be found. Then it continues with the rest of the pattern. If that fails, backtracking occurs, discarding some of the matches of the *-modified construct in case that makes it possible to match the rest of the pattern. For example, in matching ca*ar against the string "caaar", the a* first tries to match all three "a"s; but the rest of the pattern is ar and there is only "r" left to match, so this try fails. The next alternative is for a* to match only two "a"s. With this choice, the rest of the regexp matches successfully.
Thus, [ad] matches either one "a" or one "d", and [ad]* matches any string composed of just "a"s and "d"s (including the empty string), from which it follows that c[ad]*r matches "cr", "car", "cdr", "caddaar", etc.
You can also include character ranges in a character set, by writing the starting and ending characters with a - between them. Thus, [a-z] matches any lower-case ASCII letter. Ranges may be intermixed freely with individual characters, as in [a-z$%.], which matches any lower-case ASCII letter or "$", "%" or period.
Note that the usual regexp special characters are not special inside a character set. A completely different set of special characters exists inside character sets: ], - and ^.
To include a ] in a character set, you must make it the first character. For example, []a] matches "]" or "a". To include a -, write - as the first or last character of the set, or put it after a range. Thus, []-] matches both "]" and "-".
To include ^ in a set, put it anywhere but at the beginning of the set.
When you use a range in case-insensitive search, you should write both ends of the range in upper case, or both in lower case, or both should be non-letters.
Because \ quotes special characters, \$ is a regular expression that matches only "$", and "\[" is a regular expression that matches only "[", and so on.
The current implementation is by no means complete, but it is pretty stable, and has most of the commonly used features. An interesting missing operator is {a[,[b]]} (matching a definite amount of times, similar to * and friends, but finite).
From the URE README:This is a simple regular expression package for matching against Unicode text in UCS2 form. The implementation of this URE package is a variation on the RE->DFA algorithm done by Mark Hopkins (markh@csd4.csd.uwm.edu). Mark Hopkins' algorithm had the virtue of being very simple, so it was used as a model. Assumptions: o Regular expression and text already normalized. o Conversion to lower case assumes a 1-1 mapping. Definitions: Separator - any one of U+2028, U+2029, '\n', '\r'. Operators: . - match any character. * - match zero or more of the last subexpression. + - match one or more of the last subexpression. ? - match zero or one of the last subexpression. () - subexpression grouping. Notes: o The "." operator normally does not match separators, but a flag is available for the ure_exec() function that will allow this operator to match a separator. Literals and Constants: c - literal UCS2 character. \x.... - hexadecimal number of up to 4 digits. \X.... - hexadecimal number of up to 4 digits. \u.... - hexadecimal number of up to 4 digits. \U.... - hexadecimal number of up to 4 digits. Character classes: [...] - Character class. [^...] - Negated character class. \pN1,N2,...,Nn - Character properties class. \PN1,N2,...,Nn - Negated character properties class. POSIX character classes recognized: :alnum: :alpha: :cntrl: :digit: :graph: :lower: :print: :punct: :space: :upper: :xdigit: And a Unicode specific character class: :title: Aliases #url# -> Any internet address (http://www.nowhe.re). Expands to "https?://([:alnum:]|[-~./?%_=+])+" #email# -> Email address (somebody@somebody.sth). Expands to "([:alnum:]|[-~.])+@([:alnum:]|[-~.])+" Notes: o Character property classes are \p or \P followed by a comma separated list of integers between 1 and 32. These integers are references to the following character properties: N Character Property -------------------------- 1 _URE_ALNUM 2 _URE_ALPHA 3 _URE_CNTRL 4 _URE_DIGIT 5 _URE_GRAPH 6 _URE_LOWER 7 _URE_PRINT 8 _URE_PUNCT 9 _URE_SPACE 10 _URE_UPPER 11 _URE_XDIGIT 12 _URE_TITLE 13 _URE_DEFINED 14 _URE_WIDE 15 _URE_NONSPACING o Character classes can contain literals, constants, and character property classes. Example: [abc\U10A\p1,3,4]