7.4.1 Regexp Syntax Summary
This is a quick-reference summary of the regular expression syntax
used in Monotone.
Quoting
\
x- where x is non-alphanumeric is a literal x
\Q...\E
- treat enclosed characters as literal
Characters
\a
- alarm, that is, the BEL character (hex 07)
\c
x- “control-x”, where x is any character
\e
- escape (hex 1B)
\f
- formfeed (hex 0C)
\n
- newline (hex 0A)
\r
- carriage return (hex 0D)
\t
- tab (hex 09)
\
ddd- character with octal code ddd, or backreference
\x
hh- character with hex code hh
\x{
hhh...}
- character with hex code hhh...
Character Types
.
- any character except newline;
in dotall mode, any character whatsoever
\C
- one byte, even in UTF-8 mode (best avoided)
\d
- a decimal digit
\D
- a character that is not a decimal digit
\h
- a horizontal whitespace character
\H
- a character that is not a horizontal whitespace character
\p{
xx}
- a character with the xx property
\P{
xx}
- a character without the xx property
\R
- a newline sequence
\s
- a whitespace character
\S
- a character that is not a whitespace character
\v
- a vertical whitespace character
\V
- a character that is not a vertical whitespace character
\w
- a “word” character
\W
- a “non-word” character
\X
- an extended Unicode sequence
‘\d’, ‘\D’, ‘\s’, ‘\S’, ‘\w’, and ‘\W’
recognize only ASCII characters.
General category property codes for ‘\p’ and ‘\P’
C
- Other
Cc
- Control
Cf
- Format
Cn
- Unassigned
Co
- Private use
Cs
- Surrogate
L
- Letter
Ll
- Lower case letter
Lm
- Modifier letter
Lo
- Other letter
Lt
- Title case letter
Lu
- Upper case letter
L&
- Ll, Lu, or Lt
M
- Mark
Mc
- Spacing mark
Me
- Enclosing mark
Mn
- Non-spacing mark
N
- Number
Nd
- Decimal number
Nl
- Letter number
No
- Other number
P
- Punctuation
Pc
- Connector punctuation
Pd
- Dash punctuation
Pe
- Close punctuation
Pf
- Final punctuation
Pi
- Initial punctuation
Po
- Other punctuation
Ps
- Open punctuation
S
- Symbol
Sc
- Currency symbol
Sk
- Modifier symbol
Sm
- Mathematical symbol
So
- Other symbol
Z
- Separator
Zl
- Line separator
Zp
- Paragraph separator
Zs
- Space separator
Script names for ‘\p’ and ‘\P’
Arabic,
Armenian,
Balinese,
Bengali,
Bopomofo,
Braille,
Buginese,
Buhid,
Canadian_Aboriginal,
Cherokee,
Common,
Coptic,
Cuneiform,
Cypriot,
Cyrillic,
Deseret,
Devanagari,
Ethiopic,
Georgian,
Glagolitic,
Gothic,
Greek,
Gujarati,
Gurmukhi,
Han,
Hangul,
Hanunoo,
Hebrew,
Hiragana,
Inherited,
Kannada,
Katakana,
Kharoshthi,
Khmer,
Lao,
Latin,
Limbu,
Linear_B,
Malayalam,
Mongolian,
Myanmar,
New_Tai_Lue,
Nko,
Ogham,
Old_Italic,
Old_Persian,
Oriya,
Osmanya,
Phags_Pa,
Phoenician,
Runic,
Shavian,
Sinhala,
Syloti_Nagri,
Syriac,
Tagalog,
Tagbanwa,
Tai_Le,
Tamil,
Telugu,
Thaana,
Thai,
Tibetan,
Tifinagh,
Ugaritic,
Yi.
Character Classes
[...]
- positive character class
[^...]
- negative character class
[
x-
y]
- range (can be used for hex characters)
[[:
xxx:]]
- positive POSIX named set
[[:^
xxx:]]
- negative POSIX named set
alnum
- alphanumeric
alpha
- alphabetic
ascii
- 0-127
blank
- space or tab
cntrl
- control character
digit
- decimal digit
graph
- printing, excluding space
lower
- lower case letter
print
- printing, including space
punct
- printing, excluding alphanumeric
space
- whitespace
upper
- upper case letter
word
- same as ‘\w’
xdigit
- hexadecimal digit
In PCRE, POSIX character set names recognize only ASCII
characters. You can use ‘\Q...\E’ inside a character class.
Quantifiers
?
- 0 or 1, greedy
?+
- 0 or 1, possessive
??
- 0 or 1, lazy
*
- 0 or more, greedy
*+
- 0 or more, possessive
*?
- 0 or more, lazy
+
- 1 or more, greedy
++
- 1 or more, possessive
+?
- 1 or more, lazy
{
n}
- exactly n
{
n,
m}
- at least n, no more than m, greedy
{
n,
m}+
- at least n, no more than m, possessive
{
n,
m}?
- at least n, no more than m, lazy
{
n,}
- n or more, greedy
{
n,}+
- n or more, possessive
{
n,}?
- n or more, lazy
Anchors and Simple Assertions
\b
- word boundary
\B
- not a word boundary
^
- start of subject
also after internal newline in multiline mode
\A
- start of subject
$
- end of subject
also before newline at end of subject
also before internal newline in multiline mode
\Z
- end of subject
also before newline at end of subject
\z
- end of subject
\G
- first matching position in subject
Match Point Reset
\K
- reset start of match
Alternation
- expr
|
expr|
expr...
Capturing
(...)
- capturing group
(?<
name>...)
- named capturing group (like Perl)
(?'
name'...)
- named capturing group (like Perl)
(?P<
name>...)
- named capturing group (like Python)
(?:...)
- non-capturing group
(?|...)
- non-capturing group; reset group numbers for
capturing groups in each alternative
Atomic Groups
(?>...)
- atomic, non-capturing group
Comment
(?#....)
- comment (not nestable)
Option Setting
(?i)
- caseless
(?J)
- allow duplicate names
(?m)
- multiline
(?s)
- single line (dotall)
(?U)
- default ungreedy (lazy)
(?x)
- extended (ignore white space)
(?-...)
- unset option(s)
Lookahead and Lookbehind Assertions
(?=...)
- positive look ahead
(?!...)
- negative look ahead
(?<=...)
- positive look behind
(?<!...)
- negative look behind
Each top-level branch of a look behind must be of a fixed length.
Backreferences
\
n- reference by number (can be ambiguous)
\g
n- reference by number
\g{
n}
- reference by number
\g{-
n}
- relative reference by number
\k<
name>
- reference by name (like Perl)
\k'
name'
- reference by name (like Perl)
\g{
name}
- reference by name (like Perl)
\k{
name}
- reference by name (like .NET)
(?P=
name)
- reference by name (like Python)
Subroutine References (possibly recursive)
(?R)
- recurse whole pattern
(?
n)
- call subpattern by absolute number
(?+
n)
- call subpattern by relative number
(?-
n)
- call subpattern by relative number
(?&
name)
- call subpattern by name (like Perl)
(?P>
name)
- call subpattern by name (like Python)
Conditional Patterns
(?(
condition)
yes-pattern)
(?(
condition)
yes-pattern|
no-pattern)
(?(
n)...
- absolute reference condition
(?(+
n)...
- relative reference condition
(?(-
n)...
- relative reference condition
(?(<
name>)...
- named reference condition (like Perl)
(?('
name')...
- named reference condition (like Perl)
(?(
name)...
- named reference condition (PCRE only)
(?(R)...
- overall recursion condition
(?(R
n)...
- specific group recursion condition
(?(R&
name)...
- specific recursion condition
(?(DEFINE)...
- define subpattern for reference
(?(assert)...
- assertion condition
Backtracking Control
The following act immediately they are reached:
(*ACCEPT)
- force successful match
(*FAIL)
- force backtrack; synonym ‘(*F)’
The following act only when a subsequent match failure causes a backtrack to
reach them. They all force a match failure, but they differ in what happens
afterwards. Those that advance the start-of-match point do so only if the
pattern is not anchored.
(*COMMIT)
- overall failure, no advance of starting point
(*PRUNE)
- advance to next starting character
(*SKIP)
- advance start to current matching position
(*THEN)
- local failure, backtrack to next alternation
Newline Conventions
These are recognized only at the very start of the pattern or after a
‘(*BSR_...)’ option.
(*CR)
(*LF)
(*CRLF)
(*ANYCRLF)
(*ANY)
What ‘\R’ Matches
These are recognized only at the very start of the pattern or after a
‘(*...)’ option that sets the newline convention.
(*BSR_ANYCRLF)
(*BSR_UNICODE)