Regex Creator
3.1 Summary of regular-expression constructs
3.1.1 Backslashes, escapes, and quoting
Regex Creator is a Java application
that helps you create and test regular expressions.
Definition:
Tag : delimiter used to recognize a pattern
Field : piece of data to extract
1- Paste the text to parse in the “Text
Sample” box.
2- Specify the ending position of one search
: place the cursor just after the word “room”, then click on “Set End Pos”. The
end position will be displayed, 52 in this case.
3- Select text that you don’t
necessarily want to extract, but that are constants and can serve as anchors to
recognize patterns. Then press “Create Delimiter from Selection” button. This
will create an entry in the “Fields Definition” table. Since a Delimiter is
supposed to be constant, the regex pattern will be the same as the text. If the
delimiter can have more than one value, select the text that will serve as the
new value and press the “Add value” button. Repeat for each string to create a
delimiter.
4- Select one string you one want to
extract in the Text Sample box using the mouse and press “Create Group from Selection”
button. This will create an entry in the “Fields Definition” table. A regex
pattern will be automatically created. You can modify it if needed. Repeat for
each string to extract.
5- Select ON and press tag
6- If you create a field by mistake,
you can remove it by pressing the “Delete Field” button.
7- When all your fields are created,
press the “Generate Regex” button. The application will create a regex pattern
using the fields created and the delimiters. The pattern generated is put in the
Regex Pattern box in the next section. If there was already a pattern, it is
replaced with the new one.
8- Now, to test your new regex pattern,
press the “Search” button. The text in the text sample box will be searched
using the regex pattern. The results are displayed in the “Search Results”
section.
9- If nothing is displayed in the
“Search Results” section, this means that the regex did not find a match. In
that case, press the “Debug Regex” button and the application will tell what
part of the regex pattern can find a match.
10- If you want to modify the regex
pattern, you can do it directly in the “Regex Pattern” box, or by modifying the
fields. Remember that if you modify the pattern directly, if you later press
the “Generate Regex” button later on,
your modifications will be lost.
11- When you’re satisfied with your pattern, you can double the
backslashes ( \ ) characters by pressing the “Double “\”” button. This is
useful if you want to cut and paste your regex pattern in your source code, and
the backslash has a special meaning to the compiler (ex : C++). You can remove
the double backslashes by pressing the “Remove “\\” ” button.
Construct |
Matches |
|
|
Characters |
|
x |
The
character x |
\\ |
The
backslash character |
\0n |
The
character with octal value 0n (0 <= n <= 7) |
\0nn |
The
character with octal value 0nn (0 <= n <= 7) |
\0mnn |
The
character with octal value 0mnn (0 <= m <= 3,
0 <= n <= 7) |
\xhh |
The character
with hexadecimal value 0xhh |
\uhhhh |
The
character with hexadecimal value 0xhhhh |
\t |
The tab
character |
\n |
The
newline (line feed) character |
\r |
The
carriage-return character |
\f |
The
form-feed character |
\a |
The alert
(bell) character |
\e |
The
escape character |
\cx |
The
control character corresponding to x |
|
|
Character classes |
|
[abc] |
a, b, or c (simple
class) |
[^abc] |
Any
character except a, b, or c
(negation) |
[a-zA-Z] |
a through z or A through
Z, inclusive (range) |
[a-d[m-p]] |
a through d, or m through
p: [a-dm-p] (union) |
[a-z&&[def]] |
d, e, or f
(intersection) |
[a-z&&[^bc]] |
a through z, except
for b and c: [ad-z]
(subtraction) |
[a-z&&[^m-p]] |
a through z, and
not m through p: [a-lq-z](subtraction) |
|
|
Predefined character classes |
|
. |
Any
character (may or may not match line terminators) |
\d |
A digit: [0-9] |
\D |
A non-digit: [^0-9] |
\s |
A
whitespace character: [ \t\n\x0B\f\r] |
\S |
A
non-whitespace character: [^\s] |
\w |
A word
character: [a-zA-Z_0-9] |
\W |
A
non-word character: [^\w] |
|
|
POSIX character classes (US-ASCII
only) |
|
\p{Lower} |
A lower-case
alphabetic character: [a-z] |
\p{Upper} |
An
upper-case alphabetic character:[A-Z] |
\p{ASCII} |
All
ASCII:[\x00-\x7F] |
\p{Alpha} |
An
alphabetic character:[\p{Lower}\p{Upper}] |
\p{Digit} |
A decimal
digit: [0-9] |
\p{Alnum} |
An alphanumeric
character:[\p{Alpha}\p{Digit}] |
\p{Punct} |
Punctuation:
One of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ |
\p{Graph} |
A visible
character: [\p{Alnum}\p{Punct}] |
\p{Print} |
A
printable character: [\p{Graph}] |
\p{Blank} |
A space
or a tab: [ \t] |
\p{Cntrl} |
A control
character: [\x00-\x1F\x7F] |
\p{XDigit} |
A
hexadecimal digit: [0-9a-fA-F] |
\p{Space} |
A
whitespace character: [ \t\n\x0B\f\r] |
|
|
Classes for Unicode blocks and
categories |
|
\p{InGreek} |
A character
in the Greek block (simple block) |
\p{Lu} |
An
uppercase letter (simple category) |
\p{Sc} |
A
currency symbol |
\P{InGreek} |
Any
character except one in the Greek block (negation) |
[\p{L}&&[^\p{Lu}]] |
Any
letter except an uppercase letter (subtraction) |
|
|
Boundary matchers |
|
^ |
The beginning
of a line |
$ |
The end
of a line |
\b |
A word
boundary |
\B |
A
non-word boundary |
\A |
The
beginning of the input |
\G |
The end
of the previous match |
\Z |
The end
of the input but for the final terminator, if any |
\z |
The end
of the input |
|
|
Greedy quantifiers |
|
X? |
X, once or not at all |
X* |
X, zero or more times |
X+ |
X, one or more times |
X{n} |
X, exactly n times |
X{n,} |
X, at least n times |
X{n,m} |
X, at least n but not more
than m times |
|
|
Reluctant
quantifiers |
|
X?? |
X, once or not at all |
X*? |
X, zero or more times |
X+? |
X, one or more times |
X{n}? |
X, exactly n times |
X{n,}? |
X, at least n times |
X{n,m}? |
X, at least n but not more
than m times |
|
|
Possessive
quantifiers |
|
X?+ |
X, once or not at all |
X*+ |
X, zero or more times |
X++ |
X, one or more times |
X{n}+ |
X, exactly n times |
X{n,}+ |
X, at least n times |
X{n,m}+ |
X, at least n but not more
than m times |
|
|
Logical operators |
|
XY |
X followed by Y |
X|Y |
Either X
or Y |
(X) |
X, as a capturing group |
|
|
Back references |
|
\n |
Whatever
the nth capturing group matched |
|
|
Quotation |
|
\ |
Nothing,
but quotes the following character |
\Q |
Nothing,
but quotes all characters until \E |
\E |
Nothing,
but ends quoting started by \Q |
|
|
Special constructs (non-capturing) |
|
(?:X) |
X, as a non-capturing group |
(?idmsux-idmsux) |
Nothing,
but turns match flags on - off |
(?idmsux-idmsux:X) |
X, as a non-capturing group with the given flags on - off |
(?=X) |
X, via zero-width positive
lookahead |
(?!X) |
X, via zero-width negative
lookahead |
(?<=X) |
X, via zero-width positive
lookbehind |
(?<!X) |
X, via zero-width negative
lookbehind |
(?>X) |
X, as an independent, non-capturing
group |
The
backslash character ('\') serves to introduce escaped
constructs, as defined in the table above, as well as to quote characters that
otherwise would be interpreted as unescaped constructs. Thus the expression \\ matches
a single backslash and \{ matches a left brace.
It is an
error to use a backslash prior to any alphabetic character that does not denote
an escaped construct; these are reserved for future extensions to the
regular-expression language. A backslash may be used prior to a non-alphabetic
character regardless of whether that character is part of an unescaped construct.
Backslashes
within string literals in Java source code are interpreted as required by the Java Language Specification as either Unicode escapes or other character escapes. It is therefore necessary to
double backslashes in string literals that represent regular expressions to
protect them from interpretation by the Java bytecode compiler. The string
literal "\b", for example, matches a single backspace
character when interpreted as a regular expression, while "\\b" matches a word boundary. The string literal "\(hello\)" is illegal and leads to a compile-time error; in order to match the
string (hello) the string literal "\\(hello\\)" must be used.
Character
classes may appear within other character classes, and may be composed by the
union operator (implicit) and the intersection operator (&&).
The union operator denotes a class that contains every character that is in at
least one of its operand classes. The intersection operator denotes a class
that contains every character that is in both of its operand classes.
The
precedence of character-class operators is as follows, from highest to lowest:
1 |
Literal
escape |
\x |
2 |
Grouping |
[...] |
3 |
Range |
a-z |
4 |
Union |
[a-e][i-u] |
5 |
Intersection |
[a-z&&[aeiou]] |
Note that a different set of metacharacters are in effect inside a
character class than outside a character class. For instance, the regular
expression . loses its
special meaning inside a character class, while the expression - becomes a range forming
metacharacter.
A line
terminator is a one- or two-character sequence that marks the end of a line
of the input character sequence. The following are recognized as line
terminators:
If UNIX_LINES
mode is activated, then the only line terminators recognized are
newline characters.
The regular
expression . matches any character except a line terminator
unless the DOTALL
flag is specified.
By default,
the regular expressions ^ and $ ignore line terminators and only
match at the beginning and the end, respectively, of the entire input sequence.
If MULTILINE
mode is activated then ^ matches at the beginning of input
and after any line terminator except at the end of input. When in MULTILINE
mode $ matches just before a line terminator or the
end of the input sequence.
Capturing groups are numbered by counting their opening parentheses from
left to right. In the expression ((A)(B(C))), for example, there are four such groups:
1 |
((A)(B(C))) |
2 |
(A) |
3 |
(B(C)) |
4 |
(C) |
Group zero always stands for the entire expression.
Capturing groups are so named because, during a match, each subsequence of
the input sequence that matches such a group is saved. The captured subsequence
may be used later in the expression, via a back reference, and may also be
retrieved from the matcher once the match operation is complete.
The captured input associated with a group is always the subsequence
that the group most recently matched. If a group is evaluated a second time
because of quantification then its previously-captured value, if any, will be
retained if the second evaluation fails. Matching the string "aba" against the expression (a(b)?)+, for example, leaves group two set
to "b". All captured input is discarded at
the beginning of each match.
Groups beginning with (? are pure, non-capturing groups that do not capture text and do
not count towards the group total.
Perl constructs not supported by this class:
·
The
conditional constructs (?{X}) and (?(condition)X|Y),
·
The
embedded code constructs (?{code}) and (??{code}),
·
The
embedded comment syntax (?#comment), and
·
The
preprocessing operations \l \u, \L, and \U.
Constructs supported by this class but not by Perl:
·
Possessive
quantifiers, which greedily match as much as they can and do not back off, even
when doing so would allow the overall match to succeed.
·
Character-class
union and intersection as described above.
Notable differences from Perl:
·
In
Perl, \1 through \9 are always interpreted as back references;
a backslash-escaped number greater than 9 is treated as a back reference if at least
that many subexpressions exist, otherwise it is interpreted, if possible, as an
octal escape. In this class octal escapes must always begin with a zero. In this
class, \1 through \9 are always interpreted as back
references, and a larger number is accepted as a back reference if at least
that many subexpressions exist at that point in the regular expression,
otherwise the parser will drop digits until the number is smaller or equal to
the existing number of groups or it is one digit.
·
Perl
uses the g flag to
request a match that resumes where the last match left off. This functionality
is provided implicitly by the Matcher
class: Repeated invocations of the find
method will resume where the last match
left off, unless the matcher is reset.
·
In
Perl, embedded flags at the top level of an expression affect the whole
expression. In this class, embedded flags always take effect at the point at
which they appear, whether they are at the top level or within a group; in the
latter case, flags are restored at the end of the group just as in Perl.
·
Perl
is forgiving about malformed matching constructs, as in the expression *a, as well as dangling brackets, as
in the expression abc], and treats them as literals. This class also accepts dangling brackets
but is strict about dangling metacharacters like +, ? and *, and will throw a PatternSyntaxException
if it encounters them.