Groups
Phonex allows defining groups by placing any subpattern between the parenthesis - (
and )
- metacharacters.
Some reasons to use groups:
Repeating subpatterns
Extract information for furthur processing
Exclude part of the pattern from the final match
Denote different possible subpatterns
Capture Group
Capture groups are used to extract portions of matches for furthur processing. In Phon this is often used to create a new column in query result listings containing the data matched by the group subpattern.
For example, say you were searching for any CV pattern (e.g., \c\v
) but you wanted the consonant and vowel
in their own separate columns in a Phon query report. You would place each phone matcher into a group using parenthesis.
(It's also required to 'name' the group in this situation, see 'Group Names' below.)
E.g.
(\c)(\v)
Capture groups may be quantified. The following expression will match a consonant followed by a vowel repeatedly:
(\c\v)+
Lookahead and Lookbehind Groups
Lookahead and lookbehind groups allow matching subpatterns around a pattern without including
the content matched by the lookahead or lookbehind group. These groups are considered to be
zero-width assertions (i.e., the length of matched content is zero) like the start-of-input ^
and end-of-input $
boundary matchers.
Lookahead patterns are contained within parenthesis like regular groups with the special prefix ?>
.
An example of using a lookahead group would be to search for all consonants \c
which
are followed by a high vowel {v, high}
.
\c(?>{v, high})
Lookbehind patterns are specified by the group prefix ?<
. They behave in the same manner as
lookahead groups, but look backwards in the input rather than forwards. An example would be to search
for all vowels \v
which are preceeded by a b
.
(?<b)\v
Lookahead and lookbehind groups can be used together in the same pattern.
Conditional Groups
Conditional groups allow for choices within patterns. To specify choices, subpatterns in a group
are spearated by the logical-or (or pipe) |
metacharacter. The following example will match
the sequence bab
as well as dib
.
(ba|di)b
Conditional groups may be quantified.
Group Numbers
Groups in a phonex pattern are numbered left to right. Each open parenthesis (
metacharacter will increment the group index by 1 unless
the group is 'non-capturing' such as for lookbehind and lookahead groups. The following example pattern has two groups,
the first group includes both the consonant \c
and vowel \v
matchers; the second group includes only the
vowel \v
matcher:
(\c(\v))
The next example also has two groups as the lookbehind group is not included in group indexing:
(?<^\s)(\c(\v))
Phonex includes syntax to exclude a group from indexing (the group's content will not be stored.) These groups are
called non-capturing or organizational groups. To exclude a group from indexing the group content must start with ?=
.
The following phonex pattern has two capturing groups: group 1 includes a syllable boundary \S
, consonant \c
, and
vowel \v
matcher; group 2 includes just the vowel \v
matcher. There is one non-capturing group
containing the consonant \c
matcher.
(\S(?=\c)(\v))
Note that while the consonant is considered part of a non-capturing group it will still be included in the enclosing group's matched data.
Group Names
Capturing groups may also be named. To name a group the group content should start with the desired group name followed by an
equals =
metacharacter. The group name must start with a letter and consist of only letters, numbers, and
underscore _
. The following expression has two named groups; the first group name is 'onset'
and will match a consonant in the onset position \c:O
; the second group name is 'nucleus'
and will match a vowel in the nucleus position \v:N
.
(onset=\c:O)(nucleus=\v:N)
When used in Phon queries named groups will be added to result listings in a new column with a title
matching the phonex group name. The group name X
is reserved in Phon queries to mark
the portion of the phonex pattern to be used as the query result.
Back References
Back references are used to match a subpattern previously matched by a capture group. Back references can be specified
using either the group number or group name. For a numbered back reference enter a backslash \
metacharacter followed by the group number. The following pattern will match a consonant, store the value of the
matched consonant in group number 1, and then match the value of group 1 again (i.e., it will match repeated consonants.)
(\c)\1
To use a named group reference enter a backslash \
metacharacter followed by the group name enclosed in braces - {
and }
.
The following pattern will match a consonant, store the value of the matched consonant in a group named C1
, and then match
the sequence stored in group C1
.
(C1=\c)\{C1}
Group names are case sensitive, so in the above example \{c1}
would result in an error as there is no group named c1
with a lower-case C.
Another caveat is that the \{C1}
back reference will not match syllable position information (e.g., :O
) or other supplementary matchers specified
in the capture group. Quantifiers may be applied to back references but supplementary matchers are not allowed.