It's often necessary to extract a single chunk of data from a text field or other string. The standard VBA string handling functions such as InStr() and Mid() can do a great deal, but require laborious coding when it is a matter of identifying and extracting the part of a string that matches a given pattern - e.g. a Zip code (five digits plus optionally a hyphen and four more digits) or a UK postcode (e.g. “B1 3AA”, “SA21 4FH”, “SW1P 5RH”).
rgxExtract() short-circuits this. It is a versatile
function for pattern matching and parsing with regular expressions.
rgxExtract() can be used in VBA, in classic Visual Basic, and in queries in
an Access database. Before you can use it, you need to copy the program code
below into a standard module in your application. If you
copy it into a new module, be sure to save the module with a name other than “rgxExtract”.
To test whether it's working, open the Immediate pane (e.g. with Ctrl-G), type
? rgxExtract("John Paul Jones", "J\w+", 0)
and hit Enter. The function should return the first word beginning with J,
namely
John
Now try
? rgxExtract("John Paul Jones", "J\w+", 1)
This should return the second word beginning with J, namely
John
If you get an error message like this
rgxExtract(): Could not create VBScript.Regexp object
it means that rgxExtract() has been unable to initialise the VBScript regular
expression
object which it uses for the pattern matching. Likely reasons include:
If the function worked OK, go back to the Immediate pane and type
? rgxExtract("(713) 555-1212", "(\d{3})\)?[- ]?(\d{3})[- ]?(\d{4})", 0)
and hit Enter. Be careful to distinguish between (parentheses) and {braces}. The function should return the area code 713.
Change the final 0 to 1 and you should get the exchange 555.
Change the 1 to 2 and you should get the last part of the phone
number 1212. Now experiment with different phone numbers, formatted
differently: e.g. “212-555-1212” or “2125551212”. Embed the phone number in
other stuff, e.g. “Home telephone: (344) 123-4567 except on Wednesdays” and try
again.
rgxExtract() takes up to six arguments, although normally only the first
two or three are needed.
These are as follows:
Target As VariantTarget is the string to be matched against Pattern.
It is declared as a Variant to make it easier to use rgxExtract() in Access queries:
if Target is Null, the function simply
returns Null.Pattern As StringPattern is the regular expression to be used. Even though it is
declared as Optional, you must always pass the pattern.
Item As LongItem is an optional argument for use when the regular expression
is expected to produce more than one match (as in the John Paul Jones example above),
or when there are grouping/capturing parentheses in the pattern (as in
the telephone number example).Item is omitted, the function returns the first (or only)
match found. Item >= 0, the function returns the corresponding
match. 0 returns the first match, 1 the second, and so on. If Item
exceeds the number of matches, the function returns a range error (or Null if you
have specified that it should not fail on errors (see below). Item = -1; 0, the function returns the last match. In the
John Paul Jones example, using -1 would return "Jones". Similarly, -2 returns the
second last match, and so on. The remaining arguments are all optional and are less often needed. They are:
CaseSensitive As BooleanCaseSensitive to True to override this. FailOnError As BooleanTrue, and any run-time
error will be passed back to the calling routine with an appropriate
message. This is convenient for debugging, but can be a great nuisance
when you are calling rgxExtract() from an Access query and get a messagebox
reporting an error - repeated for each of tens or hundreds of records. If you
set FailOnError to FalseItem range errors mentioned above), returning Null instead.
Persist as BooleanPersist to True,
the regular expression object will be created once and left in memory between
successive calls to rgxExtract(). This can greatly speed up tight loops.
Persist, the regular
expression object will be left in memory after the last call to rgxExtract(). You
can dispose of it by calling rgxExtract() one more time without any arguments. Click here for some useful links and references.
Copy everything between the two horizontal lines into a standard code module and save it with any name except rgxExtract (modules and procedures share the same namespace).
Public Function rgxExtract(Optional ByVal Target As Variant, _
Optional Pattern As String = "", _
Optional ByVal Item As Long = 0, _
Optional CaseSensitive As Boolean = False, _
Optional FailOnError As Boolean = True, _
Optional Persist As Boolean = False) _
As Variant
'Regular expression matching function suitable for use
'in VB/A generally and in Access queries.
'By John Nurick. Updated 14 Jan 06.
'Takes a search string (Target) and a regular expression
'(Pattern), and an optional Item argument.
'- If Item is omitted and a substring of Target matches Pattern,
' returns that substring.
'- If Pattern includes grouping parentheses, a substring of Target
' matches Pattern, and Item is an integer, returns the submatch
' specified by Item (first submatch is item 0). If there aren't
' enough submatches, returns Null. Negative values of Item start
' counting with the last submatch.
'- If no match, returns Null.
'- Returns Null on error unless FailOnError is True.
' Always matches against the entire Target (i.e. Global and
' Multiline are True).
'CaseSensitive matches regardless of case.
'Persist controls whether the compiled RegExp object
'remains in memory ready for the next call to the
'function or whether it is disposed of immediately. This
'means the function can be used in queries without having
'to create, compile, use and destroy a new RegExp object for
'each row being processed. But it also means that the object
'remains in memory after the query has run. To destroy the
'object and release the memory, call this function one
'last time with no arguments.
'
'Calling the function with different arguments (e.g. a new
'Pattern) recompiles the RegExp object, so
'the function can be used in different queries. However there
'may be problems if two threads are calling the function at
'the same time.
Const rgxPROC_NAME = "rgxExtract"
Static oRE As Object 'VBScript_RegExp_55.RegExp
'Static declaration means we don't have to create
'and compile the RegExp object every single time
'the function is called.
Dim oMatches As Object 'VBScript_RegExp_55.MatchCollection
On Error GoTo ErrHandler
rgxExtract = Null 'Default return value
'NB: if FailOnError is false, returns Null on error
If IsMissing(Target) Then
'This is the signal to dispose of oRE
Set oRE = Nothing
Exit Function 'with default value
End If
'Create the RegExp object if necessary
If oRE Is Nothing Then
Set oRE = CreateObject("VBScript.Regexp")
End If
With oRE
'Check whether the current arguments (other than Target)
'are different from those stored in oRE, and update them
'(thereby recompiling the regex) only if necessary.
If CaseSensitive = .IgnoreCase Then
.IgnoreCase = Not .IgnoreCase
End If
.Global = True
.Multiline = True
' If Multiline <> .Multiline Then
' .Multiline = Multiline
' End If
If Pattern <> .Pattern Then
.Pattern = Pattern
End If
'Finally, execute the match
If IsNull(Target) Then
rgxExtract = Null
Else
Set oMatches = oRE.Execute(Target)
If oMatches.Count > 0 Then
If oMatches(0).SubMatches.Count = 0 Then
'No ( ) group in Pattern: return the match
If Item < 0 Then 'we're counting from last item
'convert to count from the first item
Item = oMatches.Count + Item
End If
Select Case Item
Case Is < 0
'Negative Item originally passed exceeded the
'number of matches
rgxExtract = Null
If FailOnError Then
Err.Raise 9
End If
Case Is >= oMatches.Count
'Positive Item exceeded the number of matches
rgxExtract = Null
If FailOnError Then
Err.Raise 9
End If
Case Else
rgxExtract = oMatches(Item)
End Select
Else 'There are one or more ( ) captured groups in Pattern
'return the one specified by Item
With oMatches(0).SubMatches
If Item < 0 Then 'we're counting from last item
'convert to count from the first item
Item = .Count + Item
End If
Select Case Item
Case Is < 0
'Negative Item originally passed exceeded the
'number of submatches
rgxExtract = Null
If FailOnError Then
Err.Raise 9
End If
Case Is >= .Count
'Positive Item exceeded the number of submatches
rgxExtract = Null
If FailOnError Then
Err.Raise 9
End If
Case Else 'valid Item number
rgxExtract = .Item(Item)
End Select
End With
End If
Else
rgxExtract = Null
End If
End If
End With
'Tidy up and normal exit
If Not Persist Then Set oRE = Nothing
Exit Function
ErrHandler:
If FailOnError Then
With Err
Select Case .Number
'Replace the default "object-defined error" message
Case 9: .Description = "Subscript out of range (the Item number requested " _
& "was greater than the number of matches found, or than the number of " _
& "(...) grouping/capturing parentheses in the Pattern)."
Case 13: .Description = "Type mismatch, probably because " _
& "the ""Target"" argument could not be converted to a string"
Case 5017: .Description = "Syntax error in regular expression"
Case 5018: .Description = "Unexpected quantifier in regular expression"
Case 5019: .Description = "Expected ']' in regular expression"
Case 5020: .Description = "Expected ')' in regular expression"
Case Else
If oRE Is Nothing Then 'Failed to create Regexp object
.Description = "Could not create VBScript.RegExp object. " & Err.Description
Else 'Unexpected error
.Description = rgxPROC_NAME & ": " & .Description
End If
End Select
Set oRE = Nothing
.Raise Err.Number, rgxPROC_NAME, _
rgxPROC_NAME & "(): " & .Description
End With
Else 'Fail silently
Err.Clear
Set oRE = Nothing
End If
End Function