rgxExtract: use pattern matching to parse data out of a string

Home page

Description

It's often necessary to extract a single chunk of data from a text field or other string. The standard VBA string handling functions such as InStr() and Mid() can do a great deal, but require laborious coding when it is a matter of identifying and extracting the part of a string that matches a given pattern - e.g. a Zip code (five digits plus optionally a hyphen and four more digits) or a UK postcode (e.g. “B1 3AA”, “SA21 4FH”, “SW1P 5RH”).

rgxExtract() short-circuits this. It is a versatile function for pattern matching and parsing with regular expressions.

Installation and testing

rgxExtract() can be used in VBA, in classic Visual Basic, and in queries in an Access database. Before you can use it, you need to copy the program code below into a standard module in your application. If you copy it into a new module, be sure to save the module with a name other than “rgxExtract”. To test whether it's working, open the Immediate pane (e.g. with Ctrl-G), type
    ? rgxExtract("John Paul Jones", "J\w+", 0)
and hit Enter. The function should return the first word beginning with J, namely
    John
Now try
    ? rgxExtract("John Paul Jones", "J\w+", 1)
This should return the second word beginning with J, namely
    John

If you get an error message like this
  rgxExtract(): Could not create VBScript.Regexp object
it means that rgxExtract() has been unable to initialise the VBScript regular expression object which it uses for the pattern matching. Likely reasons include:

If the function worked OK, go back to the Immediate pane and type
    ? rgxExtract("(713) 555-1212", "(\d{3})\)?[- ]?(\d{3})[- ]?(\d{4})", 0)
and hit Enter. Be careful to distinguish between (parentheses) and {braces}. The function should return the area code 713. Change the final 0 to 1 and you should get the exchange 555. Change the 1 to 2 and you should get the last part of the phone number 1212. Now experiment with different phone numbers, formatted differently: e.g. “212-555-1212” or “2125551212”. Embed the phone number in other stuff, e.g. “Home telephone: (344) 123-4567 except on Wednesdays” and try again.

Arguments

rgxExtract() takes up to six arguments, although normally only the first two or three are needed. These are as follows:

Target As Variant
Target is the string to be matched against Pattern. It is declared as a Variant to make it easier to use rgxExtract() in Access queries: if Target is Null, the function simply returns Null.
Pattern As String
Pattern is the regular expression to be used. Even though it is declared as Optional, you must always pass the pattern.
Item As Long
Item is an optional argument for use when the regular expression is expected to produce more than one match (as in the John Paul Jones example above), or when there are grouping/capturing parentheses in the pattern (as in the telephone number example).
    If Item is omitted, the function returns the first (or only) match found.
    If Item >= 0, the function returns the corresponding match. 0 returns the first match, 1 the second, and so on. If Item exceeds the number of matches, the function returns a range error (or Null if you have specified that it should not fail on errors (see below).
    If Item = -1; 0, the function returns the last match. In the John Paul Jones example, using -1 would return "Jones". Similarly, -2 returns the second last match, and so on.

The remaining arguments are all optional and are less often needed. They are:

CaseSensitive As Boolean
By default, rgxExtract is not case-sensitive (thus matching the default behaviour for other searching and matching in Access). Set CaseSensitive to True to override this.
FailOnError As Boolean
By default this is set to True, and any run-time error will be passed back to the calling routine with an appropriate message. This is convenient for debugging, but can be a great nuisance when you are calling rgxExtract() from an Access query and get a messagebox reporting an error - repeated for each of tens or hundreds of records. If you set FailOnError to False
, rgxExtract() will simply ignore most errors (such as faulty regular expressions and the Item range errors mentioned above), returning Null instead.
Persist as Boolean
Normally, each time you call rgxExtract() it has to create and initialise the VBScript regular expression object that it uses. The time this takes becomes significant if the function is being called in a tight loop or from a query that is processing thousands of records. If you set Persist to True, the regular expression object will be created once and left in memory between successive calls to rgxExtract(). This can greatly speed up tight loops.
    If you use Persist, the regular expression object will be left in memory after the last call to rgxExtract(). You can dispose of it by calling rgxExtract() one more time without any arguments.

More about regular expressions

Click here for some useful links and references.

Code

Copy everything between the two horizontal lines into a standard code module and save it with any name except rgxExtract (modules and procedures share the same namespace).





Public Function rgxExtract(Optional ByVal Target As Variant, _
    Optional Pattern As String = "", _
    Optional ByVal Item As Long = 0, _
    Optional CaseSensitive As Boolean = False, _
    Optional FailOnError As Boolean = True, _
    Optional Persist As Boolean = False) _
  As Variant
   
  'Regular expression matching function suitable for use
  'in VB/A generally and in Access queries.
  'By John Nurick. Updated 14 Jan 06.
   
  'Takes a search string (Target) and a regular expression
  '(Pattern), and an optional Item argument.
  '- If Item is omitted and a substring of Target matches Pattern,
  '  returns that substring.
  '- If Pattern includes grouping parentheses, a substring of Target
  '  matches Pattern, and Item is an integer, returns the submatch
  '  specified by Item (first submatch is item 0). If there aren't
  '  enough submatches, returns Null. Negative values of Item start
  '  counting with the last submatch.
  '- If no match, returns Null.
  '- Returns Null on error unless FailOnError is True.
  '  Always matches against the entire Target (i.e. Global and
  '  Multiline are True).
  
  'CaseSensitive matches regardless of case.
 
  'Persist controls whether the compiled RegExp object
  'remains in memory ready for the next call to the
  'function or whether it is disposed of immediately. This
  'means the function can be used in queries without having
  'to create, compile, use and destroy a new RegExp object for
  'each row being processed. But it also means that the object
  'remains in memory after the query has run. To destroy the
  'object and release the memory, call this function one
  'last time with no arguments.
  '
  'Calling the function with different arguments (e.g. a new
  'Pattern) recompiles the RegExp object, so
  'the function can be used in different queries. However there
  'may be problems if two threads are calling the function at
  'the same time.
 
  Const rgxPROC_NAME = "rgxExtract"
  Static oRE As Object 'VBScript_RegExp_55.RegExp
    'Static declaration means we don't have to create
    'and compile the RegExp object every single time
    'the function is called.
  Dim oMatches As Object 'VBScript_RegExp_55.MatchCollection
   
  On Error GoTo ErrHandler
  rgxExtract = Null 'Default return value
    'NB: if FailOnError is false, returns Null on error
 
  If IsMissing(Target) Then
    'This is the signal to dispose of oRE
    Set oRE = Nothing
    Exit Function 'with default value
  End If
   
  'Create the RegExp object if necessary
  If oRE Is Nothing Then
    Set oRE = CreateObject("VBScript.Regexp")
  End If
 
  With oRE
    'Check whether the current arguments (other than Target)
    'are different from those stored in oRE, and update them
    '(thereby recompiling the regex) only if necessary.
    If CaseSensitive = .IgnoreCase Then
      .IgnoreCase = Not .IgnoreCase
    End If
    .Global = True
    .Multiline = True
'    If Multiline <> .Multiline Then
'      .Multiline = Multiline
'    End If
    If Pattern <> .Pattern Then
      .Pattern = Pattern
    End If
 
  'Finally, execute the match
    If IsNull(Target) Then
      rgxExtract = Null
    Else
      Set oMatches = oRE.Execute(Target)
      If oMatches.Count > 0 Then
        If oMatches(0).SubMatches.Count = 0 Then
          'No ( ) group in Pattern: return the match
          If Item < 0 Then 'we're counting from last item
            'convert to count from the first item
            Item = oMatches.Count + Item
          End If
          Select Case Item
            Case Is < 0
              'Negative Item originally passed exceeded the
              'number of matches
              rgxExtract = Null
              If FailOnError Then
                Err.Raise 9
              End If
            Case Is >= oMatches.Count
              'Positive Item exceeded the number of matches
              rgxExtract = Null
              If FailOnError Then
                Err.Raise 9
              End If
            Case Else
              rgxExtract = oMatches(Item)
          End Select
         
        Else  'There are one or more ( ) captured groups in Pattern
              'return the one specified by Item
          With oMatches(0).SubMatches
            If Item < 0 Then 'we're counting from last item
              'convert to count from the first item
              Item = .Count + Item
            End If
            Select Case Item
              Case Is < 0
                'Negative Item originally passed exceeded the
                'number of submatches
                rgxExtract = Null
                If FailOnError Then
                  Err.Raise 9
                End If
              Case Is >= .Count
                'Positive Item exceeded the number of submatches
                rgxExtract = Null
                If FailOnError Then
                  Err.Raise 9
                End If
              Case Else 'valid Item number
                rgxExtract = .Item(Item)
            End Select
          End With
        End If
      Else
        rgxExtract = Null
      End If
    End If
  End With
 
  'Tidy up and normal exit
  If Not Persist Then Set oRE = Nothing
  Exit Function
 
ErrHandler:
  If FailOnError Then
    With Err
      Select Case .Number
        'Replace the default "object-defined error" message
        Case 9: .Description = "Subscript out of range (the Item number requested " _
          & "was greater than the number of matches found, or than the number of " _
          & "(...) grouping/capturing parentheses in the Pattern)."
        Case 13: .Description = "Type mismatch, probably because " _
          & "the ""Target"" argument could not be converted to a string"
        Case 5017: .Description = "Syntax error in regular expression"
        Case 5018: .Description = "Unexpected quantifier in regular expression"
        Case 5019: .Description = "Expected ']' in regular expression"
        Case 5020: .Description = "Expected ')' in regular expression"
      Case Else
        If oRE Is Nothing Then 'Failed to create Regexp object
          .Description = "Could not create VBScript.RegExp object. " & Err.Description
        Else 'Unexpected error
          .Description = rgxPROC_NAME & ": " & .Description
        End If
      End Select
      Set oRE = Nothing
      .Raise Err.Number, rgxPROC_NAME, _
          rgxPROC_NAME & "(): " & .Description
    End With
  Else 'Fail silently
    Err.Clear
    Set oRE = Nothing
  End If
End Function

Home page

Email (remove reversed spamtrap) j.mapson.nurick@dial.pipex.com. Last updated 14/01/05