Skip to main content

Regex gem

The Regex gem enables pattern matching and text extraction using regular expressions. This gem provides four distinct output methods for processing text data: Replace, Tokenize, Parse, and Match.

Input and Output

The Regex gem uses the following ports:

PortDescription
in0The source table containing text data that needs to be processed with regex patterns.
outThe output table containing:
  • Original columns preserved
  • New columns created based on the selected output method (Replace, Tokenize, Parse, or Match)
The output schema depends on the chosen method and configuration.

Parameters

Configure the Regex gem using the following parameters.

Common configuration

These parameters are available for all regex operations:

ParameterDescription
Select Column to SplitChoose the input column containing the text data you want to process with regex patterns.
Output MethodSelect how the regex operation should handle matches:
  • Replace: Substitute matched text with replacement values.
  • Tokenize: Split text into tokens or columns based on regex patterns.
  • Parse: Extract specific groups from regex matches into separate columns.
  • Match: Determine whether text matches the pattern.
RegexEnter your regular expression pattern. The field supports standard regex syntax with capture groups for extracting specific portions of matched text.
Case Insensitive MatchingEnable this option to perform pattern matching without regard to letter case.

Replace configuration

The Replace method substitutes matched portions of text with specified replacement values. When using this method, the gem outputs an additional column with the replaced values.

ParameterDescription
Replacement TextSpecify replacement text or use capture group references.
Copy Unmatched Text to OutputWhen enabled, non-matching text is preserved in the appended output column.

Example

Use this method to standardize phone number formats from 555-123-4567 to (555) 123-4567.

  • Select Column to Split: phone_number

  • Regex: (\d{3})-(\d{3})-(\d{4})

  • Replacement text: ($1)$2-$3

    This inserts capture groups 1, 2, and 3 into the replacement pattern to create the new formatted string. The result is written to a new output column, while the original value is preserved.

Input table

idphone_number
1555-332-1234
2555-034-9876

Output table

idphone_numberphone_number_replaced
1555-332-1234(555)332-1234
2555-034-9876(555)034-9876

Tokenize configuration

The Tokenize method splits text into tokens based on regex patterns and capture groups. Each capture group becomes a token. This method creates either new columns or rows depending on your configuration.

ParameterDescription
Select Split StrategyChoose how to split the data:
  • Split to columns: Breaks text into tokens and places each token into a new column in the same row. Requires a fixed number of columns.
  • Split to rows: Breaks text into tokens and outputs each token as a new row in a single column. Rows are generated dynamically, making this option useful when the number of tokens varies.
Allow Blank Tokens (Split to columns only)If there are fewer tokens than the defined number of columns, allow empty strings to fill the extra columns. Otherwise, those columns are set to NULL.
Number of columns (Split to columns only)Specify the number of output columns to create for tokenized data.
For Extra Columns (Split to columns only)Define how to handle cases where there are more tokens than columns.
  • Drop Extra with Warning: Skip writing excess tokens and log a warning message to indicate this.
  • Drop Extra without Warning: Skip writing excess tokens silently without generating warnings.
  • Error: Stop processing and raise an error when the number of tokens exceeds the defined number of columns.
Output Root NameBase name for the new column(s) containing the tokens.

Example

Use this method to parse email addresses into username and domain components.

  • Select Column to Split: email
  • Regex: ([^@]+)@(.+)
  • Select Split Strategy: Split to columns
  • Number of columns: 2
  • Output root name: token

Input table

idemail
1support@example.com
2sales@company.org

Output table

idemailtoken_1token_2
1support@example.comsupportexample.com
2sales@company.orgsalescompany.org

Parse configuration

The Parse method extracts capture groups from regex matches and outputs each group as a separate column. Prophecy automatically generates one output column for every capture group in the regex.

ParameterDescription
New Column NameSpecify the name for the new column.
Select Data TypeChoose the data type.
Regex ExpressionView the capture group that will populate the column.
If you edit this value, Prophecy will automatically revert it to the original value.
info

Rows in the Parse Configuration table are determined by the number of capture groups in the Regex field. You cannot add additional rows to or remove rows from this table.

Example

Use this method to parse phone numbers into area_code, exchange, and number columns.

  • Select Column to Split: phone_number

  • Regex: ([0-9]{3})-([0-9]{3})-([0-9]{4})

  • Parse Configuration:

    New Column NameSelect Data TypeRegex Expression
    area_codeString([0-9]{3})
    exchangeString([0-9]{3})
    numberString([0-9]{4})

Input table

idphone_number
1555-332-1234
2555-034-9876

Output table

idphone_numberarea_codeexchangenumber
1555-332-12345553321234
2555-034-98765550349876

Match configuration

The Match method determines whether text matches the specified regex pattern. Adds a column with 1 for matches and 0 for non-matches.

ParameterDescription
Column name for match statusSpecify the name for the new column containing match results.
Error if not MatchedEnable to raise an error when no match is found. When disabled, non-matching rows will receive a 0 value.

Example

Use this method to validate email addresses and create a binary match column.

  • Select Column to Split: email
  • Regex: ^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$
  • Column name for match status: is_valid_email
  • Error if not matched: Disabled

Input table

idemail
1support@example.com
2sales.team
3engineering@company.org

Output table

idemailis_valid_email
1support@example.com1
2sales.team0
3engineering@company.org1