Skip to main content
info
This documentation is for version v2.4 of the product.
For the latest version(v2.7) documentation click here
Version: v2.4 print this page

RegexTokenizer

This node creates a new DataFrame by the process of taking text (such as a sentence) and breaking it into individual terms (usually words) based on regular express

Type

transform

Fields

NameTitleDescription
inputColColumninput column for tokenizing
outputColTokenized ColumnNew output column after tokenization
patternPatternThe regex pattern used to match delimiters
gapsGapsIndicates whether the regex splits on gaps

Examples

Input

labelmessageid
DoubleTypeStringTypeDoubleType
1.0this is a spam2.0
0.0i am going to work1.0

Parameters

NameValue
Columnmessage
Tokenized Columntoken_output
Pattern\s+
Gapsfalse

Output

labelmessageidtoken_output
DoubleTypeStringTypeDoubleTypeArrayType(StringType,true)
1.0this is a spam2.0WrappedArray(this, is, a, spam)
0.0i am going to work1.0WrappedArray(i, am, going, to, work)