Version: v2.5
RegexTokenizer
This node creates a new DataFrame by the process of taking text (such as a sentence) and breaking it into individual terms (usually words) based on regular express
Type
transform
Fields
Name | Title | Description |
---|---|---|
inputCol | Column | input column for tokenizing |
outputCol | Tokenized Column | New output column after tokenization |
pattern | Pattern | The regex pattern used to match delimiters |
gaps | Gaps | Indicates whether the regex splits on gaps |
Examples
Input
label | message | id |
---|---|---|
DoubleType | StringType | DoubleType |
1.0 | this is a spam | 2.0 |
0.0 | i am going to work | 1.0 |
Parameters
Name | Value |
---|---|
Column | message |
Tokenized Column | token_output |
Pattern | \s+ |
Gaps | false |
Output
label | message | id | token_output |
---|---|---|---|
DoubleType | StringType | DoubleType | ArrayType(StringType,true) |
1.0 | this is a spam | 2.0 | WrappedArray(this, is, a, spam) |
0.0 | i am going to work | 1.0 | WrappedArray(i, am, going, to, work) |