Version: v2.4 print this page

RegexTokenizer

This node creates a new DataFrame by the process of taking text (such as a sentence) and breaking it into individual terms (usually words) based on regular express

Type

transform

Fields

Name	Title	Description
inputCol	Column	input column for tokenizing
outputCol	Tokenized Column	New output column after tokenization
pattern	Pattern	The regex pattern used to match delimiters
gaps	Gaps	Indicates whether the regex splits on gaps

Examples

Input

label	message	id
DoubleType	StringType	DoubleType
1.0	this is a spam	2.0
0.0	i am going to work	1.0

Parameters

Name	Value
Column	message
Tokenized Column	token_output
Pattern	\s+
Gaps	false

Output

label	message	id	token_output
DoubleType	StringType	DoubleType	ArrayType(StringType,true)
1.0	this is a spam	2.0	WrappedArray(this, is, a, spam)
0.0	i am going to work	1.0	WrappedArray(i, am, going, to, work)

Type​

Fields​

Examples

Input​

Parameters​

Output​

Type

Fields

Input

Parameters

Output