Version: v2.3 print this page

Workflow Nodes

Following are the fields required to create each of these nodes:

ETL Job Node

This node takes up an ETL job and takes in arguments to be used within the job. For instance, if we have an ETL job that finds the highest paying job and its salary, we can use this node to run that job as part of a larger workflow and can also use it to carry out subsequent jobs based on the output.

Attribute	Description
Module Type	This field is preselected as 'ETL Job' for etl nodes.
Resource	Select an etl job from the dropdown list of jobs you have access to.
Node Name	Name given to the node for quick and easy identification.
Input Configurations	Arguments which can be used in the job.

ML Model Inference Node

This node is used to runs machine learning models and uses input arguments to make decisions within the model. For example, the ML Model Node takes in customer data and runs a pre-trained machine learning model to predict customer churn. The output, the probability of each customer churning, is passed on to the next node in the workflow.

Attribute	Description
Module Type	This field is preselected as 'ML Model' for ml model nodes.
Resource	Select a machine learning model from the dropdown list of ml models you have access to.
Node Name	Name given to the node for quick and easy identification.
Input Dataset	The dataset which contains the file to perform ml model inference on.
Select Latest File	Selecting yes would automatically pick latest file in the input dataset to perform ml model inference on.
File Name Execution Property Key (must be an execution property key)	When the above field is selected as no, you must a file name from the the input dataset.
Target Dataset	The dataset to which ml model inference results be written to.

Email Node

This node is used to send out an email when arguments for the recipient, subject and body is given.

Attribute	Description
Module Type	This field is preselected as 'email' for email nodes.
Node Name	Name given to the node for quick and easyidentification.
Email Recipient (must be an execution property key)	List ofemails addresses to notify to. Eg: ['john.doe@amorphicdata.com','jane.doe@amorphicdata.com']. Note that emails are not delivered to external domains from SES. If needed, please raise a support ticket to bring SES service out of sandbox status.
Email Subject (must be an execution property key)	Subject of the email.
Email Body (must be an execution property key)	Body of the email.

Creation of a workflow that predicts readmission risk of patients and notifies them via email:

workflow_readmission_prediction

In the above example, execution properties email_to, email_body are defined as n/a during the workflow creation, during the run time Post_Processing_ETL_Job figures out which email addresses to send emails based on the patient's details returned by risk model and populates email_to, email_body. The code snippets in section workflow execution properties can be used to populate execution properties.

Textract Node

This node is used to extract text from documents, images and other types of files.

Attribute	Description
Module Type	This field is preselected as 'Textract' for textract nodes.
Node Name	Name given to the node for quick and easy identification.
Input Dataset	The dataset which contains files to extract text from. Supported files are of type - PDF, JPG, PNG.
File Processing Mode	All, Change Data Capture and Time Based are three modes available. All: processes all documents in a source dataset Change Data Capture: processes the documents that have landed after the previous workflow execution Time Based: processes the documents based on the custom time period chosen
Features	Choose one or more features that you want to extract to target dataset. Text, Forms and Tables are three features available. Text: extracts all text from document Forms: extracts forms as key-value pairs Tables: extracts tables in csv format
Target Dataset	The dataset to which extracted text is written to.

The picture below shows a textract node being created with source dataset - patient_cioms_docs_pdf and target dataset

patient_cioms_text with file processing mode: all and text as the feature to be extracted to output.

edit textract node

List of files in the source dataset - cioms_john_doe.pdf, cioms_jane_doe.pdf.

list of files in source dataset

List of files in the target dataset post workflow execution

cioms_john_doe.pdf-text.txt, cioms_jane_doe.pdf-text.txt.

list of files in target dataset

Textract node updates workflow execution properties with checkpoint_for_target_dataset_utc so that subsequent nodes can identify files generated from previous node in workflow execution.

Workflow execution properties updated by textract node

Note

No two textract nodes can have same input dataset in a workflow.

Rekognition Node

This node is used to analyze images and videos to detect and identify objects, people, and text.

Attribute	Description
Module Type	This field is preselected as 'Rekognition' for rekognition nodes.
Node Name	Name given to the node for quick and easy identification.
Input Dataset	The dataset which contains files to extract text from. Supported files are of type - MP4, JPG, PNG.
File Processing Mode	All, Change Data Capture and Time Based are three modes available. All: processes all documents in a source dataset Change Data Capture: processes the documents that have landed after the previous workflow execution Time Based: processes the documents based on the custom time period chosen
Features	Choose one or more features that you want to extract to target dataset. Text, Faces, Content Moderation, Celebrities and Labels are five features available. Text: extracts all text from document Faces: detects faces from image or video Content Moderation: extracts the inappropriate, unwanted, or offensive content analysis results Celebrities: extracts the name and additional information about a celebrity Labels: extracts label name, the percentage confidence in the accuracy of the detected label
Target Dataset	The dataset to which extracted json is written to.

The picture below shows a rekognition node being created with source dataset - celebrity_input_jpg and target dataset

celebrity_output_json with file processing mode: all and Celebrities as the feature to be extracted to output.

edit rekognition node

List of files in the source dataset - celebrity_input_jpg.

list of files in source dataset

List of files in the target dataset post workflow execution

barack_obama.jpg-celebrities.json

list of files in target dataset

Rekognition node updates workflow execution properties with checkpoint_for_target_dataset_utc so that subsequent nodes can identify files generated from previous node in workflow execution.

Workflow execution properties updated by rekognition node

Note

No two rekognition nodes can have same input dataset in a workflow.

Translate Node

This node is used to translate text from one language to another.

Attribute	Description
Module Type	This field is preselected as 'Translate' for translate nodes.
Node Name	Name given to the node for quick and easy identification.
Source Dataset	The dataset which contains files to translate text from. Supported files are of type - TXT.
File Processing Mode	All, Change Data Capture and Time Based are three modes available. All: processes all documents in a source dataset Change Data Capture: processes the documents that have landed after the previous workflow execution Time Based: processes the documents based on the custom time period chosen
Source Language	The language of the text to be translated in source dataset.
Target Language	The language to which text is translated to and written to target dataset.
Target Dataset	The dataset to which to translate text is written to. Supported files are of type - TXT.

The picture below shows a translate node being created with source dataset - german_news_articles and target dataset

german_news_articles_translated_into_english with file processing mode: all.

edit translate node

List of files in the source dataset - news_german_nov_04.text, news_german_nov_03.text, news_german_nov_02.text

list of files in source dataset

List of files in the target dataset post workflow execution - news_german_nov_04.text, news_german_nov_03.text, news_german_nov_02.text.

list of files in target dataset

Translate node updates workflow execution properties with checkpoint_for_target_dataset_utc so that subsequent nodes can identify files generated from previous node in workflow execution.

Workflow execution properties updated by translate node

Note

No two translate nodes can have same input dataset in a workflow.

Limits: Currently translate node only translates the first 5000 characters of the text, rest of the text is ignored.

Comprehend Node

This node is used to extract insights and relationships from text.

Attribute	Description
Module Type	This field is preselected as 'Comprehend' for comprehend nodes.
Node Name	Name given to the node for quick and easy identification.
Input Dataset	The dataset which contains files to find insights and relationships in text. Supported files are of type - TXT.
File Processing Mode	All, Change Data Capture and Time Based are three modes available. All: processes all documents in a source dataset Change Data Capture: processes the documents that have landed after the previous workflow execution Time Based: processes the documents based on the custom time period chosen
Features	Choose one or more features that you want to extract to target dataset. Entities, KeyPhrases and Sentiment are three features available. Entities: named entities like people, places, locations etc., in a document KeyPhrases: key phrases or talking points in a document Sentiment: overall sentiment of a text (positive, negative, neutral or mixed)
Target Dataset	The dataset to which insights and relationships in text is written to.

The picture below shows a comprehend node being created with source dataset - sample_news_articles and target dataset

news_articles_extracted_entities with file processing mode: all and three features - Entities, KeyPhrases, Sentiment to be extracted to output.

edit comprehend node

List of files in the source dataset - sample_news_article.txt.

list of files in source dataset

List of files in the target dataset post workflow execution - sample_news_article.txt-entities.json, sample_news_article.txt-keyphrases.json.

list of files in target dataset

Comprehend node updates workflow execution properties with checkpoint_for_target_dataset_utc so that subsequent nodes can identify files generated from previous node in workflow execution.

Workflow execution properties updated by comprehend node

Note

No two comprehend nodes can have same input dataset in a workflow.

Medical Comprehend Node

This node is used for natural language processing of medical text to extract insights and relationships.

Attribute	Description
Module Type	This field is preselected as 'Medical Comprehend' for medical comprehend nodes.
Node Name	Name given to the node for quick and easy identification.
Input Dataset	The dataset which contains files to find medications, medical conditions etc., in text. Supported files are of type - TXT.
File Processing Mode	All, Change Data Capture and Time Based are three modes available. All: processes all documents in a source dataset Change Data Capture: processes the documents that have landed after the previous workflow execution Time Based: processes the documents based on the custom time period chosen
Features	Choose one or more features that you want to extract to target dataset. Medications, Medical conditions, Personal health information and Medical entities are the features available. Medications: Detects medication and dosage information for the patient. Medical conditions: Detects the signs, symptoms, and diagnosis of medical conditions. Personal health information: Detects the patient's personal information. Medical entities: All the medical and personal information in the document
Target Dataset	The dataset to which extracted medical information in text is written to.

The picture below shows a medical comprehend node being created with source dataset - medical_articles and target dataset - medical_articles_extracted_information with file processing mode: all and features - Medications, MedicalConditions, PersonalHealthInformation, MedicalEntities to be extracted to output.

edit medical comprehend node

List of files in the source dataset - cancer_medication_article.txt.

list of files in source dataset

List of files in the target dataset post workflow execution - cancer_medication_article.txt-medications.json, cancer_medication_article.txt-medicalconditions.json.

list of files in target dataset

Medical comprehend node updates workflow execution properties with checkpoint_for_target_dataset_utc so that subsequent nodes can identify files generated from previous node in workflow execution.

Workflow execution properties updated by medical comprehend node

Note

No two medical comprehend nodes can have same input dataset in a workflow.

Transcribe Node

This node is used to transcribe audio files into text.

Attribute	Description
Module Type	This field is preselected as 'Transcribe' for transcribe nodes.
Node Name	Name given to the node for quick and easy identification.
Source Dataset	The dataset which contains audio files to extract content in the form of text. Supported files are of type - MP3, WAV.
Source Language	The language of the audio files in input dataset.
File Processing Mode	All, Change Data Capture and Time Based are three modes available. All: processes all documents in a source dataset Change Data Capture: processes the documents that have landed after the previous workflow execution Time Based: processes the documents based on the custom time period chosen
Features	Choose one or more features that you want to extract to target dataset. Text, ConversationBySpeaker, RedactedText and RedactedConversationBySpeaker are four features available. Text: Raw text extracted from the audio file. ConversationBySpeaker: Raw conversation displaying speaker and the sentence the speaker spoke. RedactedText: Extracted text from audio file with some content obscured for legal and security purposes. RedactedConversationBySpeaker: Conversation displaying speaker and the sentence the speaker spoke with some content obscured for legal and security purposes.
Target Dataset	The dataset to which extracted content is written in the form of txt files.

The picture below shows a transcribe node being created with source dataset - sample_news_articles_mp3 and target dataset - sample_news_articles_extracted_entities with file processing mode: all and four features - Text, ConversationBySpeaker, RedactedText and RedactedConversationBySpeaker to be extracted to output.

edit transcribe node

List of files in the source dataset - sample_news_article.mp3.

list of files in source dataset

List of files in the target dataset post workflow execution - sample_news_article.mp3-redacted_conversation_by_speaker.txt, sample_news_article.mp3-redacted_text.txt, sample_news_article.mp3-conversation_by_speaker.txt, sample_news_article.mp3-text.txt.

list of files in target dataset

transcribe node updates workflow execution properties with checkpoint_for_target_dataset_utc so that subsequent nodes can identify files generated from previous node in workflow execution.

Workflow execution properties updated by transcribe node

Note

No two transcribe nodes can have same input dataset in a workflow.

Medical Transcribe Node

This node is used to transcribe audio files of medical content into text.

Attribute	Description
Module Type	This field is preselected as 'Medical Transcribe' for medical transcribe nodes.
Node Name	Name given to the node for quick and easy identification.
Source Dataset	The dataset which contains audio files to extract content in the form of text. Supported files are of type - MP3, WAV.
Source Language	The language of the audio files in input dataset only English-US (en-US) is supported as of now.
File Processing Mode	All, Change Data Capture and Time Based are three modes available. All: processes all documents in a source dataset Change Data Capture: processes the documents that have landed after the previous workflow execution Time Based: processes the documents based on the custom time period chosen
Features	Choose one or more features that you want to extract to target dataset. Text and ConversationBySpeaker are two features available. Text: Raw text extracted from the audio file. ConversationBySpeaker: Raw conversation displaying speaker and the sentence the speaker spoke.
Target Dataset	The dataset to which extracted content is written in the form of txt files.

The picture below shows a medical transcribe node being created with source dataset - sample_medical_article_mp3 and target dataset - sample_medical_article_extracted_entities with file processing mode: all and four features - Text and ConversationBySpeaker to be extracted to output.

edit medical transcribe node

List of files in the source dataset - sample_medical_article.mp3.

list of files in source dataset

List of files in the target dataset post workflow execution - sample_medical_article.mp3-conversation_by_speaker.txt, sample_medical_article.mp3-text.txt.

list of files in target dataset

medical transcribe node updates workflow execution properties with checkpoint_for_target_dataset_utc so that subsequent nodes can identify files generated from previous node in workflow execution.

Workflow execution properties updated by medical transcribe node

Note

No two medical transcribe nodes can have same input dataset in a workflow.

Workflow Node

This node is used to combine previously created workflows and work in parallel or sequentially. For example, if we have a workflow that has an ETL Job Node with an Email Node and another workflow which has a Translate Node with an Email Node, we can use this Workflow Node to run those two workflows in parallel or one after the other.

Attribute	Description
Module Type	This field is preselected as 'Workflow' for workflow nodes.
Resource	Select a workflow from the dropdown list of workflows you have access to.
Node Name	Name given to the node for quick and easy identification.

If multiple workflows workflow_a, workflow_b, workflow_c are chained as nodes in workflow_root, we refer workflow_a, workflow_b, workflow_c as child workflows and workflow_root as parent workflow.

Once a parent workflow is stopped, it triggers the stoppage of all of its child workflows.

The following graphic shows how a workflow with workflow nodes is created: Workflow_Creation

Parent workflow execution shows details of node execution status: Parent_workflow_execution

Workflow execution properties can be defined at parent workflow level and they can be accessed and updated from any of the child workflows. Execution properties of a parent workflow overrides execution properties of child workflow. Refer to workflow execution properties for creating execution properties.

Child workflow execution shows details of node execution status and properties inherited from parent workflow: Child_workflow_execution

Below graphic showing parent workflow execution properties and details of child workflows: Parent_workflow_execution_properties

File Load Validation Node

This node is used to validate and check the data before loading into the system.

Attribute	Description
Module Type	This field is preselected as 'File Load Validation' for file load validation nodes.
Node Name	Name given to the node for quick and easy identification.
Timeout	Optional field to set the timeout value (in minutes) for the node to stop execution. Default is set to 60 minutes.

The following graphic shows how a file load validation node can be created to validate if all files written by an etl job to a dataset are ready to be consumed:

Creation of file_load_validation node

Note

A file load validation node can only succeed an etl job, it cannot exist by itself or succeed other type of nodes in a workflow.

A sample workflow execution with file load validation looks as shown below:

Execution properties of file_load_validation node

node_name.total_number_of_files - Number of files written by preceding etl job node.

node_name.number_of_completed_files - Number of files successfully validated and ready to be consumed from datasets.

node_name.number_of_processing_files - Number of files in process of being validated.

node_name.number_of_pending_files - Number of files in pending state waiting for success file to proceed further in processing. This is only applicable for datasets of type reload.

node_name.number_of_failed_files - Number of files that failed to be validated and are not available for consumption.

node_name.output_manifest_file - Manifest file showing list of file details.

By clicking 'Download' button next to output manifest file, you will be able to download output manifest file as shown below:

Output manifest file of file_load_validation node

It is advised to set number of concurrent runs to 1 on etl job when file load validation node is used. For advanced use cases which involve concurrent workflow executions, please refer to: Advanced usage of file load validation node

Sync To S3 Node

This node is used to synchronize data to the S3 storage.

Note

The Sync to S3 function for a dataset can now be run from multiple places within the Amorphic console, but only once at a time, for a specific dataset within the same domain, as multiple attempts to synchronize the same datasets simultaneously may fail with an error message to prevent data inconsistency.

Attribute	Description
Module Type	This field is preselected as 'Sync To S3' for file load validation nodes.
Node Name	Name given to the node for quick and easy identification.
Concurrency Factor	The number of datasets that user want to sync in parallel. Valid values range from 1-10.
Domain	The name of the domain for which datasets need to be synced from the Data Warehouse(Redshift or AuroraMySql) to S3
Sync All Datasets	Indicates whether you want all datasets(Only registered datasets with TargetLocation as Redshift or AuroraMySql) in the domain to be synced(When set to 'Yes') or not.
Select Datasets	List of datasets(Only registered datasets with TargetLocation as Redshift or AuroraMySql) that you want to sync from the Data Warehouse to S3. This attribute is required only if you select 'No' for the 'Sync All Datasets' attribute.
Timeout	Optional field to set the timeout value (in minutes) for the node to stop execution. Default is set to 60 minutes.

The following graphic shows how a Sync To S3 node can be created to sync data for all the datasets present in 'Insurance' domain & and at a time maximum 10 datasets will be synced in parallel:

Sync to S3 Node

User can view/download the progress report of sync for all selected datasets in node execution properties. At a time, progress is updated only when a specific batch(based on Concurrency Factor, here 10) of datasets syncing is completed/failed. Below graphic shows execution properties of Sync to S3 Node:

Sync to S3 Execution Properties

User is given an option to click & 'Download' the manifest file in execution properties of workflow Sync To S3 node. Below graphic shows manifest file that is generated/downloaded after a workflow with sync to s3 node is executed:

Sync to S3 Manifest File

ETL Job Node​

ML Model Inference Node​

Email Node​

Textract Node​

Rekognition Node​

Translate Node​

Comprehend Node​

Medical Comprehend Node​

Transcribe Node​

Medical Transcribe Node​

Workflow Node​

File Load Validation Node​

Sync To S3 Node​

ETL Job Node

ML Model Inference Node

Email Node

Textract Node

Rekognition Node

Translate Node

Comprehend Node

Medical Comprehend Node

Transcribe Node

Medical Transcribe Node

Workflow Node

File Load Validation Node

Sync To S3 Node