Schedules
Amorphic Schedule is for automating data ingestion, you can schedule batch and streaming data ingestion on regular basis. This eliminates the need for manual intervention and ensures that data is always up-to-date. You can set up custom schedules based on your specific needs.
How to create a Schedule?
Click on + New Schedule
to create schedule and fill in the information shown below.
Type | Description |
---|---|
Schedule Name | A unique name that identifies schedules' specific purpose |
Job Type | You can pick a specific type type from the dropdown list (Details given in the Job type table below) |
Schedule Type | There are are two schedule types
|
Schedule expression | Time based schedules require a schedule expression. i.e., Every 15 min, daily, etc. |
If the schedule job type is 'Data Ingestion' and the dataset is of 'reload' type, then schedule execution will load the data and reload the data automatically.
Type | Description |
---|---|
ETL Job | This option is used to schedule an ETL job. |
JDBC CDC | This option is utilized to synchronize data between a data warehouse and S3 for tasks related to Change Data Capture (CDC). It's important to note that only tasks with the "SyncToS3" option set to "yes" will be visible and can be scheduled. |
Data Ingestion | This option is used to schedule a data ingestion job for normal JDBC, S3 and external API connections. |
JDBC FullLoad | This option is used to schedule a JDBC Bulk Data Load full-load task. |
Forecast Predictors | This option is used to schedule a forecast predictor. |
Forecast Reports | This option is used to schedule a forecast report. |
Workflows | This option is used to schedule a workflow. |
HCLS-Store | This option is used to schedule an import job for Healthlake Store, Omics Storage: Sequence Store, Omics Analytics: Variant Store, Annotation Store, HealthImaging store |
Health Image Data Conversion | This option is used schedule a job which converts DICOM files in a dataset to NDJSON format and store it in a different dataset. |
Data Ingestion
This type schedule is used to schedule a data ingestion job for normal data load connection type. The supported arguments for this schedule are
For JDBC connection schedules
- NumberOfWorkers : This parameter specifies the number of worker nodes allocated for your Glue job. Allowed values are between 2 and 100.
- WorkerType: This parameter specifies the type of worker (computing resources) you want to use for the jobs. The worker type determines the amount of memory, CPU, and overall processing power allocated to each worker. Allowed values are
Standard, G.1X, G.2X
only. - query : User can use this argument to specify a SELECT SQL query and ingest the data from source database retrieved by that SQL command.
- prepareQuery : This argument is used to is specify a prefix that will form the final sql query together with
query
argument. This argument offers a way to run such complex queries. Read here for more information.
For S3 and Ext-API connection schedules
- MaxTimeOut : Can be provided during creation to override the timeout setting of the connection for the specific schedule. It accepts values from 1 to 2880.
- MaxCapacity : MaxCapacity is a parameter that defines the number of AWS Glue data processing units (DPUs) that can be allocated when the job runs. Allowed values are 1 and 0.0625 only.
- FileConcurrency : This argument is unique to S3 connection. This argument determines the no. of parallel file uploads that happens during S3 ingestion.
If the dataset is of 'reload' type then schedule execution will load the data and also reload the data automatically.
Health Image Data Conversion
This type of schedule job is used to convert DICOM files in a dataset to NDJSON format in order to upload in to Healthlake store. Healthlake store only support NDJSON file formats while importing data. Input dataset of these jobs are the datasets which contains DICOM files. User have to specify output dataset id in arguments with keyoutputDatasetId
and its value should be id of a valid s3 other type dataset. Converted NDJSON files will be stored into the specified output dataset. An optional argumentselectFiles
with valueall
will select all files in the input dataset during data conversion. Default value of this key willlatest
which only selects the files in the dataset that are uploaded after last job run during data conversion.
Schedule details
Once you have created a schedule, you can view it on the schedules listing page, and perform various actions on it, such as running, disabling, enabling, editing, cloning, or deleting the schedule.
Run Schedule
To schedule a job, you can utilize the Run Schedule option located in the top right corner of the page. After running the schedule, you can review its status in the Execution Status tab. This tab will indicate whether the job is currently running, or if it has completed either successfully or with a failure.
- Schedule execution will error out if the related S3 connection is using any of Amorphic S3 buckets as source. For ex:
<projectshortname-region-accountid-env-dlz>
- For Data Ingestion Schedules, the following arguments can be provided during schedule runs:
- MaxTimeOut: This argument allows users to override the timeout setting of the connection for the specific run. It accepts values from 1 to 2880.
- FileConcurrency: This argument enables users to configure the number of parallel file ingestion that occur for S3 connections. It accepts values from 1 to 100 and has a default value of 20.
Schedule use case
When the schedule execution is completed, an email notification will be sent out, based on the notification setting and schedule execution status. You can also view the execution logs of each schedule run, which includes Output Logs, Output Logs (Full), and Error Logs.
For example, if you need to create a schedule that runs an ETL job and sends out important emails every 4 hours, you can create a workflow with an ETL Job Node followed by a Mail Node. This workflow can then be scheduled to run every 4 hours, every day.