S3 Connections
From version 2.2, encryption(in-flight, at-rest) for all jobs and catalog is enabled. All the existing jobs(User created, and also system created) were updated with encryption related settings, and all the newly created jobs will have encryption enabled automatically.
S3 Connections are used to migrate data from a remote S3 bucket to Amorphic Data Cloud. There are two types of S3 Connections available: Bucket Policy
and Access Keys
.
Bucket Policy
To create an S3 Connection using a Bucket Policy, you have to first select the Bucket from which the data needs to be migrated and specify the Bucket Region.
Once the connection is created, a Bucket policy and a KMS key policy will be available for the user to download. The bucket policy generated should be attached to the source bucket which was added to the connection during creation. If the source bucket has a KMS key attached, then this amorphic generated KMS key policy also should be attached to that policy of KMS key.
Access Keys
To create an S3 Connection using Access Keys, you have to select the bucket from which the data has to be migrated and provide Access Key and Secret Access Key of the user who has permission to read the data from the bucket.
How to create an S3 Connection?
Attribute | Description |
---|---|
Connection Name | Give your connection a unique name. |
Connection Type | Type of connection, in this case, it is S3. |
Description | You can describe the connection purpose and important information about the connection. |
Authorized Users | Amorphic users who can access this connection. |
S3 Bucket | Name of the bucket from which the dataset files have to be imported. |
Connection Access Type | There are two access types for this connection Access Keys and Bucket Policy. |
Version | Enables you to select which version of ingestion scripts to use (Amorphic specific). For any new feature/Glue version that gets added to the underlying ingestion script, new version will be added to the Amorphic. |
S3 Bucket Region | Region where source S3 bucket is created. If the source bucket is in one of the regions (eu-south-1, af-south-1, me-south-1,ap-east-1) then this property needs to be provided and the region needs to be enabled in Amorphic else ingestion fails. |
For Redshift use cases with a large number of incoming files, the user should turn ON dataload throttling and set a maximum limit of 90 for redshift.
Data migration to Amorphic
To migrate data to Amorphic, users must create a new Dataset, choosing the Connection Type as S3, and then select this S3 connection from the drop-down list. During dataset creation, users have the option to provide the FileType and DirectoryPath, ensuring that only the data from the specified path and file type is extracted.
After dataset creation, the next step for the user is to set up a Schedule for data ingestion. They need to create a schedule with the Type specified as DataIngestion and select the previously created dataset from the list. This schedule will be responsible for ingesting data from the source S3 bucket into the target dataset that was created earlier.
Once the user has set up the schedule for data ingestion and runs it successfully user can check the Amorphic dataset to verify that the files from the source S3 bucket have been successfully ingested and are now present in the dataset.
Upgrade S3 Connection
You can upgrade a connection if a new version is available. Upgrading a connection upgrades the underlying Glue version and the data ingestion script with new features.
Downgrade S3 Connection
You can downgrade a connection to a previous version if the upgrade is not meeting their needs. A connection can only be downgraded if it has been upgraded. The option to downgrade is available on the top right corner if the connection is downgrade compatible.
Connection Versions
1.6
In this version of s3 connection, the data ingestion happens by considering and comparing ETag of the files in source and target.
First we check the file name, if file exists, size is same then we get the ETags of the source and current file in dataset. If they match then we do the ingestion. This is because an ETag of a file doesn't change when the file name changes and if the user intends to duplicate the files by changing the names then he won't be able to if only ETag is considered.
In this version, only files stored in S3 Standard class are supported for S3 data ingestion. If there exist files from other storage classes, the ingestion process will fail.
1.7
In this version of s3 connection, the storage classes of files do not affect the flow of ingestion.
That means, the ingestion process will not be terminated even if there exist files from S3 Glacier type of storage classes. We just skip that files from ingestion, then show the details of skipped files and successfully complete the ingestion of all other files without any failure.
Files that are stored in S3 Glacier and S3 Glacier deep archive classes will be skipped during ingestion.
1.8
In this version of s3 connection, we added support of skip LZ feature.
This feature enables users to directly upload data to the data lake zone by skipping the data validation. Please refer Skip LZ related docs for more details.