S3 Connections
From version 2.2, encryption(in-flight, at-rest) for all jobs and catalog is enabled. All the existing jobs(User created, and also system created) were updated with encryption related settings, and all the newly created jobs will have encryption enabled automatically.
S3 Connections are used to migrate data from a remote S3 bucket to Amorphic Data Cloud. There are two types of S3 Connections available: Bucket Policy
and Access Keys
.
Bucket Policy
To create an S3 Connection using a Bucket Policy, you have to first select the Bucket from which the data needs to be migrated and specify the Bucket Region.
After the connection is established, the user will have the option to download a Bucket policy and a KMS key policy. The generated bucket policy should be attached to the source bucket that was added to the connection during its creation. If the source bucket is associated with a KMS key, then the KMS key policy generated by Amorphic should also be attached to the policy of that KMS key.
Access Keys
To create an S3 Connection using Access Keys, you have to select the bucket from which the data has to be migrated and provide Access Key and Secret Access Key of the user who has permission to read the data from the bucket.
How to create an S3 Connection?
Attribute | Description |
---|---|
Connection Name | Give your connection a unique name. |
Connection Type | Type of connection, in this case, it is S3. |
Description | You can describe the connection purpose and important information about the connection. |
Authorized Users | Amorphic users who can access this connection. |
S3 Bucket | Name of the bucket from which the dataset files have to be imported. |
Connection Access Type | There are two access types for this connection Access Keys and Bucket Policy. |
Version | Enables you to select which version of ingestion scripts to use (Amorphic specific). For any new feature/Glue version that gets added to the underlying ingestion script, new version will be added to the Amorphic. |
S3 Bucket Region | Region where source S3 bucket is created. If the source bucket is in one of the regions (eu-south-1, af-south-1, me-south-1,ap-east-1) then this property needs to be provided and the region needs to be enabled in Amorphic else ingestion fails. |
For Redshift use cases involving a substantial volume of incoming files, it is advisable for the user to enable data load throttling and configure a maximum limit of 90 for Redshift.
Additionally, the timeout for the ingestion process can be set during connection creation by adding a key IngestionTimeout to ConnectionDetails in the input payload. The value should be between 1 and 2880 and is expected in minutes. If the value is not provided the default value of 480(8hours) would be used. Please note that this feature is available exclusively via API.
{
"ConnectionDetails": {
"S3Bucket": "example-test-bucket",
"S3ConnectionType": "bucket_policy",
"S3BucketRegion": "us-east-1"
"IngestionTimeout": 222
},
}
This timeout can be overridden during schedule creation and schedule run by providing an argument MaxTimeOut.
Test Connection
This functionality allows users to quickly verify the connectivity to the specified s3 bucket. By initiating this test, users can confirm if the s3 bucket details provided are accurate and functional, ensuring seamless access to the s3 bucket.
Data migration to Amorphic
To migrate data to Amorphic, users must follow these steps:
- Create a new Dataset and select the Connection Type as S3. Choose the S3 connection from the drop-down list
- Provide the FileType and DirectoryPath if required, ensuring that only the data from the specified path and file type is extracted.
- If necessary, create Partitions for the dataset(for S3Athena and LakeFormation only), this will ingest files into that specified partition and helps in better reading and the query performance.
After creating the dataset, the next step for the user is to set up a Schedule for data ingestion.
- Create a schedule with the Type specified as DataIngestion and select the previously created dataset from the list.
- If the user created any Partitions for the dataset, specify values for each partition.
This schedule will be responsible for ingesting data from the source S3 bucket into the target dataset that was created earlier. Once the user has set up the schedule for data ingestion and runs it successfully user can check the Amorphic dataset to verify that the files from the source S3 bucket have been successfully ingested and are now present in the dataset.
In the details page, Estimated Cost of the Connection is also displayed to show the approximate cost incurred since creation.
Upgrade S3 Connection
You can upgrade a connection if a new version is available. Upgrading a connection upgrades the underlying Glue version and the data ingestion script with new features.
Downgrade S3 Connection
You can downgrade a connection to a previous version if the upgrade is not meeting your needs. A connection can only be downgraded if it has been upgraded. The option to downgrade is available on the top right corner if the connection is downgrade compatible.
Connection Versions
1.6
In this version of s3 connection, the data ingestion happens by considering and comparing ETag of the files in source and target.
The first step involves checking the filename and confirming if the file exists and its size remains unchanged. Subsequently, we retrieve the ETags of both the source file and the current file in the dataset. If these ETags match, we proceed with the ingestion process. This approach is adopted because an ETag of a file remains consistent even if the filename changes. Therefore, if a user intends to duplicate files by altering their names, this method ensures that ingestion is not affected solely by the ETag.
In this version, only files stored in S3 Standard class are supported for S3 data ingestion. If there exist files from other storage classes, the ingestion process will fail.
1.7
In this version of s3 connection, the storage classes of files do not affect the flow of ingestion.
That means, the ingestion process will not be terminated even if there exist files from S3 Glacier type of storage classes. We just skip those files from ingestion, then show the details of skipped files and successfully complete the ingestion of all other files without any failure.
Files that are stored in S3 Glacier and S3 Glacier deep archive classes will be skipped during ingestion.
1.8
In this version of s3 connection, we added support of skip LZ feature.
This feature enables users to directly upload data to the data lake zone by skipping the data validation. Please refer Skip LZ related docs for more details.
1.9
In this version of s3 connection, we have incorporated multithreading to achieve faster data ingestion. This can be configured by providing an argument 'FileConcurrency' during schedule execution for the ingestion. This argument accepts values ranging from 1 to 100, allowing users to fine-tune their concurrency preferences. If the 'FileConcurrency' argument is not provided, it will default to a value of 20.
2.0
The update in this version is specifically to ensure FIPS compliance, with no changes made to the script.
2.1
In this S3 connection update, we've introduced Dynamic Partitioning support.
For Partitioned datasets users can now input wildcard (*) patterns during schedule creation to dynamically generate partitions upon S3 ingestion. Each partition captures values from source S3 objects based on user-defined patterns
This feature is exclusive to Append-type datasets. For other type datasets, there's no change; ingestion creates a single partition with the specified value upon schedule creation, following the usual flow.
2.2
In this S3 connection update, we've introduced additional configuration options for Dynamic Partitioning. These enhancements provide users with more control over which partitions are ingested, including the ability to rename partitions.
Configuration Explanation:
- Include: Specifies which partitions should be included for ingestion.
- Exclude: Defines which partitions should be excluded from ingestion into Amorphic.
- Rename: Allows users to rename partitions based on specified criteria.
2.3
This version does not introduce any new features but focuses on optimizing performance and enhancing error handling capabilities.
2.4
This version also focuses on optimizing performance and enhancing error handling capabilities.