Datasets | Amorphic Docs

📄️ Intro

The Amorphic Dataset portal allows you to create unstructured, semi-structured, and structured Datasets, while also providing comprehensive data lake visibility. These Datasets can act as a unified source of truth for different departments within an organization.

📄️ Data Profiling

From version 2.2, encryption(in-flight, at-rest) for all jobs and catalog is enabled. All the existing jobs(User created, and also system created) were updated with encryption related settings, and all the newly created jobs will have encryption enabled automatically.

📄️ Files

Files helps you with the management of files that contain actual data for datasets. It allows you to view the files associated with the datasets and their status. The data from the uploaded files will be copied to tables in the target location.

📄️ Dataset Lifecycle

The Dataset Lifecycle policy is a feature that helps manage objects in Amazon S3 to keep costs low throughout their lifecycle. It allows users to control the transition and expiration of objects in the dataset using a set of rules that define actions for Amazon S3 to take.

📄️ Athena Datasets

Athena Datasets is a feature of Amazon Athena which allows users to store structured data files in Amazon S3 and run SQL queries on the data. Data validation can be enabled to check for corrupt data, and the query engine can be used to run queries on the dataset.

📄️ DynamoDB Datasets

Amorphic DynamoDB Datasets enable structured datasets, serving as a single source of truth across organizational departments.

📄️ External Datasets

External datasets on amorphic let users consume their existing data present in S3 buckets directly without having to ingest them into a new amorphic dataset.

📄️ Iceberg Datasets

In Amorphic, User can create Iceberg datasets with S3Athena target location which creates Iceberg table in the backend to store the data.

📄️ Lakeformation Datasets

Lakeformation extends S3-Athena datasets with added security and supports CSV, TSV, XLSX, JSON and Parquet files. It also checks data integrity and offers ACID transactions, data compaction, and time-travel queries.

📄️ Redshift Datasets

Amorphic provides users with the ability to store csv/tsv/xslx files in s3 with Redshift datasets as the target location, with an optional partial data validation enabled by default. This validation helps detect and correct corrupt or invalid data files, and supports data types such as strings/varchar, integers, double, boolean, date, timestamp, and complex data structures. However, there are limitations to the CSV parser recommended by AWS Athena, such as not supporting embedded line breaks or empty fields in columns defined as a numeric data type. In this case, it's suggested to import the data as string columns and then cast it to the required data type through views.