Skip to main content
Version: v2.1 print this page

Intro

The Amorphic Dataset portal allows you to create unstructured, semi-structured, and structured Datasets, while also providing comprehensive data lake visibility. These Datasets can act as a unified source of truth for different departments within an organization.

Note

Datasets with the same name can now be created throughout the application. That is, more than one dataset can have the same name, but it must be unique within a specified domain. Domain name will now be visible alongside the dataset name throughout the Amorphic application.

The Amorphic Dataset page provides the capability to search through the Dataset Metadata using a Google-like search index. It consists of options to list or create a new Dataset. You can sort through the Datasets using the Domain filters, create new Datasets, or view Dataset details.

How to create a New Dataset?

To create a new dataset click on + New Dataset, and fill the require information like Domain, Connection type and File Type etc. mentioned below.

https://media-hub.amorphicdata.io/docs/v2/assets/Catalog_datasets.gif

To create a Dataset in the Amorphic dataset, you must first create a Domain using Amorphic Administration. Then, you can create a Dataset and upload structured, semi-structured, or unstructured files to it.

  • Dataset Name: 3-70 alphanumeric/underscore chars; Must be unique per Domain.

  • Description: Contains info about a topic; text searchable.

  • Domain Name: Groups related datasets; used as 'database' in Glue/Athena, schema in Redshift.

  • Classifications: Categories to protect data (e.g. PCI, PII).

  • Keywords: Meaningful words to index/search within app; helps find related datasets.

  • Connection Type: Currently amorphic supports below connection types.

    • API: Default connection for manual file upload. Refer to Dataset files docs for more info.
    • JDBC: Ingest data from JDBC connection as source to dataset. Scheduled ingestion.
    • S3: Ingest data from S3 connection. Scheduled ingestion.
    • Ext API: Ingest data from external API. Scheduled ingestion.
  • File Type: File type should be compatible with ML Model's supported formats. Use auto ML to extract metadata from unstructured datasets integrated with AWS Transcribe and Comprehend services.

  • Target Location: Currently amorphic supports below target locations.

    • S3: Files uploaded to dataset stored in S3.

    • Redshift: Files stored in Redshift datawarehouse. DB selection during deployment determines which DB is displayed.

    • S3-Athena: Structured data stored in Athena. Refer to Athena Datasets for more info.

    • Lakeformation: Lakeformation datasets extend S3-Athena datasets, providing access control on dataset columns. For more info, see Lakeformation Datasets.

      Note

      If the target location is a data warehouse (Redshift/S3-Athena), the user should upload a file for schema inference and then publish the schema.

  • Update Method: Currently amorphic supports below three update methods.

    • Append : This will append data to the existing data.

    • Latest Record : This update method allows you to query the latest data using the Record Keys and Latest Record Indicator defined during the schema extraction process.

    • Reload : This update method reloads data to the dataset. There are two exclusive options for a Reload type dataset:

      • Target Table Prep Mode

        • Recreate : Dataset will be dropped and recreated when this option is selected.
        • Truncate : Just the data is deleted in the dataset without deleting the table.
      • Skip Trash (Optional) : When Skip Trash is True old data is not moved to Trash bucket during the data reload process, default is true when not provided.

        Based on the above reload settings, data reload process times can vary.

  • Skip LZ (Validation) Process: This functionality deals with dataset file upload. If the Skip LZ is set to True, the whole validation (LZ) process will be skipped and the file will be directly uploaded to the DLZ bucket. This will avoid unnecessary S3 copies and validations, and will automatically disable MalwareDetection and IsDataValidationEnabled (for S3Athena and Lake formation datasets). It is applicable only to append and update types of datasets. If it is set to False, the LZ process with validations will be followed.

Note

As of Amorphic 1.14, this is applicable to the dataset file upload process through Amorphic UI (manual file upload), ETL (file write) process streams, and Appflow. It is not applicable to other file upload scenarios like ingestion and Bulkloadv2. The SkipLZ feature will be implemented for these other scenarios in upcoming releases.

  • Enable Malware Detection: Whether to enable or disable the malware detection on the dataset.
  • Unscannable Files Action: When malware is found in a file uploaded to the dataset, decide whether to quarantine or pass-through the file.
  • Enable Data Profiling: This is only applicable for datasets targeted to S3Athena or DataWarehouse (Redshift).
  • Enable AI Services: Only applicable for datasets targeted to S3.
  • Enable Data Cleanup: User can enable/disable dataset auto clean up. Data deletion to save storage costs is based on inputted clean up duration. All files past expiration date (clean up duration) will be removed permanently, not undoable. Expired datasets are identified by upload date, not time. E.g. file uploaded August 21, 2020 4:55 PM, clean up duration 2 days, clean up happens August 23, 2020 12:00 AM, not 4:55 PM.
  • Enable AI Results Translation: User can enable/disable AI result translation to English for datasets. Amorphic runs ML algorithms to analyze PDF/TXT files uploaded to datasets. Option to enable/disable auto-translate AI results to English, for user to read/search Amorphic interfaces.

For example, Amorphic can identify language, run ML algorithms & analyze sentiment in an Arabic PDF document. Setting the flag to “No” allows user to preserve native language and query AI results in original language. AI results default to English, unless user specifies another language.

  • Enable Life Cycle Policy: Only applicable for datasets targeted to S3, S3-Athena & LakeFormation.

  • Enable Data Metrics Collection(API Only): Currently, this feature is only available for datasets regardless of its TargetLocation. If the flag is enabled, the metrics of the dataset will be collected every day. This collected data helps to display the necessary information on the Insights page. Metrics will be automatically expired one year after they are collected.

    ### API Request Payload Details

    - To create a dataset along with enabling/disabling dataset metrics collection

    **/datasets** & **POST** method

    ```
    {
    (Include all attributes required for creating dataset)
    (Include attributes specified below in second point)
    }

    ```

    - To update dataset metrics collection flag

    **/datasets/{id}/updatemetadata** & **PUT** method

    ```
    {
    "DataMetricsCollectionOptions": {
    "IsMetricsCollectionEnabled": <Boolean> (true|false)
    }
    }

    ```

    - To fetch dataset metrics up to given duration in days

    **/metrics/datasets/{id}?duration={days}** & **GET** method

    `Request Body Not Required`

    - Sample Metrics
    {
    "FileCount": 1,
    "TotalFileSize": 5391,
    "AvgFileSize": 5391,
    "ProcessingFiles": 0,
    "TotalFiles": 2,
    "CorruptedFiles": 0,
    "FailedFiles": 1,
    "CompletedFiles": 1,
    "DeletedFiles": 0,
    "InconsistentFiles": 0, # Negative value indicates that S3 has more files than DynamoDB metadata and vice versa.
    "PendingFiles": 0,
    "DependentViews": 2,
    "AuthorizedGroups": 1,
    "AuthorizedUsers": 2,
    "DuplicateRows": 0,
    "Columns": 7,
    "DatasetSize": "10 MB",
    "Rows": 100,
    "PercentageOfDuplicateRows": 0.0
    }

    `Note: Here only dependent views count are listed, it can also contain other dependent resources`
Note
Amorphic currently supports character recognition in two languages: English and Arabic. Support for additional languages will be added over time.
Note

The Notification settings have been moved to the Settings page. Users can use this page to choose the type of notification settings for their Datasets. Please refer to the Notification settings documentation for more information.

For a multi-tenancy deployment, data uploaded to the Dataset is stored in its respective tenant database. For example, if a User creates a Dataset under the domain "testorg_finance", all the data uploaded will be stored in the tenant database "testorg". Users can connect to "testorg" using any third-party connector to view their tables.

Schema

https://media-hub.amorphicdata.io/docs/v2/assets/Catalog_schema_register.png

The Amorphic Dataset registration feature helps users to extract and register the schema of uploaded files when they choose a data warehouse (Redshift) as the target location.

Schema Definition

https://media-hub.amorphicdata.io/docs/v2/assets/Catalog_define_schema.png

When a user registers a dataset with a data warehouse as the target location, they are directed to the "Define Schema" page. Here, they can upload a schema file, upload a JSON file containing the schema definition, or manually enter the schema fields.

Schema Publishing

https://media-hub.amorphicdata.io/docs/v2/assets/Catalog_publish_schema.png

When a user registers a dataset with a data warehouse (Redshift) as the target location and selects "My data files have headers" as "Yes", they are directed to the "Publish Schema" page, where an inferred schema will be displayed.

On the Schema Extraction page, the user has the option to add or remove new fields, or edit the "Column Name" and "Column Type". The user can also change the data type of all columns to varchar by using the 'Set all columns to varchar type' button.

Below are the data types supported for Redshift targeted dataset:

INTEGER, SMALLINT, BIGINT, REAL, DOUBLE PRECISION, DECIMAL (Precision: 38), VARCHAR (Precision: 65535), CHAR (Precision: 4096), DATE, TIMESTAMP, TIMESTAMPTZ, TIME, TIMETZ, BOOLEAN

Note

Column names must be 1-64 (common limit) alphanumeric characters, _ is allowed and must start with an alphabetic character only.

Redshift: The maximum length for a column name is 127 bytes; longer names are truncated. You can use UTF-8 multibyte characters up to a maximum of four bytes. A single table can contain up to 1600 columns.

Athena: The maximum length for the column name is 255 characters.

You can also custom partition the data. Please check the documentation on Dataset Custom Partitioning.

View Dataset

https://media-hub.amorphicdata.io/docs/v2/assets/Dataset_details.gif

Upon clicking on View Details under a dataset, the user will be able to see all the following details of the dataset:

  1. Details
  2. Profile
  3. Files
  4. Resources
  5. Authorized Users
  6. Authorized Groups

Details

https://media-hub.amorphicdata.io/docs/v2/assets/Dataset_details.png

The Details tab would contain dataset information such as Dataset Name, Dataset S3 Location, Target Location, Domain, Connection Type, File Type, etc., as well as user-inputted Dataset Details, including Dataset Name, Domain, Connection Type, and File Type.

  • Dataset Status: The status of the dataset will be 'active' if it is in a normal state. If the dataset is undergoing a data reload process, the status will be 'reloading'. If the user initiated a truncate dataset, the status will be 'truncating'.
  • Dataset S3 Location: The AWS S3 folder location of the dataset.
  • Dataset AI/ML Results: Advanced analytics summary of the dataset.
  • Created By: The user name of the user who created the dataset, and the creation date-time in YYYY-MM-DD format.
  • Last Modified By: The user name of the user who last modified the metadata of the dataset, and the date-time in YYYY-MM-DD format when the metadata was last modified.

Profile

https://media-hub.amorphicdata.io/docs/v2/assets/Dataset_details_profile.png

The Profile tab will contain JDBC connection, Host, and ODBC connection information. This is useful for establishing a connection between different data sources and the amorphic platform.

Note: When a user connects to the JDBC connection through a data source or a BI tool, all the tables (only the schema, not the actual data) in the specific database will be displayed along with the user-owned tables (datasets).

You can view the schema of the dataset if it is registered with the target location as a Data Warehouse (Redshift). If data profiling is enabled for the dataset, then data profiling details will also be displayed in the Profile tab. Please check the documentation on Data Profiling for more information.

Files

https://media-hub.amorphicdata.io/docs/v2/assets/Dataset_details_file.png You can manually upload files to the dataset only if the connection type is API (default). In the Files tab, you can upload files to the dataset, delete files, download them, apply ML, and view AI/ML results. For more information, please refer to the Dataset Files documentation.

Invocations & Results

When a machine learning model is applied to a dataset, whether structured or unstructured, the output is referred to as an invocation and result. Users can download the logs for each invocation. If no ML model is applied to the file, clicking the "Invocations and Results" button will display a "No invocations" message.

Apply ML

https://media-hub.amorphicdata.io/docs/v2/assets/Dataset_ML.gif

You can apply Machine Learning models to files in the dataset. After selecting the "Apply ML" option, the application will prompt for the "Type" (e.g. ML Model or Entity Recognizer).

If the selected type is ML Model then user needs to provide below details:

  • ML Model: Dropdown shows ML model names user has access to.
  • Instance Type: Dropdown shows machine types user wants to apply model on.
  • Target Dataset: Dropdown shows amorphic datasets user wants to write output to.
  • Entity Recognizer (if selected): Dropdown shows Entity Recognizer names user has access to.

View AI/ML Results

The "View AI/ML Results" option in the File Operation shows you the AI/ML results applied to ML models and Entity Recognizers.

View More Details

When files are added to the dataset via an S3 connection, View More Details provides the location of the file on the source, for example, the source object key.

Note

The source metadata of the file can only be seen for data loaded via an S3 connection of version 1.0 or higher (e.g. new or upgraded connections). Due to S3 limitations, this metadata will not be available for files larger than 5 GB.

Note

The File Search option will be disabled if Search Datasets is disabled by administrators. Please contact the administrators to re-enable it.

For non-analytic file type datasets, you can search for files metadata and add or remove tags to the files. The File Search tab contains a Search bar to facilitate this process. The results of the search will display the matched files, along with the following options:

  • Tags: User can add/remove tags to the files.
  • File Preview: User can preview the file. Supported file formats are txt, json, docx, pdf, mp3, wav, fla, m4a, mp4, webm, flv, ogg, mov, html, htm, jpg, jpeg, bmp, gif, ico, svg, png, csv, tsv.
  • Complete Analytics: User can view the complete analytics of the file.
  • File Download: User can download the file.

Resources

https://media-hub.amorphicdata.io/docs/v2/assets/Dataset_details_resources.png

View all the resource dependencies of the dataset like Jobs, Schedules, Notebooks, Views etc.

Note

You should detach/delete all the dependent resources in order to delete the dataset.

Authorized Users

https://media-hub.amorphicdata.io/docs/v2/assets/Dataset_details_authorization.png

View the list of users authorized to perform operations on the dataset. The owner, the user who created or has owner access to the dataset, can provide dataset access to any other user in the system.

There are two types of access:

  • Owner: This user has permissions to edit the dataset and provide access to other users for the dataset.
  • Read-only: This user has limited permission to the dataset, such as viewing the details and downloading files.

Authorized Groups

This tab shows the list of groups authorized to perform operations on the resources like Datasets, Dashboards, Models, Schedules etc. A group is a list of users given access to a resource, in this case dataset. Groups are created by going to User Profile -> Profile & Settings -> Groups

There are two type of access types:

  • Owner: This group of users has permissions to edit the resources and provide access to other user/groups for the resources.
  • Read-only: This group has limited permission to resources, such as view the details.
Note

A user can have thousands of datasets, but the list is limited to 5000 per user in reverse chronological order to prevent performance issues. To view all datasets, users can use the "Search Dataset" option under the Dataset tab.

Edit Dataset

You can edit datasets in Amorphic. All editable fields relevant to the dataset type can be modified.

Note

For Redshift targeted datasets, Distribution style can be updated by editing the dataset.

Clone Dataset

You can clone a Dataset in Amorphic. The Clone Dataset page auto-populates with the metadata from the existing dataset, so you only need to change the Dataset Name. You can edit any field before clicking Register, and a new dataset will appear in the Datasets page.

Delete Dataset

You can delete a dataset instantly and remove all associated metadata.

Note

When deleting a dataset, the process runs asynchronously in the background. If new data is uploaded during this time, there is a risk of data loss. To avoid this, please ensure no files are visible in the Files tab of the dataset details before initiating a new data load. Parallel delete cannot be triggered while a dataset deletion is in progress; Amorphic will throw an error if attempted.

For bulk deletion of datasets, Please check the documentation on How to bulk delete the datasets in Amorphic

Repair Dataset(s)

Datasets can be repaired in two ways:

  • Individually, using the "Dataset Repair" button in the top right corner of the dataset details page.
  • Globally, using the "Global Dataset Repair" button in the top right corner of the dataset listing page.

For more details, please refer to the How to Repair Dataset(s) in Amorphic documentation.