Data Quality Checks
From version 2.2, encryption(in-flight, at-rest) for all jobs and catalog is enabled. All the existing jobs(User created, and also system created) were updated with encryption related settings, and all the newly created jobs will have encryption enabled automatically.
Amorphic provides data quality checks that help you detect errors in your data before it is utilized by other systems or machine learning algorithms. You can create rules for the columns of your structured datasets, and then run the checks to see if there are any rules that are broken. If one rule is broken, the whole check will fail. In the Amorphic data quality checks page, you can view a list of checks, create new checks, and sort through the list of checks using various criteria such as name, creator, and creation time.
How to Create Data Quality Check?
To create new data quality check:
- Click on
Create Data Quality Check
In order to create a new data quality check on a dataset you need access to at least one structured dataset. - Fill in the following fields shown below:
Property | Description |
---|---|
Data Quality Check Name | Data quality check name must be 3-120 alphanumeric, _ characters only. It must be unique across the application. |
Description | Description of the data quality check being created. |
Domain | Logical grouping of datasets. This will shortlist datasets from a particular domain. |
Dataset Name | Structured dataset on which data quality check is to be performed. |
Auto-Constraint Suggestions Enabled | This feature enables or disables the suggestion of auto constraints. This can be challenging for large and complex datasets that contain information from multiple sources. Enabling this functionality helps users find suitable constraints for their data. |
Keywords | Create comma-separated keywords to index & search app. Use keywords to flag related datasets for easier future location. |
Edit Data Quality Check
You can modify, add, or remove constraints from data quality check's metadata using the "Edit Data Quality Check" button.
Execute a Data Quality Check
You can also execute the data quality checks either on-demand or schedule them. Once the data quality check completes, you will receive an email and a push notification with the execution results.
Stop Data Quality Check execution
Data quality check execution can be stopped by using the 'Stop Execution' option present in more options icon
View Data Quality Check executions
You can view the results of a particular execution. The report displays the count of constraints that were both successful and failed.
To view auto constraint suggestions, click on View Auto Suggestions during a data quality check execution.
Clone Data Quality Checks
Clone a data quality check in Amorphic and it auto-populates the clone page with the original's metadata. Just give it a unique name.
Constraint Definitions
Name of the constraint | Definition of the constraint |
---|---|
hasMax | Creates a constraint that asserts on the maximum value of a column. The column contains either a long, int or float datatype. |
hasMin | Creates a constraint that asserts on the minimum value of a column. The column is contains either a long, int or float datatype. |
hasMaxLength | Creates a constraint that asserts on the maximum length of a string datatype column. |
hasMinLength | Creates a constraint that asserts on the minimum length of a string datatype column. |
hasMean | Creates a constraint that asserts on the mean of the column. |
hasSum | Creates a constraint that asserts on the sum of the column. |
hasStandardDeviation | Creates a constraint that asserts on the standard deviation of the column. |
hasApproxCountDistinct | Creates a constraint that asserts on the approximate count distinct of the given column. |
isComplete | Creates a constraint that asserts on a column completion. |
isUnique | Creates a constraint that asserts on a column uniqueness. |
containsCreditCardNumber | Check to run against the compliance of a column against a Credit Card pattern. |
containsEmail | Check to run against the compliance of a column against an e-mail pattern. |
containsURL | Check to run against the compliance of a column against an URL pattern. |
isPositive | Creates a constraint which asserts that a column contains no negative values and is greater than 0. |
containsSocialSecurityNumber | Check to run against the compliance of a column against the Social security number pattern for the US. |
isNonNegative | Creates a constraint which asserts that a column contains no negative values. |
hasCompleteness | Creates a constraint that asserts column completion. Uses the given history selection strategy to retrieve historical completeness values on this column from the history provider. |
hasEntropy | Creates a constraint that asserts on a column entropy. Entropy is a measure of the level of information contained in a message. |
hasMutualInformation | Creates a constraint that asserts on a mutual information between two columns. Mutual Information describes how much information about one column can be inferred from another. |
hasCorrelation | Creates a constraint that asserts on the pearson correlation between two columns. |
isLessThan | Asserts that, in each row, the value of columnA is less than the value of columnB. |
isLessThanOrEqualTo | Asserts that, in each row, the value of columnA is less than or equal to the value of columnB. |
isGreaterThan | Asserts that, in each row, the value of columnA is greater than the value of columnB. |
isGreaterThanOrEqualTo | Asserts that, in each row, the value of columnA is greater than or equal to the value of columnB. |
hasUniqueness | Creates a constraint that asserts any uniqueness in a single or combined set of key columns. Uniqueness is the fraction of unique values of a column(s) values that occur exactly once. |
hasDistinctness | Creates a constraint on the distinctness in a single or combined set of key columns. Distinctness is the fraction of distinct values of a column(s). |
hasUniqueValueRatio | Creates a constraint on the unique value ratio in a single or combined set of key columns. |
haveCompleteness | Creates a constraint that asserts column completion. Uses the given history selection strategy to retrieve historical completeness values on this column from the history provider. |
haveAnyCompleteness | Creates a constraint that asserts on any completion in the combined set of columns. |
areComplete | Creates a constraint that asserts completion in combined set of columns. |
areAnyComplete | Creates a constraint that asserts any completion in the combined set of columns. |
isContainedIn | Asserts that every non-null value in a column is contained in a set of predefined values. |
Data Quality check use case
A retail company has a large database of customer information, including name, address, email, and purchase history. Before running any data analysis or machine learning algorithms on this data, the company wants to ensure the quality of the data by checking for errors and inconsistencies.
To do this, the company sets up a data quality check in Amorphic, with constraints such as:
The email column must contain a valid email address format. The address column must contain a valid postal code. The purchase history column must contain only positive numbers. The company runs the data quality check, which reads the entire database and performs these checks for each record. If any of the constraints fail, the data quality check execution is considered as a failure and the report provides details denoting which constraint failed and for which particular record.
The company can then use this information to correct the errors in the database and ensure that the data is of high quality before running any further data analysis or machine learning algorithms on it.