Data Profiling
From version 2.2, encryption(in-flight, at-rest) for all jobs and catalog has been enabled. All the existing jobs(User created, and also system created) were updated with encryption related settings, and all the newly created jobs will have encryption enabled automatically.
Data profiling is the process of analyzing existing data source to collect statistics or the summary about the data. It helps to identify anomalies and evaluate data quality.
The image above displays a data profile for a cross-sectional MRI dataset from a sample of people diagnosed with Alzheimer's disease.
Enable Data profiling
You can enable data profiling. It can be enabled or disabled at any time. Data profiling is only available for structured datasets (e.g. datasets hosted on S3-Athena or Redshift).
Following are the fields derived in a data profile:
Property | Description |
---|---|
Files | Number of files in the dataset. |
Dataset Size | Size of the datasets in S3 with target location set as 'S3-Athena', and size of datasets in data warehouse with target location set as 'Redshift'. |
Rows | Number of rows present in the dataset. |
Duplicate Rows | Number of non-unique rows present in the dataset. |
Columns | Number of columns present in the dataset. |
Missing Values | Number of empty cells in the dataset. |
Last Modified | Time when the dataset was last edited by a user. |
Last Profiled | Time when the data profile was extracted. |
Data Type | Data types inferred when a dataset is registered. |
Min Value | Minimum value of each column. |
Max Value | Maximum value of each column. |
Sample Rows | A random sample of 10 rows from the dataset. |
Users with read-only access will not be able to view data profiling details
Update frequency of Data Profiling
Data profiling jobs are run at 12 AM UTC everyday.
When do you update data profile for a dataset?
If data profiling is enabled by the user and there have been additions to the dataset in the last 24 hours, the data profile will be updated accordingly. This process is set up to prevent waste of any resources, as the data profile will remain the same state if no new files are added.
Concurrency of data profiling jobs
Currently all data profiling jobs run with a concurrency factor of 5.
How long does it take for all data profiles to get updated?
If there are 100 datasets which are to be profiled, assuming that each dataset takes approximately 3 minutes (depending on the dataset size), the total time which will be utilized would be equivalent to 20 * 3 minutes or 60 minutes where concurrency factor will be taken as 5 units. All data profiles should be updated by 1:00 AM UTC.
What happens in case of failures?
If a data profile fails to be extracted, an error will be displayed on the profile tab and an email alert will also be sent to the subscribed user.