Version: v2.7 print this page

Data Profiling

info

From version 2.2, encryption(in-flight, at-rest) for all jobs and catalog has been enabled. All the existing jobs(User created, and also system created) were updated with encryption related settings, and all the newly created jobs will have encryption enabled automatically.

Data profiling is the process of analyzing existing data source to collect statistics or the summary about the data. It helps to identify anomalies and evaluate data quality.

/img/datasets/datasets/Data_profiling.png

The image above displays a data profile for a cross-sectional MRI dataset from a sample of people diagnosed with Alzheimer's disease.

Enable Data profiling

You can enable data profiling. It can be enabled or disabled at any time. Data profiling is only available for structured datasets (e.g. datasets hosted on S3-Athena or Redshift).

/img/datasets/datasets/enable-data-profiling.gif

Following are the fields derived in a data profile:

Property	Description
Files	Number of files in the dataset.
Dataset Size	Size of the datasets in S3 with target location set as 'S3-Athena', and size of datasets in data warehouse with target location set as 'Redshift'.
Rows	Number of rows present in the dataset.
Duplicate Rows	Number of non-unique rows present in the dataset.
Columns	Number of columns present in the dataset.
Missing Values	Number of empty cells in the dataset.
Last Modified	Time when the dataset was last edited by a user.
Last Profiled	Time when the data profile was extracted.
Data Type	Data types inferred when a dataset is registered.
Min Value	Minimum value of each column.
Max Value	Maximum value of each column.
Sample Rows	A random sample of 10 rows from the dataset.

Note

Users with read-only access will not be able to view data profiling details

Update frequency of Data Profiling

Data profiling jobs are run at 12 AM UTC everyday.

When do you update data profile for a dataset?

If data profiling is enabled by the user and there have been additions to the dataset in the last 24 hours, the data profile will be updated accordingly. This process is set up to prevent waste of any resources, as the data profile will remain the same state if no new files are added.

Concurrency of data profiling jobs

Currently all data profiling jobs run with a concurrency factor of 5.

How long does it take for all data profiles to get updated?

If there are 100 datasets which are to be profiled, assuming that each dataset takes approximately 3 minutes (depending on the dataset size), the total time which will be utilized would be equivalent to 20 * 3 minutes or 60 minutes where concurrency factor will be taken as 5 units. All data profiles should be updated by 1:00 AM UTC.

What happens in case of failures?

If a data profile fails to be extracted, an error will be displayed on the profile tab and an email alert will also be sent to the subscribed user.

Data profile failure

Enable Data profiling​

Update frequency of Data Profiling​

When do you update data profile for a dataset?​

Concurrency of data profiling jobs​

How long does it take for all data profiles to get updated?​

What happens in case of failures?​