Datasets
S3 Athena Datasets
1. Empty data for numeric or non-string columns:
Error message: Data parsing failed, empty field data found for non-string column
Issue description: When the user tries to load empty/null values for non-string columns, the load process fails throwing a data validation failed error message.
Explanation: This error message is thrown by the file parser which currently does not support the usage of non-string fields being null/empty. As per the documentation one work around to achieve this is to import them as string columns and create views on top of it by casting them to the required data types.
2. File parsing:
Error message: Data validation failed with message, new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
Issue description: This is one of the data validation errors which occurs while loading the data into Dataset.
Explanation: This error message is thrown by the file parser which currently does not support the usage of embedded line breaks in the csv/tsv/xlsx file. Please follow the documentation. Possible solution of this is to perform a regex replace on in-appropriate new line or carriage return characters in the file.
3. Field validations:
Error message: N/A
Issue description: Validations not available for all data types
Explanation: Currently data type validations are only limited primitive types Strings/Varchar
, Integers
, Double
, Boolean
, Date
and Timestamp
. Support of complex structures is yet to be added. Moreover for data types like Date and Timestamp, value formats are not strictly validated as they are multi formatted.
4. Data Profiling:
Error message: Data Profiling Failing with Failed to create any executor tasks
Error
Issue description: This error occurs when there are insufficient IP addresses available in the subnet for the Data Profiling Glue job execution.
Explanation: The job fails to start if the number of required IP addresses exceeds those available in the subnet. Re-triggering the Data Profiling Job for that dataset after some time can resolve this issue, as IP addresses used by other job executions will become available again shortly after completion.