Datasets
S3 Athena Datasets
1. Empty data for numeric or non-string columns:
Error message: Data parsing failed, empty field data found for non-string column
Issue description: When the user tries to load empty/null values for non-string columns, the load process fails throwing a data validation failed error message.
Explanation: This error message is thrown by the file parser which currently does not support the usage of non-string fields being null/empty. As per the documentation one work around to achieve this is to import them as string columns and create views on top of it by casting them to the required data types.
2. File parsing:
Error message: Data validation failed with message, new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
Issue description: This is one of the data validation errors which occurs while loading the data into Dataset.
Explanation: This error message is thrown by the file parser which currently does not support the usage of embedded line breaks in the csv/tsv/xlsx file. Please follow the documentation. Possible solution of this is to perform a regex replace on in-appropriate new line or carriage return characters in the file.
3. Field validations:
Error message: N/A
Issue description: Validations not available for all data types
Explanation: Currently data type validations are only limited primitive types Strings/Varchar
, Integers
, Double
, Boolean
, Date
and Timestamp
. Support of complex structures is yet to be added. Moreover for data types like Date and Timestamp, value formats are not strictly validated as they are multi formatted.
4. Batch file uploads:
Error message: Data validation failed with message, Hive bad data exception
Issue description: This occurs when a user uploads a batch or multiple files of good and bad data files at the same time. Currently the validation fails and user won't be able to upload any of the file in the batch.
Explanation: This is a limitation on the validation feature for append type datasets, where even if one file in the batch upload is corrupt the user won't be able to load any of the individual files data. Temporary work around is to load individual files and correct and re-upload them in case of any failures. Our team is working on implementing an update in the next version.
4. Parquet file with special characters:
Error message: Getting forbidden error with status code 403
Issue description: This occurs during schema extraction process where if we upload parquet file containing special characters such as <`!
is being blocked and throws a forbidden error.
Explanation: Currently the AWS WAF appears to block the parquet files with parquet encoding version larger than 10.0.0 containing some special characters such as <`!
. Temporary work around is to generate the schema offline and manually register the schema and complete the registration process. File upload process will work normally as it is and not being impacted by these special characters. Our team is working on implementing an update in the next version.
Reload Datasets
If a dataset is created with table update type of reload and has the target as DWH of AuroraMysql.
Please be informed that if the data contains headers only header from 1st file is being skipped if Skip file header is selected as True. Headers from remaining files are being loaded as the data in the AuroraMysql table. This is because of an issue on the AWS side and has nothing to do with Amorphic. Creating datasets with Skip file header as false and uploading files that doesn’t contain header should solve the issue. No ETA from AWS on when this issue gets resolved.