It is observed that during consecutive runs of S3 ingestion for datasets of append type, existing files within the dataset are erroneously re-ingested.
Affected Versions: 2.3
2.2
2.1
2.0
Fix Version: 2.4
Root cause(s)
The system used Etags of S3 files to determine file existence. However, due to the Etags not being the MD5 hash for larger files, different Etags were generated, causing failed comparisons and resulting in the ingestion of duplicate files.
Impact
This issue results in a failure to accurately identify previously ingested files, leading to their inadvertent re-ingestion. This recurrence may cause duplication of files, impacting data integrity and overall system efficiency.
Mitigation
A fix is available in Amorphic v2.4. Please upgrade to the latest version to resolve this issue.
Timeline
- 2023-09-11: Bug reported/identified (CLOUD-3937)
- 2023-09-11: Bug triaged
- 2023-10-05: Bug fixed
- 2023-10-06: Testing completed and fix is available