Version: v2.5 print this page

Custom Script

Users can create custom logic for processing file results using a Python script. The script must adhere to the specified template for it to function correctly. This script should be defined within the 'DocumentHandler' class. While users have the flexibility to define additional methods as per their needs, the primary logic should be implemented within the 'execute_custom_script' method within the class. To assist users, we have supplied a script template that they can refer to when crafting their script.

Script Template

As mentioned in in the template below, the script constructor would initialize the parameters such as file metadata and results returned from the business rules(if any). There are several sample methods available for retrieving and updating result values, as well as for marking a file or a target key for review. Additionally, a couple of methods related to Textract queries have been included for reference.

import logging
import boto3

LOGGER = logging.getLogger()
LOGGER.setLevel(logging.INFO)

class DocumentHandler:
    def __init__(self, metadata, results):
    """Class constructor

    Args:
        metadata (object): Object containing metadata that can be useful in the script
            Sample -> {
                "FileKey": <s3_path_of_raw_file>,
                "TextractOutputFileKey": <s3_path_of_textract_output_file>,
                "AWSRegion": "",
                "DataBucketName": <S3_bucket_containing_raw_and_textract_output_files>,
                "ConfigBucketName": <S3_bucket_containing_config_file_if_present>,
                "ConfigFileKey": <s3_path_of_config_file_if_present>
            }
        results (object): Object containing the key-value pairs returned from business rules along with some review details: {
            Results: {
                <key1>: <value>,
                <key2>: <value>,
                ...
            },
            "ReviewStatus": "not-required"/"pending-review" -> Review Status at the file level
            "Message": "" -> Add a message at file level
            "KeyLevelReviewDetails": {
                <key>: {
                    "FlagForReview": True/False, -> Flag a key for review
                    "Message": "", -> Add a message at a key level
                    "FlaggedBy": ""
                }
            } - Review Status at an individual key level
        }
    """
    self.metadata = metadata
    self.results = results
    self.output_dataset_keys = metadata['OutputDatasetKeys']
    self.AWS_REGION = metadata['AWSRegion']
    self.DATA_BUCKET_NAME = metadata['DataBucketName']

    def get_results_object(self):
        """Returns all the key and values in results
        Args:
        """
        return self.results['Results']

    def get_result_value(self, output_key):
        """Returns the value corresponding to a given key
        Args:
            output_key (string): Key whose value needs to be fetched
        """
        if output_key in self.output_dataset_keys:
            return self.results['Results'][output_key]
        else:
            LOGGER.error("The given key does not exist in the result data")

    def set_result_value(self, output_key, result_value):
        """Update the value corresponding to the given key
        Args:
            output_key (string): Key whose value needs to be updated
            result_value (string): Updated value corresponding to the given key
        """
        if output_key in self.output_dataset_keys:
            self.results['Results'][output_key] = result_value
        else:
            LOGGER.error("The provided key - %s is not a part of the OutputDatasetKeys", output_key)

    def flag_file(self, message = ''):
        """Flag the file for review

        Args:
            message (str, optional): Message stating the reason for flagging. Defaults to ''.
        """
        self.results['ReviewStatus'] = 'pending-review'
        if message:
            self.results['Message'] = message

    def flag_result(self, output_key, message = ''):
        """Flag a particular key for review

        Args:
            output_key (string): Key that needs to be flagged
            message (str, optional): Message stating the reason for flagging. Defaults to ''.
        """
        if output_key in self.output_dataset_keys:
            self.results['KeyLevelReviewDetails'][output_key]['FlagForReview'] = True
            if message:
                self.results['KeyLevelReviewDetails'][output_key]['Message'] += f"\n{message}"
        else:
            LOGGER.error("The provided key - %s is not a part of the OutputDatasetKeys", output_key)

    def get_query_result_by_id(self, response, id):
        """Get the value and confidence score for a QUERY_RESULT block with given Id.
        Args:
            response (json): JSON response returned by textract
            id (string): Id for to the query result block

        Returns:
            object: Value & confidence score if found, otherwise None
        """
        for b in response["Blocks"]:
            if b["BlockType"] == "QUERY_RESULT" and b["Id"] == id:
                return {
                            "Value": b.get("Text"),
                            "Confidence": b.get("Confidence")
                        }
        return None

    def get_query_results_for_alias(self, response, q_alias):
        """Get a list of query results (value & confidence score) for a given alias
        Args:
            response (json): JSON response returned by textract
            q_alias (string): alias used in query

        Returns:
            object[]: List of query results for the given alias
            [
                {
                    "value": <query_result>,
                    "confidence": <confidence_score_if_present>
                }
            ]
        """
        results = []
        for b in response["Blocks"]:
            if b["BlockType"] == "QUERY" and b["Query"]["Alias"] == q_alias:
                    if b.get("Relationships"):
                        ref_id = b["Relationships"][0]["Ids"][0]
                        result = self.get_query_result_by_id(response, ref_id)
                        if result:
                            results.append(result)
        return results

    def run_synchronous_textract_queries(self, queries):
        """Run a list of synchronous textract queries for the file and get the response

        Note: In case textract query fails due to Throughput Exception, you can define a textract client with custom config with more retries

        from botocore.client import Config
        max_attempts = <define_according_to_use_case> (default retries is 3)
        config = Config(retries = dict(max_attempts=max_attempts, mode="standard"))
        TEXTRACT_CLIENT = boto3.client("textract", region_name=AWS_REGION, config=config)

        Args:
            queries (object[]): List of queries to run
                [
                    {
                        "Text": "",
                        "Alias": "" (optional),
                        "Pages": "" (optional, defaulted to ["1-*"])
                    }
                ]

        Returns:
            (json): json response from textract queries
        """
        textract_client = boto3.client('textract', self.AWS_REGION)
        queries_config = []
        for query in queries:
            config = {
                'Text': query['Text'],
                'Pages': query.get('Pages', ["*"])
            }
            if query.get("Alias"):
                config.update({
                    'Alias': query['Alias']
                })
            queries_config.append(config)
        file_key = self.metadata['FileKey']
        response = textract_client.analyze_document(
            Document = {
                'S3Object': {
                    'Bucket': self.DATA_BUCKET_NAME,
                    'Name': file_key
                }
            },
            FeatureTypes=["QUERIES"],
            QueriesConfig = {
                'Queries': queries_config
            }
        )
        return response

    def execute_custom_script(self):
        """Write the custom code for the given script here
        """

Custom Script Run logs

Whenever a custom script is defined for a process flow, and a user wishes to verify if the script is functioning as intended, they can access and download the run logs for that specific run. Users can review and retrieve these logs directly from the details of the corresponding run.

Below image shows how to download the run logs Download Run Logs

Custom Script Configuration

Users also have the option to specify a particular JSON configuration that they wish to access within their custom script. This configuration can be defined while updating the process flow, and if provided, the file will be uploaded to S3 for reference in the script. This file can be accessed in the custom script using the ConfigBucketName and the ConfigFileKey properties present in the metadata object.

Below image shows how to add a configuration for custom script Custom Configuration

Sample snippet for accessing the custom configuration in the custom script

def execute_custom_script(self):
    """Write the custom code for the given script
    """
    config_bucket_name = self.metadata['ConfigBucketName']
    config_file_key = self.metadata['ConfigFileKey']

    # Get the S3 object
    response = s3_client.get_object(Bucket=config_bucket_name, Key=config_file_key)

    # Read the content of the object
    object_content = response['Body'].read()

    # Parse the JSON content
    config_json_data = json.loads(object_content.decode('utf-8'))

    # Now you can work with the JSON data as required
    print(config_json_data)

Custom Script

Script Template​

Custom Script Run logs​

Custom Script Configuration​

Script Template

Custom Script Run logs

Custom Script Configuration