Reading Class

Amorphic Datalake stores data in many formats which can be backed by Relational Database (Redshift/Aurora only for structured data) or object storage s3. However, the platform uses a structure to store the data for better organization.

With the Read class you can orchestrate the reading of data in more elegant way. You can either use python or pyspark as backend processing.

Reading in python-shell

The following class will return the pandas dataframe of the data.

Reading from S3

class amorphicutils.python.read.Read(bucket_name, region=None, logger=None)

Class to read data from Amorphic

init(bucket_name, region=None, logger=None)

Initialize the class with dataset specific details

Parameters:bucket_name – name of the bucket

>>> reader = Read("dlz_bucket")

list_object(domain_name, dataset_name)

List the objects for specific datasets

Parameters:
- domain_name – domain name of the dataset
- dataset_name – dataset name
Returns: list of objects from s3

>>> reader = Read("dlz_bucket")
>>> reader.list_object("testdomain", "testdataset")

read_csv_data(domain_name, dataset_name, schema=None, header=False, delimiter=',', upload_date=None, path=None, **kwargs)

Read csv data from s3 using pandas dataframe read api and generate pandas dataframe

Parameters:
- domain_name – domain name of the dataset
- dataset_name – dataset name
- schema – List of col names of the data.
- header – True if data files contains header. Default: False
- delimiter – delimiter in the dataset. Default: “,”
- upload_date – upload date timestamp.
- path – Path of the file to read from.
- kwargs – Optional arguments available for python pandas csv read
Type: list(str)
Returns:

>>> reader = Read("dlz_bucket")
>>> df = reader.read_csv_data("testdomain", "testdataset", upload_date="1578305347")

read_excel(domain_name, dataset_name, sheet_name=0, header=False, schema=None, upload_date=None, path=None, **kwargs)

Read data from excel files and return pandas dataframe

Parameters:
- domain_name – domain name of the dataset
- dataset_name – dataset name
- sheet_name – sheet name or indices to read data from. Default: 0
- header – True if data files contains header. Default: False
- schema – List of col names of the data.
- upload_date – upload date timestamp.
- path – Path of the file to read from.
- kwargs – Optional arguments available for python pandas excel read
Returns:

>>> amorphic_reader = Read(bucket_name="dlz_bucket")
>>> result = amorphic_reader.read_excel(domain_name="testdomain", dataset_name="testdataset", header=True)

read_json(domain_name, dataset_name, upload_date=None, path=None, **kwargs)

Read data from excel files and return pandas dataframe

Parameters:
- domain_name – domain name of the dataset
- dataset_name – dataset name
- upload_date – upload date timestamp.
- path – Path of the file to read from.
- kwargs – Optional arguments available for python pandas json read
Returns:

>>> amorphic_reader = Read(bucket_name="dlz_bucket")
>>> result = amorphic_reader.read_json(domain_name="testdomain", dataset_name="testdataset")

Reading in pyspark

The following class will return spark dataframe of the data

Reading from S3

class amorphicutils.pyspark.read.Read(bucket_name, spark, region=None, logger=None)

Class to read data from Amorphic

init(bucket_name, spark, region=None, logger=None)

Initialize the class with dataset specific details

Parameters:
- bucket_name – name of the bucket (dlz)
- spark – SparkContext

>>> reader = Read("dlz_bucket", spark_context)

list_object(domain_name, dataset_name)

List the objects for specific datasets

Parameters:
- domain_name – domain name of the dataset
- dataset_name – dataset name
Returns: list of objects from s3

>>> reader = Read("dlz_bucket", spark=spark_context)
>>> reader.list_object("testdomain", "testdataset")

Reading from Data Warehouse

class amorphicutils.pyspark.read.DwhRead(dwh_type, dwh_host, dwh_port, dwh_db, dwh_user, dwh_pass, tmp_dir)

Class to read data from Datawarehouse(Redshift/Aurora)

init(dwh_type, dwh_host, dwh_port, dwh_db, dwh_user, dwh_pass, tmp_dir)

Initialize class with required parameters for connecting to data warehouse.

Parameters:
- dwh_type – Is it “redshift” or “aurora”
- dwh_host – Hostname for DWH
- dwh_port – Port for DWH
- dwh_db – Database name to connect. ex. cdap
- dwh_user – Username to use for connection
- dwh_pass – Password for the user
- tmp_dir – Temp directory for store intermediate result

read_from_redshift(glue_context, domain_name, dataset_name, **kwargs)

Return response with data from Redshift

Parameters:
- glue_context – GlueContext
- domain_name – Domain name of dataset
- dataset_name – Dataset name
- kwargs – Extra params like: hashfield
Returns:

>>> dwh_reader = DwhRead("redshift", DWH_HOST, DWH_PORT, DWH_DB, dwh_user, dwh_pass, tmp_dir)
>>> response = dwh_reader.read_from_redshift(glue_context, domain_name, dataset_name)

Reading in python-shell​

Reading from S3​

class amorphicutils.python.read.Read(bucket_name, region=None, logger=None)​

__init__(bucket_name, region=None, logger=None)​

list_object(domain_name, dataset_name)​

read_csv_data(domain_name, dataset_name, schema=None, header=False, delimiter=',', upload_date=None, path=None, **kwargs)​

read_excel(domain_name, dataset_name, sheet_name=0, header=False, schema=None, upload_date=None, path=None, **kwargs)​

read_json(domain_name, dataset_name, upload_date=None, path=None, **kwargs)​

Reading in pyspark​

Reading from S3​

class amorphicutils.pyspark.read.Read(bucket_name, spark, region=None, logger=None)​

__init__(bucket_name, spark, region=None, logger=None)​

list_object(domain_name, dataset_name)​

Reading from Data Warehouse​

class amorphicutils.pyspark.read.DwhRead(dwh_type, dwh_host, dwh_port, dwh_db, dwh_user, dwh_pass, tmp_dir)​

__init__(dwh_type, dwh_host, dwh_port, dwh_db, dwh_user, dwh_pass, tmp_dir)​

read_from_redshift(glue_context, domain_name, dataset_name, **kwargs)​

Reading in python-shell

Reading from S3

class amorphicutils.python.read.Read(bucket_name, region=None, logger=None)

init(bucket_name, region=None, logger=None)

list_object(domain_name, dataset_name)

read_csv_data(domain_name, dataset_name, schema=None, header=False, delimiter=',', upload_date=None, path=None, **kwargs)

read_excel(domain_name, dataset_name, sheet_name=0, header=False, schema=None, upload_date=None, path=None, **kwargs)

read_json(domain_name, dataset_name, upload_date=None, path=None, **kwargs)

Reading in pyspark

Reading from S3

class amorphicutils.pyspark.read.Read(bucket_name, spark, region=None, logger=None)

init(bucket_name, spark, region=None, logger=None)

list_object(domain_name, dataset_name)

Reading from Data Warehouse

class amorphicutils.pyspark.read.DwhRead(dwh_type, dwh_host, dwh_port, dwh_db, dwh_user, dwh_pass, tmp_dir)

init(dwh_type, dwh_host, dwh_port, dwh_db, dwh_user, dwh_pass, tmp_dir)

read_from_redshift(glue_context, domain_name, dataset_name, **kwargs)