Module 15

Data engineering patterns

The five Vs of data characteristics are value, veracity, volume, velocity, and variety. Each of these characteristics impact decision-making with data

Three-pronged strategy to build data infrastructure

Modernize
- Move to a cloud-based infrastructure and purpose-built services to reduce undifferentiated lifting.
Unify
- Create a single source of truth for data, and make the data available across the organization.
Innovate
- Apply artificial intelligence and machine learning (AI/ML) to find new insights in the data

Data lake = Raw/Unstructured data

Data Warehouse = Structured data

Elements of a data pipeline

Homogeneous ingestion pattern

Essentially just extracting the data and storing it in the same format it was orginally made in

Heterogeneous ingestion patterns

Extract, transform and load (ETL)

Extract, load, and transform (ELT)

Works well with structured data that is destined for a data warehouse

Works well for unstructured data that is destined for a data lake

Stores data that is ready to be analyzed, so this pattern can save time for an analyst

Offers flexibility to create new queries (analysts can access more raw data)

Batch and Streaming processing patterns

AWS tools to ingest data

Amazon App Flow

Provides the ability to transfer data between SaaS applications and AWS services
Offers reuse of available service integrations with available Amazon AppFlowAPIs
Provides data transformation capabilities

AWS DataSync

Is a fully managed data migration service•Simplifies, automates, and accelerates copying file and object data to and from AWS storage services
Is optimized for speed•Includes encryption and integrity validation
Preserves metadata when moving data

AWS Data Exchange

Provides customers with a way to find, subscribe to, and use third-party data in the cloud
Bridges the gap between providers and subscribers who exchange data by supporting data delivery through files, tables, and APIs
Simplifies finding, preparing, and using data in the cloud

Processing Data in AWS

Batch ingestion and processing

Example:

Batch processing is for:

For reporting purposes
When dealing with large datasets
When the analytics use case is focused more on aggregating or transforming data and less on real-time analysis of data

AWS Glue

Used for batch processing
Is a data integration service that helps automate and perform ETL tasks as part of ingesting data into a pipeline
Provides the ability to read and write data from multiple systems and databases
Simplifies batch and streaming ingestion

AWS Glue Components

Example

AWS Glue Transformation types

.csv

.parquet

Convert .csv to .parquet

Is the most common format to store tabular data

Stores data in a columnar fashion

Speeds up analytics workloads

Isn’t efficient to store or manipulate large amounts of data (more than 15 GBs)

is optimized for storage

Over time, saves storage space, cost, and time

Is suitable for parallel processing

PreviousModule 14 NextModule 16

Last updated 1 year ago

Was this helpful?