Module 15

Data engineering patterns

The five Vs of data characteristics are value, veracity, volume, velocity, and variety. Each of these characteristics impact decision-making with data

Three-pronged strategy to build data infrastructure

  • Modernize

    • Move to a cloud-based infrastructure and purpose-built services to reduce undifferentiated lifting.

  • Unify

    • Create a single source of truth for data, and make the data available across the organization.

  • Innovate

    • Apply artificial intelligence and machine learning (AI/ML) to find new insights in the data

Data lake = Raw/Unstructured data

Data Warehouse = Structured data

Elements of a data pipeline

Homogeneous ingestion pattern

Essentially just extracting the data and storing it in the same format it was orginally made in

Heterogeneous ingestion patterns

Extract, transform and load (ETL)
Extract, load, and transform (ELT)

Works well with structured data that is destined for a data warehouse

Works well for unstructured data that is destined for a data lake

Stores data that is ready to be analyzed, so this pattern can save time for an analyst

Offers flexibility to create new queries (analysts can access more raw data)

Batch and Streaming processing patterns

AWS tools to ingest data

Amazon App Flow

  • Provides the ability to transfer data between SaaS applications and AWS services

  • Offers reuse of available service integrations with available Amazon AppFlowAPIs

  • Provides data transformation capabilities

AWS DataSync

  • Is a fully managed data migration service•Simplifies, automates, and accelerates copying file and object data to and from AWS storage services

  • Is optimized for speed•Includes encryption and integrity validation

  • Preserves metadata when moving data

AWS Data Exchange

  • Provides customers with a way to find, subscribe to, and use third-party data in the cloud

  • Bridges the gap between providers and subscribers who exchange data by supporting data delivery through files, tables, and APIs

  • Simplifies finding, preparing, and using data in the cloud

Processing Data in AWS

Batch ingestion and processing

Example:

Batch processing is for:

  • For reporting purposes

  • When dealing with large datasets

  • When the analytics use case is focused more on aggregating or transforming data and less on real-time analysis of data

AWS Glue

  • Used for batch processing

  • Is a data integration service that helps automate and perform ETL tasks as part of ingesting data into a pipeline

  • Provides the ability to read and write data from multiple systems and databases

  • Simplifies batch and streaming ingestion

AWS Glue Components

Example

AWS Glue Transformation types

.csv
.parquet
Convert .csv to .parquet

Is the most common format to store tabular data

Stores data in a columnar fashion

Speeds up analytics workloads

Isn’t efficient to store or manipulate large amounts of data (more than 15 GBs)

is optimized for storage

Over time, saves storage space, cost, and time

Is suitable for parallel processing

Last updated

Was this helpful?