Module 15
Data engineering patterns
Last updated
Was this helpful?
Data engineering patterns
Last updated
Was this helpful?
The five Vs of data characteristics are value, veracity, volume, velocity, and variety. Each of these characteristics impact decision-making with data
Modernize
Move to a cloud-based infrastructure and purpose-built services to reduce undifferentiated lifting.
Unify
Create a single source of truth for data, and make the data available across the organization.
Innovate
Apply artificial intelligence and machine learning (AI/ML) to find new insights in the data
Data lake = Raw/Unstructured data
Data Warehouse = Structured data
Essentially just extracting the data and storing it in the same format it was orginally made in
Works well with structured data that is destined for a data warehouse
Works well for unstructured data that is destined for a data lake
Stores data that is ready to be analyzed, so this pattern can save time for an analyst
Offers flexibility to create new queries (analysts can access more raw data)
Provides the ability to transfer data between SaaS applications and AWS services
Offers reuse of available service integrations with available Amazon AppFlowAPIs
Provides data transformation capabilities
Is a fully managed data migration service•Simplifies, automates, and accelerates copying file and object data to and from AWS storage services
Is optimized for speed•Includes encryption and integrity validation
Preserves metadata when moving data
Provides customers with a way to find, subscribe to, and use third-party data in the cloud
Bridges the gap between providers and subscribers who exchange data by supporting data delivery through files, tables, and APIs
Simplifies finding, preparing, and using data in the cloud
Example:
Batch processing is for:
For reporting purposes
When dealing with large datasets
When the analytics use case is focused more on aggregating or transforming data and less on real-time analysis of data
Used for batch processing
Is a data integration service that helps automate and perform ETL tasks as part of ingesting data into a pipeline
Provides the ability to read and write data from multiple systems and databases
Simplifies batch and streaming ingestion
Is the most common format to store tabular data
Stores data in a columnar fashion
Speeds up analytics workloads
Isn’t efficient to store or manipulate large amounts of data (more than 15 GBs)
is optimized for storage
Over time, saves storage space, cost, and time
Is suitable for parallel processing