- It is a quick, powerful, and fully managed, petabyte-scale data warehouse service in AWS
- It is a RDBMS
- built specifically for OLAP
- optimized for high-performance analysis and reporting
- for very large datasets
- Usage in fast querying capabilities
- Uses standard SQL commands for interactive query
- connectivity by ODBC or JDBC
- It depends on PostgreSQL
- It also automatically monitors nodes and drives to support recovery from any failures.
- Has a massively parallel processing (MPP) architecture to parallelize and distribute SQL operations
- stores three copies of your data — all data written to a node in cluster is automatically replicated to other nodes within the cluster, and all data is continuously backed up to Amazon S3.
- Snapshots are automated, incremental, and continuous and stored for a user-defined period (1-35 days)
- Manual snapshots can be created and are retained until deleted.
- Continuously monitors health of cluster
- Automatically re-replicates data from failed drives and replaces nodes as necessary.
- has three pricing components:
- data warehouse node hours – total number of hours run across all the compute node
- backup storage – storage cost for automated and manual snapshots
- data transfer
- There is no data transfer charge for data transferred to or from Amazon Redshift outside of Amazon VPC
- Data transfer to or from Amazon Redshift in Amazon VPC accrues standard AWS data transfer charges.
- number of nodes can be easily scaled as per demand
Clusters and Nodes
- cluster is composed of a leader node and one or more compute nodes.
- client application interacts directly only with leader node
- compute nodes are transparent to external applications.
- Redshift provides support for six different node types and each has a different mix of CPU, memory, and storage.
- node types are grouped into categories
- Dense Compute –support clusters up to 326TB using fast SSDs
- Dense Storage –support clusters up to 2PB using large magnetic disks.
Data Types
- Redshift columns support a wide range of data types
- Supported data types are
- numeric – INTEGER, DECIMAL, and DOUBLE
- text – CHAR and VARCHAR
- date – DATE and TIMESTAMP
- Additional columns can be added to table using ALTER TABLE command
- existing columns cannot be modified.
Compression Encoding
- Redshift’s data compression is key performance optimizations
- During data loading into an empty table, Redshift samples data and selects best compression scheme for each column.
- User can also specify compression encoding on a per-column basis with CREATE TABLE command.
Distribution Strategy
- how to distribute records across nodes and slices in a cluster
- configure distribution style of a table, how data be partitioned to meet query patterns
- By running a query, optimizer shifts rows as needed to perform any joins and aggregates
- table distribution style aims
- to minimize impact of redistribution step
- putting data where it needs to be before query is performed.
- For best distribution strategy for each table, balance data distribution
- During table creation, can choose between distribution styles: EVEN, KEY, or ALL.
- EVEN distribution –default option, data being distributed across slices in uniform fashion regardless of data.
- KEY distribution –rows are distributed as per values in one column. leader node will store matching values close together and increase query performance for joins.
- ALL distribution –a full copy of entire table is distributed to every node. useful for large tables which are not updated frequently.
Loading Data
- For bulk operations, Redshift provides COPY command
- Does not need to repeatedly calling INSERT for bulk upload
- loading data from S3, COPY command can read from multiple files at same time.
- Redshift can distribute workload to nodes and perform load process in parallel.
- Enable parallel processing by cluster with multiple nodes and multiple input files instead single large file with data
- After bulk data load run VACUUM command to reorganize data and reclaim space after deletes.
- suggested to run ANALYZE command to update table statistics.
Querying Data
- Redshift offers standard SQL commands to query tables
- It supports SELECT to query and join tables
- For complex queries, can analyze query plan to optimize access pattern.
- can monitor performance of cluster by queries using CloudWatch and Redshift web console.
- To support many users, configure Workload Management (WLM) to queue and prioritize queries.
- WLM defines multiple queues and set concurrency level for each queue.
Snapshots
- create point-in-time snapshots of Redshift cluster
- can be used to restore a copy or create a clone of original Redshift cluster
- They are durably stored internally in S3 by Redshift.
- Redshift supports automated and manual snapshots
- automated snapshots – Redshift periodically take snapshots of cluster and copy is kept for configurable retention period.
- manual snapshots – can be done and share across regions or with other AWS accounts. Need to be explicitly deleted
Security
- securing Redshift cluster similar to securing databases in cloud.
- security plan must include controls to protect
- infrastructure resources
- database schema
- records in table
- network access.
- Need to address security at every level to securely operate an Redshift data warehouse.
- first layer – infrastructure level using IAM policies to limit actions like
- to create and manage lifecycle of a cluster
- scaling
- backup
- recovery operations.
- Deploy clusters within private IP address space of VPC to restrict network connectivity
- Fine-grained network access by security groups and network ACLs at subnet level.
- At database level, create a master user account for creating to create more users and groups.
- Each database user can be granted permission to
- Schemas
- Tables
- other database objects
- These permissions are independent from IAM policies
Are you an AWS Expert?Take a Quiz