Amazon Redshift

It is a quick, powerful, and fully managed, petabyte-scale data warehouse service in AWS
It is a RDBMS
- built specifically for OLAP
- optimized for high-performance analysis and reporting
- for very large datasets
Usage in fast querying capabilities
Uses standard SQL commands for interactive query
connectivity by ODBC or JDBC
It depends on PostgreSQL
It also automatically monitors nodes and drives to support recovery from any failures.
Has a massively parallel processing (MPP) architecture to parallelize and distribute SQL operations
stores three copies of your data — all data written to a node in cluster is automatically replicated to other nodes within the cluster, and all data is continuously backed up to Amazon S3.
Snapshots are automated, incremental, and continuous and stored for a user-defined period (1-35 days)
Manual snapshots can be created and are retained until deleted.
Continuously monitors health of cluster
Automatically re-replicates data from failed drives and replaces nodes as necessary.
has three pricing components:
- data warehouse node hours – total number of hours run across all the compute node
- backup storage – storage cost for automated and manual snapshots
- data transfer
  - There is no data transfer charge for data transferred to or from Amazon Redshift outside of Amazon VPC
  - Data transfer to or from Amazon Redshift in Amazon VPC accrues standard AWS data transfer charges.
number of nodes can be easily scaled as per demand

Clusters and Nodes

cluster is composed of a leader node and one or more compute nodes.
client application interacts directly only with leader node
compute nodes are transparent to external applications.
Redshift provides support for six different node types and each has a different mix of CPU, memory, and storage.
node types are grouped into categories
- Dense Compute –support clusters up to 326TB using fast SSDs
- Dense Storage –support clusters up to 2PB using large magnetic disks.

Data Types

Redshift columns support a wide range of data types
Supported data types are
- numeric – INTEGER, DECIMAL, and DOUBLE
- text – CHAR and VARCHAR
- date – DATE and TIMESTAMP
Additional columns can be added to table using ALTER TABLE command
existing columns cannot be modified.

Compression Encoding

Redshift’s data compression is key performance optimizations
During data loading into an empty table, Redshift samples data and selects best compression scheme for each column.
User can also specify compression encoding on a per-column basis with CREATE TABLE command.

Distribution Strategy

how to distribute records across nodes and slices in a cluster
configure distribution style of a table, how data be partitioned to meet query patterns
By running a query, optimizer shifts rows as needed to perform any joins and aggregates
table distribution style aims
- to minimize impact of redistribution step
- putting data where it needs to be before query is performed.
For best distribution strategy for each table, balance data distribution
During table creation, can choose between distribution styles: EVEN, KEY, or ALL.
EVEN distribution –default option, data being distributed across slices in uniform fashion regardless of data.
KEY distribution –rows are distributed as per values in one column. leader node will store matching values close together and increase query performance for joins.
ALL distribution –a full copy of entire table is distributed to every node. useful for large tables which are not updated frequently.

Loading Data

For bulk operations, Redshift provides COPY command
Does not need to repeatedly calling INSERT for bulk upload
loading data from S3, COPY command can read from multiple files at same time.
Redshift can distribute workload to nodes and perform load process in parallel.
Enable parallel processing by cluster with multiple nodes and multiple input files instead single large file with data
After bulk data load run VACUUM command to reorganize data and reclaim space after deletes.
suggested to run ANALYZE command to update table statistics.

Querying Data

Redshift offers standard SQL commands to query tables
It supports SELECT to query and join tables
For complex queries, can analyze query plan to optimize access pattern.
can monitor performance of cluster by queries using CloudWatch and Redshift web console.
To support many users, configure Workload Management (WLM) to queue and prioritize queries.
WLM defines multiple queues and set concurrency level for each queue.

Snapshots

create point-in-time snapshots of Redshift cluster
can be used to restore a copy or create a clone of original Redshift cluster
They are durably stored internally in S3 by Redshift.
Redshift supports automated and manual snapshots
automated snapshots – Redshift periodically take snapshots of cluster and copy is kept for configurable retention period.
manual snapshots – can be done and share across regions or with other AWS accounts. Need to be explicitly deleted

Security

securing Redshift cluster similar to securing databases in cloud.
security plan must include controls to protect
- infrastructure resources
- database schema
- records in table
- network access.
Need to address security at every level to securely operate an Redshift data warehouse.
first layer – infrastructure level using IAM policies to limit actions like
- to create and manage lifecycle of a cluster
- scaling
- backup
- recovery operations.
Deploy clusters within private IP address space of VPC to restrict network connectivity
Fine-grained network access by security groups and network ACLs at subnet level.
At database level, create a master user account for creating to create more users and groups.
Each database user can be granted permission to
- Schemas
- Tables
- other database objects
These permissions are independent from IAM policies