Learning CloudWatch Alarms

Watches a single metric over a specified time period, and performs specified actions,
It initiates actions on your behalf.
Action on value of the metric relative to a threshold over time.
Action can be notification to SNS or Auto Scaling policy.
Can add alarms to dashboards.
Actions only for sustained state changes only.
Always select a period greater or equal to the frequency of the metric to be monitored.
can create up to 5000 alarms per Region per AWS account.
To create or update an alarm, use PutMetricAlarm API action
Alarm names must contain only ASCII characters.
list currently configured alarms, by DescribeAlarms (mon-describe-alarms).
Disable or enable alarms by DisableAlarmActions and EnableAlarmActions
Test alarm by setting it to any state using SetAlarmState (mon-set-alarm-state).
View alarm’s history using DescribeAlarmHistory (mon-describe-alarm-history).
CloudWatch saves alarm history for two weeks.
The number of evaluation periods for an alarm multiplied by the length of each evaluation period can’t exceed one day.
Permissions needed using AWS IAM to create or change alarm
- iam:CreateServiceLinkedRole, iam:GetPolicy, iam:GetPolicyVersion, and iam:GetRole — For all alarms with Amazon EC2 actions
- ec2:DescribeInstanceStatus and ec2:DescribeInstances — For all alarms on Amazon EC2 instance status metrics
- ec2:StopInstances — For alarms with stop actions
- ec2:TerminateInstances — For alarms with terminate actions
- No specific permissions are needed for alarms with recover actions.

CloudWatch Monitoring

Can monitor EC2 instances, Autoscaling Groups, ELBs, Route53 Health Checks, EBS Volumes, Storage Gateways, CloudFront, DynamoDB, ElastiCache nodes, RDS instances, EMR Job Flows, Redshift. SNS topics, SQS Queues, OpsWorks, CloudWatch Logs, Estimated charges on your AWS bill, and custom metrics | logs generated by your applications and services.
EC2 will by default monitor your instances @5 minute intervals
EC2 instances can monitor your instances @1 minute intervals if the ‘detailed monitoring’ option is set on the instance
By default CloudWatch will monitor CPU, Network, Disk, and Status Checks
RAM utilization is a custom metric and must be added manually to EC2 instances in order to be tracked.
2 types of Status Checks:
- System Status Checks (Physical Host):
  - Checks the underlying physical host
  - Checks for loss of network connectivity
  - Checks for loss of system power
  - Checks for software issues on the physical host
  - Checks for hardware issues on the physical host
  - Best way to resolve issues is to stop the instance and start it again (will switch physical hosts)
- Instance Status Checks
  - Checks the VM itself
  - Checks for failed system status checks
  - Checks for mis-configured networking or startup configs
  - Checks for exhausted memory
  - Checks for corrupted file systems
  - Checks for an incompatible kernel
  - Best way to troubleshoot is rebooting the instance or modifying the instance OS
By default CloudWatch metrics are stored for 2 weeks
Can retrieve data that is longer than 2 weeks using the GetMetricStatistics API endpoint, or by using third party tools
Can retrieve data from any terminated EC2 or ELB instance for up to 2 weeks after its termination
Many default metrics for many default services are 1 min, but it can be 3-5 minutes depending on the service
Custom metrics have a minimum 1 minute granularity
Alarms can be created to monitor any CloudWatch metric in your account
Alarms can include EC2, CPU, ELB, Latency, or even changes on your AWS bill
Within the alarm, actions can be set, triggering things like lambda functions, or SNS notifications if the alarm threshold is reached

Alarm has states

OK –metric within threshold.
ALARM –metric outside threshold.
INSUFFICIENT_DATA –alarm started but metric is not available

Data point reported to CloudWatch classified as

Not breaching (within the threshold)
Breaching (violating the threshold)
Missing

Missing data points agaginst each alarm, can be treated as

notBreaching – Missing data points are treated as “good” and within the threshold,
breaching – Missing data points are treated as “bad” and breaching the threshold
ignore – The current alarm state is maintained
missing – The alarm doesn’t consider missing data points when evaluating whether to change state

Become an AWS Certified DevOps Engineer with hundreds of practice tests and expert guidance. Take test Now!

Learning CloudWatch Alarms

Prepare for Assured Success