Learning CloudWatch Alarms
- Watches a single metric over a specified time period, and performs specified actions,
- It initiates actions on your behalf.
- Action on value of the metric relative to a threshold over time.
- Action can be notification to SNS or Auto Scaling policy.
- Can add alarms to dashboards.
- Actions only for sustained state changes only.
- Always select a period greater or equal to the frequency of the metric to be monitored.
- can create up to 5000 alarms per Region per AWS account.
- To create or update an alarm, use PutMetricAlarm API action
- Alarm names must contain only ASCII characters.
- list currently configured alarms, by DescribeAlarms (mon-describe-alarms).
- Disable or enable alarms by DisableAlarmActions and EnableAlarmActions
- Test alarm by setting it to any state using SetAlarmState (mon-set-alarm-state).
- View alarm’s history using DescribeAlarmHistory (mon-describe-alarm-history).
- CloudWatch saves alarm history for two weeks.
- The number of evaluation periods for an alarm multiplied by the length of each evaluation period can’t exceed one day.
- Permissions needed using AWS IAM to create or change alarm
- iam:CreateServiceLinkedRole, iam:GetPolicy, iam:GetPolicyVersion, and iam:GetRole — For all alarms with Amazon EC2 actions
- ec2:DescribeInstanceStatus and ec2:DescribeInstances — For all alarms on Amazon EC2 instance status metrics
- ec2:StopInstances — For alarms with stop actions
- ec2:TerminateInstances — For alarms with terminate actions
- No specific permissions are needed for alarms with recover actions.
CloudWatch Monitoring
- Can monitor EC2 instances, Autoscaling Groups, ELBs, Route53 Health Checks, EBS Volumes, Storage Gateways, CloudFront, DynamoDB, ElastiCache nodes, RDS instances, EMR Job Flows, Redshift. SNS topics, SQS Queues, OpsWorks, CloudWatch Logs, Estimated charges on your AWS bill, and custom metrics | logs generated by your applications and services.
- EC2 will by default monitor your instances @5 minute intervals
- EC2 instances can monitor your instances @1 minute intervals if the ‘detailed monitoring’ option is set on the instance
- By default CloudWatch will monitor CPU, Network, Disk, and Status Checks
- RAM utilization is a custom metric and must be added manually to EC2 instances in order to be tracked.
- 2 types of Status Checks:
- System Status Checks (Physical Host):
- Checks the underlying physical host
- Checks for loss of network connectivity
- Checks for loss of system power
- Checks for software issues on the physical host
- Checks for hardware issues on the physical host
- Best way to resolve issues is to stop the instance and start it again (will switch physical hosts)
- Instance Status Checks
- Checks the VM itself
- Checks for failed system status checks
- Checks for mis-configured networking or startup configs
- Checks for exhausted memory
- Checks for corrupted file systems
- Checks for an incompatible kernel
- Best way to troubleshoot is rebooting the instance or modifying the instance OS
- System Status Checks (Physical Host):
- By default CloudWatch metrics are stored for 2 weeks
- Can retrieve data that is longer than 2 weeks using the GetMetricStatistics API endpoint, or by using third party tools
- Can retrieve data from any terminated EC2 or ELB instance for up to 2 weeks after its termination
- Many default metrics for many default services are 1 min, but it can be 3-5 minutes depending on the service
- Custom metrics have a minimum 1 minute granularity
- Alarms can be created to monitor any CloudWatch metric in your account
- Alarms can include EC2, CPU, ELB, Latency, or even changes on your AWS bill
- Within the alarm, actions can be set, triggering things like lambda functions, or SNS notifications if the alarm threshold is reached
Alarm has states
- OK –metric within threshold.
- ALARM –metric outside threshold.
- INSUFFICIENT_DATA –alarm started but metric is not available
Data point reported to CloudWatch classified as
- Not breaching (within the threshold)
- Breaching (violating the threshold)
- Missing
Missing data points agaginst each alarm, can be treated as
- notBreaching – Missing data points are treated as “good” and within the threshold,
- breaching – Missing data points are treated as “bad” and breaching the threshold
- ignore – The current alarm state is maintained
- missing – The alarm doesn’t consider missing data points when evaluating whether to change state
Become an AWS Certified DevOps Engineer with hundreds of practice tests and expert guidance. Take test Now!
AWS Certified DevOps Engineer Free Practice TestTake a Quiz