Monitoring and Log Processing

CloudWatch

Amazon CloudWatch monitors

AWS resources
applications running on AWS

CloudWatch

collects and tracks metrics, for AWS resources and applications.
CloudWatch home page displays metrics about every AWS service in use.
Can create custom dashboards to display metrics
Alarms can be configured to monitor metrics and send notifications , if needed
Alarms can automatically make changes to the resources under monitoring against a threshold

Access CloudWatch by

Amazon CloudWatch console – https://console.aws.amazon.com/cloudwatch/
AWS CLI
CloudWatch API
AWS SDKs

CloudWatch Namespaces

A cloudwatch namespace is

It is a container for CloudWatch metrics.
Metrics are isolated if they are in different namespaces
There is no default namespace.
Must specify a namespace for each data point to be published to CloudWatch.
While creating a metric, provide namespace name.
These names must contain valid XML characters,
Be fewer than 256 characters in length.
Possible characters are: alphanumeric characters (0-9A-Za-z), period (.), hyphen (-), underscore (_), forward slash (/), hash (#), and colon (:).
The AWS namespaces, naming convention: AWS/service

CloudWatch Dimensions

A dimension

is a name/value pair
part of the identity of a metric.
a metric can be given a maximum of 10 dimensions
Used to describe characteristic of a metric
Also used to filter the results that CloudWatch returns
For few AWS services like EC2, CloudWatch can aggregate data across dimensions
Example – Server=Producton,Domain=City01

CloudWatch Statistics

It is metric data aggregations over specified periods of time.
Aggregations use the namespace, metric name, dimensions, and the data point unit of measure, within the specified time period.

Available statistics

Minimum – lowest value observed during a period. indicates, when low activity
Maximum – highest value observed during a period. indicates, when high activity
Sum – Add all values submitted for matching metric indicates, total activity
Average – The value of Sum / SampleCount during a period.
SampleCount – The count (number) of data points used for statistical calculation.
pNN.NN – Value of specified percentile up to 2 decimal places like p95.45. Not for negative value metrics.

CloudWatch Metrics

A cloudwatch metric

Is a group of data points which are arranged as per time and sent to CloudWatch.
To illustrate, consider it as a variable whose value changes over time and has to be monitored.
Data points are generated by all AWS services
AWS services send metrics to CloudWatch
Can send custom metrics to CloudWatch also
Can add data points in any order or at any rate
Retrieve statistics about data points as an ordered set of time-series data.
Metrics are specific to a Region in which were created
Metrics cannot be deleted,
By default all data point expire automatically, after 15 months if no new data is added.
They expire on a rolling basis; as new data points come in, data older than 15 months is dropped.
Metrics are defined uniquely by, specific
- name
- namespace
- zero or more dimensions.
Each data point in a metric has a time stamp, and (optionally) a unit of measure.

CloudWatch Metrics Time Stamps

Each metric data point must be associated with a time stamp.
The range of time stamp value can be of past two weeks or future two hours
If no time stamp is given, CloudWatch creates a time stamp on time data point was received.
Time stamps are dateTime objects
Coordinated Universal Time (UTC) is recommended
Time values are specified in UTC, in CloudWatch
Metrics are checked by CloudWatch alarms with current time specified in UTC.

CloudWatch Metrics Retention

CloudWatch retains metric data as follows:

For a period <60 seconds, available for 3 hours. Also called as high-resolution custom metrics.
For a period of 60 seconds/1 minute, available for 15 days
For a period of 300 seconds/5 minute, available for 63 days
For a period of 3600 seconds/1 hour, available for 455 days (15 months)

CloudWatch Metrics Units

Each statistic has a unit of measure.
Few example metric units are
- Bytes
- Seconds
- Count
- Percent.
custom metric creation needs unit to be specified
If not specified, CloudWatch uses None as the unit.
No significance is given to a unit by CloudWatch internally
unit of measure are aggregated separately Metric data points that specify a unit of measure are aggregated separately.
Statistics without specifying a unit, CloudWatch aggregates all data points of the same unit together.

CloudWatch Metrics Periods

Period refers to duration of time linked with a specific CloudWatch statistic.
Periods defined in seconds, and valid values for period are 1, 5, 10, 30, or any multiple of 60.
For period of six minutes, use 360 as the period value.
varying period values, can help in see changes in data aggregation
sub-minute periods are supported for those custom metrics having storage resolution of 1 second
Retrieval of statistics needs
- Period
- start time
- end time
The default values for the start time and end time get you the last hour’s worth of statistics.
For statistics aggregated over the entire hour, specify a period of 3600.
aggregated statistics are stamped with the time corresponding to the beginning of the period.
Periods are also important for CloudWatch alarms.

CloudWatch Metrics Aggregation

CloudWatch aggregates statistics as per specified period length
publish as many data points as needed with same or similar time stamps.
CloudWatch aggregates them as per specified period length.
CloudWatch does not aggregate data across Regions.
pre-aggregated dataset (statistic set ) should be added in case of large datasets
With statistic sets, gives Min, Max, Sum, and SampleCount for a number of data points.
No differentiation is done by CloudWatch on basis of source of metric.
metric with namespace and dimensions is treated as single metric, even if having different sources

CloudWatch Alarms

Watches a single metric over a specified time period, and performs specified actions,
It initiates actions on behalf.
An alarm can result in taking action on basis of metric value against a threshold over time period.
Action can be notification to SNS or Auto Scaling policy.
Can add alarms to dashboards.
Actions only for sustained state changes only.
Always select a period greater or equal to the frequency of the metric to be monitored.
Maximum limit to create 5000 alarms/Region in a AWS account.
To create or update an alarm, use PutMetricAlarm API action
Alarm names must contain only ASCII characters.
list currently configured alarms, by DescribeAlarms (mon-describe-alarms).
Disable or enable alarms by DisableAlarmActions and EnableAlarmActions
Test alarm by setting it to any state using SetAlarmState (mon-set-alarm-state).
View alarm’s history using DescribeAlarmHistory (mon-describe-alarm-history).
CloudWatch saves alarm history for two weeks.
The value of evaluation periods number for alarm multiplied by evaluation period length, should be less than one day.
Following permissions are required to create or change a Cloudwatch alarm
- For alarms with EC2 actions
  - iam:CreateServiceLinkedRole
  - iam:GetPolicy
  - iam:GetPolicyVersion
  - iam:GetRole
- For alarms on EC2 instance status metrics
  - ec2:DescribeInstanceStatus
  - ec2:DescribeInstances
- For alarms with stop actions
  - ec2:StopInstances
- For alarms with terminate actions
  - ec2:TerminateInstances
- No specific permissions are needed for alarms with recover actions.

CloudWatch Monitoring

Cloudwatch can be used to monitor
- EC2 instances
- Autoscaling Groups
- ELBs
- Route53 Health Checks
- EBS Volumes
- Storage Gateways
- CloudFront
- DynamoDB
- Other AWS services
- logs generated by applications and services.
EC2 will by default monitor instances @5 minute intervals
EC2 instances can monitor instances @1 minute intervals if the ‘detailed monitoring’ option is set on the instance
CloudWatch monitors following, by default
- CPU
- Network
- Disk
- Status Checks
RAM utilization metric
- is a custom metric
- has to be added manually to EC2 instances for tracking.
2 types of Status Checks:
- System Status Checks (Physical Host):
  - Checks the underlying physical host
  - Checks for loss of network connectivity
  - Checks for loss of system power
  - Checks for software issues on the physical host
  - Checks for hardware issues on the physical host
  - Stop the instance and start again, for resolution (will switch physical hosts)
- Instance Status Checks
  - Checks the VM itself
  - Checks for failed system status checks
  - Checks for mis-configured networking or startup configs
  - Checks for exhausted memory
  - Checks for corrupted file systems
  - Checks for an incompatible kernel
  - rebooting instance or changing instance OS, for troubleshooting
CloudWatch metrics are saved for 2 weeks only, by default
use GetMetricStatistics API endpoint to get data more than 2 weeks
Data from terminated EC2/ ELB instance, after termination can be obtained up to 2 weeks
As per service the default metrics can be 1 min or 3-5 minutes
The minimum granularity for custom metrics is 1 minute
Alarms can be created to monitor any CloudWatch metric in account
Alarms can include EC2, CPU, ELB, Latency, or even changes on AWS bill
Following can be specified in a alarm
- actions can be set
- triggering lambda functions or SNS notifications against a threshold

Alarm has states

OK –metric within threshold.
ALARM –metric outside threshold.
INSUFFICIENT_DATA – indicates that alarm has initiated but metric is not accessible

Data point reported to CloudWatch classified as

Not breaching (within the threshold)
Breaching (violating the threshold)
Missing

CloudWatch Logs

CloudWatch is integrated with CloudTrail
CloudTrail provides record of actions taken by a user, role, or AWS service
CloudTrail captures API calls made by or on behalf of AWS account.
The calls captured include
- calls from CloudWatch console
- code calls to the CloudWatch API operations.
After trail creation, continuous delivery of CloudTrail events are done to S3 bucket
Actions logged in CloudTrail log files in CloudWatch are
- DeleteAlarms
- DeleteDashboards
- DescribeAlarmHistory
- DescribeAlarms
- DescribeAlarmsForMetric
- DisableAlarmActions
- EnableAlarmActions
- GetDashboard
- ListDashboards
- PutDashboard
- PutMetricAlarm
- SetAlarmState

CloudTrail

It is a web service that records API activity in AWS account.
It is enabled on AWS account when created.
All activity occurring in AWS account, is recorded in a CloudTrail event.
Activity of past 90 days can be viewed/ searched/downloaded from event history view
It logs information on
who made a request
- the services used
- the actions performed
- parameters for the actions
- the response elements returned by the AWS service.
Stores Logs in specific log group.
Logs provide specific information on what occurred in AWS account.
focuses more on AWS API calls made in AWS account.
helps in meeting compliance and regulatory standards.
Usually delivers an event within 15 minutes of the API call.
It helps you enable governance, compliance, and operational and risk auditing.
CloudTrail records all actions taken on user-wise/role-wise/service -wise
Events cover all actions in
- AWS Management Console
- AWS Command Line Interface
- AWS SDKs and APIs.
Trail is a configuration which delivers event details to specified S3 bucket
Trail is employed for archival, analysis against changes in AWS resources
create a trail with
- CloudTrail console
- AWS CLI
- CloudTrail API
Types of trails
- A trail that applies to all regions – records events in each region. Default with console
- A trail that applies to one region – records the events in that region only. Default option with AWS CLI or CloudTrail API.

CloudTrail Logs

Monitor existing system, application and custom logs in real time.
Send existing logs to CloudWatch; Create patterns to look for in logs; Alert based on finding of these patterns.
Free agents for Ubuntu, Amazon Linux, Windows.
Purpose
- Monitor logs from EC2 instances in realtime. (track number of errors in application logs and send notification if exceed thresold)
- Monitor AWS CloudTrail logged events (API Activity such as manual EC2 instance termination)
- Archive log data (change log retention setting to automatically delete)
Log events – record stored to CloudWatch Logs with the Timestamp and Message to store.
Log Streams – Refers to the log events sequence sharing same resource (like for Apache access logs, they are automatically deleted after every 2 months).
Log Groups – Refer to log stream group sharing same settings for
- Retention
- monitoring
- access control
CMetric Filters – define how a service would extract metric observations from events and turn them into data points for a CloudWatch metric.
Retention Settings – Settings for duration to keep events. Automatic deletion of expired logs.
The duration offered for Log Group Retention ranges from 1 day to 10 years.
CloudWatch Log Filters: filter log data pushed to CloudWatch; won’t work on existing log data, only work after log filter created, only returns
first 50 results. Metric contains 1. Filter Pattern 2. Metric Name 3. Metric NameSpace 4. Metric value
Modify rsyslog (/etc/rsyslog.d/50-default.conf) and remove auth on line number 9, sudo service rsyslog restart
Real-Time Log processing: It needs subscription Filters and applicable for AWS Kinesis Streams, AWS Lambda and AWS Kinesis Firehouse
aws kinesis command is used for creation/ describing stream. Command can also list the stream ARN. Them update the permissions.json file with ARN’s of the stream and role.

Advanced tasks with CloudTrail log files

Create multiple trails per region.
CloudWatch Logs are used to monitor CloudTrail log files
Share log files between accounts.
Log processing applications can be developed in Java by using CloudTrail Processing Library.
Validate log files to verify that they have not changed after delivery by CloudTrail.

To receive CloudTrail log files from multiple regions

Sign in to the AWS Management Console and open the CloudTrail console at https://console.aws.amazon.com/cloudtrail/.
Choose the option – “Trails”, and then select a trail name.
Next, click on pencil icon adjacent to “Apply trail to all regions”, and then select “Yes”.
Choose Save. The original trail will be replicated across all AWS regions. CloudTrail will deliver log files present in all regions to S3 bucket.

Monitoring and Log Processing

Prepare for Assured Success