Introduction
Alerting lets you know when there are issues with your cloud applications so you can fix them right away. An alerting policy in Cloud Monitoring specifies the situations and methods for which you should be notified. The alerting guidelines are summarized on this page. Metric-based alerting policies keep track of the metrics data that Cloud Monitoring collects. Most of the Cloud Monitoring documentation on alerting policies uses metric-based alerting policies.
This article is the continuation with introduction to alerting part -1. Here we will learn about the behavior of metric-based alerting policies, and how to add severity levels to alerting policies. So let’s dive into the article for more detail on alerting.
The behavior of metric-based alerting policies
This document explains how alerting rules combine numerous criteria, how they replace missing data points, and how the alignment period and duration settings influence when a condition occurs. A policy's maximum number of open incidents, the average number of notifications sent for each incident, and the reasons for notification delays are also covered.
Alignment period and duration settings
When defining a condition for an alerting policy, you must set two fields: the alignment period and the duration window. The definitions of these domains are briefly illustrated in this section.
Alignment period
A lookback period from a specific time is the alignment period. For instance, if the alignment period is five minutes at 1:00 PM, the samples received between 12:55 PM and 1:00 PM are included in the alignment period. The samples collected between 12:56 PM and 1:01 PM are included in the alignment period, which slides by one minute at that time.
The Rolling window and Rolling window function menus in the New condition dialogue allow you to customize the alignment fields.
Consider a condition that monitors a metric with a sample time of one minute to show how the alignment period affects a condition in an alerting policy. Assume that the aligner is set to sum and that the alignment duration is five minutes. The requirement is said to be met or active when the time series' aligned value exceeds two for at least three minutes. Assume for this example that the condition is assessed every minute.
Duration window
You utilize the duration or duration window to prevent a condition from being satisfied due to a single measurement. You can set the duration window by using the Retest window parameter in the Configure trigger step. Each time measurement doesn't meet the condition; the duration window is reset. The following illustration of this behavior is:
Example: This policy stipulates a five-minute timeframe.
If the time it takes for an HTTP response to arrive is longer than two seconds, and if that time exceeds the five-minute criteria, you should open an incident and contact your support staff via email.
The sequence that follows demonstrates how the duration window influences the assessment of the condition:
- Less than two seconds pass between HTTP requests.
- HTTP latency exceeds two seconds during the following three minutes.
- The condition resets the duration window when the latency in the subsequent measurement is less than two seconds.
- The criterion is satisfied, and the policy is activated for the following five minutes, where HTTP latency exceeds two seconds.
Set the length window such that it is both long enough to reduce false positives and brief enough to guarantee that incidents are opened promptly.
Select the alignment period and duration window
Conditions under the alerting policy are assessed on a fixed basis. The time window and alignment period options you select do not affect how frequently the condition is evaluated.
The above graphic shows that the number of data samples paired with the aligner depends on the alignment period. Pick an extended period if you want to mix plenty of examples. Select a brief duration to limit the interval to just one sample. On the other hand, the duration window outlines how long the aligned values must exceed the threshold before the condition is satisfied. Set the duration window to 0 to allow the condition to be satisfied when a single aligned value is higher than the threshold.
Policies with multiple conditions
Up to six conditions can be included in an alerting policy.
You must specify when an incident is opened if you're utilizing the Cloud Monitoring API or if your alerting policy has multiple circumstances. The Multi-condition trigger step allows you to configure combiner options.
The following table contains each parameter in the Google Cloud console along with its corresponding value in the Cloud Monitoring API and a brief description:
Example
Consider a Google Cloud project with two VM instances, vm1 and vm2. Assume that you develop an alerting policy with the following two conditions:
- The instance's CPU utilization is tracked by the condition "CPU usage is too high." The requirement is fulfilled when any instance's CPU utilization exceeds 100 ms/s for a minute.
- Excessive utilization is a condition that keeps track of how much CPU each instance is using. The requirement is satisfied when an instance's CPU usage exceeds 60% for one minute.
Assume at first that neither of the conditions is true. Next, assume that vm1's CPU consumption for a minute surpasses 100 ms/s. The condition that CPU utilization is too high is met because it has exceeded the threshold for one minute. Any criterion completed in combination with the other conditions results in the occurrence of an incident. An incident won't be formed if all conditions are satisfied, including those for different resources for each condition, or if all conditions are satisfied overall. Both requirements must be met to use these combiner options.
Next, imagine that vm2's CPU usage will be higher than 60% for a minute and that vm1's CPU usage will continue to be greater than 100 ms/s. As a result, both requirements are satisfied. Depending on how the circumstances are brought together, the following describes what happens:
- Any condition is met: When a resource fulfills a condition, an incident is caused. In this case, vm2 makes the requirement for Excessive Utilization true.
- An incident is also triggered if vm2 results in the condition that CPU utilization is too high to be met. Because the events that cause vm1 and vm2 to cause the condition CPU use is too high to be met are separate, an incident is started.
- All conditions are met even for different resources for each condition: Both requirements must be satisfied for an occurrence.
- All conditions are met: This combiner requires that the same resource trigger all conditions; therefore, an incident is not generated. Because vm1 causes CPU consumption to be too high to meet requirements while vm2 causes excessive utilization to be fulfilled, no incident is triggered in this scenario.
Partial metric data
Monitoring identifies time series data as missing when it stops or is delayed, preventing policies from alerting and incidents from being closed. Data arrival delays from third-party cloud providers can reach 30 minutes, with delays of 5 to 15 minutes being the most typical. A protracted delay that exceeds the duration window can result in the occurrence of "unknown" situations. Monitoring could have forgotten part of the most recent history of the conditions by the time the data is eventually received. Because there is no indication of delays once the data is received, a later examination of the time-series data may fail to identify this issue.
When data stops coming in, Monitoring assesses metric-threshold circumstances using two customizable fields:
- Utilizing the Evaluation missing data field you set in the Condition trigger stage, you can configure how Monitoring determines the replacement value for missing data. When the retest time frame is set to No retest, this field is disabled.
- Use the Incident auto close duration field to specify how long Monitoring should wait after data stops before closing an open incident. In the Notification stage, you can specify the auto-close duration. Seven days is the standard auto-close duration.
The various choices for the missing data field are described as follows:
You can reduce issues brought on by missing data by carrying out any of the following:
- Get in touch with your third-party cloud provider to find out how to lower metric collecting latency.
- When applicable, use longer-length windows. Your alerting policies will be less responsive if you use a more prolonged period window.
-
Choose metrics with a shorter collection delay:
- Monitoring agent metrics, particularly when the agent is executing on VM instances in external clouds.
- When you directly send their data to Cloud Monitoring, custom metrics.
- If logs collection is not postponed, logs-based metrics.
Notifications and incidents per policy
When an alerting policy is disabled, neither incidents nor notifications are generated for the policy.
It is possible to create incidents and send notifications when an alerting policy is activated. The number of open incidents a policy can handle is limited, and this section explains when you might receive several alerts for the same issue.
Number of open incidents per policy
A problem affecting all resources could set off an alerting policy that applies to numerous resources, opening incidents for each resource. An incident is opened for each time series that causes a condition to be satisfied. A single policy can only open 5000 incidents simultaneously to avoid overtaxing the system.
Consider a policy that, for example, affects 2000 (or 20,000) Compute Engine instances, and each instance results in the alerting requirements being satisfied. The monitoring cap is set at 5000 open incidents. Until any of the open incidents for that policy are resolved, any remaining requirements that are met are disregarded.
Note: If you have an incident open for an alerting policy and the time series that caused it crosses the condition threshold once again. In contrast, the incident is still available; another incident will not be opened. In an alerting policy, Cloud Monitoring only keeps one open incident for each time series.
Number of notifications per incident
A notification is automatically sent out when a time series fulfills a condition. Any of the following circumstances could result in multiple notifications being sent to you:
- Multiple time series monitoring is a condition.
-
Multiple conditions are included in a policy:
- All conditions are met: When all requirements are satisfied, the policy generates an alert and creates an incident for each period that causes a requirement to be satisfied. Consider, for illustration, a policy with two criteria, each monitoring a different time series. You see two instances and get two notifications when this policy is activated.
- Any condition is met: Every time a brand-new set of conditions is satisfied, the policy sends a message. Consider the scenario where Condition A is satisfied, an incident occurs, and a notification is sent. Another message is delivered if the incident is still active when a subsequent measurement satisfies Conditions A and B.
Note: When the policy has numerous conditions, you cannot set up Cloud Monitoring to create a single event and send a single notification.
When a condition is met and when it no longer is, alerting policies set with the help of the Cloud Monitoring API let you know. When an incident is opened, alerting policies defined with the Google Cloud console send you a notification by default. When an incident is resolved, you are not informed. Notifications of incident closure can be enabled.
Note: Using the Monitoring API will not allow you to modify how incident-closure notifications are delivered.
Notifications for disabled alerting policies
By turning off and on the policy, alerting policies can be momentarily suspended and then resumed. For instance, you can turn off the alerting policies that watch over a virtual machine (VM) before performing maintenance.
Disabling an alerting policy stops it from opening or closing incidents, but Cloud Monitoring continues to assess the conditions and record the outcomes. After you disable an alerting policy, silence the associated occurrences to resolve any outstanding issues.
Monitoring evaluates the values of all conditions throughout the most recent duration window, which may include information obtained before, during, and after the stopped interval when a deactivated policy is re-enabled. Even with long windows, policies can start immediately after being resumed.
Let's say, for instance, that a monitored operation needs to be down for 20 minutes. If the process is immediately restarted and the alerting policy is enabled, Monitoring detects that the process hasn't been active for the past five minutes and creates an issue.
Notification latency
The time elapsed between when a problem first arises and when a policy is activated, is known as notification latency.
The following activities and configuration choices influence the overall notification latency:
- Metric collection delay: The amount of time required by Cloud Monitoring to gather metric values. Most metrics for Google Cloud values take 60 seconds to become available after collecting, but this varies depending on the metric. Calculations for the alerting policy incur an additional delay of 60 to 90 seconds. The visible lag for AWS CloudWatch measurements can last for many minutes. This can take an average of two minutes for uptime tests (from the end of the duration window).
- Duration window: The time window is set up for the circumstance. Only when conditions are true for the window period are conditions said to be met. The notice is delayed by at least five minutes from when the event first happens, for instance, when the duration window is set to five minutes.
- Time for notification to arrive: Network or other latency (unrelated to the content being sent), sometimes approaching minutes, may be experienced via notification channels like email and SMS. There is no assurance that messages will be delivered on some channels, including SMS and Slack.