Introduction
In this article, we will discuss the use of API in Cloud Monitoring. We will see the alerting policies in the monitoring API in detail, and how to manage them. We will also see how to manage notification channels by API.
Using API in Cloud Monitoring
We can use the Monitoring API to access over 1,500 cloud monitoring metrics from Google Cloud and Amazon Web Services. We can create our own custom metrics and use groups to organize our cloud resources.
Alerting policies in the Monitoring API
Alert policies are represented by the AlertPolicy object in the Cloud Monitoring API. This object describes a set of conditions that indicate a potentially abnormal state of the system. This article describes how the Monitoring API presents alert policies and the types of conditions that the Monitoring API exposes to alert policies.
Structure of an alerting policy
The AlertPolicy structure describes the components of an alerting policy. When we create a policy, either by using the Google Cloud console or the Monitoring API, we specify values for the below AlertPolicy fields:
- displayName: It is a descriptive label for the policies.
- documentation: Any information or messages given here to help responders.
- userLabels: Any user-defined or custom labels attached to the policy. Visit here to know more about using labels.
- Conditions []: It is an array of Condition structures.
- combiner: A logical operator, we can handle multiple situations through it.
- notificationChannels[]: A notification channel array.
- alert strategy: It describes how fastly monitoring closes the incidents when the arrival of data stops.
We might use other fields, depending on the conditions we create.
Notification policies created using the Monitoring API send notifications when the conditions that trigger the policy are met and when the conditions are no longer met. This is the default case, and we can't change this behavior using the Monitoring API, but we can disable close incident notifications by editing the policy in Google Cloud Console. To disable notifications about closing an incident, in the Notifications section, clear the Notify me when an incident closes check box and save the edited policy.
When we create or modify a notification policy, monitoring also sets other fields, including the name field. The value in the name field is the alert policy resource name that identifies the policy. The format of the resource name is as follows:
projects/PROJECT_ID/alertPolicies/POLICY_ID
Alerting policy conditions is the most variable part of an alerting policy.
Types of conditions in the API
The Cloud Monitoring API supports different condition types in Condition structures. There are different condition types for metric-based alerting policies, and there is only one condition for log-based alerting policies.
The coming sections describe the present condition types.
Conditions for metric-based alerting policies
We can use the following condition types to create an alerting policy that can monitor metric data along with log-based metrics:
Filter-based metric conditions
The MetricAbsence and MetricThreshold conditions use a monitoring filter to select the time series data to monitor. The other fields in the conditional structure show how the data is filtered, grouped, and aggregated. For more information, visit Filtering and aggregation: manipulating time series.
If we use the MetricAbsence condition type, we can use aggregates to aggregate time series into a single time series to create conditions triggered only if all time series are missing. See the Metric absence condition reference in the API documentation.
The metric alert policy is missing, and some data needs to be written in advance.
MQL-based metric conditions
The MonitoringQueryLanguageCondition condition uses the Monitoring Query Language (MQL) to select and manipulate the time series data to be monitored. We can use this condition type to create alerting policies that compare values to thresholds and test for missing values. If we use the MonitoringQueryLanguageCondition condition, this should be the only condition of the alerting policy. For more information, visit Alerting policies with MQL.
Conditions for alerting on ratios
We can create a metric threshold alerting policy to monitor the ratio of the two metrics. These policies can be created using either the MetricThreshold or MonitoringQueryLanguageCondition condition types. We can also use MQL directly in the Google Cloud Console. We cannot create or manage ratio-based conditions by using the graphical interface to create threshold conditions.
It is recommended that we should use MQL to create a ratio-based alerting policy. MQL allows us to create more powerful and flexible queries than we can create using the MetricTheshold condition type and monitoring filters. For example, we can use the MonitoringQueryLanguageCondition to calculate the gauge and delta metrics ratio. To see more examples, visit MQL alerting-policy examples.
When using the MetricThreshold condition, the numerator and denominator of the ratio must be the same MetricKind.
It is generally best to use label values to calculate ratios based on the time series collected for a single metric type. Ratios calculated across two different metric types can be anomalous due to different sampling periods and alignment windows.
Suppose we have two different metric types, total RPC count and error RPC count, and we want to calculate the ratio of error count RPC to total RPC. Failed RPCs are counted in time series for both metric types. Therefore, we may not see failed RPCs in both time series with the same alignment interval when aligning time series.
There are several possible reasons for this difference:
- Since there are two different time series that record the same event, two basic counter values implement the recording, and they are not updated atomically.
- Sampling rates may vary. If the time series are aligned in a common time series, the count for a single event may appear in adjacent alignment intervals in time series with different metrics.
Differences in the number of corresponding alignment interval values can result in nonsensical error / total ratio values such as 1/0 and 2/1. The higher the ratio, the less likely it is that a nonsensical value will be generated. We can get more numbers by aggregation by using an alignment window that is longer than the sampling period or by grouping the data for a particular label. These techniques minimize the impact of slight differences in the number of points within a particular interval. A 2-point discrepancy is more pronounced when the expected number of points in the interval is three than when the expected number is 300.
When using the built-in metric types, we may have no choice but to calculate the ratio between the metric types to get the required value.
If we are designing a custom metric that allows two different metrics to count the same (such as an RPC that returns error status), consider a single metric that contains each count only once instead. Suppose we are counting RPCs and want to keep track of the ratio of failed RPCs to all RPCs. To resolve this issue, we need to create a single metric type that counts RPCs, and we need to use a label to record the status of the call, including the OK status. Then each status value, error, or "OK" is recorded by updating a single counter in that case.
Condition for Log-Based Alerting Policies
Use the LogMatch condition type to create a log-based alerting policy that notifies us when a message that matches the filter appears in a log entry. If we use the LogMatch condition, this should be the only condition in your notification policy.
We cannot use the LogMatch condition type with log-based metrics. Alerting policies that monitor log-based metrics are metric-based. The alerting policy used in the document example is a metric-based alerting policy, but the principles of a log-based alerting policy are the same. For specific information on log-based alerting policies, see "Create a log-based alert (Monitoring API)" in the Cloud Logging documentation.
Managing alerting policies by API
In this section, we use the Cloud Monitoring API to create and manage metric-based alerting policies programmatically. We will also illustrate the use of the Google Cloud CLI for managing alerting policies. This content does not apply to log-based alerting policies. For information about log-based alerting policies, see Monitoring your logs.
Many of these tasks can also be performed by using the Cloud Monitoring console; see Using Alerting Policies for an introduction to creating and managing alerting policies with the Cloud Monitoring console.
Now let's see how to use the Cloud Monitoring API to programmatically create and manage metric-based alerting policies. It also describes how to manage alerting policies using the Google Cloud CLI. This content does not apply to log-based notification policies.
Many of these tasks can also be performed using the Cloud Monitoring Console. For an overview of creating and managing alerting policies using the Cloud Monitoring Console.
Prerequisites
Before writing the code for the API, we need to do the following:
- Familiar with the general concepts and terminology used in alert policies. See Introduction to Alerting.
- Make sure that the use of Cloud Monitoring API is enabled. For more information, see Enabling the API.
- Install the client library for our language. See Client Libraries for more information. Currently, API support for alerting is only available in C #, Go, Java, Node.js, and Python.
Install Google Cloud CLI, which can also perform these tasks. If we are using Cloud Shell, use Cloud Shell instead of installing the Google Cloud CLI. An example using the gcloud interface is also provided here. All gcloud examples assume that the current project is already set as a target (gcloud config set project [PROJECT_ID]), so the explicit --project flag is omitted in the call. We need to be careful. The current project ID in the example is a-gcp-project.
Make sure we have the appropriate permissions for our Google Cloud project. See Permissions. for more information.
About Alerting Policies
Alert policies are represented by AlertPolicy objects that describe a set of conditions that indicate a potentially abnormal state of the system. Alerting policies refer to notification channels. You can use the notification channel to specify how to notify that an alerting policy has been triggered.
Each alerting policy belongs to the metric scope project. Each project can contain up to 500 policies. For API calls, you need to specify the "project ID". Use the ID of the metric-scoped scoping project as the value. In these examples, the scope project ID for the metric scope is a-gcp-project.
The AlertPolicy resource supports the following operations:
- Create new policy
- Delete existing policy
- Retrieve specific or all policies
- Modify existing policy
Alerting policies can be expressed in YAML or JSON, so you can record the policy to a file and use the file to back up and restore the policy. You can use the Google Cloud CLI to create policies from files in either format. You can use the REST API to create a policy from a JSON file.
Creating Policies
To create an alerting policy in a project, use the alertPolicies.create method.
You can create policies from JSON or YAML files. The Google Cloud CLI accepts these files as arguments, and you can programmatically read JSON files, convert them to AlertPolicy objects, and create policies from them by using the alertPolicies.create method.
For general information about monitoring ratios of metrics, see Ratios of metrics.
The following examples illustrate the creation of alerting policies.
C#:
static void RestorePolicy(string projectId, string filePath)
{
var policyClient = AlertPolicyServiceClient.Create();
var channelClient = NotificationChannelServiceClient.Create();
List<Exception> exceptions = new List<Exception>();
var backup = JsonConvert.DeserializeObject<BackupRecord>(
File.ReadAllText(filePath), new ProtoMessageConverter());
var projectName = new ProjectName(projectId);
bool isSameProject = projectId == backup.ProjectId;
var channelNameMap = new Dictionary<string, string>();
foreach (NotificationChannel channel in backup.Channels)
{
}
foreach (AlertPolicy policy in backup.Policies)
{
string policyName = policy.Name;
policy.CreationRecord = null;
policy.MutationRecord = null;
for (int ctr = 0; ctr < policy.NotificationChannels.Count; ++ctr)
{
if (channelNameMap.ContainsKey(policy.NotificationChannels[ctr]))
{
policy.NotificationChannels[ctr] =
channelNameMap[policy.NotificationChannels[ctr]];/6
}
}
try
{
Console.WriteLine("Update the policy.\n{0}",
policy.DisplayName);
bool update = false;
if (isSameProject)
try
{
policyClient.UpdateAlertPolicy(null, policy);
update = true;
}
catch (Grpc.Core.RpcException e)
when (e.Status.StatusCode == StatusCode.NotFound)
{ }
if (!update)
{
// The policy no longer exists. Recreate it.
policy.Name = null;
foreach (var condition in policy.Conditions)
{
condition.Name = null;
}
policyClient.CreateAlertPolicy(projectName, policy);
}
Console.WriteLine("Restored {0}.", policyName);
}
catch (Exception e)
{
// Trying to update others, if one failed
exceptions.Add(e);
}
}
if (exceptions.Count > 0)
{
throw new AggregateException(exceptions);
}
}
Java:
private static void restoreRevisedPolicies(
String projectId, boolean isSameProject, List<AlertPolicy> policies) throws IOException {
try (AlertPolicyServiceClient client = AlertPolicyServiceClient.create()) {
for (AlertPolicy policy : policies) {
if (!isSameProject) {
policy = client.createAlertPolicy(ProjectName.of(projectId), policy);
} else {
try {
client.updateAlertPolicy(null, policy);
} catch (Exception e) {
policy =
client.createAlertPolicy(
ProjectName.of(projectId), policy.toBuilder().clearName().build());
}
}
System.out.println(String.format("Restored %s", policy.getName()));
}
}
}
Python:
def restore(project_name, backup_filename):
print(
"Loading alert policies and notification channels from {}.".format(
backup_filename
)
)
record = json.load(open(backup_filename, "rt"))
is_same_project = project_name == record["project_name"]
policies_json = [json.dumps(policy) for policy in record["policies"]]
policies = [
monitoring_v3.AlertPolicy.from_json(policy_json)
for policy_json in policies_json
]
channels_json = [json.dumps(channel) for channel in record["channels"]]
channels = [
monitoring_v3.NotificationChannel.from_json(channel_json)
for channel_json in channels_json
]
# Restore the channels.
channel_client = monitoring_v3.NotificationChannelServiceClient()
channel_name_map = {}
for channel in channels:
update = False
print("Updating channel", channel.display_name)
channel.verification_status = (
monitoring_v3.NotificationChannel.VerificationStatus.VERIFICATION_STATUS_UNSPECIFIED
)
if is_same_project:
try:
channel_client.update_notification_channel(notification_channel=channel)
update = True
except google.api_core.exceptions.NotFound:
pass # The channel was deleted. Create it below.
if not updated:
# The channel no longer exists. Recreate it.
old_name = channel.name
del channel.name
new_channel = channel_client.create_notification_channel(
name=project_name, notification_channel=channel
)
channel_name_map[old_name] = new_channel.name
# Restore the alerts
alert_client = monitoring_v3.AlertPolicyServiceClient()
for policy in policies:
print("Updating policy", policy.display_name)
del policy.creation_record
del policy.mutation_record
for counter, channel in enumerate(policy.notification_channels):
new_channel = channel_name_map.get(channel)
if new_channel:
policy.notification_channels[counter] = new_channel
update = False
if is_same_project:
try:
alert_client.update_alert_policy(alert_policy=policy)
update = True
except google.api_core.exceptions.NotFound:
pass
except google.api_core.exceptions.InvalidArgument:
pass
if not updated:
# The policy no longer exists. Recreate it.
old_name = policy.name
del policy.name
for condition in policy.conditions:
del condition.name
policy = alert_client.create_alert_policy(
name=project_name, alert_policy=policy
)
print("Updated", policy.name)
The AlertPolicy object created has additional fields. The policy itself contains name, creationRecord, and mutationRecord fields. In addition, each condition of the policy is also named. These fields cannot be changed externally and do not need to be set when creating the policy. The JSON samples used to create the policies do not include them, but the fields are present when the policies created from them are retrieved after creation.
To see code in more languages click here
Deleting Policies
Use the alertPolicies.delete method to delete a policy from a project, and also supply the name of the alerting policy that needs to be deleted.
gcloud
Use gcloud alpha monitoring policies to delete an alerting policy and specify the name of the policy that needs to be deleted. For example, this command deletes the “High CPU rate of change” policy:
gcloud alpha monitoring policies delete projects/a-gcp-project/alertPolicies/12669073143329903307
Retrieving Policies
Use the alertPolicies.list method to retrieve a list of the policies in a project. We can also use this method to retrieve policies and apply some action to each of them, for example, backing them up. This method also supports orderBy and filter options to restrict and sort the results; see Sorting and Filtering.
You can use the alertPolicies.get a method to retrieve only that policy that you are looking for and you know its name. The name of a policy is the value of the name field in the AlertPolicy object. The name of a policy has the format projects/[PROJECT_ID]/alertPolicies/[POLICY_ID], for example:
projects/a-gcp-project/alertPolicies/12669073143329903307
C#:
static void ListAlertPolicies(string projectId)
{
var client = AlertPolicyServiceClient.Create();
var response = client.ListAlertPolicies(new ProjectName(projectId));
foreach (AlertPolicy policy in response)
{
Console.WriteLine(policy.Name);
if (policy.DisplayName != null)
{
Console.WriteLine(policy.DisplayName);
}
if (policy.Documentation?.Content != null)
{
Console.WriteLine(policy.Documentation.Content);
}
Console.WriteLine();
}
}
Java:
private static void listAlertPolicies(String projectId) throws IOException {
try (AlertPolicyServiceClient client = AlertPolicyServiceClient.create()) {
ListAlertPoliciesPagedResponse response = client.listAlertPolicies(ProjectName.of(projectId));
System.out.println("Alert Policies:");
for (AlertPolicy policy : response.iterateAll()) {
System.out.println(
String.format("\nPolicy %s\nalert-id: %s", policy.getDisplayName(), policy.getName()));
int channels = policy.getNotificationChannelsCount();
if (channels > 0) {
System.out.println("notification-channels:");
for (int i = 0; i < channels; i++) {
System.out.println("\t" + policy.getNotificationChannels(i));
}
}
if (policy.hasDocumentation() && policy.getDocumentation().getContent() != null) {
System.out.println(policy.getDocumentation().getContent());
}
}
}
}
Python
def list_alert_policies(project_name):
client = monitoring_v3.AlertPolicyServiceClient()
policies = client.list_alert_policies(name=project_name)
print(
str(
tabulate.tabulate(
[(policy.name, policy.display_name) for policy in policies],
("name", "display_name"),
)
)
)
To know more about other operations, please click here
Managing Notification Channels by the API
Alerting policies usually have a way to notify us when triggered. These "notification means" are called notification channels. There are several channel types available. Each type is described in a notification channel descriptor. A particular type of notification channel is an instance of that type of descriptor. The alerting policy contains a reference to the notification channel used as the notification path.
A notification channel must exist to be used in the notification policy. A notification channel descriptor is provided, but you must create the channel before using it.
Channel Descriptors
Monitoring has several built-in notification channel types. Each of these types is described in NotificationChannelDescriptor. These descriptors have a type field, and the value of this field is used as an identifier when creating an instance of that channel type. You can use the following command to get the available channel types that are more commonly described in Notification Options.
$ gcloud beta monitoring channel-descriptors list --format='value(type)'
campfire
email
hipchat
pagerduty
pubsub
slack
sms
webhook_basicauth
webhook_tokenauth
We need to use notificationChannelDescriptors.list method to retrieve all the channel descriptors in a project.
Similarly, we can use the notificationChannelDescriptors.get method, if you are looking for a particular descriptor and you know its names, it will retrieve only that channel descriptor. The name of a channel descriptor has the format projects/[PROJECT_ID]/notificationChannelDescriptors/[CHANNEL_TYPE] [CHANNEL_TYPE] must be one of the types listed above, for example:
projects/[PROJECT_ID]/notificationChannelDescriptors/email
To list all the notification-channel descriptors in a project, use the gcloud beta monitoring channel-descriptors list command:
gcloud beta monitoring channel-descriptors list
If successful, the list command provides a listing of all the channel descriptors in the specified project. For example, the email channel descriptor appears in the list like this:
---
description: A channel that sends notifications via email.
displayName: Email
labels:
- description: An address to send email.
key: email_address
name: projects/[PROJECT_ID]/notificationChannelDescriptors/email
type: email
---
All channel descriptors include these fields:
- name: The fully qualified resource name of the channel descriptor
- type: The part of the name that indicates the type of channel
- displayName: A description of the type field, for display purposes
- description: A brief description of the channel
- labels: A set of fields specific to a channel type. Each channel type has its own set of labels.
When a channel is created, it also gets an enabled field, with the value true by default.
To list a single channel descriptor, use gcloud beta monitoring channel-descriptors describe, instead, and specify the name of the channel descriptor. You don't need to specify the fully qualified name. For example, both of these commands return the listing above:
gcloud beta monitoring channel-descriptors describe email
gcloud beta monitoring channel-descriptors describe projects/[PROJECT_ID]/notificationChannelDescriptors/email
See the gcloud beta monitoring channel-descriptors list and describe references for more information. The describe command corresponds to the notificationChannelDescriptors.get method in the API.