Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
Incidents in Cloud Monitoring
2.1.
Incidents for Metric-Based Alerts
2.2.
Incidents for Log-Based Alerts 
2.3.
Finding Incidents
2.3.1.
Finding Older Incidents
2.4.
Filtering Incidents
2.5.
Investigating Incidents
2.6.
Managing Incidents
2.7.
Acknowledging Incidents
2.8.
Closing Incidents
3.
Frequently Asked Questions
3.1.
What is the most important question to focus on when resolving an incident?
3.2.
How do I check GCP logs?
3.3.
How do I create a GCP alert?
4.
Conclusion
Last Updated: Mar 27, 2024

Incidents in Cloud Monitoring

Author Shivam Sinha
0 upvote

Introduction

 In this article, we will discuss incidents in Cloud Monitoring. We will also discuss how to find incidents, how to filter incidents, and how we can manage incidents.

Incidents in Cloud Monitoring

Incidents for Metric-Based Alerts

Incidents are records of triggered alert policies. Cloud Monitoring opens an incident when the conditions of the alerting policy are met. Incidents contain information that we can use to investigate the cause of the alert.

This document describes how metric-based alert policies view, investigate, and manage incidents.

Incidents for Log-Based Alerts 

Incidents are records of triggered alert policies. Cloud Monitoring opens an incident when the conditions of the notification policy are met. Incidents contain information that you can use to investigate the cause of the alert.

When a matching log entry triggers a log-based alert policy for the first time, Monitoring opens an incident and sends a notification.

This document describes how log-based alert policies view, investigate, and manage incidents.

Finding Incidents

To view the list of incidents, follow these steps:

  1. On the Google Cloud Console toolbar, click Menu Navigation Menu and select Monitoring.
  2. In the navigation pane, select Monitoring Notifications from Open Incident Count.
  3. The Incidents pane shows the latest incidents. To hide closed incidents in the table, click Hide Closed Incidents.

Finding Older Incidents

The Incidents section of the Alerts page shows the latest open incidents. To find the old incident, do one of the following:

  • Click < Newer or > Older to scroll through the incident table entries.
  • Click See All Incidents to go to the Incidents page. On the Incidents page, you can do the following:
    • Hide Closed Incidents: Click Hide Closed Incidents to list only the incidents that are open in the table.
    • Filter Incidents: To add a filter, see Filtering incidents.
    • Confirm, Silence, or Close Incident: To access these options, click more_vert More options in the incident row and select from the menu. See  Managing incidents for more information.

Filtering Incidents

When you enter a value in the filter bar, only incidents that match the filter are listed in the incident table. If you add multiple filters, the incident will only be displayed if all filters match. To add a filter to the incident table, follow these steps:

  1. On the Incidents page, click  Filter Table and select a filter property. Filter properties include all of the following:
  • Incident status
  • Notification policy name
  • When an incident is opened or closed
  • Metric type
  • Resource type
     

2. Select a value from the secondary menu or enter a value in the filter bar.

For example, if you select Metric Type and enter Usage_time, you may see only the following options in the secondary menu:

agent.googleapis.com/cpu/usage_time
compute.googleapis.com/guest/container/cpu/usage_time
container.googleapis.com/container/cpu/usage_time

Investigating Incidents

The following screenshot shows the details page for an incident:

To view incident details, you must have at least the identity and access control role from roles/monitoring.viewer. For more information, see Unable to view incident details due to a permission error.

When you find an incident to investigate, go to the incident details page for that incident. To view the details, click the incident summary in the Incidents table on the Alerts page or the Incidents page. Or, if you receive a notification with a link to the incident, you can use that link to view the incident details.

The Incident Details page contains the following information:

  • Status Information (including):
    • Name: The name of the alert policy that caused this incident.
    • Status: Incident Status: Open, Confirmed, or Closed.
    • Period: The period during which the incident was open.
       
  • Information about the notification policy that caused the incident:
    • Status area: Identifies the state of the notification policy that caused the incident.
    • Message Area: Provides a brief description of the cause based on how the condition was set in the alert policy. This area is always filled.
    • Document area: Displays the notification document template specified when creating the notification policy. This information can include a description of what the notification policy monitors and hints for mitigation. If you skip this field when creating an alert policy, this area will be reported as "Document not configured".
       
  • Labels: Report the following: 
    • Labels and values for the time-series monitored resources and metrics that triggered the alert policy. This information helps identify the specific monitored resource that caused the incident. Custom labels and values ​​defined in the alert policy. You can use these labels to organize and identify your notification policies. The labels associated with the policy are listed in the Policy Labels section, and the labels defined as part of the condition are listed in the Metric Labels section. See Adding Severity Levels to Notification Policies for use cases.

 

The incident details page also has tools for investigating incidents:

  • Incident Timeline: Shows two visual representations of an incident.
    • Red bars on the timeline represent incidents. The length and location of the bar reflect the duration of the incident. 
    • The chart shows the time series data and thresholds used by the alert policy that caused the incident. An incident occurred when a time series met the conditions of an alert policy. 

The time axis shows the duration of the incident with two labeled dots. The position of these points on the timeline determines the range of data displayed in the chart accompanying the incident timeline. By default, one dot is placed at the beginning of the incident and the other at the end of the incident or at the current time if the incident is still open.

You can change the time range in the incident timeline and charts. 

  • To change the time range displayed on the chart, drag one of the dots along the timeline. You can use this technique to focus on a specific interval. B. Near the beginning or end of the incident. Dragging a point on the axis to change the graph sets a custom value for the Timespan menu and disables the menu. Click Reset to enable the Timespan menu.
     
  • To change the time range displayed on the timeline, select a range from the TimeSpan menu.
     
  • Links to other troubleshooting tools. The project configuration, notification policy, and incident elapsed time determine the available links.
     
  • Click View Policy to view the notification policy details page.
     
  • Click Edit Policy to edit the alert policy definition
     
  • Click View Resource Details to go to the dashboard that contains the resource's performance information.
     
  • Click View Log to view the relevant log entries in the Log Explorer.
     
  • To see the data in the graph, click View in the Metric Explorer.

    Annotations:
  • Provides a log of findings, suggestions, or other comments from an incident investigation.
  • Enter the text in the box to add a comment and click Add Comment.
  • Click Cancel to discard the comment. 


You can also review, mute, or close an incident from the Incident Details page. See Managing Incidents for more information.

Managing Incidents

Incidents have one of the following states:

  • Error Pending: There is no data to indicate whether the policy conditions are met or not. If the policy contains multiple conditions, the incident will be opened based on how those conditions are combined. For more information, see Combinations of conditions.
     
  • Warning Confirmed: The incident is open and manually marked as confirmed. This status usually indicates that the incident is under investigation.
     
  • check_circle Closed: The system has determined that 7 days have passed without determining that the condition is no longer met, the incident is closed, or the condition is still met.

When configuring an alert policy, ensure that the persistence state provides a signal when everything is fine. This is necessary to be able to identify the normal state and to be able to close the incident if it is open. If there is no signal that the failure condition has ended, the incident will remain open for 7 days after the policy is triggered.

For example, if you want to create a policy that notifies you when the error count is greater than 0, you should generate an error count of 0 when there are no errors. If the policy returns null or empty when it is healthy, no signal indicates when the error ended. In some situations, Monitoring Query Language (MQL) supports the ability to specify default values ​​when measurements are unavailable.

Acknowledging Incidents

It is recommended that you mark the incident as confirmed when you begin investigating the cause of the incident.

To mark an incident as confirmed, follow these steps: In the Incidents section of the

  • In the Incidents pane of the Alert Dashboard, click See All Incidents.
  • On the Incidents page, find the incident you want to check and do one of the following:
    • more_vert Click More Options and select Confirm.
    • Open the incident details page and click Confirm Incident.

Closing Incidents

You can close the incident with monitoring or, in some cases, close the incident.

  • Monitoring automatically closes incidents when any of the following occur:
    • Monitoring indicates that the conditions are not met.
    • For metric threshold conditions when no observations are received during the automatic closing of the alert policy. You can use the Google Cloud Console or the Cloud Monitoring API to set the auto-close period. By default, the auto-close period is 7 days. The minimum time for automatic closing is 30 minutes. 
    • In the absence of the metric, Monitoring closes the incident if no data arrives 24 hours after the end of the auto-close period. You can use the Google Cloud Console or the Cloud Monitoring API to set the auto-close period. By default, the auto-close period is 7 days.


For example, the alert policy generated an incident because the HTTP response wait time exceeded 2 seconds in 10 minutes in a row. The incident is closed if the next HTTP response delay measurement is less than or equal to 2 seconds. Similarly, the incident will be closed if no data is received for 7 days. 
 

  • You can close the incident after observations are no longer received.

After closing an incident, an incident is created when data arrives indicating that the conditions are met.

Closing an incident does not close other incidents that are open for the same notification policy. This behavior is different from incident silencing, which closes all open incidents under the same conditions.
 

To close the incident, do the following:

  • In the Incidents section of the Alert Dashboard, click View All Incidents. 
  • On the Incidents page, find the incident you want to close and do one of the following:
    • more_vert Click More Options and select Close This Incident.
    • Open the incident details page and click Close Incident.


If you see the message "Unable to close incident while active", you cannot close the incident because you received the data within the last notification period.

A message is displayed if the incident cannot be closed. Please try again in a few minutes. After that, the incident could not be closed due to an internal error.

Frequently Asked Questions

What is the most important question to focus on when resolving an incident?

When starting out with incident management, it's recommended to focus on asking the most critical questions so that the fixed effort can get underway as soon as possible.

How do I check GCP logs?

In the Google Cloud console, go to the Logging page. When in the Logs Explorer, select and filter your resource type from the first drop-down list. From the All logs drop-down list, select compute.googleapis.com/activity_log to see Compute Engine activity logs.

How do I create a GCP alert?

In the navigation pane, select notifications Alerting and then click Create policy. Select the time series to be monitored: Click Select a metric and enter into the filter bar the name of the metric type or resource type that is of interest.

Conclusion

This article discusses Incidents in Cloud Monitoring, the different kinds of Incidents available, and how we can find incidents or even filter one! We also learned how to investigate and close an Incident.

To learn more, see Cloud ComputingMicrosoft Azure, Basics of C++ with Data StructureDBMSOperating System by Coding Ninjas, and keep practicing on our platform Coding Ninjas Studio.

If you think you are ready for the tech giants company, check out the mock test series on code studio.

You can also refer to our Guided Path on Coding Ninjas Studio to upskill yourself in domains like Data Structures and AlgorithmsCompetitive ProgrammingAptitude, and many more!. You can also prepare for tech giants companies like Amazon, Microsoft, Uber, etc., by looking for the questions asked by them in recent interviews. If you want to prepare for placements, refer to the interview bundle. If you are nervous about your interviews, you can see interview experiences to get ideas about questions that have been asked by these companies.

Do upvote if you find this blog helpful!

Be a Ninja

Happy Coding!

 

ninjas logo

 

Live masterclass