Introduction
Error-handling is a critical activity that aids in speedier deployment as digitalization and the cloud takes over the development and deployment of new features in software applications. Any error in the chain, from creating code to deploying to monitoring performance, can impair customer experience, increase costs, or abruptly halt key services. Many systems have been developed to maintain a chain or a pipeline to manage faults, only to thwart the tools and systems' limits. Most IT teams trawl through terabytes of data by hand to find the problem; this is time-consuming and frequently delays resolution, resulting in companies losing money.
Amazon DevOps Guru is a machine learning power service that helps us improve our application's operational performance and availability by spotting anomalies that are out of the ordinary. Developers and DevOps engineers can use DevOps guru to detect, diagnose automatically, and correct performance and operation issues that were previously difficult to identify and resolve. The service does not require us to master any ML abilities or develop any ML models because this is all handled by the service. It is intended to relieve operators of the strain of setting alarms and thresholds and the necessity to produce an excessive number of alarms.
Pre-trained DevOps guru background models learn our application's operational trends and provide advice based on known resource usage patterns. When several alarms are triggered simultaneously as a result of a spike in traffic to our site or application, it can be difficult to tell which alarms are the source of the problem and which alarms are simply symptoms of the underlying problem. When an issue is found, DevOps Guru carefully knits together a chain of co-related events and bundles them into an insight that may guide an engineer to the underlying cause of the problem more quickly.
Working
Data from CloudWatch is ingested by the DevOps guru, who monitors it against the parameters defined by the ML models. When a potential issue is discovered, the DevOps guru collects log data from CloudTrail from the resources linked to the trigger issue and creates insight with viable solutions. The advantage of this technique is that instead of investigating many alerts and then analyzing the alarm and going to the logs for each related resource to figure out the problem, the operator or engineer only has to look at the insight. The insight will already include recommendations for resolving the identified potential issues. A DevOps guru can monitor CloudFormation stacks, and a stack set can be used to monitor resources across several accounts.
SNS notifications can be used to set up alerts. This means we may create a central notification target in one account, such as an SNS topic, and get SNS notifications from DevOps Guru running in several AWS accounts. This would be a good method to keep track of client or managed service accounts from a DevOps or enterprise AWS account. Popular incident management platforms such as Atlassian Opsgenie, PagerDuty, and AWS Systems Manager OpsCenter can also be connected with notifications. When a notice is received in OpsCenter, a new ticket is created in OpsCenter with the details from the insight filled in.
We can either specify the resources we want to monitor or instruct the DevOps guru to monitor everything in our account when we start DevOps guru (via the CLI or console). The service is charged by the number of resource hours examined for each active resource. We can use AWS cloud formation stacks or AWS tags to confine the service to only the most important resources, such as a production environment. We can also use an IAM policy to restrict access to resources. Suppose we want to remove access to resources generating insights needlessly. We can provide an IAM role when we create the service or use the default DevOpsGuru Role and update the policy.
We can also nominate or create up to two SNS topics to get notifications when installing DevOps guru. The interested parties would then be subscribed to the topic to receive the insights as they are generated. It will take about 2 hours to build a baseline when we enable the service. It's worth noting that if we're having issues when we first start DevOps Guru, this may be ingested as part of the baseline, making it difficult to pinpoint the application or environment behavior we're looking for. So that's a quick overview of Amazon DevOps Guru and its applications.
See this, Amazon Hirepro