Ops agent acts as the primary agent for collecting telemetry from the user's compute engine instances. It combines logging and metrics into a single agent. It uses Fluent Bit for logs that support high-throughput logging and OpenTelemtery Collector for metrics. It can also be used for supporting the parsing of log files from third-party applications.
Authorize the Ops Agent
Before authorizing an ops agent, check your authorization scopes on compute engine using the command below.
Authorization is defined as the process of determining what permissions an authenticated client has for a set of resources.
The following steps are involved for authorizing the ops agent on a VM instance:
Create a service account with the required private-key credentials and privileges in the google cloud project associated with the VM instance.
Copy the private-key credentials to the VM instance, where they act as Application default credentials for software running on the user's instance.
Install and restart the agent
Creating a Service account
For the process of authentication, which is basically the process of determining a client's identity, it is recommended to use a service account: an account associated with the user's Google cloud project as opposed to a specific user. A service account can be used regardless of whether the code is running on compute engine, app engine, or on-premise.
In order to create a service account, complete the create a service account procedure with the instructions mentioned below:
Choose the Google cloud project in which the service account is to be created. In the case of compute engine instance, select that project in which the instance was created.
From the Role drop-down menu, choose the following roles:
Monitoring > Monitoring Metric Writer
Logging > Logs Writer
Select JSON for the key type when creating the key
Copying the private key to your instance
Once the user has successfully created a service account, the user must copy the private key file to one of the below-mentioned locations on their VM instances so that the agent can recognize their credentials. Any file-copy tool can be used.
For both Linux and Windows: Any location the user stores the variable in, GOOGLE_APPLICATION_CREDENTIALS. It must be visible to the agent's process.
If you have a Linux environment on both the workstation as well as your instance, then use the below file-copy instructions. When a service account is created, the private key credentials get stored on the workstation at a location that you saved in the variable CREDS:
Using the gcloud command line tool, find the [YOUR-INSTANCE-NAME] and [YOUR-INSTANCE-ZONE] in the google cloud console in the VM instances page:
The below-mentioned command can be used to restart the agent on your VM instance
sudo service google-cloud-ops-agent restart
Configure the Ops Agent
Configuration Model
There is a built-in default configuration that an ops agent uses. This can't be directly modified but can be overridden by creating a file that is merged with the built-in configuration whenever the agent restarts.
The configuration contains the following building blocks:
Receivers: it is an element that describes what is collected by the agent.
Processors: it is an element that describes how the agent can modify the collected information.
Service: it is an element that links receivers and processors together in order to create data flows known as pipelines. The pipeline element can further contain multiple pipelines.
User-specified configuration
In order to override the built-in default configuration, the user can add new configuration elements to the configuration file. The user-specified configuration gets merged with the built-in configuration whenever the agent restarts. Put the configuration for the ops agent in the below files:
Various configurations options are possible depending on the value of the type element, which is as follows:
files receivers:
include_paths: it contains a list of filesystem paths that are to be read by tailing each file. Wildcard (*) can also be used in the paths.
exclude_paths: optionally a list of filesystems path patterns to exclude from the set matched by include_paths.
fluent_forward receivers:
listen_host: It is an IP address to listen on, whose default value is 127.0.0.1
listen_port: It is a port to listen on, whose default value is 24224.
syslog receivers:
transport_protocol: It supports tcp and udp, but the default value is tcp.
listen_host: It is an IP address to listen on, whose default value is 0.0.0.0.
listen_port: It is a port to listen on, whose default value is 5140.
tcp receivers:
Format: log format. It is mandatory. Supported values are JSON.
listen_host: It is an IP address to listen on, whose default value is 127.0.0.1.
listen_port: It is a port to listen on, whose default value is 5170.
Logging processors
A set of processing directives are available in the processor's element, each identified by a PROCESSOR_ID. It is the duty of the processor to describe how the information that is collected by the receiver is to be managed.
Each processor needs to have a unique identifier and must include a type element. The valid types are:
parse_json: Parse JSON-formatted structured logs.
parse_multiline: Parse multiline logs.
parse_regex: parse the text-formatted logs via regex patterns in order to turn them into JSON-formatted structured logs.
exclude_logs: they match specified rules.
modify_fields: set/transform fields in log entries.
Each pipeline can contain multiple pipeline IDs and definitions. Each of the pipeline definitions consists of:
receivers: It is required for new pipelines. The order of the receiver's IDs is irrelevant. Data from all of the listed receivers gets collected by the pipeline.
Processors: The order of the processor IDs is irrelevant. Each of the record is run through the processors in the listed order.
Metrics configurations
The configuration model defined above is also used by metrics:
receivers: it is a list of receiver definitions. It describes the source of the metrics. It can be shared among multiple pipelines.
Processors: it is a list of processor definitions. It describes how to modify the metrics which are collected by a receiver.
Service: it contains a pipelines section which in turn contains a list of pipeline definitions. A pipeline connects a list of receivers and processors to form the data flow.
Troubleshooting the Ops Agent
Agent diagnostics tool for Linux VMs
The agent diagnostics tool collects the critical local debugging information from the user's Linux VMs for the Ops agent, legacy logging agent, and legacy monitoring agent. Information like project info, VM info, agent configuration, agent logs, and agent service status comes under the debugging information, basically, any information that typically requires manual work to gather is a part of the debugging information. It also checks whether the local VM environment meets the requirements for agents to function properly.
Before filling a customer case for an agent on Linux VM, run the agent diagnostics tool and attach the information to the case after redacting any sensitive information.
The following command is used to retrieve the agent diagnostics tool
To locate the files with the collected info, follow the script execution output. Typically it is located in /var/tmp/google-agents directory unless the user has customized the output directory.
Agent fails to install
Following are the common errors that may be encountered when running the installation script:
If the operating system is not supported, the error message will look like this:
https://packages.cloud.google.com/yum/repos/google-cloud-ops-agent-el6-x86_64-all/repodata/repomd.xml: [Errno 14] PYCURL ERROR 22 - "The requested URL returned error: 404 Not Found"
Trying other mirrors.
To address this issue, please refer to the below wiki article
https://wiki.centos.org/yum-errors
If the above article doesn't help to resolve this issue, please use https://bugs.centos.org/.
Error: Cannot retrieve repository metadata (repomd.xml) for repository: google-cloud-ops-agent. Please verify its path and try again
There may be a cloud logging agent or the cloud monitoring agent installed on the VM, which will conflict with the new agent. Then the error message looks like this:
Error:
Problem: problem with installed package stackdriver-agent-6.0.5-1.el8.x86_64 - package google-cloud-ops-agent-0.1.0-1.el8.x86_64 conflicts with stackdriver-agent provided by stackdriver-agent-6.0.5-1.el8.x86_64
To fix this error, the following can be done:
Save the custom configuration for the Cloud Monitoring agent and Cloud logging agent.
Try uninstalling the old Cloud Monitoring agent and cloud logging agent.
AGENT is installed but not running
Agent services are not running
If in case the agent service is not running as expected, then you might see the following status:
$ sudo service google-cloud-ops-agent status
● google-cloud-ops-agent.service - Google Cloud Ops Agent
Loaded: loaded (/lib/systemd/system/google-cloud-ops-agent.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Wed 2021-06-30 21:20:43 UTC; 6s ago
To fix this, use the following command:
sudo service google-cloud-ops-agent start
Conflict with currently installed agents
If the VM already has a Cloud logging agent or the cloud monitoring agent installed, then their configuration will conflict with the new agent's configuration. To fix this error, you have two options:
Disable the conflicting section of the Ops agent configuration file.
Disable the conflicting cloud logging agent or the cloud monitoring agent.
Agent is running, but data is not ingested.
Use the metrics explorer in order to query the agent uptime metric. Also, verify that the agent component google-cloud-ops-agent-metrics or google-cloud-ops-agent-logging is writing to the metric.
Click on Monitoring from the google cloud console.
Click on the metrics explorer from the navigation pane.
The below steps require to SSH into the VM. To check if the logging module is running, use the following commands:
sudo systemctl status google-cloud-ops-agent"*"
Check the logging module log.
Logging module logs can be found at /var/log/google-cloud-ops-agent/subagents/*.log for Linux and C:\ProgramData\Google\Cloud Operations\Ops Agent\log\logging-module.log for Windows. In case there are no logs, it means that the agent service is not running properly.
You might get 403 error when writing to the logging API. For example,
[2020/10/13 18:55:09] [ warn] [output:stackdriver:stackdriver.0] error
{
"error": {
"code": 403,
"message": "Cloud Logging API has not been used in project 147627806769 before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/logging.googleapis.com/overview?project=147627806769 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.",
"status": "PERMISSION_DENIED",
"details": [
{
"@type": "type.googleapis.com/google.rpc.Help",
"links": [
{
"description": "Google developers console API activation",
"url": "https://console.developers.google.com/apis/api/logging.googleapis.com/overview?project=147627806769"
}
]
}
]
}
}
To fix this error, enable the logging API and set the logs writer role.
There might be a quota issue for the logging API. It can be fixed by raising the quota or reducing the log throughput. The below error might appear in the module log
{"error":"invalid_request","error_description":"Service account not enabled on this instance"}
This happens when you have deployed an agent with no service account.
Is the agent sending metrics to cloud Monitoring?
Check the metrics module log
The metrics module logs can be found in syslog. In case there are no logs, it indicates that the agent service isn't running properly.
A PermissionDenied error might occur while writing to the Monitoring API. This occurs when the Ops agent doesn't have any proper configuration. To fix this error, the user can enable the monitoring API and further set the Monitoring metric writer role.
A ResourceExhausted error might occur while writing to the monitoring API. This occurs when the project is hitting the limit for any Monitoring API quotas. To fix this error, the user needs to either raise the quota or reduce the metrics throughput.
The below error might appear in the module log
{"error":"invalid_request","error_description":"Service account not enabled on this instance"}
This indicates that the user has deployed the agent with no service account.
Frequently Asked Questions
Why is a service account preferred for authentication?
A service account is preferred for authentication as it is a google account that is associated with a google cloud project rather than a specific user.
What is a receiver in the configuration model?
A receiver is an element that describes what is collected by the agent.
What is the default value for listen port of syslog receivers?
The default value for listen port of syslog receivers is 5140.
Conclusion
In this article, we have extensively discussed how to manage the Ops Agent
After reading about how to manage the Ops Agent, are you not feeling excited to read/explore more articles on Google Cloud? Don't worry; Coding Ninjas has you covered. To learn about GCP certification: Google Cloud Platform, the difference between AWS, Azure & Google Cloud, and which platform is best: AWS vs. Google Cloud.
If you wish to enhance your skills in Data Structures and Algorithms, Competitive Programming, JavaScript, etc., you should check out our Guided path column at Coding Ninjas Studio. We at Coding Ninjas Studio organize many contests in which you can participate. You can also prepare for the contests and test your coding skills by giving the mock test series available. In case you have just started the learning process, and your dream is to crack major tech giants like Amazon, Microsoft, etc., then you should check out the most frequently asked problems and the interview experiences of your seniors that will surely help you in landing a job in your dream company.