Observability Start-Up Guide

Introduction

What is Observability?

In the ever-evolving landscape of distributed system operations, ensuring the reliability, performance, and scalability of complex applications has become increasingly more difficult. System Observability has emerged as a critical practice that empowers IT organizations to effectively monitor and gain deep insights into the inner workings of their software systems. By systematically collecting and analyzing data about applications, infrastructure, and user interactions. Observability enables teams to proactively identify, diagnose, and resolve issues, ultimately leading to enhanced user experiences and operational efficiency.

What is OpenTelemetry?

OpenTelemetry is an open-source project that standardizes the collection of telemetry data from software systems, making it easier for organizations to gain holistic visibility into their environments. By seamlessly integrating with various programming languages, frameworks, and cloud platforms, OpenTelemetry simplifies the instrumentation of applications, allowing developers and operators to collect rich, actionable data about their systems' behavior.  The adoption of OpenTelemetry by software vendors and Application Performance Monitoring (APM) tools represents a significant shift in the observability landscape. OpenTelemetry has gained substantial traction across the industry due to its open-source, vendor-neutral approach and its ability to standardize telemetry data collection.

Many software vendors have started incorporating OpenTelemetry into their frameworks and libraries. Major cloud service providers like AWS, Azure, and Google Cloud have also embraced OpenTelemetry. In addition, many APM tools have integrated OpenTelemetry into their offerings. This integration allows users of these APM solutions to easily collect and visualize telemetry data from their applications instrumented with OpenTelemetry. It enhances the compatibility and flexibility of APM tools, making them more versatile in heterogeneous technology stacks.

Solution Architecture (Component Description)

Getting Started

Introduction

The following will provide a example setup to get started with Observability using Grafana for Universal Automation Center.

This set-up is based on the setup done with Grafana.

This set-up allows collecting Metrics, Trace and Log data from the Universal Automation Center. The collected data is stored your Grafana’s cloud stack for analysis.

Grafana is selected for this Getting Started Guide as an example. Any other data store or analysis tool could also be used.  

Metrics

Metrics data can be collected from Universal Controller, Universal Agent, OMS and Universal Tasks of type Extension.

Metrics data is pulled through the Prometheus metrics Web Service endpoint (Metrics API) and via user-defined Universal Event Open Telemetry metrics, which is exported to the Grafana Alloy Agent metrics collector.

The collected Metrics data is exported to your Grafana Cloud Stack for analysis.

To enable Open Telemetry metrics, the Prometheus exporter block needs to be configured in the config file of Grafana Alloy.

Trace

Universal Controller will manually instrument Open Telemetry trace on Universal Controller (UC), OMS, Universal Agent (UA), and Universal Task Extension interactions associated with task instance executions, agent registration, and Universal Task of type Extension deployment.

The collected Trace data is stored in your Grafana Cloud Stack for analysis. 

To enable tracing an Open Telemetry span exporter must be configured in Grafana Alloy.




Prerequisites

The sample set will be done on a single on-premise Linux server and Grafana Cloud. 

Server Requirements

  • Linux Server 
    • Memory: 16GB RAM
    • Storage: 70GB Net storage 
    • CPU: 4 CPU
    • Distribution: Any major Linux distribution 
    • For the installation and configurations of the required Observability tools Administrative privileges are required
  • Ports

The Following default ports will be used. 

Application

Port
Grafana Alloy "Prometheus"http: 9090
Grafana Alloy "OTEL Collector"

4317 (grpc), 4318 (http)


Pre-Installed Software Components

It is assumed that following components are installed and configured properly:

  • Universal Agent 7.5.0.0 or higher
  • Universal Controller 7.5.0.0 or higher

Please refer to the documentation for Installation and Applying Maintenance

and Universal Agent UNIX Quick Start Guide for further information on how to install Universal Agent and Universal Controller.

Required Software for Observability  

The following Opensource Software needs to be installed and configured for use with Universal Automation Center.

Note: This Startup Guide has been tested with the provide Software Version in the table below. 

SoftwareVersionLinux Archive
Grafana Alloy1.43https://github.com/grafana/alloy/releases

Configuration

Grafana Alloy setup

Set up a Grafana Cloud Account

To start using Observability using Grafana a Grafana Cloud Account is needed. This can be done on Grafana's Website. There is an unlimited free trial that allows for Data retention up to 14 days.

Once an account is created Grafana will set-up an Instance where all Observability data is getting sent to.

Setup Grafana Alloy

To get Observability data into your Grafana Cloud stack, Grafana Alloy is needed.

For a detailed installation guide and how to set-up Grafana Alloy on your own the Grafana Alloy documentation can help.

Installing Grafana Alloy is done by using the local package repository on your System.

Grafana Alloy is getting installed as a system service on your System, in which the environment variables for the Grafana Cloud stack needs to be configured.

To edit the environment file for the service head to either:

  • Debian-based systems: /etc/default/alloy
  • RedHat or SUSE-based systems: /etc/sysconfig/alloy

It is also recommended to change the user running alloy to the same user that runs the Universal Controller in order for the log data to get collected. The following shows an example of the system config file:

## Path:
## Description: Grafana Alloy settings
## Type: string
## Default: ""
## ServiceRestart: alloy
#
# Command line options for Alloy.
#
# The configuration file holding the Alloy config.
CONFIG_FILE="/etc/alloy/config.alloy"

# User-defined arguments to pass to the run command.
CUSTOM_ARGS=
Tempo_user=”userid of Grafana Tempo from your Cloud Instance”
Loki_user=”userid of Grafana Loki from your Cloud Instance”
Prometheus_user=”userid of Grafana Prometheus from your Cloud Instance”
Tempo_PW="Password of Grafana Tempo from your Cloud Instance"
Loki_PW=" Password of Grafana Loki from your Cloud Instance "
Prometheus_PW=" Password of Grafana Prometheus from your Cloud Instance"
UCUSER="user for login"
UCPW="password for login"
UCADDR="address of your Universal Controller"
# Restart on system upgrade. Defaults to true.
RESTART_ON_UPGRADE=true

Once the changes are made use the following commands:

sudo systemctl daemon-reload
sudo systemctl restart alloy

In order to start Grafana Alloy with the new environmental variables.

Configuring Grafana Alloy

To configure Grafana Alloy the configuration file needs to be changed at “/etc/alloy/config.alloy”

It follows a pipeline format in which data sent to Grafana Alloy is received, processed and sent to the Grafana Cloud stack.

There are multiple additional configurations that can be made to make ingesting your Observabiltiy data easier and efficient.

To find more information about the different "Code Blocks" for the config head to the official Grafana documentation for Alloy

The following configuration is an example set-up in order to collect Metric/Trace/Log data.


config.alloy
//Metric pipeline

//UC metrics
prometheus.scrape "controller" {
targets = [{
__address__ = env("UCADDR"),
}]
forward_to = [prometheus.remote_write.uc.receiver]
job_name = "uc"
scrape_interval = "60s"
metrics_path = "/resources/metrics"

basic_auth {
username = env("UCUSER")
password = env("UCPW")
}
extra_metrics = true
}

//Alloy metrics
prometheus.scrape "alloy" {
targets = [{
__address__ = "0.0.0.0:12345",
}]
forward_to = [prometheus.remote_write.uc.receiver]
extra_metrics = true
}

//Sender -> Grafana
prometheus.remote_write "uc" {
endpoint {
url = "Prometheus endpoint given by Grafana"

basic_auth {
username = env("Prometheus_user")
password = env("Prometheus_PW")
}
}
}

// Tracing pipeline

//Sender -> Grafana 
otelcol.exporter.otlp "default" {
client {
endpoint = "Tempo endpoint given by Grafana"
auth = otelcol.auth.basic.tempo.handler
}
}

//Authblock
otelcol.auth.basic "tempo" {
username = env("Tempo_user")
password = env("Tempo_PW")
}

//Sender -> Prometheus [Extension metrics]
otelcol.exporter.prometheus "default" {
forward_to = [prometheus.remote_write.libelle.receiver]
resource_to_telemetry_conversion = true
}

otelcol.processor.batch "default" {
output {
metrics = [otelcol.exporter.prometheus.default.input]
logs = []
traces = [otelcol.exporter.otlp.default.input]
}
}

otelcol.processor.memory_limiter "default" {
check_interval = "60s"
limit = "5GiB"

output {
metrics = [otelcol.processor.batch.default.input]
logs = []
traces = [otelcol.processor.batch.default.input]
}
}

//Trace, extension metrics receiver
otelcol.receiver.otlp "default" {
grpc { 
endpoint = "localhost:4317"
}

http {
endpoint = "localhost:4318"
}

output {
metrics = [otelcol.processor.batch.default.input]
logs = []
traces = [otelcol.processor.batch.default.input]
}
}

// Log file pipeline

loki.source.file "uc_logs" {
targets = [
{__path__ = "/path/to/your/uc_log/file"}, 
]
forward_to = [loki.process.uclogs.receiver]
}

loki.process "uclogs" {
forward_to = [loki.write.pointer.receiver]

stage.json {
expressions = {
log = "",
ts = "timestamp",
service_name = "service_name",
color = "color",
detected_level = "detected_level",
filename = "filename",
}
}

stage.timestamp {
source = "ts"
format = "RFC3339"
}

stage.json {
source = "log"
expressions = {
is_secret = "",
level = "",
log_line = "message",
}
}

stage.drop {
source = "is_secret"
value = "true"
}

stage.labels {
values = {
level = "",
service_name = "service_name",
service_namespace = "service_namespace",
color = "",
detected_level = "",
filename = "",
}
}

stage.template {
source = "service_name"
template = "{{ if eq .Value \"unknown_service\" }}controller{{ else }}controller{{ end }}"
}
stage.template {
source = "service_namespace"
template = "{{ if eq .Value \"unknown_service\" }}Stonebranch.UAC{{ else }}Stonebranch.UAC{{ end }}"
}
stage.labels {
values = {
level = "",
service_name = "service_name",
service_namespace = "", 
color = "",
detected_level = "",
filename = "",
}
}

stage.output {
source = "log_line"
}
}

loki.write "pointer" {
endpoint {
url = "Loki endpoint given by Grafana"
basic_auth {
username = env("Loki_user")
password = env("Loki_PW")
}
}
}

Once restarting Grafana Alloy will collect Observability data from your Universal Controller. The log file of Grafana Alloy can be checked by using the command: “sudo journalctl -u alloy.service”.

Universal Controller 


Description:

Universal Controller 

Installation Steps:

Update Universal Controller Properties

The following uc.properties need to be set in order to enable metrics and traces from Universal Controller:

NameDescription
uc.otel.exporter.otlp.metrics.endpointThe OTLP metrics endpoint to connect to. Must be a URL with a scheme of either http or https based on the use of TLS.
Default is http://localhost:4317 when protocol is grpc, and http://localhost:4318/v1/metrics when protocol
is http/protobuf.
uc.otel.exporter.otlp.traces.endpointThe OTLP traces endpoint to connect to. Must be a URL with a scheme of either http or https based on the use of TLS.
Default is http://localhost:4317 when protocol is grpc, and http://localhost:4318/v1/traces when protocol
is http/protobuf

Please refer to the uc.properties documentation for a list of all configuration options.

Sample Configuration Files

The following provides a minimum uc.properties file: 

uc.properties
uc.properties

# Enable metrics and trace from UC Controller

# The OTLP traces endpoint to connect to (grpc):
uc.otel.exporter.otlp.traces.endpoint http://localhost:4317

# The OTLP metrics endpoint to connect to (grpc): 
uc.otel.exporter.otlp.metrics.endpoint http://localhost:4317

Universal Agent

Description:

The following describes the steps to enable tracing and metrics for UAG and OMS Server. 

The here described set-up use http protocol. In addition to supporting HTTP (default), HTTPS is also supported.

Refer to the documentation on how to Enable and Configure SSL/TLS for OMS Server and UAG:

Installation Steps:

Enabling Metrics/Traces
Metrics and Traces will be turned off, by default, in both UAG and OMS Server. The user must configure two new options to enable metrics and traces.

Metrics:

ComponentConfiguration File Option
UAGotel_export_metrics YES
OMS Serverotel_export_metrics YES

Traces:

ComponentConfiguration File Option
UAGotel_enable_tracing YES
OMS Serverotel_enable_tracing YES

Configure Service Name
All applications using Opentelemetry must register a service.name, including UAG and OMS Server

ComponentConfiguration File Option
UAGotel_service_name <agent_name>
OMS Serverotel_service_name <oms_agent_name>

Configuring OTLP Endpoint

Both the metrics and tracing engines end up pushing the relevant data to the Opentelemetry collector using the HTTP(S) protocol (gRPC protocol
NOT supported this release). In most scenarios, the traces and metrics will be sent to the same collector, but this is not strictly necessary. To
account for this, two new options will be added in both UAG and OMS

Metrics:

ComponentConfiguration File Option
UAGotel_metrics_endpoint http://localhost:4318
OMS Serverotel_metrics_endpoint http://localhost:4318

Traces: 

ComponentConfiguration File Option
UAGotel_trace_endpoint http://localhost:4318
OMS Serverotel_trace_endpoint http://localhost:4318

Configure how often to export the metrics from UAG and OMS Server

ComponentConfiguration File Option
UAGotel_metrics_export_interval 60
OMS Serverotel_metrics_export_interval 60

The value:

Opentelemetry default of 60 seconds

is specified in seconds
must be non-negative (i.e. >0)
cannot exceed 2147483647


Sample Configuration Files

The following provides the sample set-up for UAG and OMS Server.

The otel_metrics_export_interval  is not set. The default value of 60s is taken in that case.

UAG
# /etc/universal/uags.conf:

otel_export_metrics YES
otel_enable_tracing YES
otel_service_name agt_lx_wiesloch_uag
otel_metrics_endpoint http://localhost:4318
otel_trace_endpoint http://localhost:4318


OMS
# /etc/universal/omss.conf:

otel_export_metrics YES
otel_enable_tracing YES
otel_service_name agt_lx_wiesloch
otel_metrics_endpoint http://localhost:4318
otel_trace_endpoint http://localhost:4318

Note: After adjusting uags.conf and omss.conf restart the Universal Agent. 

Universal Agent Restart
sudo /opt/universal/ubroker/ubrokerd restart

Official Documentation: Links to OMS and UAG open telemetry configuration options.


Universal Automation Center Observability Tutorials

Tutorial 1: Metric Data Collection and Analysis using Grafana

Configure a sample Dashboard in Grafana (add prometheus datasource, create visualization)

In the following example, a Grafana Dashboard with one visualization showing the OMS Server Status will be configured. The datasource is automatically configured by Grafana Alloy.

The following Steps need to be performed:

  1. Log-in to Grafana
  2. Create a new Dashboard and add a new visualization to it
  3. Configure visualization
  4. Display Dashboard
Log-in to your Grafana Instance

Create a new Dashboard and add a new visualization to it

Configure Visualization
  1. Select Prometheus as Data Source
  2. Select the Metric uc_oms_server_status
  3. Enter a Title and Description e.g. OMS Server Status
  4. In the Legend Options enter  {{instance}}

Display Dashboard

Tutorial 2: Traces Data Collection and Analysis using Grafana

This tutorial will show how to collect and visualize traces from the different UAC components in Grafana.

This tutorial requires that all configuration steps from the Observability Start-Up Guide have already been performed. 

After finishing this Tutorial, you will be able to collect and display Metrics and Traces in Grafana

Universal Controller will manually instrument Open Telemetry trace on Universal Controller (UC), OMS, Universal Agent (UA), and Universal Task Extension interactions associated with task instance executions, agent registration, and Universal Task of type Extension deployment.

To enable tracing, an Open Telemetry span exporter must be configured in Grafana Alloy. 

The collected Trace data is used for analysis in Grafana. 

The following outlines the architecture:

Installation

Prerequisites

This tutorial requires that all configuration steps from the Observability Start-Up Guide have been already performed. 

Server Requirements

the same Linux Server as in the first part of the Tutorial will be used.

Pre-Installed Software Components

This tutorial requires the following software, installed during the first Tutorial.

Software
Version
Linux Archive

Universal Controller 

7.5.0.0 or higherDownload via Stonebranch Support Portal 

Universal Agent 

7.5.0.0 or higherDownload via Stonebranch Support Portal 
Grafana Alloy1.43 or higherhttps://github.com/grafana/alloy/releases


1. Enable Tracing in Universal controller

Update uc.properties
Enable Tracing in uc.properties
# stop tomcat ( adjust according to your environment )
sudo /usr/share/apache-tomcat-9.0.80/bin/shutdown.sh

# enable traces (grpc protocol) for Universal Controller
sudo vi /usr/share/apache-tomcat-9.0.80/conf/uc.properties
uc.otel.exporter.otlp.traces.endpoint http://localhost:4317

# start tomcat ( adjust according to your environment )
sudo /usr/share/apache-tomcat-9.0.80/bin/startup.sh

Official Documentation: link to uc.properties open telemetry properties.

Start/ Stop Tomcat
Start/ Stop Tomcat
# stop/start Tomcat ( adjust according to your environment )
sudo /usr/share/apache-tomcat-9.0.80/bin/shutdown.sh
sudo /usr/share/apache-tomcat-9.0.80/bin/startup.sh

Optionally:
sudo /etc/init.d/uac_tomcat stop
sudo /etc/init.d/uac_tomcat start
Checks
Checks Universal Controller
Log Files:
# Adjust to the location of your Tomcat directory
sudo cat /usr/share/apache-tomcat-9.0.80/uc_logs/uc.log

2. Enable Tracing in Universal Agents

For the endpoint give the IP address of the System on which Grafana Alloy is running.

Configure uags.conf and omss.conf of the Universal Agent
Enable Tracing in uags.conf and omss.conf
# Add the following to:
# uags.conf and omss.conf

sudo vi /etc/universal/uags.conf
sudo vi /etc/universal/omss.conf

otel_enable_tracing YES
otel_trace_endpoint http://192.168.88.17:4318

# Restart the Agent
sudo /opt/universal/ubroker/ubrokerd restart
Start/Stop
Start/ Stop Universal Agent
# Start/Stop Universal Agent
sudo /opt/unversal/ubroker/ubrokerd start
sudo /opt/unversal/ubroker/ubrokerd stop
Checks
Checks Universal Agents
Log Files:
# UAG 
sudo vi /var/opt/universal/uag/logs/agent.log
# Broker
sudo vi /var/opt/universal/log/unv.log


3. Configure a sample tracing dashboard in Grafana (create visualization, view trace)

In this example, a Grafana dashboard with one tracing visualization showing incoming traces from the controller will be configured. The datasource is automatically configured by Grafana Alloy
The following steps need to be performed:

  1. Log-in to Grafana
  2. Create or access a dashboard and add a visualization
  3. Configure the visualization with the Grafana Tempo data source set
  4. Click on a trace to open detailed information about the trace


Log-in to Grafana

Head to the dashboards tab and create/access a dashboard and add a visualization to the dashboard

Configure the visualization: (Examples)
  1. Select the Tempo data source
  2. Select the "search" tab for a general view of traces or the "traceid" tab for a specific trace
  3. Choose the controller as the service name and choose "all" operation names
  4. Add the "sort by" transformation set to start time
  5. Select that all tooltips should be shown
  6. Save the visualization


Click on a given trace and open the link in a new tab


The trace will now be shown in a detailed view




Example Widgets inside of Grafana

Grafana has sharing options for Dashboards or Widgets. Copying the JSON model of one of the widgets and pasting it will result in the Widget being present in the dashboard, make sure beforehand that your datasource is set up correctly and double check the widget if there are any problems.


List of Business Services with detailed view (advanced)

Making a List of Business services

Despriction:

This Widget allows for a list of Business services available on the Universal Controller. Clicking one of the Business services will open a nested dashboard which will give more detailed information about the Business service.

This Widget requires the optional metric label "security_business_service" in order to function properly.

Configuration:

The Widget uses the "uc_history_total" metric and the "Time series to table" transformation, the transformation is available from Grafana.

The metric is filtered by the current instance, if you have multiple and summarized by the "security_business_services" label.

Under the Transformation tab choose the "Time series to table" transformation and set it to the following setting:

Changing the Widget type to "Table" will now result in the above seen Widget with the task number of launched tasks.

To not let the number of tasks launched be seen it is possible to set an overide in the Widget editor.

Now we set up a new dashboard for the detailed view of the business services.

It is recommended to use the template dashboard given below and pasting it as a new dashboard for a quick out-of-the-box-experience. This dashboard is configured to use the default prometheus data source provided by Grafana and uses several variables already pre-set up.

JSON Model
{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": {
"type": "grafana",
"uid": "-- Grafana --"
},
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 1,
"id": 45,
"links": [],
"panels": [
{
"datasource": {
"type": "prometheus",
"uid": "grafanacloud-prom"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"custom": {
"align": "auto",
"cellOptions": {
"type": "auto"
},
"inspect": false
},
"links": [
{
"targetBlank": true,
"title": "Task details",
"url": "/d/de2xd2cic64g0e/?var-query0=${__data.fields.task_name}&var-query1=${query1}&var-query2=${query0}"
}
],
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
}
]
}
},
"overrides": [
{
"matcher": {
"id": "byName",
"options": "Trend #B"
},
"properties": [
{
"id": "displayName",
"value": "Launches"
}
]
},
{
"matcher": {
"id": "byName",
"options": "task_name"
},
"properties": [
{
"id": "displayName",
"value": "Task Name"
}
]
}
]
},
"gridPos": {
"h": 26,
"w": 10,
"x": 0,
"y": 0
},
"id": 1,
"options": {
"cellHeight": "sm",
"footer": {
"countRows": false,
"fields": "",
"reducer": [
"sum"
],
"show": false
},
"showHeader": true,
"sortBy": []
},
"pluginVersion": "11.4.0-77868",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "grafanacloud-prom"
},
"disableTextWrap": false,
"editorMode": "builder",
"expr": "sum by(task_name) (delta(uc_history_total{security_business_services=\"$query0\"}[24h]))",
"fullMetaSearch": false,
"hide": false,
"includeNullMetadata": true,
"instant": false,
"legendFormat": "__auto",
"range": true,
"refId": "B",
"useBackend": false
}
],
"title": "Successful Tasks",
"transformations": [
{
"id": "timeSeriesTable",
"options": {
"B": {
"timeField": "Time"
}
}
}
],
"type": "table"
},
{
"datasource": {
"type": "prometheus",
"uid": "grafanacloud-prom"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [
{
"options": {
"match": "null",
"result": {
"index": 0,
"text": "0"
}
},
"type": "special"
}
],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
}
]
}
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 14,
"x": 10,
"y": 0
},
"id": 5,
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "auto",
"orientation": "auto",
"percentChangeColorMode": "standard",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"showPercentChange": false,
"textMode": "auto",
"wideLayout": true
},
"pluginVersion": "11.4.0-77868",
"targets": [
{
"editorMode": "builder",
"expr": "sum by(task_name) (delta(uc_history_total{security_business_services=\"$query0\", task_instance_status=\"Success\"}[24h]))",
"legendFormat": "__auto",
"range": true,
"refId": "A"
}
],
"title": "Successful ",
"type": "stat"
},
{
"datasource": {
"type": "prometheus",
"uid": "grafanacloud-prom"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [
{
"options": {
"match": "null",
"result": {
"index": 0,
"text": "0"
}
},
"type": "special"
}
],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
}
},
"overrides": []
},
"gridPos": {
"h": 9,
"w": 14,
"x": 10,
"y": 8
},
"id": 2,
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "auto",
"orientation": "auto",
"percentChangeColorMode": "standard",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"showPercentChange": false,
"textMode": "auto",
"wideLayout": true
},
"pluginVersion": "11.4.0-77868",
"targets": [
{
"editorMode": "builder",
"expr": "sum by(task_name) (delta(uc_history_total{security_business_services=\"$query0\", task_instance_status=\"Failed\"}[24h]))",
"legendFormat": "__auto",
"range": true,
"refId": "A"
}
],
"title": "Failed",
"type": "stat"
},
{
"datasource": {
"type": "prometheus",
"uid": "grafanacloud-prom"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"mappings": [
{
"options": {
"match": "null",
"result": {
"index": 0,
"text": "0"
}
},
"type": "special"
}
],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
}
},
"overrides": []
},
"gridPos": {
"h": 9,
"w": 14,
"x": 10,
"y": 17
},
"id": 4,
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "auto",
"orientation": "auto",
"percentChangeColorMode": "standard",
"reduceOptions": {
"calcs": [
"lastNotNull"
],
"fields": "",
"values": false
},
"showPercentChange": false,
"textMode": "auto",
"wideLayout": true
},
"pluginVersion": "11.4.0-77868",
"targets": [
{
"editorMode": "code",
"expr": "sum by(task_name) (delta(uc_history_total{security_business_services=\"$query0\", task_instance_status=\"Start Failure\"}[24h]))",
"legendFormat": "__auto",
"range": true,
"refId": "A"
}
],
"title": "Start Failure",
"type": "stat"
}
],
"preload": false,
"schemaVersion": 40,
"tags": [],
"templating": {
"list": [
{
"current": {
"text": "",
"value": ""
},
"definition": "label_values(security_business_services)",
"label": "Business Service",
"name": "query0",
"options": [],
"query": {
"qryType": 1,
"query": "label_values(security_business_services)",
"refId": "PrometheusVariableQueryEditor-VariableQuery"
},
"refresh": 1,
"regex": "",
"type": "query"
},
{
"current": {
"text": "",
"value": ""
},
"definition": "label_values(instance)",
"label": "Instance",
"name": "query1",
"options": [],
"query": {
"qryType": 1,
"query": "label_values(instance)",
"refId": "PrometheusVariableQueryEditor-VariableQuery"
},
"refresh": 1,
"regex": "",
"type": "query"
}
]
},
"time": {
"from": "now-24h",
"to": "now"
},
"timepicker": {},
"timezone": "browser",
"title": "Business Group Information",
"uid": "ce2xb7h1w8hs0e",
"version": 21,
"weekStart": ""
}

To connect the two dashboards, head back to the Widget and add a "Data Link" with the following link:

Use the UUID of the pasted dashboard after the "d/" in order to connect the dashboards.

This concludes the setup of the Business service widget.

Widgets for System data (Agents, OMS)

Number of Agents connected

Description:

This Widget shows the number of Agents connected and has an indicator for the upper limit of how many Agents can connect to the controller.

The upper limit depends on the number of licenses the controller owns.

This Widget uses the “Time series” configuration to give a real time update on the Agent status.

Configuration:

The Widget is constructed using 2 metrics derived from the controller:

The first query is the “uc_license_agents_distributed_max” metric which will show the maximum amount of licenses available, which is used to show the upper limit graph.

The second query is the “uc_license_agents_distributed_used” metric which shows the amount of Agents currently connected to the controller.

Seen below is an example of the configuration used to make it an opaque graph that runs through the time series.


Below are the 2 PromQL lines used to configure the queries

uc_license_agents_distributed_max
uc_license_agents_distributed_used
JSON Model
{
  "datasource": {
    "type": "prometheus",
    "uid": "a65085b5-82cf-490b-a6cb-c01306f4a949"
  },
  "description": "This Widget shows the number of Agents connected and has an indicator for the upper limit of how many Agents can connect to the controller. The upper limit depends on the number of licenses the controller owns.",
  "fieldConfig": {
    "defaults": {
      "custom": {
        "drawStyle": "line",
        "lineInterpolation": "linear",
        "barAlignment": 0,
        "lineWidth": 1,
        "fillOpacity": 58,
        "gradientMode": "none",
        "spanNulls": true,
        "insertNulls": false,
        "showPoints": "never",
        "pointSize": 5,
        "stacking": {
          "mode": "none",
          "group": "A"
        },
        "axisPlacement": "auto",
        "axisLabel": "",
        "axisColorMode": "series",
        "scaleDistribution": {
          "type": "linear"
        },
        "axisCenteredZero": false,
        "hideFrom": {
          "tooltip": false,
          "viz": false,
          "legend": false
        },
        "thresholdsStyle": {
          "mode": "off"
        },
        "axisGridShow": false,
        "axisSoftMin": 0,
        "lineStyle": {
          "fill": "solid"
        }
      },
      "color": {
        "mode": "palette-classic"
      },
      "mappings": [],
      "thresholds": {
        "mode": "absolute",
        "steps": [
          {
            "color": "green",
            "value": null
          },
          {
            "color": "red",
            "value": 80
          }
        ]
      },
      "decimals": 0
    },
    "overrides": [
      {
        "matcher": {
          "id": "byFrameRefID",
          "options": "Maximum Agents"
        },
        "properties": [
          {
            "id": "displayName",
            "value": "Max Number of Agents"
          },
          {
            "id": "color",
            "value": {
              "fixedColor": "dark-red",
              "mode": "fixed",
              "seriesBy": "max"
            }
          },
          {
            "id": "custom.fillOpacity",
            "value": 0
          }
        ]
      },
      {
        "matcher": {
          "id": "byFrameRefID",
          "options": "#Agents connected"
        },
        "properties": [
          {
            "id": "displayName",
            "value": "Total Number of used Agents"
          },
          {
            "id": "color",
            "value": {
              "fixedColor": "green",
              "mode": "shades"
            }
          }
        ]
      }
    ]
  },
  "gridPos": {
    "h": 8,
    "w": 6,
    "x": 18,
    "y": 29
  },
  "id": 11,
  "options": {
    "tooltip": {
      "mode": "single",
      "sort": "none"
    },
    "legend": {
      "showLegend": true,
      "displayMode": "list",
      "placement": "bottom",
      "calcs": []
    }
  },
  "pluginVersion": "10.1.4",
  "targets": [
    {
      "datasource": {
        "type": "prometheus",
        "uid": "a65085b5-82cf-490b-a6cb-c01306f4a949"
      },
      "disableTextWrap": false,
      "editorMode": "builder",
      "expr": "uc_license_agents_distributed_max",
      "fullMetaSearch": false,
      "includeNullMetadata": true,
      "instant": false,
      "legendFormat": "__auto",
      "range": true,
      "refId": "Maximum Agents",
      "useBackend": false
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "a65085b5-82cf-490b-a6cb-c01306f4a949"
      },
      "disableTextWrap": false,
      "editorMode": "builder",
      "expr": "uc_license_agents_distributed_used",
      "fullMetaSearch": false,
      "hide": false,
      "includeNullMetadata": true,
      "instant": false,
      "legendFormat": "__auto",
      "range": true,
      "refId": "#Agents connected",
      "useBackend": false
    }
  ],
  "title": "#Agents connected",
  "type": "timeseries"
}

OMS Server Status

Description:

OMS Server Status shown in a "Status History" graph. Depending on the number of OMS server connected the graph changes to represent them.

The graph will also show the different states an OMS server can be.

Configuration:

This query is made from the “uc_oms_server_status” metric using the code

sum by(instance) (uc_oms_server_status)

The metric can send 3 different types of values depending on the OMS status. 1 for “running”, 0 for “not running”, -1 for “in doubt”.

To ensure the Widget shows this information we change add value mappings for the different states the server can be.


JSON model
{
  "datasource": {
    "type": "prometheus",
    "uid": "a65085b5-82cf-490b-a6cb-c01306f4a949"
  },
  "description": "OMS Server Status shown in a \"Status History\" graph. Depending on the number of OMS server connected the graph changes to represent them. Every addition of an OMS server needs a new query for overriding the name.",
  "fieldConfig": {
    "defaults": {
      "mappings": [
        {
          "options": {
            "0": {
              "color": "red",
              "index": 0,
              "text": "Offline"
            },
            "1": {
              "color": "green",
              "index": 1,
              "text": "Online"
            },
            "-1": {
              "color": "yellow",
              "index": 2,
              "text": "In doubt"
            }
          },
          "type": "value"
        }
      ],
      "thresholds": {
        "mode": "absolute",
        "steps": [
          {
            "color": "green",
            "value": null
          }
        ]
      },
      "unit": "short"
    },
    "overrides": []
  },
  "gridPos": {
    "h": 8,
    "w": 6,
    "x": 0,
    "y": 29
  },
  "id": 1,
  "options": {
    "reduceOptions": {
      "values": false,
      "calcs": [
        "lastNotNull"
      ],
      "fields": ""
    },
    "orientation": "auto",
    "textMode": "auto",
    "colorMode": "background",
    "graphMode": "none",
    "justifyMode": "auto"
  },
  "pluginVersion": "10.1.4",
  "targets": [
    {
      "datasource": {
        "type": "prometheus",
        "uid": "a65085b5-82cf-490b-a6cb-c01306f4a949"
      },
      "disableTextWrap": false,
      "editorMode": "builder",
      "expr": "sum by(instance) (uc_oms_server_status)",
      "fullMetaSearch": false,
      "includeNullMetadata": true,
      "instant": false,
      "legendFormat": "__auto",
      "range": true,
      "refId": "OMS Server Status",
      "useBackend": false
    }
  ],
  "title": "OMS Server Status",
  "type": "stat"
}



Active OMS Server Client connections

Description:

Widget that shows how many Clients are connecting to an OMS server. It will count the connections from agents and controller that connect to the OMS server.


Configuration:

This query uses the “ua_active_connections” metric to read out the number of active connections to all OMS server and showing them using the “Stats” graph.

To set up the Widget, select the metric using the metrics browser or paste the following line in the code builder of Grafana:

sum by(instance) (ua_active_connections)

As more OMS servers are sending metrics to the OTelCollector, the stats graph will update to represent them.

Furthermore on the settings on the right side under the “Value mappings” tab we add a note that says "0 → No active connections"

JSON Model
{
  "datasource": {
    "type": "prometheus",
    "uid": "a65085b5-82cf-490b-a6cb-c01306f4a949"
  },
  "description": "Widget that shows how many Clients are connecting to an OMS server. It will count the connections from agents and controller that connect to the OMS server",
  "fieldConfig": {
    "defaults": {
      "mappings": [
        {
          "options": {
            "0": {
              "color": "light-red",
              "index": 0,
              "text": "No active connections"
            }
          },
          "type": "value"
        }
      ],
      "thresholds": {
        "mode": "absolute",
        "steps": [
          {
            "color": "green",
            "value": null
          }
        ]
      },
      "color": {
        "mode": "thresholds"
      }
    },
    "overrides": []
  },
  "gridPos": {
    "h": 8,
    "w": 6,
    "x": 12,
    "y": 37
  },
  "id": 4,
  "options": {
    "reduceOptions": {
      "values": false,
      "calcs": [
        "lastNotNull"
      ],
      "fields": ""
    },
    "orientation": "auto",
    "textMode": "auto",
    "colorMode": "value",
    "graphMode": "area",
    "justifyMode": "auto"
  },
  "pluginVersion": "10.1.4",
  "targets": [
    {
      "datasource": {
        "type": "prometheus",
        "uid": "a65085b5-82cf-490b-a6cb-c01306f4a949"
      },
      "disableTextWrap": false,
      "editorMode": "builder",
      "expr": "sum by(instance) (ua_active_connections)",
      "fullMetaSearch": false,
      "hide": false,
      "includeNullMetadata": false,
      "instant": false,
      "legendFormat": "__auto",
      "range": true,
      "refId": "A",
      "useBackend": false
    }
  ],
  "title": "OMS Client connections",
  "type": "stat"
}


Widgets for observing Tasks and Task statuse

Tasks started in a set time period

Description:

This “Stat” graph is showing how many tasks have been created in a time period that can be specified. This example shows the Tasks from a 24h time period.

Configuration:

To create this Widget, we use the “uc_history_total” metric to receive all the data from tasks of the universal controller and universal agent.

When creating the query, use the metric browser to find the “uc_history_total” metric and choose the operations “Increase” from the “Range functions” tab and the “Sum” from the Aggregations tab.

Label the “Increase” range to the specified time period you wish to observe (in the example, 24h) and set the “Sum by” label to “task_type”.

The code builder should now look like this:

sum by(task_type) (increase(uc_history_total[24h]))




JSON Model
{
  "gridPos": {
    "h": 8,
    "w": 6,
    "x": 18,
    "y": 20
  },
  "id": 20,
  "title": "Tasks started in the last 24h",
  "targets": [
    {
      "datasource": {
        "type": "prometheus",
        "uid": "a65085b5-82cf-490b-a6cb-c01306f4a949"
      },
      "disableTextWrap": false,
      "editorMode": "builder",
      "expr": "sum by(task_type) (increase(uc_history_total[24h]))",
      "fullMetaSearch": false,
      "hide": false,
      "includeNullMetadata": true,
      "instant": false,
      "legendFormat": "__auto",
      "range": true,
      "refId": "Task history 24h",
      "useBackend": false
    }
  ],
  "options": {
    "reduceOptions": {
      "values": false,
      "calcs": [
        "lastNotNull"
      ],
      "fields": ""
    },
    "orientation": "auto",
    "textMode": "value_and_name",
    "colorMode": "background",
    "graphMode": "area",
    "justifyMode": "center",
    "text": {
      "titleSize": 16,
      "valueSize": 16
    }
  },
  "fieldConfig": {
    "defaults": {
      "mappings": [
        {
          "options": {
            "0": {
              "color": "text",
              "index": 0,
              "text": "No Tasks yet"
            }
          },
          "type": "value"
        }
      ],
      "thresholds": {
        "mode": "absolute",
        "steps": [
          {
            "color": "green",
            "value": null
          }
        ]
      },
      "decimals": 0,
      "unit": "short"
    },
    "overrides": []
  },
  "datasource": {
    "type": "prometheus",
    "uid": "a65085b5-82cf-490b-a6cb-c01306f4a949"
  },
  "description": "How many tasks started in the last 24h\n",
  "pluginVersion": "10.1.4",
  "type": "stat"
}



Task duration split of tasks launched in a time period

Description:

This “Stat” graph is showing how many tasks have been created in a time period that can be specified. This example shows the Tasks from a 24h time period.

Configuration:

To create this Widget, we use the “uc_history_total” metric to receive all the data from tasks of the universal controller and universal agent.

When creating the query, use the metric browser to find the “uc_history_total” metric and choose the operations “Increase” from the “Range functions” tab and the “Sum” from the Aggregations tab.

Label the “Increase” range to the specified time period you wish to observe (in the example 24h) and set the “Sum by” label to “task_type”.

The code builder should now look like this:

You can also add this line of code directly into the code tab to receive the settings:

sum by(task_type) (changes(uc_task_instance_duration_seconds_bucket[24h]))

It is important that in the standard options tab of the general settings the unit is set to “duration (s)” and the decimal point is set to at least 1 decimal point for more accuracy.

JSON Model
{
  "datasource": {
    "type": "prometheus",
    "uid": "a65085b5-82cf-490b-a6cb-c01306f4a949"
  },
  "description": "Shows the duration of tasks in the last given time period.",
  "fieldConfig": {
    "defaults": {
      "custom": {
        "hideFrom": {
          "tooltip": false,
          "viz": false,
          "legend": false
        }
      },
      "color": {
        "mode": "palette-classic"
      },
      "mappings": [],
      "decimals": 1,
      "unit": "dtdurations"
    },
    "overrides": []
  },
  "gridPos": {
    "h": 8,
    "w": 6,
    "x": 12,
    "y": 12
  },
  "id": 23,
  "options": {
    "reduceOptions": {
      "values": false,
      "calcs": [
        "lastNotNull"
      ],
      "fields": ""
    },
    "pieType": "pie",
    "tooltip": {
      "mode": "single",
      "sort": "none"
    },
    "legend": {
      "showLegend": true,
      "displayMode": "list",
      "placement": "right",
      "values": [
        "percent"
      ]
    }
  },
  "pluginVersion": "10.1.4",
  "targets": [
    {
      "datasource": {
        "type": "prometheus",
        "uid": "a65085b5-82cf-490b-a6cb-c01306f4a949"
      },
      "disableTextWrap": false,
      "editorMode": "builder",
      "expr": "sum by(task_type) (changes(uc_task_instance_duration_seconds_bucket[24h]))",
      "fullMetaSearch": false,
      "includeNullMetadata": true,
      "instant": false,
      "legendFormat": "__auto",
      "range": true,
      "refId": "Task durations",
      "useBackend": false
    }
  ],
  "title": "Task duration split",
  "type": "piechart"
}




Successful/Late Finish ratio shown in a Pie chart of a given Task type


Description:

Pie chart which shows the percentage of "Late Tasks" in reference to the total amount of tasks (Last 1h in this example; Linux Tasks in this example).

Configuration:

This pie chart is made up of 2 queries that will represent the ratio of “Late Finish” Tasks and “successful” Tasks.

The first query is made from using the “uc_task_instance_late_finish_total” metric and using a label filter on the specified Task we want to observe.

Using the Operator “Delta” gives the query a time period to observe the metric data. (In this example it is 1h).

The “Sum by” is set to “task_type” to ensure all metric data of the specified task is displayed.

Using an “Override” we name the query for the pie chart and set a color.

The second query is made up of the “uc_history_total” and the “uc_task_instance_late_finish_total” metric and subtracting them the “Late Finish” tasks from the total.

Similar to the first query we specify a time period using the “Delta” operator and the “Sum by” operator, as well as set the label filter to the tasks we observe.

Using a “Binary operations with query” operator allows for the second metric to be set as the “uc_task_instance_late_finish_total” metric and set to the same as the first query.

Using the “-” in the operation will result in all tasks being shown once and not be counted a second time for the pie chart.

Using the “Override” we set a color and optionally a name for the pie chart.

The code for the queries is below:

first query

sum by(task_type) (delta(uc_task_instance_late_finish_total[1h]))

second query

sum by(task_type) (delta(uc_history_total[1h])) - sum by(task_type) (delta(uc_task_instance_late_finish_total[1h]))


JSON Model
{
  "datasource": {
    "type": "prometheus",
    "uid": "a65085b5-82cf-490b-a6cb-c01306f4a949"
  },
  "description": "Pie chart which shows the percentage of \"Late Tasks\" in reference to the total amount of tasks (Last 1h in this example; Linux Tasks in this example)",
  "fieldConfig": {
    "defaults": {
      "custom": {
        "hideFrom": {
          "tooltip": false,
          "viz": false,
          "legend": false
        }
      },
      "color": {
        "mode": "palette-classic"
      },
      "mappings": [
        {
          "options": {
            "0": {
              "index": 0,
              "text": "None"
            }
          },
          "type": "value"
        }
      ],
      "decimals": 0,
      "noValue": "-"
    },
    "overrides": [
      {
        "matcher": {
          "id": "byFrameRefID",
          "options": "Late_Finished_Tasks"
        },
        "properties": [
          {
            "id": "color",
            "value": {
              "fixedColor": "yellow",
              "mode": "fixed"
            }
          },
          {
            "id": "displayName",
            "value": "Late Finish Tasks"
          }
        ]
      },
      {
        "matcher": {
          "id": "byFrameRefID",
          "options": "Total_Linux_Tasks"
        },
        "properties": [
          {
            "id": "color",
            "value": {
              "fixedColor": "green",
              "mode": "fixed"
            }
          }
        ]
      }
    ]
  },
  "gridPos": {
    "h": 8,
    "w": 6,
    "x": 12,
    "y": 20
  },
  "id": 22,
  "options": {
    "reduceOptions": {
      "values": false,
      "calcs": [
        "lastNotNull"
      ],
      "fields": ""
    },
    "pieType": "pie",
    "tooltip": {
      "mode": "single",
      "sort": "none"
    },
    "legend": {
      "showLegend": true,
      "displayMode": "list",
      "placement": "right",
      "values": [
        "percent"
      ]
    },
    "displayLabels": [
      "percent",
      "name"
    ]
  },
  "pluginVersion": "10.1.4",
  "targets": [
    {
      "datasource": {
        "type": "prometheus",
        "uid": "a65085b5-82cf-490b-a6cb-c01306f4a949"
      },
      "disableTextWrap": false,
      "editorMode": "builder",
      "expr": "sum by(task_type) (delta(uc_task_instance_late_finish_total[1h]))",
      "fullMetaSearch": false,
      "includeNullMetadata": true,
      "instant": false,
      "legendFormat": "__auto",
      "range": true,
      "refId": "Late_Finished_Tasks",
      "useBackend": false
    },
    {
      "datasource": {
        "type": "prometheus",
        "uid": "a65085b5-82cf-490b-a6cb-c01306f4a949"
      },
      "disableTextWrap": false,
      "editorMode": "builder",
      "expr": "sum by(task_type) (delta(uc_history_total[1h])) - sum by(task_type) (delta(uc_task_instance_late_finish_total[1h]))",
      "fullMetaSearch": false,
      "hide": false,
      "includeNullMetadata": true,
      "instant": false,
      "legendFormat": "__auto",
      "range": true,
      "refId": "Total_Linux_Tasks",
      "useBackend": false
    }
  ],
  "title": "Task Instance Late Finish",
  "type": "piechart"
}


Widgets for Traces

Bar Chart of incoming traces


Description:

This Widget displays all the traces coming from the universal controller and displaying their duration. Hovering over a trace will give more information about the trace.

Configuration:

The bar chart is taking the “Jaeger” data source and accessing all the traces that come from the universal controller. The query is configured as follows:

Once the query is set up. We need to add a transformation for the graph. Doing to the "transform" tab and choosing "sort by" and sorting by the start time will result in the trace links matching the correct traces.

Going to the general settings tab and changing the X-Axis to the start time and setting the Y-axis to a log10 scale will allow for more visibility.

Changing the Tooltip to show all information allows the user to hover over a trace and inspect it more closely using grafana’s trace tools.

Clicking on the trace link will result in a new tab opening up for detailed views of the trace:


JSON Model
{
  "datasource": {
    "type": "jaeger",
    "uid": "ba9176e4-0b3b-437c-ab29-045d734b5b63"
  },
  "description": "This Widget displays all the traces coming from the controller and agent and displaying their duration. Hovering over a trace will give more information about the trace.",
  "fieldConfig": {
    "defaults": {
      "custom": {
        "lineWidth": 1,
        "fillOpacity": 80,
        "gradientMode": "hue",
        "axisPlacement": "left",
        "axisLabel": "",
        "axisColorMode": "series",
        "scaleDistribution": {
          "type": "log",
          "log": 10
        },
        "axisCenteredZero": false,
        "hideFrom": {
          "tooltip": false,
          "viz": false,
          "legend": false
        },
        "thresholdsStyle": {
          "mode": "off"
        }
      },
      "color": {
        "mode": "thresholds"
      },
      "mappings": [],
      "thresholds": {
        "mode": "absolute",
        "steps": [
          {
            "color": "green",
            "value": null
          }
        ]
      },
      "unit": "s"
    },
    "overrides": []
  },
  "gridPos": {
    "h": 10,
    "w": 12,
    "x": 0,
    "y": 17
  },
  "id": 37,
  "links": [],
  "options": {
    "orientation": "auto",
    "xTickLabelRotation": 0,
    "xTickLabelSpacing": 300,
    "showValue": "auto",
    "stacking": "none",
    "groupWidth": 0.7,
    "barWidth": 0.97,
    "barRadius": 0,
    "fullHighlight": false,
    "tooltip": {
      "mode": "multi",
      "sort": "none"
    },
    "legend": {
      "showLegend": false,
      "displayMode": "list",
      "placement": "bottom",
      "calcs": []
    },
    "xField": "Start time"
  },
  "pluginVersion": "10.1.4",
  "targets": [
    {
      "datasource": {
        "type": "jaeger",
        "uid": "ba9176e4-0b3b-437c-ab29-045d734b5b63"
      },
      "queryType": "search",
      "refId": "Traces",
      "service": "controller"
    }
  ],
  "title": "Trace Log",
  "transformations": [
    {
      "id": "sortBy",
      "options": {
        "fields": {},
        "sort": [
          {
            "field": "Start time",
            "desc": false
          }
        ]
      }
    }
  ],
  "type": "barchart"
}



Example widget for universal extensions: Cloud Data Transfer

Max Avg. duration of file transfers

A stat graph showing the maximum average time for a Cloud Data Transfer task.



To configure the query we use the "sum by" and "increase" operators with 2 metrics that are divided by each other. For more clarity an override is added to change the color of the widget.

The code shown above is pasted here:

sum(increase(ue_cdt_rclone_duration_sum{universal_extension_name="ue-cloud-dt"}[24h])) / sum(increase(ue_cdt_rclone_duration_count{universal_extension_name="ue-cloud-dt"}[24h]))

The time interval can be changed to see determine the time period. This example was the max average in 24h.

Important is that under the general settings the calculation is set to the max number. This allows the query to only give the maximum amount of the calculated average.

If the value on the stat graph is not shown in seconds it can help to set the units used to "seconds" this will force the stat graph to show the given value in seconds.


JSON Model
{
  "datasource": {
    "uid": "a65085b5-82cf-490b-a6cb-c01306f4a949",
    "type": "prometheus"
  },
  "description": "Average Duration is computed within the Time Period selected. The Max value of it is displayed",
  "fieldConfig": {
    "defaults": {
      "mappings": [],
      "thresholds": {
        "mode": "absolute",
        "steps": [
          {
            "color": "green",
            "value": null
          }
        ]
      },
      "color": {
        "fixedColor": "super-light-yellow",
        "mode": "fixed"
      },
      "unit": "s"
    },
    "overrides": [
      {
        "matcher": {
          "id": "byName",
          "options": "Value"
        },
        "properties": [
          {
            "id": "color",
            "value": {
              "fixedColor": "super-light-yellow",
              "mode": "fixed"
            }
          }
        ]
      }
    ]
  },
  "gridPos": {
    "h": 4,
    "w": 4,
    "x": 0,
    "y": 1
  },
  "id": 20,
  "interval": "30",
  "options": {
    "reduceOptions": {
      "values": false,
      "calcs": [
        "max"
      ],
      "fields": ""
    },
    "orientation": "auto",
    "textMode": "auto",
    "colorMode": "value",
    "graphMode": "area",
    "justifyMode": "auto"
  },
  "pluginVersion": "10.1.4",
  "targets": [
    {
      "datasource": {
        "type": "prometheus",
        "uid": "a65085b5-82cf-490b-a6cb-c01306f4a949"
      },
      "disableTextWrap": false,
      "editorMode": "code",
      "expr": "sum(increase(ue_cdt_rclone_duration_sum{universal_extension_name=\"ue-cloud-dt\"}[24h])) / sum(increase(ue_cdt_rclone_duration_count{universal_extension_name=\"ue-cloud-dt\"}[24h]))",
      "fullMetaSearch": false,
      "includeNullMetadata": true,
      "instant": false,
      "interval": "",
      "legendFormat": "{{label_name}}",
      "range": true,
      "refId": "Average Speed Over Time (MB/s)",
      "useBackend": false
    }
  ],
  "title": "Max of Average duration",
  "type": "stat"
}



Demo Dashboard

The following section is dedicated to an example dashboard created for Observabilty use cases.

The dashboard is split into different parts. The dashboard has variables set for the current instance that should be watched, as well as the datasource currently used.

The Visulization setup is using nested dashboards and variables to ensure a better overview of the data injected. In the main dashboard information for the business services and Agents can be seen, as well as tasks/traces and logs.

Under the Traces/Tasks Row there are multiple Widgets showing the information about the tasks, for example the amount of Successful or Failed tasks or how long tasks run for.

There is also some Widgets using Grafana's machine learning tool in order to predict task usage in the future.


Using the nested dashboards feature more information for different Businessservices or Universal Agents can be viewed. The dashboard for the Businessservices looks as follows and is also featured in the Example Widgets section

The same can be done for inspecting the tasks associated with the Businessservice.


The results in better view of the different tasks and statuses without cluttering a single dashboard with endless rows.