UAC Utility: System Monitor

UAC Utility: System Monitor

Disclaimer

Your use of this download is governed by Stonebranch’s Terms of Use.

Version Information

Template Name

Extension Name

Version

Status

Template Name

Extension Name

Version

Status

System Monitor

ue-system-monitor

1 (Current 1.0.0)

Fixes and new Features are introduced.

Refer to Changelog for version history information.

Overview

The System Monitor integration provides a powerful tool for users to track system metrics, such as CPU usage, memory consumption, disk activity, and network performance from both Linux and Windows hosts. By leveraging OpenTelemetry, these metrics can be seamlessly published to observability platforms, enabling real-time infrastructure monitoring. This integration facilitates the detection of performance bottlenecks and potential system failures but also allows for proactive management of resources. By making infrastructure metrics observable, System Monitor enhances the ability to correlate system behavior with application performance, leading to better overall visibility and system reliability.

Key Features

Feature

Description

Feature

Description

Observe and Publish Metrics

Observe and Publish Metrics from Linux and Windows Hosts. The following categories are supported.

  • CPU utilization metrics

  • CPU load Metrics

  • Memory utilization metrics

  • Paging/Swap space utilization metrics

  • Disk I/O metrics

  • Filesystem utilization metrics

  • Network Interface metrics

  • Process count metrics

  • Miscellaneous system metrics

Filtering

Filtering abilities for Disk/Filesystem and Network Interface metrics

Other Configuration Options

  • Individually switch on/off specific metric categories

  • Configurable metrics collection interval

Requirements

This integration requires a Universal Agent and a Python runtime to execute the Universal Task.

Area

Details

Area

Details

Python Version

Requires Python 3.11

Universal Agent Compatibility

  • Compatible with Universal Agent for Windows x64 and version >= 7.6.0.0.

  • Compatible with Universal Agent for Linux and version >= 7.6.0.0.

Universal Controller Compatibility

Universal Controller Version >= 7.6.0.0.

Open Telemetry

Universal Agent should be configured to send Open Telemetry data.

There should never be two task instances running simultaneously on the same system, as this can lead to inconsistent metric values and unreliable data. Although a warning appears on the default Grafana dashboard provided, our software does not automatically prevent multiple task instances from running simultaneously on the same system, so this must be managed operationally.

The provided Grafana dashboard makes use of metric attributes that are attached by Universal Agent using the Agent default configuration. If any of these options are changed, such as otel_uip_service_name, which can be configured inside of the uags.conf file, appropriate changes must be made to the queries used on the dashboard.

Input Fields

Name

Type

Description

Version Information

Name

Type

Description

Version Information

Action

Choice

Possible values are

  • System Monitor

Introduced in 1.0.0

Provide Configuration As

Choice

Specifies how System Monitor configuration is provided.

Available options are:

  • As YAML Text (default)

  • As YAML UAC Script

Available if Action is “System Monitor”

Introduced in 1.0.0

Collection Interval (sec)

Int

How often metrics are retrieved. Default value is 15 seconds.

Note

The Collection Interval determines the collection frequency of metrics and therefore how often metrics are sent to the OTEL Collector. This data can be pulled (scraped) by the intended Timeseries database (e.g. Prometheus) at configurable intervals. To optimize resource utilization and ensure granular metrics retrieval, it is recommended to align these values.

Introduced in 1.0.0

Configuration

Large Text

System Monitor configuration as Text

Default value:

metrics: system: cpu: # Enable this Metric. memory: load_average: paging: disk: filesystem: network: processes:

For more information on the System Monitor configuration options, see YAML Configuration Options

Introduced in 1.0.0

Configuration

Script

System Monitor configuration as UC Script. This allows the configuration to be shared across multiple task definitions. For more information on the System Monitor configuration options, see YAML Configuration Options

Introduced in 1.0.0

Supported Actions

Action: System Monitor

Configuration examples

Provide configuration as YAML text. Collection interval is set to 15 seconds, and the default YAML configuration is used, activating all metrics and applying no filters.

Provide configuration using the "System Monitor - Full Configuration" UAC Script, setting a collection interval of 10 seconds.

System Monitor Configuration Options

The configuration, provided as either plain text or a UC Script, defines the System Monitor's behavior, specifying which metrics to be published and any desired filtering options. Written in YAML format, configurations must adhere to a defined hierarchical structure.

The metrics and system settings must always be present in configuration files. The activation or filtering of any other metrics is optional.

The configuration allows you to:

  1. Enable or disable metric categories: Choose which system metrics categories to collect.

  2. Filter specific resources: Apply include/exclude filters on specific attribute values using strict mode or regex. If any "include" filters are activated for a specific attribute, no "exclude" filters can be activated for the same attribute (they are mutually exclusive).

A configuration example that demonstrates all the applicable options is the following

metrics:   system:     cpu: # Enables CPU metrics to be published     memory: # Enables Memory metrics to be published     load_average: # Enables Load Average metrics to be published     paging: # Enables Swap/Paging metrics to be published     disk: # Enables Disk metrics to be published       exclude_devices: # Filters exported metrics, excluding a list of devices. "include_devices" is also available.         devices: ["loop.*"] # Devices excluded using a list provided in brackets         match_type: regex # Value matching is with Regex     filesystem: # Enables File System metrics to be published       include_devices: # Filters exported metrics, including only a list of devices. "exclude_devices" is also available.         devices: ["sda", "sdb"] # Devices included using a list provided in brackets         match_type: strict # Value matching is strict. The exact names from the above list are used.       include_types: # Filters exported metrics, including only a list of filesystem types. "exclude_types" is also available.         types: ["xfs"] # Filesystem types included using a list provided in brackets         match_type: strict # Value matching is strict. The exact names from the above list are used       include_mountpoints: # Filters exported metrics, including only a list of mountpoints. "exclude_mountpoints" is also available         mountpoints: ["/dev"] # Mountpoints included using a list provided in brackets         match_type: strict # Value matching is strict. The exact names from the above list are used     network: # Enables Network metrics to be published       include_devices: # Filters exported metrics, including only a list of network interfaces. "exclude_devices" is also available.         devices: ["lo"]         match_type: strict processes:

 

YAML Field

Description

YAML Field

Description

metrics

Necessary as top-level key of the YAML configuration.

Required

system

Enables monitor of host uptime and acts as the root key for any additional metrics provided.

All following metrics (such as cpu) are marked for activation with the inclusion of the relevant key in the configuration file.

Required

cpu

Enables CPU related metrics.

load_average

Enables Load Average (1, 5 and 15 minute) metrics.

memory

Enables Memory metrics

paging

Enables Paging/Swap metrics.

disk

Enables Disk metrics. Filtering options are available:

  • include_devices: Includes only specific devices.

  • exclude_devices: Excludes specific devices.

The above filtering options are mutually exclusive (both should not be set)

filesystem

Enables Filesystems metrics. Filtering options are available:

  • include_devices: Includes only specific devices

  • exclude_devices: Excludes specific devices.

The above filtering options are mutually exclusive (both should not be set)

  • include_types: Includes only specific filesystem types.

  • exclude_types: Excludes specific filesystem types.

The above filtering options are mutually exclusive (both should not be set)

  • include_mountpoints: Includes only specific mountpoints

  • exclude_mountpoints: Excludes specific mountpoints.

The above filtering options are mutually exclusive (both should not be set)

If filtering options for devices/types and mountpoints are used at the same time, a logical AND is applied.

network

Enables Network metrics. Filtering options are available:

  • include_devices: Includes only specific network interfaces

  • exclude_devices: Excludes specific network interfaces

The above filtering options are mutually exclusive (both should not be set)

processes

Enables Process count metric.

System Monitor Configuration Examples

#

Configuration

Description

#

Configuration

Description

1

Default Configuration
metrics: system: cpu: memory: load_average: paging: disk: filesystem: network: processes:

Default configuration that enables all available metrics without applying any filters to the configurations.

 

2

 

Configuration Excluding specific Disks
metrics:   system:     cpu:     memory:     load_average:     paging:     disk:       exclude_devices: devices: ["sda", "sdb"]       match_type: strict     filesystem:     network:     processes:

 

This configuration filters disk metrics so as not to report for the disks named 'sda' and 'sdb'.

 

3

Configuration Excluding a set of Filesystems
metrics: system: cpu: memory: load_average: paging: disk: filesystem: exclude_devices: devices: ["dev/loop.*"] match_type: regex network: processes:

This configuration filters disk metrics so as to exclude reporting for any filesystems whose names start with ‘dev/loop’.

4

 

Configuration Including unsafe-only specific Network Interfaces
metrics: system: cpu: memory: load_average: paging: disk: filesystem: network: include_devices: devices: ["Ethernet", "Wireless"] match_type: strict processes:

 

This configuration filters network metrics to include reports originating only from the ‘Ethernet' and 'Wireless’ network interfaces.

5

Configuration Including numerous filters
metrics: system: cpu: memory: load_average: paging: disk: exclude_devices: devices: ["loop.*"] match_type: regex filesystem: exclude_devices: devices: ["/dev/loop.*"] match_type: regex include_types: types: ["xfs"] match_type: strict exclude_mountpoints: mountpoints: ["/var/lib/snapd/.*"] match_type: strict network: exclude_devices: devices: ["lo"] match_type: strict processes:

This configuration applies several filters to tailor the collected metrics as follows:

  • Disk Filters: Exclude disks with names starting with loop.

  • Filesystem Filters:

    • Exclude filesystems with names starting with /dev/loop.

    • Include only filesystems of type xfs.

    • Exclude filesystems with mountpoints starting with /var/lib/snapd/.

  • Network Filters: exclude the 'lo' network interface.

Action Output

Output Type

Description

Examples

Output Type

Description

Examples

EXTENSION

The extension output provides the following information:

  • exit_code, status_description: General info regarding the task execution. For more information, see the exit code table.

  • invocation: The task configuration used for this task execution.

  • result: Any errors that have been raised in case of Failure.

Successful scenario
{ "exit_code": 0, "status_description": "Task cancelled successfully", "invocation": { "extension": "ue-system-monitor", "version": "1.0.0", "fields": { ... } } }
Failing scenario
{ "exit_code": 20, "status_description": "Data Validation Error: Duplicate key detected in configuration file", "invocation": { "extension": "ue-system-monitor", "version": "1.0.0", "fields": { ... } }, "result": { "errors": [ "Data Validation Error: Duplicate key detected in configuration file" ] } }

STDERR

Universal Extension Task log information

 

Exit Codes

Exit Code

Status

Status Description

 Meaning

Exit Code

Status

Status Description

 Meaning

0

Success

“Success: << Task cancelled successfully.>>“

Successful execution and subsequent cancellation.

1

Failure

“Execution Failed: <<Error Description>>”

Raised in case of an unexpected error during execution

20

Failure

“Data Validation Error: <<Error Description>>“

Validation error related to input fields or the YAML Configuration provided.

* See STDERR for more detailed error descriptions.

Observability

System CPU metrics

Metric: system.cpu.time

Name

Instrument Type

Unit (UCUM)

Attributes

Description

Name

Instrument Type

Unit (UCUM)

Attributes

Description

system.cpu.time

Counter

s

As defined on Metric Attributes List

Observes the CPU time spent on the system.

Metric Attributes List:

Attribute Name

Description

Attribute Name

Description

state

The CPU mode on which time was spent. Possible values are:

  • user: time spent by normal processes executing in user mode; on Linux this also includes guest time

  • system: time spent by processes executing in kernel mode

  • idle: time spent doing nothing

Platform-specific fields:

  • nice (Linux): time spent by niced (prioritized) processes executing in user mode; on Linux this also includes guest_nice time

  • iowait (Linux): time spent waiting for I/O to complete. This is not accounted in idle time counter.

  • irq (Linux): time spent for servicing hardware interrupts

  • softirq (Linux): time spent for servicing software interrupts

  • steal (Linux 2.6.11+): time spent by other operating systems running in a virtualized environment

  • guest (Linux 2.6.24+): time spent running a virtual CPU for guest operating systems under the control of the Linux kernel

  • guest_nice (Linux 3.2.0+): time spent running a niced guest (virtual CPU for guest operating systems under the control of the Linux kernel)

  • interrupt (Windows): time spent for servicing hardware interrupts (similar to “irq” on UNIX)

  • dpc (Windows): time spent servicing deferred procedure calls (DPCs); DPCs are interrupts that run at a lower priority than standard interrupts.

Note: Not all attributes might be available as this relates to the platform and version operating system version.

cpu

The logical CPU number [cpu0, cpu1, ..cpun-1]

Metric: system.cpu.utilization

Name

Instrument Type

Unit (UCUM)

Attributes

Description

Name

Instrument Type

Unit (UCUM)

Attributes

Description

system.cpu.utilization

Gauge

1

As defined on Metric Attributes List

Observes the CPU utilization on the system.

Metric Attributes List:

Attribute Name

Description

Attribute Name

Description

state

The CPU mode on which time was spent. Possible values are:

  • user: time spent by normal processes executing in user mode; on Linux this also includes guest time

  • system: time spent by processes executing in kernel mode

  • idle: time spent doing nothing

Platform-specific fields:

  • nice (Linux): time spent by niced (prioritized) processes executing in user mode; on Linux this also includes guest_nice time

  • iowait (Linux): time spent waiting for I/O to complete. This is not accounted in idle time counter.

  • irq (Linux): time spent for servicing hardware interrupts

  • softirq (Linux): time spent for servicing software interrupts

  • steal (Linux 2.6.11+): time spent by other operating systems running in a virtualized environment

  • guest (Linux 2.6.24+): time spent running a virtual CPU for guest operating systems under the control of the Linux kernel

  • guest_nice (Linux 3.2.0+): time spent running a niced guest (virtual CPU for guest operating systems under the control of the Linux kernel)

  • interrupt (Windows): time spent for servicing hardware interrupts (similar to “irq” on UNIX)

  • dpc (Windows): time spent servicing deferred procedure calls (DPCs); DPCs are interrupts that run at a lower priority than standard interrupts.

Note: Not all attributes might be available as this relates to the platform and version operating system version.

cpu

The logical CPU number [cpu0, cpu1, ..cpun-1]

Metric: system.cpu.physical.count

Name

Instrument Type

Unit (UCUM)

Attributes

Description

Name

Instrument Type

Unit (UCUM)

Attributes

Description