High Availability

High Availability

Introduction

High Availability (HA) of Universal Automation Center means that it has been set up to be a redundant system; in addition to the components that are processing work, there are back-up components available to continue processing through hardware or software failure.

This page describes a High Availability environment, how High Availability components recover in the event of such a failure, and what actions, if any, the user must take.

High Availability System

The following illustration is a typical, although simplified, Universal Automation Center system in a High Availability environment.

In this environment, there are:

The components in blue are active and operating. The components in gray are available for operations but currently are inactive (passive).
 
 

See High Availability Components for a detailed description of how each component type functions in a High Availability environment.

High Availability Components

This section provides detailed information on the cluster nodes and Agents in a High Availability environment.

Cluster Nodes

Each Universal Automation Center installation consists of one or more instances of Universal Controller; each instance is a cluster node. Only one node is required in a Universal Automation Center system; however, in order to run a High Availability configuration, you must run at least two nodes.

At any given time under High Availability, one node operates in Active mode and the remaining nodes operate in Passive mode (see Determining Mode of a Cluster Node at Start-up).

An Active node performs all system processing functions; Passive nodes can perform limited processing functions.

Passive Cluster Node Restrictions

Passive cluster nodes cannot execute any automated or scheduled work.

Also, from a Passive node you cannot:

However, Passive nodes do let you perform a limited number of processing functions, such as:

  • Launch tasks.

  • Monitor and display data.

  • Access the database.

  • Generate reports.

Agent

The Agent runs as a Windows service or Linux/Unix daemon. A cluster node sends a request to the Agent to perform a function. The Agent processes the request, gathers data about the operation of the client machine, and sends status and results back to the node. It performs these functions by exchanging messages with the node.

Once an Agent has registered with a node, you can view it by selecting that Agent type from the Agents & Connections navigation pane of the user interface. A list displays showing all the registered Agents of that type. See Agents Overview for more information.

If an Agent fails, Universal Broker restarts it. The Agent then attempts to determine what tasks or functions were in process at the time of failure.

Warm Start Processing is a term used to refer to a process UAG goes through upon startup by which all task instances that were active at the time of the last shutdown (intentional or otherwise) are reviewed and proper action is taken based on state and platform.

  • Task instances running on Windows and z/OS platforms are resumed when a Warm Start is attempted.

  • Task instances running on Unix and Linux platforms are set to IN-DOUBT status when a Warm Start is attempted.  

In order to support such a determination, Agent task processing includes the following steps:

Step 1

Each time the Agent receives a task, it writes to cache a record called [guid]_job, where [guid] is a unique tracking number assigned to the task instance.

Step 2

As the task runs, the Agent updates the [guid]_job record with status information.

Step 3

When the task run completes, the Agent deletes the [guid]_job record.

Step 4

If an Agent is restarted, it looks in the cache for [guid]_job records. If any are found, the Agent looks at the status. If the record indicates that the job is supposed to be running, the Agent searches the system to locate it. If the Agent is able to locate the task and resume tracking, it continues and marks the task resumed. If the Agent is not able to resume tracking a task, it returns a message to the cluster node, setting the status of the task instance to IN-DOUBT. This then requires manual follow-up to determine the state of the process.

As illustrated below, the Agent reads/writes a record to its agent/cache directory for each task instance that it manages.
 

Universal Message Service (OMS)

Universal Message Service (OMS) sends and receives messages between the cluster nodes and Agents.

OMS consists of an OMS Server and an OMS Administration Utility. The OMS clients - cluster nodes and Agents - establish persistent TCP/IP socket connections with the OMS Server.

OMS provides for reliable message communication by persisting all OMS queued messages to persistent storage. The OMS Server maintains OMS queues in an OMS message database that resides on persistent storage.

See Universal Message Service (OMS) for detailed information on OMS.

How High Availability Works

In a High Availability environment, passive cluster nodes play the role of standby servers to the active (primary) cluster nodes server. All running cluster nodes issue heartbeats and check the mode (status) of other running cluster nodes, both when they start up and continuously during operations. If a cluster node that currently is processing work can no longer do so, one of the other cluster nodes will take over and continue processing.