High Availability Behavior and Processing

Introduction

High Availability (HA) of Universal Data Manager Gateway means that it has been set up to be a redundant system; in addition to the components that are processing work, there are back-up components available to continue processing through hardware or software failure.

This page describes a High Availability environment, how High Availability components recover in the event of such a failure, and what actions, if any, the user must take.

High Availability System

The following illustration is a typical, although simplified, Universal Data Manager Gateway system in a High Availability environment.

In this environment, there are:

  • Two UDMG Server instances (cluster nodes)

  • Two UDMG Authentication Proxies

  • Two UDMG Admin UIs

High Availability for the UDMG Admin UI

Each pair of UDMG Authentication Proxy and UDMG Admin UI (forming a UDMG UI Group) can access both of the UDMG Server instances in the following 2 ways:

  • Different login services can be defined for each UDMG Server instance on the UDMG Authentication Proxy for the UDMG Admin UI. For example, Service A to access UDMG Server instance A and Service B to access UDMG Server instance B.
    This allows the user to select the UDMG Server instance to use.

    [service]
      [service.Service-A]
        protocol = "http"
        [[service.Service-A.targets]]
          hostname = "Server-A"
          port = 8080
          
      [service.Service-B]
        protocol = "http"
        [[service.Service-B.targets]]
          hostname = "Server-B"
          port = 8080



  • Each login service can have multiple UDMG Server targets that are tried one after the other during the connection request. For example Service-A is setup with Server-A and Server-B as targets.
    This allows to continue using the Service-A session even if Server-A becomes unavailable.

    [service]
      [service.Service-A]
        protocol = "http"
    
        [[service.Service-A.targets]]
          hostname = "Server-A"
          port = 8080
        [[service.Service-A.targets]]
          hostname = "Server-B"
          port = 8080

High Availability Components

This section provides detailed information on the cluster nodes in a High Availability environment.

Cluster Nodes

Each UDMG installation consists of one or more instances of UDMG Server; each instance is a cluster node. Only one node is required in a UDMG system; however, in order to run a High Availability configuration, you must run at least two nodes.

At any given time under High Availability, one node operates in Active mode and the remaining nodes operate in Passive mode (see Determining Mode of a Cluster Node at Start-up).

An Active node performs all system processing functions; Passive nodes can perform limited processing functions.

Passive Cluster Node Restrictions

Passive cluster nodes cannot execute any file transfer work.


However, Passive nodes do let you perform a limited number of processing functions, such as:

  • Monitor and display data.

  • Access the database.

  • Generate reports.

Note

Connecting to the active instance is recommended for performing transfer-related actions like add, pause, resume, and cancel.


How High Availability Works

In a High Availability environment, passive cluster nodes play the role of standby servers to the active (primary) cluster nodes server. All running cluster nodes issue heartbeats and check the mode (status) of other running cluster nodes, both when they start up and continuously during operations. If a cluster node that currently is processing work can no longer do so, one of the other cluster nodes will take over and continue processing.

Each cluster node connects to the same UDMG database; however, only the Active cluster node performs updates for file transfer activity.

Cluster Node Mode

The mode (status) of a cluster node indicates whether or not it is the cluster node that is currently processing work:

Active

Cluster node currently is performing all file transfers and also the other system processing functions (administration, reporting, …).

It supports the local services and the processing of transfers in server and client modes. 

Passive

Cluster Node is not performing any file transfers but is available to perform the other system processing functions.

Offline

Cluster node is not running or is inoperable and needs to be restarted.

High Availability Start-Up

The following steps describe how a High Availability environment starts up:

Step 1

User starts the Cluster Nodes.

Step 2

Each cluster node reads its server.ini file.

Step 3

Each cluster node locates and connects to the database and retrieves information about the UDMG environment.

Step 4One of the nodes becomes the Active node.
It starts the local services (local FTP, SFTP, ... servers) and starts the processing of transfer requests, both in server mode and client mode.

Note

Cluster nodes in Passive mode can perform limited system processing functions.

Determining Mode of a Cluster Node at Start-up

A cluster node starts in Passive mode. It then determines if it should remain in Passive mode or switch to Active mode.

The following flow chart describes how a cluster node determines its mode at start-up:

Note

A cluster node is considered "healthy" or "stale" based on its heartbeat timestamp.


Checking the Active Cluster Node During Operations

When all cluster nodes have started, each one continuously monitors the heartbeats of the other running cluster nodes.

If a Passive cluster node determines that the Active cluster node is no longer running, the Passive cluster node automatically takes over as the Active cluster node based upon the same criteria described above.

This determination is made as follows:

Step 1

The Active cluster node sends a heartbeat by updating a timestamp in the database.

The heartbeat interval is 10 seconds. It can be adjusted with the 'Heartbeat' parameter:

[controller]
; The frequency at which the heartbeat will be updated
Heartbeat = 10s

Step 2

All Passive cluster nodes check the Active cluster node's timestamp to determine if it is current.

This check runs every 20 seconds. It can be adjusted with the 'HeartbeatCheck' parameter:

[controller]
; The heartbeat to determine if this instance will be probed
HeartbeatCheck = 20s

Step 3

If a Passive cluster node determines that the Active cluster node's timestamp is stale, failover occurs: the Passive cluster node changes the mode of the Active cluster node to Offline and takes over as the Active cluster node. If more than one cluster node is operating in Passive mode, the first cluster node eligible to become Active that determines that the Active cluster node is not running becomes the Active cluster node. A stale cluster node is one whose timestamp is older than 5 minutes.

It can be adjusted with the 'Deadline' parameter:

[controller]
; The deadline to determine if this instance will be active
Deadline = 5m

What To Do If a Failover Occurs

A Passive cluster node taking over as an Active cluster node is referred to as failover. If failover occurs, the event is invisible unless you are using the Active cluster node in a browser.

If you are using the Active cluster node in a browser and the cluster node fails, you will receive a browser error. In this case, take the following steps to continue working:

Step 1

Access the new Active cluster node in your browser. 

This can be achieved by selecting the appropriate service in that UDMG Admin UI login page, provided that each UDMG Server instance is defined as a dedicated login service.

Step 2

If you were adding, deleting, or updating records at the time of the failure, check the record you were working on. Any data you had not yet saved will be lost.

Viewing Cluster Node Status

The cluster node status is displayed by the background color of the Server Status button.

Node StatusBackground Color
Active

Transparent

Passive

Yellow

Offline

Red

It is also indicated by the "Controller" service Information string:

High Availability Configuration

To achieve High Availability for your Universal Data Mover Gateway, you must configure the UDMG Server cluster nodes, UDMG Authentication Proxy.

Configuring Cluster Nodes

All cluster nodes in a High Availability environment must point to the same database by making sure the following entries in their server.ini files are the same.

For example:

[database]
; Type of the RDBMS used for the gateway database. Possible values: sqlite, mysql (default), postgresql, oracle, mssql
Type = postgresql

; Address (URL:port) of the database. The default port depends on the type of database used (PostgreSQL: 5432, MySQL: 3306, MS SQL: 1433, Oracle: 1521, SQLite: none).
Address = localhost:5432

; The name of the database
Name = udmg

; The name of the gateway database user
User = udmg_user

; The password of the gateway database user
Password = udmg_password

; Path of the database TLS certificate file (only supported for mysql, postgresql).
; TLSCert =

; Path of the key of the TLS certificate file (only supported for mysql, postgresql).
; TLSKey =

; The path to the file containing the passphrase used to encrypt account passwords using AES. Recommended to be a full absolute path, if the file does not exist, a new passphrase is generated the first time.
AESPassphrase = /opt/udmg/etc/udmg-server/passphrase.aes

The heartbeat frequency and failover deadline can also be tuned with the following parameters: 

[controller]
; The frequency at which the heartbeat will be updated
Heartbeat = 10s

; The deadline to determine if this instance will be active
Deadline = 5m

; The heartbeat to determine if this instance will be probed
HeartbeatCheck = 20s

Load Balancer

REST API connections

If you are using a load balancer in your High Availability environment, it can utilize the following HTTP requests to direct the UDMG Admin UI and REST API requests to the active instance:

http(s)://serverhost:[Port]/ping



If a cluster node is active, this URL returns the status 200 (OK) and a simple one word content of ACTIVE.
 
If a cluster node is not active, this URL returns the status 403 (Forbidden, cluster node is not active) and lists the actual mode of the cluster node: PASSIVE or OFFLINE.

$ curl -w "http_code=%{http_code}\n" -s http://server-A:9181/ping
ACTIVE
http_code=200
$ curl -w "http_code=%{http_code}\n" -s http://server-B:9182/ping
PASSIVE
http_code=403

This API is provided without authentication.

http(s)://serverhost:[Port]/api/sb_healthcheck

This URL returns information about a cluster node:

{
    "status": "operational",
    "nodeId": "gateway_1:8080-mft-gw-0",
    "nodeHostname": "gateway_1",
    "nodeIPAddress": "172.99.0.101",
    "nodePort": "8080",
    "nodeStatus": "PASSIVE",
    "nodeUptime": "50h1m56.09356413s",
    "nodeLastUpdate": "2023-11-15T19:03:30.481154Z",
    "nodeLastActiveDate": "2023-11-13T16:57:57.026091Z"
}

This API requires authentication but no specific permissions.

http(s)://serverhost:[Port]/api/sb_mgmt_nodes

This URL returns information about all the cluster nodes:

{
    "nodes": [
        {
            "nodeId": "gateway:8080-mft-gw-0",
            "nodeHostname": "gateway",
            "nodeIPAddress": "172.99.0.100",
            "nodePort": "8080",
            "nodeStatus": "ACTIVE",
            "nodeUptime": "16m15.413255244s",
            "nodeLastUpdate": "2023-11-09T15:24:20.562225Z",
            "nodeLastActiveDate": "2023-11-09T15:08:40.105002Z" 
        },
        {
            "nodeId": "gateway_1:8080-mft-gw-0",
            "nodeHostname": "gateway_1",
            "nodeIPAddress": "172.99.0.101",
            "nodePort": "8080",
            "nodeStatus": "PASSIVE",
            "nodeUptime": "16m0.631810433s",
            "nodeLastUpdate": "2023-11-09T15:24:35.289412Z",
            "nodeLastActiveDate": "2023-11-09T14:40:28.491424Z" 
        }
    ]
}

This API requires authentication and the 'administration read' permission.

 This can be used to direct the UDMG Admin UI and REST API requests to the active instance.

Local MFT server connections

Until the node is marked as 'ACTIVE', the local server services that controlled by UDMG Server are not be started and there is no listener on their defined port.

As there can be some delay between the node status change and the actual start and stop of each local MFT server, it is recommended to use a TCP connection test to determine which node is currently owning the listener.

If you have the possibility of performing a double health check, you can choose to:

  • Validate the node status endpoint
  • Validate that the port is available


Example of a configuration with HAProxy:

# --------------------------------------------------------------------------- #
# Configuration for: <service-name> 
# --------------------------------------------------------------------------- #

frontend frontend-<service-name>
  bind *:<service-port> transparent
  mode tcp
  log             global
  option tcplog
  use_backend backend-<service-name>

backend backend-<service-name>
  mode tcp
  stick-table type ip size 10k expire 300s
  stick on src
	
  # Healtcheck
  option httpchk
  http-check send meth GET uri /ping
  http-check expect status 200
  
  # Pool Servers
  server node-0 <hostname-node-0> check port <node-0-server-api-port> cookie S01  check inter 300s
  server node-1 <hostname-node-1> check port <node-1-server-api-port> cookie S02  check inter 300s