Network Fault Tolerance - Universal Data Mover

Overview

For Universal Data Mover, fault tolerance is the capability of its components to recover or restart from an array of error conditions that occur in any large IT organization.

Errors occur as a result of human, software, or hardware conditions. The more resilient a product is to errors, the greater value it offers.

Currently, network fault tolerance is implemented in one Universal Data Mover component:

  • Universal Data Mover

Network Fault Tolerance

UDM uses the TCP/IP protocol for communications over a data network. The TCP/IP protocol is a mature, robust protocol capable of resending and rerouting packets when network errors occur. However, data networks do have problems significant enough to prevent the TCP/IP protocol from recovering. As a result, the TCP/IP protocol terminates the connection between the application programs. Like any application using TCP/IP, UDM is subject to these network errors. Should they occur, a product can no longer communicate and must shutdown or restart. These types of errors normally show themselves as premature closes, connection resets, time-outs, or broken pipe errors.

UDM provides the ability to circumvent these types of errors with its Network Fault Tolerant protocol. By using the network fault tolerant protocol, UDM traps the connection termination caused by the network error and it reestablishes the network connections. Once connections are reestablished, processing automatically resumes from the location of the last successful message exchange. No program restarts are required and no data are lost.

The Network Fault Tolerant protocol acknowledges and checkpoints successfully received and sent messages, respectively. The network fault tolerant protocol does reduce data throughput. Consequentially, the use of network fault tolerance should be carefully weighed in terms of increased execution time versus the probability of network errors and cost of such errors. For example, it may be easier to restart a program then to incur increased execution time.

When a network connection terminates, the UDM Manager will enter a network reconnect phase. In the reconnect phase, the Manager attempts to connect to the UDM Server and reestablish its network connections. The condition that caused the network error can persist for seconds or days. The Manager will attempt Server reconnection for a limited amount of time (configured with the RECONNECT_RETRY_COUNT and RECONNECT_RETRY_INTERVAL configuration options). These two options specify, respectively, how many reconnect attempts are made and how often they are made. After all attempts have failed, the Manager ends with an error.

When a network connection terminates, the Server enters a disconnected state and waits for the Manager to reconnect. The user process continues running; however, if the user process attempts any I/O on the standard files, it will block. The Server waits for the Manager to reconnect for a period of time defined by the Manager's RECONNECT_RETRY_COUNT and RECONNECT_RETRY_INTERVAL configuration options. Once that time has expired, the Server terminates the user process and exits.

UDM can request the use of the Network Fault Tolerant protocol. If the Server does not support the protocol or is not configured to accept the protocol, the Manager continues without using the protocol.

The NETWORK_FAULT_TOLERANT and RECONNECT_RETRY_INTERVAL option is used to request the protocol.

Open Retry

Open Retry is a type of fault tolerance used at the session-establishment level.

(Network fault tolerance is used from the time that a session has been fully established until the session has terminated.)

Open Retry is used during the establishment phase of a session. UDM tries to establish a session when the open command is issued. If the OPEN_RETRY configuration option value is yes, and UDM fails to establish the session due to a network error, timeout, or the inability to start a transfer server, it will retry the open command based on the settings of the OPEN_RETRY_COUNT and OPEN_RETRY_INTERVAL configuration options.

Component Management

In order to fully understand UDM fault tolerant features, some understanding of how the Universal Broker manages components is necessary.

Universal Broker manages component start-up, execution, and termination. Universal Broker and its components have the ability to communicate service requests and status information between each other.

Universal Broker maintains a database of components that are active or have completed and waiting for restart or reconnection. The component information maintained by Universal Broker determines the current state of the component. This state information is required by Universal Broker to determine whether or not a restart or reconnect request from a Manager is acceptable. The ;Universal Broker component information can be viewed with the Universal Query utility.

One bit of component information maintained by Universal Broker is the component's communication state. The communication state primarily determines what state the Universal Data Mover Server is in regarding its network connection with a Manager and the completion of the user process and its associated spooled data.

Communication State Values

The following table describes the communication state values.

  • Reconnect column indicates whether or not a network reconnect request is valid.
  • Restart column indicates whether or not a restart request is valid.

State

Reconnect

Restart

Description

COMPLETED

NO

NO

Server and manager have completed. All standard output and standard error files have been sent to the manager and the user process's exit status.

DISCONNECTED

YES

YES

Server is not connected to the Manager. This occurs when a network error has occurred, the Manager halted, or the Manager host halted.
 
The Server is either executing with the Network Fault Tolerant protocol, is restartable, or both.

Note

The Server cannot tell if the Manager is still executing or not, since it cannot communicate with the Manager.

ESTABLISHED

NO

NO

Server and Manager are connected and processing normally. This is the most common state when all is well.

RECONNECTING

NO

NO

Server has received a reconnect request from the Manager to recover a lost network connection.
 
This state should not remain long, only for the time it takes to re-establish the network connections.

STARTED

NO

NO

Server has started.
 
If the Server is restartable it is receiving the standard input file from the Manager and spooling it.