Network Fault Tolerance - Universal Data Mover
Overview
For Universal Data Mover, fault tolerance is the capability of its components to recover or restart from an array of error conditions that occur in any large IT organization.
Errors occur as a result of human, software, or hardware conditions. The more resilient a product is to errors, the greater value it offers.
Currently, network fault tolerance is implemented in one Universal Data Mover component:
- Universal Data Mover
Network Fault Tolerance
UDM uses the TCP/IP protocol for communications over a data network. The TCP/IP protocol is a mature, robust protocol capable of resending and rerouting packets when network errors occur. However, data networks do have problems significant enough to prevent the TCP/IP protocol from recovering. As a result, the TCP/IP protocol terminates the connection between the application programs. Like any application using TCP/IP, UDM is subject to these network errors. Should they occur, a product can no longer communicate and must shutdown or restart. These types of errors normally show themselves as premature closes, connection resets, time-outs, or broken pipe errors.
UDM provides the ability to circumvent these types of errors with its Network Fault Tolerant protocol. By using the network fault tolerant protocol, UDM traps the connection termination caused by the network error and it reestablishes the network connections. Once connections are reestablished, processing automatically resumes from the location of the last successful message exchange. No program restarts are required and no data are lost.
The Network Fault Tolerant protocol acknowledges and checkpoints successfully received and sent messages, respectively. The network fault tolerant protocol does reduce data throughput. Consequentially, the use of network fault tolerance should be carefully weighed in terms of increased execution time versus the probability of network errors and cost of such errors. For example, it may be easier to restart a program then to incur increased execution time.
When a network connection terminates, the UDM Manager will enter a network reconnect phase. In the reconnect phase, the Manager attempts to connect to the UDM Server and reestablish its network connections. The condition that caused the network error can persist for seconds or days. The Manager will attempt Server reconnection for a limited amount of time (configured with the RECONNECT_RETRY_COUNT and RECONNECT_RETRY_INTERVAL configuration options). These two options specify, respectively, how many reconnect attempts are made and how often they are made. After all attempts have failed, the Manager ends with an error.
When a network connection terminates, the Server enters a disconnected state and waits for the Manager to reconnect. The user process continues running; however, if the user process attempts any I/O on the standard files, it will block. The Server waits for the Manager to reconnect for a period of time defined by the Manager's RECONNECT_RETRY_COUNT and RECONNECT_RETRY_INTERVAL configuration options. Once that time has expired, the Server terminates the user process and exits.
UDM can request the use of the Network Fault Tolerant protocol. If the Server does not support the protocol or is not configured to accept the protocol, the Manager continues without using the protocol.
The NETWORK_FAULT_TOLERANT and RECONNECT_RETRY_INTERVAL option is used to request the protocol.
Open Retry
Open Retry is a type of fault tolerance used at the session-establishment level.
(Network fault tolerance is used from the time that a session has been fully established until the session has terminated.)
Open Retry is used during the establishment phase of a session. UDM tries to establish a session when the open command is issued. If the OPEN_RETRY configuration option value is yes, and UDM fails to establish the session due to a network error, timeout, or the inability to start a transfer server, it will retry the open command based on the settings of the OPEN_RETRY_COUNT and OPEN_RETRY_INTERVAL configuration options.
Component Management
In order to fully understand UDM fault tolerant features, some understanding of how the Universal Broker manages components is necessary.
Universal Broker manages component start-up, execution, and termination. Universal Broker and its components have the ability to communicate service requests and status information between each other.
Universal Broker maintains a database of components that are active or have completed and waiting for restart or reconnection. The component information maintained by Universal Broker determines the current state of the component. This state information is required by Universal Broker to determine whether or not a restart or reconnect request from a Manager is acceptable. The ;Universal Broker component information can be viewed with the Universal Query utility.
One bit of component information maintained by Universal Broker is the component's communication state. The communication state primarily determines what state the Universal Data Mover Server is in regarding its network connection with a Manager and the completion of the user process and its associated spooled data.
Communication State Values
The following table describes the communication state values.
- Reconnect column indicates whether or not a network reconnect request is valid.
- Restart column indicates whether or not a restart request is valid.
State |
Reconnect |
Restart |
Description |
---|---|---|---|
COMPLETED |
NO |
NO |
Server and manager have completed. All standard output and standard error files have been sent to the manager and the user process's exit status. |
DISCONNECTED |
YES |
YES |
Server is not connected to the Manager. This occurs when a network error has occurred, the Manager halted, or the Manager host halted. Note The Server cannot tell if the Manager is still executing or not, since it cannot communicate with the Manager. |
ESTABLISHED |
NO |
NO |
Server and Manager are connected and processing normally. This is the most common state when all is well. |
RECONNECTING |
NO |
NO |
Server has received a reconnect request from the Manager to recover a lost network connection. |
STARTED |
NO |
NO |
Server has started. |