Network Fault Tolerance - Universal Command
Overview
Universal Command uses the TCP/IP protocol for communications over a data network. The TCP/IP protocol is a mature, robust protocol capable of resending packets and rerouting packets when network errors occur. However, data networks do have problems significant enough to prevent the TCP/IP protocol from recovering. As a result, the TCP/IP protocol terminates the connection between the application programs.
As with any application using TCP/IP, Universal Command is subject to these network errors. Should they occur, a product can no longer communicate and must shut down or restart. These types of errors normally show themselves as premature closes, connection resets, time-outs, or broken pipe errors.
Network Fault Tolerant Protocol
Universal Command provides the ability to circumvent network errors with its Network Fault Tolerant protocol. By using this protocol, Universal Command traps the connection termination caused by the network error and reestablishes the network connections. When connections have been reestablished, processing resumes automatically from the location of the last successful message exchange. No program restarts are required and no data is lost.
The Network Fault Tolerant protocol acknowledges successfully received messages and checkpoints successfully sent messages. This reduces data throughput. Consequentially, the use of network fault tolerance should be weighed carefully in terms of increased execution time versus the probability of network errors and cost of such errors. For example, it may be easier to restart a program then to incur increased execution time.
When a network connection terminates, the Universal Command Manager enters a network reconnect phase. In this phase, the Manager attempts to connect to the Universal Command Server and reestablish its network connections. The condition that caused the network error can persist only for seconds, or it can persist for days.
The Manager attempts Server reconnection for a limited amount of time, as specified by the following configuration options:
- RECONNECT_RETRY_COUNT (number of retry attempts)
- RECONNECT_RETRY_INTERVAL (frequency of retry attempts)
If all attempts fail, the Manager ends with an error.
When a network connection terminates, the Server's action depends on whether or not it is executing with Manager Fault Tolerance.
Without Manager Fault Tolerance, the Server enters a disconnected state and waits for the Manager to reconnect. The user process continues running. However, if the user process attempts any I/O on the standard files, it will block. The Server waits for the Manager to reconnect for a period of time defined by the Manager's RECONNECT_RETRY_COUNT and RECONNECT_RETRY_INTERVAL configuration options. When that time has expired, the Server terminates the user process and exits.
With Manager Fault Tolerance, the Server continues executing in a disconnected state. The Server satisfies all user process standard I/O requests. The user process does not block. It continues to execute normally. When the user process ends, the Server waits for a Manager reconnect for a period of time defined by the JOB_RETENTION configuration option.
Universal Command Manager
You can configure Universal Command Manager to request the use of the Network Fault Tolerant protocol via its NETWORK_FAULT_TOLERANT configuration option.
If the Server does not support the protocol or is not configured to accept the protocol, the Manager continues without using the protocol.
Universal Command Server
You can configure Universal Command Server with or without the Network Fault Tolerant protocol via its NETWORK_FAULT_TOLERANT configuration option.
If the Server is configured with the protocol off, the Manager cannot override it. If the Server is configured with the protocol on, the Manager NETWORK_FAULT_TOLERANT configuration option specifies whether or not the protocol is actually used.