z/OS Installation - Sysplex
Introduction
IBM provides the ability to cluster z/OS systems together using the IBM Sysplex technology, which is a combination of IBM hardware and software components. The individual z/OS systems are referred to as Sysplex members.
The IBM JES subsystem supports a Multi-Access Spool (MAS) configuration that allows for batch jobs to be distributed among participating JES subsystems. A JES MAS configuration may be used independently of a Sysplex environment or in combination with a Sysplex environment. When used in combination with a Sysplex environment, IBM recommends the JES MAS configuration match the Sysplex configuration.
The Universal Agent for z/OS Sysplex feature provides for the management of workload across all Sysplex members. This page describes the general architecture and design of the Universal Agent for z/OS Sysplex feature.
Sysplex Solution
From a workload management perspective, a z/OS Sysplex can be represented as a single z/OS image. A single-system view of the Sysplex is represented by a single Agent called the Primary agent which runs on any Sysplex member. Other agents, called Secondary agents, run on the other Sysplex members.
Note
The sysplex_role Universal Broker configuration option is used to select the Sysplex role for an agent.
Neither the Universal Controller nor the Universal Agent for z/OS participate in the distribution of workload across the sysplex images. The controller simply executes z/OS tasks on the Primary z/OS agent that represents the Sysplex.
A batch job submitted to JES on one z/OS system may be routed by JES or by IBM Workload Manager (WLM) to any one of the Sysplex z/OS members. The routing or distribution of batch workload is based on JCL specifications, system configuration and the state of the Sysplex members.
Universal Controller starts a z/OS task by sending a task start request to the Primary Universal Agent for z/OS. The Agent submits the requested job to JES. The job can potentially execute on any one of the Sysplex members. The Agents installed in the z/OS Sysplex cooperate with each other to manage the execution of the job.
Each Agent in the Sysplex can provide complete job management capabilities regardless of which Agent in the Sysplex submitted the job to JES.
Job management capabilities include:
- Automatic data set cleanup prior to job execution.
- Tracking the execution of the job and job steps.
- Collecting and retrieving the job's JES sysout data sets.
The z/OS Agents use the IBM Cross-System Coupling Facility (XCF) for Agent-to-Agent communication within the Sysplex. The Agents utilize the XCF data sharing capabilities for message passing and sharing of common data structures.
UAG Sysplex
UAG for z/OS will create and join a Sysplex group if it is running Sysplex aware.
The XCF group name will consist of the characters UAG, followed by the first 4 (upper-case) characters of the system ID (system_id from UBRCFG00).
Each UAG will have a member name, which will be the group name appended with @ and followed by the MVS system name.
For example, UAG with a system ID of mndv, running on DVZOS202, would have a group name of UAGMNDV and a member name of UAGMNDV@DVZOS202.
UAG Sysplex System View
The Sysplex System View below illustrates the UAG deployment in a sample Sysplex environment. The Sysplex environment consists of two z/OS images, SYS1 and SYS2, and the Sysplex shared resources, JES, DASD, and XCF.
The following diagram illustrates a job submitted to one of two of the Sysplex members and the SMF exits that are called. The SMF exits reference the JME in ECSA and send events to the local UAGSRV via the event queue in z/OS High Common Storage.
- A Launch message is received by the Primary UAG.
- The Primary UAG writes a record to the Job Submission Checkpoint dataset, processes the JCL and submits it to the z/OS Internal Reader.
- The JCL passes through JCL conversion and interpretation. The UAGUJV exit is invoked and sends an Event message to UAG to prompt it to look for JCL errors that might have prevented the job from entering the execution phase. (JCL conversion and interpretation can happen on different processors in the system depending on the system configuration.)
- If a JCL error preventing the job from running is detected, a status message is sent to the Universal Controller and processing ends.
- Once the job starts execution (on whatever system), program UAGRERUN gets control as the first step in the job. UAGRERUN performs pre-processing necessary to run and track the job on the local z/OS system. It creates the JME in ECSA to allow the SMF exits to track the job through Step Initiation, Step End and Job End processing.
- As the job passes through Step End and Job End, Events are created on the Event Queue tracking the job's progress.
- Events are removed from the Event Queue by the UAGSRV instance running on the system and processed.
- In the case of a Primary UAG, Status messages are sent to the Universal Controller. In the case of a Secondary UAG, messages are queued to the XCF Message Queue. These messages are removed from the queue by the Primary UAG and Status messages are sent to the Universal Controller.
- In both cases, any requested output is written to the UNVSPOOL directory for retrieval by the Universal Controller.
- Any information required for eventual rerun processing is also sent to the Universal Controller.
Each of the components is described in the following table:
JES MAS | JES MAS (Multi Access Spool) environment (aka JESPLEX), provides for sharing JES resources between multiple z/OS images. The JES MAS environment can be implemented independent of Sysplex; however, IBM recommends that a JES MAS environment matches the Sysplex environment. |
---|---|
Job Submission Checkpoint | The purpose of the JSC is to determine if a job should be tracked by UAG and to provide information to UAG to do so. JES may route the job to a different system for conversion/interpretation and execution. The system on which these steps take place cannot be determined prior to job submission. Note Use the JSC_DATASET UAG configuration option to specify the name of a VSAM Job Submission Checkpoint cluster. The VSAM cluster must be defined on a DASD volume that is available to all members in the Sysplex. |
UAG SPOOL Directory | The UAG SPOOL directory is used by UAG to store spooled output for a job. At a minimum, the UAGRERUN report is stored here. Other datasets which might be stored here are the JESMSGLG, JESJCL and JESYSMSG datasets. Note Use the SHARED_MOUNT_POINT and SHARED_MOUNT_POINT_MODE configuration options to specify the name and mode (access permissions) of the directory where the UNVSPOOL directory should be mounted. Note The JES_SYSOUT_RETENTION can be used to specify how long spooled output will be retained by UAG. Note The UNVSPOOL directory should be mounted as Sysplex aware by specifying the RWSHARE parameter on the MOUNT command. Failure to do so will result in a UNV3333E error during broker startup. |
XCF Message Queue | XCF (Cross Coupling Facility) is a Sysplex component that provides services for communications and data sharing between Sysplex members. Note Use the CF_STRUCT_NAME UAG configuration option to specify the name of a Coupling Facility structure that will be used to communicate from the Secondary Agents to the Primary Agent. |
UAGRERUN | Program UAGRERUN is a component of UAG. A job step is inserted at the start of every job submitted by UAG which executes this program. UAGRERUN performs pre-processing necessary to run and track a job on a system. This includes, but is not limited to, creating the JME control block structure in ECSA to allow the UAG SMF exits to track the job. Note UAGRERUN needs to be available to every job submitted by UAG on every Sysplex member. This can be accomplished by adding the load library containing UAGRERUN to the z/OS linklist. Alternatively, the RERUN_LOAD_LIBRARY configuration option can be used to specify the name of the APF authorized load library which contains the UAGRERUN program. This library would need to be made available to all members of the Sysplex. |
UAG SMF Exits | The UAG SMF exits are used to track a job submitted by UAG through its life cycle.
|
Event Queue | The Event Queue is an area in shared z/OS high common storage allocated for each agent. UAGRERUN and the SMF exits can queue event message to be consumed by the agent which owns the queue. Note The HIGH_COMMON_STORAGE configuration option can be used to limit to amount of high common storage used by an agent. When the limit is reached, further tracking events will be lost. |
UAGWMDBX | UAGWMDBX is a z/OS WTO Message Data Block exit which looks for certain WTOs related to jobs submitted by UAG. For example: JCL errors. |
CF List Type Structure
UAG uses a CF List type structure with the following values:
Setting | Value |
---|---|
List headers | 1 |
Lock table entry count | 1 |
Adjunct data | No |
Alterable | Yes |
Max number of list entries | 20 |
Max number of data elements | 160 |
Max number of data elements per entry | 128 |
Reference option | None |
Data element size descriptor | ElemIncrNum |
Data element size value | 2 |
Note
Users can alter only the Max number of list entries
and Max number of data elements
settings.
IBM provides a CFSIZER web tool (Structure type OEM List ) which can be used to calculate the structure size. (Given the input above, this tool returned the INITSIZE and SIZE values of 9M.)
The structure name can be chosen by users and must be coded on the CF_STRUCT_NAME configuration option.
UAG uses this structure to communicate job tracking information from the Secondary agents to the Primary agent. List entries indicate events such as job start, step end and job end. List entries remain on the list until the primary agent has resources to process them. When the list structure is full, the secondary agents will wait until sufficient space is available before writing more tracking information. The required size of the structure is therefore dependent on the number of jobs being tracked, the number of job steps in those jobs and the resources available to the Primary agent to process the data.
File Monitor Support for Secondary Agents
File Monitors now function across a Sysplex. A File Monitor can be set, and the dataset Create, Change, or Delete will be detected on any system in the Sysplex where a UAG with the same system ID is running.
File Monitors will be detected while UAG is down as long as UAG was up when the File Monitor was set.
Exists and Missing File Monitors are resolved on the system where the Primary UAG is running. If a dataset is available only on a Secondary system, it will not be considered.
Configuration Parameters Used for Sysplex Configuration
Parameters in UBRCFG00
Name | Description |
system_id | All Primary and Secondary Brokers that belong to the same group must have the same system_id. |
sysplex_role | Select a value of primary for the primary agent; select a value of secondary for all others. |
unix_spool_data_set | All Brokers that belong to the same group must reference the same dataset name. The dataset must reside on shared DASD that is available to all Sysplex members. This file system should not be shared among Brokers that are not part of the same Sysplex group. |
mount_point | zFS mount point for non-shared UNIX file systems (currently only UNVDB). This mount point should not be shared between systems. |
shared_mount_point | zFS mount point for shared UNIX file systems (currently only UNVSPOOL). All Brokers that belong to the same group must use the same directory. This mount point should be available to all members in the Sysplex. In non Sysplex situations, this parameter can be omitted, and it will default to the value specified for mount_point. |
mount_point_mode | Mode (access permissions) to use during mount_point initialization. |
shared_mount_point_mode | Mode (access permissions) to use during shared_mount_point initialization. In non-Sysplex situations, this parameter can be omitted; it will default to the value specified for mount_point_mode. |
Parameters in UAGCFG00
Name | Description |
jsc_dataset | All Primary and Secondary agents that belong to the same group must use the same UNVJSC VSAM cluster. This cluster must be allocated on shared DASD that is available to all members in the Sysplex. Agents that are not part of the Sysplex group must use a different VSAM cluster. |
cf_struct_name | Name of the Coupling Facility structure which will be used to store the XCF Message Queue. All Primary and Secondary agents that belong to the same group must use the same structure. The structure should not be shared between agents that are not part of the same Sysplex group. |
automatic_failover | If the value of the sysplex_role parameter in UBRCFG00 is primary or secondary, this parameter can be used to control automatic failover. Automatic failover allows a Secondary agent to become the Primary agent when the original Primary agent ends. Valid values:
Default is never. (Also see the AUTOMATIC_FAILOVER UAG configuration option.) |
z/OS Console Commands
F <ubroker>,APPL=UAG,PRIMARY
This command causes an agent that is running in Sysplex Secondary mode to become a Primary agent until it is restarted or otherwise caused to become a Secondary agent.
If the agent is not running in Secondary mode, or a Primary agent is already active with the same system ID, the command will fail.
F <ubroker>,APPL=UAG,SECONDARY
This command causes an agent that is running in Sysplex Primary mode to become a Secondary agent until it is restarted or otherwise caused to become a Primary agent.
If the agent is not running in Primary mode, the command will fail.
F <ubroker>,APPL=SHUTDOWN, [ FAILOVER [ ,<sysname> ] | NOFAILOVER ]
When issued against a Secondary agent | This command behaves like the z/OS STOP command (P <ubroker>). |
When issued against a Primary agent | This command shuts down the Broker (and agent) while controlling the Sysplex failover behaviour: |
When issued without the FAILOVER or NOFAILOVER parameter | Failover will behave as configured by the automatic_failover parameter in UAGCFG00. |
When FAILOVER Is specified | An available Secondary agent will take over as Primary, regardless of how failover is configured. When the optional < |
When NOFAILOVER Is specified | No Secondary agent will take over as Primary, regardless of how failover is configured. |
Note
Behaviour of the z/OS STOP console command with failover is identical to the F <ubroker>,APPL=SHUTDOWN command with no other parameters.