Operator Messages Manual

Chapter 14 CMN (Cluster Connectivity Monitor Process) Messages

The messages in this chapter are generated by the CCMON subsystem, its further classified into IB Cluster Connectivity and CCMON subsystem events, and CCMON process events. The subsystem ID displayed by these messages includes CMN as the subsystem name.

NOTE: Negative-numbered messages are common to most subsystems. If you receive a negative-numbered message that is not described in this chapter, see Chapter 15.


–3

The CCMON subsystem state has changed from old-state to state due to state-change.

old-state

identifies the previous summary state of the subsystem.

state

identifies the current summary state of the subsystem.

state-change

indicates the reason for the state change.

Cause  This is given in the state-change token and is normally an operator command.

Effect  None.

Recovery  Informational message only; no corrective action is needed.



–1

Process process: switched from processor old-cpu to processor new-cpu due to cpu-switch-cause.

process

is the name of the Cluster Connectivity monitor process.

old-cpu

is the previous primary processor.

new-cpu

is the current primary processor.

cpu-switch-cause

indicates the cause of the switch. The possible values for this token are operator-value-required (operator-initiated command), and value-tookover (the backup took over).

Cause  The previous backup monitor process has become the primary, due either to an operator PRIMARY (SWITCH) command or to takeover in response to the failure of the primary process or its processor.

Effect  The monitor primary process is running in the processor specified in new-cpu.

Recovery  Informational message only; no corrective action is needed.



1001

The Cluster Connectivity subsystem monitor process, process, has started in processor cpunum.

Program file: required-programfile

Priority: priority

VPROC: vproc

Autorestart count: restart-count

Processor list: (cpunum00{, cpunum01, cpunum02,... cpunum15})

process

is the name of the Cluster Connectivity monitor process.

cpunum

is the processor number where the CCMON primary is running.

required-programfile

is the CCMON program file.

vproc

is the CCMON VPROC.

priority

is the priority at which the CCMON process is running.

restart-count

is the autorestart count configured for the process.

cpunumnn

is the processor list configured for the process. These are optional tokens. Only tokens corresponding to processors in CCMON’s processor list are included in the event.

Cause  The CCMON process was either started by the operator or it was started automatically by the Persistence Manager ($ZPM) after a system load. After a failure of both the primary and backup CCMON process the Persistence Manager will restart CCMON. CCMON has no means of distinguishing these two cases.

Effect  The Cluster Connectivity monitor process is running.

Recovery  Informational message only; no corrective action is needed.



1002

The Cluster Connectivity subsystem monitor process, process, has terminated.

Reason: term-reason.

process

is the name of the Cluster Connectivity monitor process.

term-reason

is an enumeration of the cause of the termination.

Cause  The Cluster Connectivity monitor process terminated voluntarily. Either CCMON was terminated by an operator command, or an environmental problem caused CCMON to self-terminate. If this event is due to self-termination, there will be a ccmon-information (1010) event reporting the environmental problem found by CCMON.

Effect  The Cluster Connectivity monitor process is no longer running.

Recovery  If this event is due to self-termination, follow recovery instructions for the ccmon-information (1010) event. After correcting any environmental problems found by CCMON, the process must be restarted with an operator command.



1003

Process process: Primary processor: cpunum.

process

is the name of the Cluster Connectivity monitor process.

cpunum

is the number of the processor in which the primary process is running.

Cause  The Cluster Connectivity monitor process terminated voluntarily. Either CCMON was terminated by an operator command, or an environmental problem caused CCMON to self-terminate. If this event is due to self-termination, there will be a ccmon-information (1010) event reporting the environmental problem found by CCMON.

Effect  The CCMON primary is now running in the indicated processor.

Recovery  Informational message only; no corrective action is needed.



1004

Process process: backup process is created in processor cpunum.

process

is the name of the Cluster Connectivity monitor process.

cpunum

is the number of the processor in which the backup process was created.

Cause  CCMON has successfully created a backup process.

Effect  CCMON is no longer vulnerable to a single failure.

Recovery  Informational message only; no corrective action is needed.



1005

Process process: Unable to create backup in processor cpunum.

Process creation error: create-error

Error detail: error-detail

process

is the name of the Cluster Connectivity monitor process.

cpunum

is the processor in which the process creation attempt was made.

create-error

is the main process creation error number.

error-detail

is the process creation error detail value.

Cause  An attempt to create the backup process has failed. create-error and error-detail are the standard NSK process creation error values that describe the failure.

Effect  Until a backup process has been started, the Cluster Connectivity monitor process will be vulnerable to a single failure. CCMON will attempt to start a backup process immediately if any processor in CCMON’s processor list other than that used by the primary process is running. CCMON makes two restart attempts in each processor eligible to contain the CCMON backup process; each failed attempt resulting in a cannot-start-backup (1005) event. If all of these attempts fail, the no-backup (1007) event is generated.

Recovery  Informational message only; no corrective action is needed, but the data in this message provides information for recovery in the event of no-backup (1007).



1006

Process process: backup process in processor cpunum failed.

process

is the name of the Cluster Connectivity monitor process.

cpunum

is the processor in which the process creation attempt was made.

Cause  An attempt to create the backup process has failed. create-error and error-detail are the standard NSK process creation error values that describe the failure.

Effect  Until a backup process has been started, the Cluster Connectivity monitor process will be vulnerable to a single failure. CCMON will attempt to start a backup process immediately if any processor in CCMON’s processor list other than that used by the primary process is running. CCMON makes two restart attempts in each processor eligible to contain the CCMON backup process; each failed attempt resulting in a cannot-start-backup (1005) event. If all of these attempts fail, the no-backup (1007) event is generated.

Recovery  Informational message only; no corrective action is needed, but the data in this message provides information for recovery in the event of no-backup (1007).



1007

The Cluster Connectivity subsystem monitor, process, is running without a backup.

Reason: no-backup-reason (1007).

process

is the name of the Cluster Connectivity monitor process.

no-backup-reason

is an enumeration that defines the reason for the termination as being one of no backup processor or excessive backup process failures.

Cause  Either there is no processor available for running the backup process, or there have been multiple failures of the backup process or of attempts to create a backup. If this event is due to backup process creation failures, there will have been cannot-start-backup (1005) events. Other possible precursors are ccmon-backupfail (1006) and backup-termination (1008).

Effect  CCMON will run without a backup, and the Cluster Connectivity subsystem will be vulnerable to a single failure. Whenever a processor in CCMON’s processor list is reloaded CCMON will attempt to create a backup there. If this event is caused by repeated backup failures or backup process creation failures, CCMON will periodically attempt to create a backup.

Recovery  Either reload processors on CCMON’s processor list, or use the information in associated cannot-startup-backup events to determine the cause of backup process creation failures.



1008

The Cluster Connectivity subsystem, process backup process has terminated.

Reason: backup-terminate-reason.

process

is the name of the Cluster Connectivity monitor process.

backup-terminate-reason

is an enumeration that defines the reason for the termination as being one of kill by primary or processor failure.

Cause  The backup processor failed or it was terminated by the primary (for example if the primary found a fatal error when check pointing to the backup.)

Effect  Until a new backup process has been started, the Cluster Connectivity monitor process will be vulnerable to a single failure. CCMON will attempt to start a new backup process immediately if any processor in CCMON’s processor list other than that used by the primary process is running. CCMON makes two restart attempts in each processor eligible to contain the CCMON backup process; each failed attempt resulting in a cannot-start-backup (1005) event. If all of these attempts fail, the no-backup (1007) event is generated.

Recovery  Informational message only; no corrective action is needed, but the data in this message provides information for recovery in the event of no-backup (1007).



1009

Cluster Connectivity subsystem or Message System Monitor trace entry trace-entry.

trace-entry

is a 64-character string containing an internally defined trace record. This token is displayed in hexadecimal format in event viewers.

Cause  These records are generated when an internal trace is enabled.

Effect  This is an informational message. The Cluster Connectivity subsystem state is unchanged.

Recovery  Informational message only; no corrective action is needed.



1010

The Cluster Connectivity subsystem monitor process, process, reports ccmon-info.

process

is the name of the Cluster Connectivity monitor process.

ccmon-info

is an enumeration containing the cause of the environmental problem.

Cause  CCMON found some environmental problems such as the processor list in the startup message is invalid, or the process name is wrong (not $ZZCMN). This event will usually be followed by a CCMON termination event.

Effect  This is an informational message, but for certain environmental problems the Cluster Connectivity monitor process will terminate.

Recovery  Corrective action may be required to fix one or more environmental problems:

  • wrong-processor-name — alter CCMON's startup parameters (generic process configuration under SCF.)

  • bad-cpu-list — alter CCMON's startup parameters (generic process configuration under SCF.)

  • wrong-cpu — this error is possible only if CCMON is started manually out of a TACL prompt on a CPU that is not in CCMON's CPU list. In this case, the corrective action is to correct the startup parameters via the regular means (SCF).



1011

The Cluster Connectivity subsystem encountered $ZCNF error error, error detail error-detail, in operation zcnf-action.

error

is the error code returned from the NSK Configuration Services routine.

error-detail

is the error detail code returned from the NSK Configuration Services routine.

zcnf-action

is the action that was performed when the error occurred.

Cause  CCMON encountered an error using an NSK Configuration Services API.

Effect  If this occurs during process startup, CCMON will use STARTSTATE STOPPED (default) instead of using the data stored in the private configuration record. If this occurs at a later stage, the action is prompted by an SCF [ALTER | START | STOP] SUBSYS command. In this case the command fails with an error.

Recovery  Investigate the cause of the error. Restart the process or reissue the failed SCF command.



1012

The Cluster Connectivity subsystem encountered DSM trace error error, error detail error detail, in operation trace-action.

error

is the error code returned from the DSM trace routine.

error-detail

is the error detail code returned from the DSM trace routine.

trace-action

is the action that was performed when the error occurred.

Cause  CCMON encountered an error using a DSM trace routine.

Effect  Any pending trace is terminated.

Recovery  Investigate the cause of the error. Reissue the SCF TRACE command.



1015

Message System Monitor Process, process, reports msgmon-inforamtion.

process

is the name of the Cluster Connectivity monitor process.

msgmon-inforamtion

is an enumeration containing the cause of the environmental problem.

Cause  A MSGMON process found some environmental problems such as the process name is wrong (not $ZIMnn). This event will usually be followed by a MSGMON termination event.

Effect  For certain environmental problems the MSGMON process will terminate. The Persistence Manager ($ZPM) will perform periodic attempts to restart the MSGMON process in this case.

Recovery  Corrective action may be required to fix one or more environmental problems:

  • wrong-processor-name — alter MSGMON’s startup parameters (generic process configuration under SCF.)



1016

Process process is not compatible with the current version of the Kernel message system.

process

is the name of the Cluster Connectivity monitor process or the Message System monitor process.

Cause  The process (CCMON or MSGMON) made a comparison of its own version and that of the Kernel Message System. The versions were determined to be incompatible and this event will be followed by a process termination event.

Effect  The process (CCMON or MSGMON) terminates.

Recovery  Check SPR requisites for NSK (T9050) and CCMON, MSGMON (T0942). The operator needs to ensure that the versions of T0942 and T9050 are compatible.



1100

The direct connection between a local and a remote processor has become unusable. The direct connection from processor local-cpu to processor remote-cpu in node remote-node, system name sysname has become unusable.

local-cpu

is the number of the local processor for which connectivity is lost.

remote-cpu

is the number of the remote processor for which connectivity is lost.

remote-node

is the Expand node number of the remote system.

sysname

is the name of the remote system.

Cause  Both the X and Y paths from the local processor to the indicated remote processor are down. This event is preceded by one or more path-down events generated by the local processor. The failure of the remote processor is due to which remote-cpu-down (1102) is generated. In fact direct-connection-down (1100) is suppressed in that case if CCMON receives the remote processor down information in time to do so.

Effect  All inter-system IB traffic between the indicated local and remote processors is routed via Expand-over-IB line-handler processes if the direct-connection-down (1100) event is not due to a failure of the remote processor. Consequently, transmission is slower and consumes additional processor cycles when Expand-over-IB line-handler processes must relay messages between the local and remote processors.

Recovery  Informational message only. Recovery information is provided by the path-down or remote-cpu-down (1102) events.



1101

The direct connection from processor local-cpu to processor remote-cpu in node remote-node, system name sysname has been restored.

NOTE: This event is not emitted during CCMON initialization.
local-cpu

is the number of the local processor for which connectivity has been restored.

remote-cpu

is the number of the remote processor for which connectivity has been restored.

remote-node

is the Expand node number of the remote system.

sysname

is the name of the remote system.

Cause  One or both of the paths between the indicated processors has been restored, as documented by the path-up event, which has been generated by the local processor.

Effect  Direct IB communication between the processors becomes possible.

Recovery  Informational message only; no corrective action is needed.



1102

Processor remote-cpu in node remote-node, system name sysname has failed.

remote-cpu

is the number of the remote processor that has failed.

remote-node

is the Expand node number of the remote system.

sysname

is the name of the remote system.

Cause  A remote processor failed.

Effect  IB paths between all local processors and the remote processor are taken down. The local processors suppress path-down event logging in this case. It is possible that local processors have already detected path failures and logged path down events before being informed of the remote processor’s failure. The particular sequence of events depends upon the speed with which CCMON is informed of the remote processor’s failure and the levels of message traffic from the local system to the failed remote processor.

Recovery  Reload the remote processor.



1103

Processor remote-cpu in node remote-node, system name sysname has been reloaded.

Direct connectivity to that processor has been restored.

remote-cpu

is the number of the remote processor that has been reloaded.

remote-node

is the Expand node number of the remote system.

sysname

is the name of the remote system.

Cause  A remote processor has been reloaded and its connections with this system have been restored.

Effect   Direct IB message traffic with the indicated processor can resume.

Recovery  Informational message only; no corrective action is needed.



1104

Local processor local-cpu was reloaded.

Direct connections to remote systems have been restored.

local-cpu

is the number of the local processor that was reloaded.

Cause  A local processor is reloaded and its connections restored.

Effect  Direct IB communications between the indicated processor and all other nodes in the IB Cluster are once again possible.

Recovery  Informational message only; no corrective action is needed.



1105

Processor remote-cpu in node remote-node, system name sysname, has lost connectivity to the fabric-id.

remote-cpu

is the number of the remote processor that has lost its IB connectivity.

remote-node

is the Expand node number of the remote system.

sysname

is the name of the remote system.

fabric-id

is the fabric to which the indicated processor has lost connectivity.

Cause  A remote processor has detected that its connection to the indicated IB fabric has failed. The indicated processor has logged an NCSL fabric down event on its local system.

Effect  IB paths on the indicated fabric between all local processors and the remote processor are taken down, although direct IB message traffic is still possible by using the other fabric. When the local processors down the paths to the remote processor on the indicated fabric, they will suppress path-down logging. It is possible that local processors have already detected path failures and logged path down events before being informed of the remote processor’s fabric failure. The particular sequence of events depends upon the speed with which CCMON is informed of the remote processor’s fabric failure and the levels of message traffic from the local system to the remote processor.

Recovery  Informational message only; NCSL fabric down event provides the recovery information.



1106

Processor remote-cpu in node remote-node, system name sysname has regained fabric-id connectivity.

remote-cpu

is the number of the remote processor that has had its fabric connection restored.

remote-node

is the Expand node number of the remote system.

sysname

is the name of the remote system.

fabric-id

is the fabric to which the indicated processor has regained connectivity.

Cause  A remote processor has regained connectivity to the indicated IB fabric.

Effect  Paths between the local system and the indicated remote processor on the indicated fabric have been restored.

Recovery  Informational message only; no corrective action is needed.


 
1107

Connectivity over the fabric-id to node remote-node, system name sysname has been lost.

fabric-id

indicates which fabric’s connectivity has been lost.

remote-node

is the Expand node number of the remote system.

sysname

is the name of the remote system.

Cause  All individual processor-processor paths over the indicated fabric between the reporting node and the indicated remote node have failed. These failures have been documented by path-down events generated by the individual processors. If distinct fabric-disconnect (1107) events were logged for the X and Y fabrics for the same remote node, then it is possible that the remote node itself has failed or lost power.

Effect  All communication with the remote node over the given fabric ceases.

Recovery  If only one fabric is involved, the condition of the intervening IB cables and IB switches between the local and remote nodes along the specified fabric should be investigated. If both fabrics are involved, the condition of the remote system itself should be investigated. In either case, the associated path-down events may provide further recovery information.



1108

Direct connectivity to node remote-node, system name sysname has been lost due to cause.

remote-node

is the Expand node number of the remote system.

sysname

is the name of the remote system.

cause

indicates the probable cause of the disconnection.

Cause  This is indicated by the cause token. The CCMON subsystem may have been stopped by an operator in the remote system, the remote system could have failed or lost power, there could be cable and/or switch failures on both IB fabrics, or a duplicate Expand node number may have been detected.

Effect  All direct IB connections with the remote system are shut down.

Recovery  Depending upon the reason indicated by cause, bring up the failed processors, start the CCMON subsystem, repair IB fabric failures, or reconfigure the Expand node number of one of the conflicting nodes if a duplicate node number was reported.



1109

A direct connection with node remote-node, system name sysname has been initialized. {If connect-warning is present} WITH WARNINGS.

remote-node

is the Expand node number of the remote system.

sysname

is the name of the remote system.

connect-warning

is a conditional token that when present informs that there were warnings on connection. This token will have a value of true-value if included in the event. The STATUS SUBNET command can be used to verify which specific processors may not have had their connections to the remote system initialized, in case there are warnings.

Cause  A connection with a remote system has been established. This could be the initial connection due to the starting of Cluster Connectivity services on either of the systems, or it could be the recovery of a failed connection.

Effect  Direct Message System traffic of IB between the two systems becomes possible.

Recovery  Informational message only; no corrective action is needed.



1111

No systems were discovered for direct connectivity.

Cause  This system is the first in the IB cluster to be started, or a connectivity failure between this system and the IB cluster fabrics occurred.

Effect  The CCMON subsystem attains the STARTED state, but there is no IB connectivity with other systems.

Recovery  If this system is the first in the IB cluster to be started, no corrective action is required. Otherwise, repair the IB connectivity problems.



1112

The Cluster Connectivity monitor process received error api-error1, error detail api-error-detail, from NCSL or RDMA Services routine api-error2.

Optionally displayed when value1 is present is: Remote node: nodenum.

Optionally displayed when value2 is present is: Remote processor: cpunum.

api-error1

is the error from the NCSL or RDMA Services routine which returned an error.

api-error-detail

is the detail of the error from the NCSL or RDMA Services routine which returned an error.

api-error2

is the NCSL or RDMA Services routine which returned an error.

nodenum

when displayed is the Expand node number that encountered the error.

cpunum

when displayed is the CPU number that encountered the error on the aforementioned Expand node number.

Cause  The CCMON process in the reporting processor received an error from an NCSL routine.

Effect  The CCMON process will perform periodic attempts to recover from this error.

Recovery  Informational message only; no corrective action is needed.



1113

Discovery of remote node remote-node failed due to fail-reason.

Details of the failed discovery attempt:

IF TOKEN PRESENT: Protocol stage: protocol-stage

IF TOKEN PRESENT: Selected target CPUs: processor-mask

IF TOKEN PRESENT: Sender protocol version: sender-protocol

IF TOKEN PRESENT: Target protocol version: target-protocol

IF TOKEN PRESENT: Sender minimum protocol version: sender-min-protocol

IF TOKEN PRESENT: Target minimum protocol version: target-min-protocol

IF TOKEN PRESENT: Target processor number: cpunum

IF TOKEN PRESENT: Node instantiation error: instantiation-error

remote-node

the target node to which discovery failed.

fail-reason

the reason why discovery failed in the sender node. The possible reasons are: protocol version error, expected discovery response packet not received, duplicate node number, failure to instantiate remote node, etc.

protocol-stage

is the stage of the discovery protocol.

processor-mask

a 16-bit binary mask depicting the processors selected by the sender CCMON as discovery targets.

sender-protocol

the discovery protocol version of the sender CCMON.

sender-minimum-protocol

the minimum discovery protocol interpretation version of the sender CCMON.

target-protocol

the discovery protocol version of the target CCMON.

target-minimum-protocol

the minimum discovery protocol interpretation version of the target CCMON.

cpunum

is the CPU number.

instantiation-error

the type of error encountered by the sender CCMON processor when instantiating the target node. The possible error types are memory allocation, etc.

send-error

the most recent IB error detected on a discovery packet transfer to processor nn. This token is emitted if discovery failed due to IB send errors and processor nn was selected by CCMON as a discovery target.

Cause  The sender CCMON could not discover the target node.

Effect  A IB Cluster connection to the target node will not be established. The nodes will perform periodic retries to discover each other.

Recovery  The recovery procedure varies depending on the reason of the discovery-sender-fail (1113) event. IB connectivity should be verified if the reason is IB send errors. Expand node number configuration should be verified if the reason is duplicate node number. CCMON versions should be verified if the reason is protocol version error. Shortage of resources such as memory in the sender CCMON processor could be the cause of the discovery failure if the reason of the discovery-sender-fail (1113) event is failure to instantiate remote node. If the reason is expected discovery response packet not received, recovery should be performed as outlined above, except that the reason is provided by a matching discovery-target-fail event in the target node.



1114

Discovery started by remote node remote-nodefailed due to fail-reason

Details of the failed discovery attempt:

IF TOKEN PRESENT: Protocol stage: protocol-stage

IF TOKEN PRESENT: Sender protocol version: sender-protocol

IF TOKEN PRESENT: Target protocol version: target-minimum-protocol

IF TOKEN PRESENT: Sender minimum protocol version: sender-minimum-protocol

IF TOKEN PRESENT: Target minimum protocol version: target-protocol

IF TOKEN PRESENT: Sender CPU number: cpunum

remote-node

the node which started the failed discovery attempt.

fail-reason

the reason why discovery failed in the target node. The possible reasons are: protocol version error, expected CR packet not received, IB send errors, duplicate node number, failure to instantiate remote node, and non-existing CCMON process.

sender-protocol

the discovery protocol version of the sender CCMON.

sender-minimum-protocol

the minimum discovery protocol interpretation version of the sender CCMON.

target-protocol

the discovery protocol version of the target CCMON.

target-minimum-protocol

the minimum discovery protocol interpretation version of the target CCMON.

cpunum

the sender CCMON processor number. This is conveyed in the discovery packet payload.

send-error

the most recent IB error detected on a discovery response packet transfer to the sender CCMON processor. This token is emitted if discovery failed on the target node due to IB send errors.

Cause  The target node could not be discovered by the sender CCMON.

Effect  A IB Cluster connection to the sender node will not be established. The nodes will perform periodic retries to discover each other.

Recovery  The recovery procedure varies depending on the reason of the target-sender-fail event. IB connectivity should be verified if the reason is “IB send errors”. Expand node number configuration should be verified if the reason is “duplicate node number”. CCMON versions should be verified if the reason is “protocol version error”. CCMON should be started in the target node if the reason is “non-existing CCMON process”. The CCMON subsystem state should be started if the reason is “CCMON subsystem is in STOPPED state” Shortage of resources such as memory in the target CCMON processor could be the cause of the discovery failure if the reason of the discovery-target-fail event is failure to instantiate remote node. IB connectivity and the presence of a running CCMON process in the sender node should be verified if the reason is “expected CR packet not received”.A matching discovery-sender-fail event in the logs of the sender node will often be available. This matching event may provide additional information that will assist with recovery.The nodes will perform periodic retries to discover each other, regardless of the reason of the discovery-sender-fail and discovery-target-fail events. Discovery should succeed automatically if the cause of the failure (e.g., an IB connectivity failure or an absent CCMON process) is corrected.



1115

The Cluster Connectivity subsystem monitor process, process, has terminated. Reason: term-reason.

process

is the name of the MSGMON monitor process.

term-reason

is the current Kernel Message System API version.

Cause  The Message Monitor (MSGMON) process terminated voluntarily. Either MSGMON was terminated by an operator command or an environmental problem caused MSGMON to self-terminate. If this event is due to self-termination, there will have been a msgmon-information (1015) event reporting the environmental problem found by MSGMON.

Effect  The Message Monitor (MSGMON) process is no longer running.

Recovery  If this event is due to self-termination, follow recovery instructions for the msgmon-information (1015) event. After correcting any environmental problems found by MSGMON, the process must be restarted with an operator command.



1116

Connectivity over the fabric-id to node remote-node, system name sysname has been established.

fabric-id

indicates which fabric’s connectivity has been established.

remote-node

is the Expand node number of the remote system.

sysname

is the Expand name of the remote system.

Cause  There is at least one working path on the specified fabric from the local node to the specified remote system now, when previously there were none. The path state changes are documented by IPC 111 events generated by the individual processors.

Effect  There is at least partial connectivity with the remote system over the given fabric.

Recovery  Informational message only; no corrective action is needed.