Chapter 57 NCSL (NonStop Connection Services Library) Messages

The messages in this chapter are generated by the NCSL subsystem, the NonStop Connection Services Library. The subsystem ID displayed by these messages includes NCSL as the subsystem name.

1001

NCSL failed to allocate resource; CPU:cpunum; PIN:pin

Resource Type: restype [threshold exceeded]

`cpunum`	The processor that is reporting this event.
`pin`	The PIN of the process that reported this event.
`restype`	This specifies the type of resource that NCSL was unable to allocate. The resource types and the corresponding Cause, Effect, and Recovery are listed below.
threshold exceeded	When the Severity of the event is Critical, the string “threshold exceeded” appears in the indicated location in the event text. If the Severity is not Critical this text is not displayed.

restype	Cause
zncs-enm-restype-kcalloc-mem zncs-enm-restype-fpool-mem zncs-enm-restype-cli-mem	NCSL was unable to allocate memory.
zncs-enm-restype-qp	NCSL was unable to allocate an InfiniBand Queue Pair.
zncs-enm-restype-pd	NCSL was unable to allocate an InfiniBand Protection Domain.
zncs-enm-restype-cq	NCSL was unable to allocate an InfiniBand Completion Queue.
zncs- enm-restype-mr	NCSL was unable to allocate an InfiniBand Memory Region.
zncs-enm-restype-port zncs-enm-restype-dev zncs-enm-restype-hca	NCSL was unable to access the InfiniBand Host Channel Adapter.

Cause See the cause for the corresponding restype in the above table.

Effect One or more subsystems within the NonStop OS may hang.

Recovery If the Severity is Warning, NCSL will retry automatically and no operator-initiated recovery action is required. If the Severity is Critical, the recovery action is to reload the processor and contact HP Support.

1002

Bad NCSL protocol message received; CPU:cpunum

Fabric ID:fabid; Message Type: msgtype

Bad Field Name: fname

Event Detail: text

`cpunum`	The processor that is reporting this event.
`fabid`	This identifies the InfiniBand Fabric ID of the source of the message. The format is an 8-digit hexadecimal number. Example, 0x0A000201.
`msgtype`	This specifies the type of NCSL protocol message that was received. The valid message types and the corresponding display strings are: zncs- enm-msgtype-control-req - Control connection request zncs- enm-msgtype-control-rep - Control connection reply zncs- enm-msgtype-conn-req - Virtual QP Group connection request zncs- enm-msgtype-conn-rep - Virtual QP Group connection reply zncs- enm-msgtype-disc-req - Virtual QP Group disconnect request zncs- enm-msgtype-disc-rep - Virtual QP Group disconnect reply zncs- enm-msgtype-switch-req - Path switch request zncs- enm-msgtype-switch-rep - Path switch reply zncs-enm-msgtype-unknown - Unknown
`fname`	This specifies which field within the received NCSL protocol message was invalid. The field identifiers and their corresponding display strings are: zncs- enm-fname-size - Message size zncs- enm-fname-reserved - Reserved zncs- enm-fname-pdata-len - Private Data length zncs- enm-fname-connid - Connection ID zncs- enm-fname-version - Protocol version zncs- enm-fname-num-vqps - Number of Virtual Queue Pairs zncs- enm-fname-priority - Virtual Queue Pair Priority zncs- enm-fname-fabric - Virtual Queue Pair Fabric Affinity zncs- enm-fname-conn -status - Connection reply status zncs- enm-fname-vqp-port - Virtual Queue Pair port zncs- enm-fname-disc -reason - Disconnect reason zncs- enm-fname-switch -reason - Switch reason zncs- enm-fname-tid - Path switch transaction identifier zncs-enm-fname-unknown - Unknown
`text`	More detailed information describing the problem with the protocol message.

Cause NCSL received a bad or unexpected NCSL protocol message from the remote source identified in the event. The most likely cause is a mismatch between the NCSL protocol in use by the remote source and the local NCSL implementation. Other possible causes are unexpected (but legal) protocol interactions, and possibly unreported data corruption.

Effect The invalid protocol message will be discarded. Depending on which protocol message was discarded the effect could range from no effect at all, or a pause in the connection establishment or Path switching process, or an OS subsystem outage or a processor failing to come up after a cold load or reload. If this happens event(s) of greater Severity will likely be emitted by the affected subsystem.

Recovery Ensure that the source and destination are running the same NCSL protocol version. If they are and if the system is repeatedly experiencing these errors, contact HP support. If there is a subsystem outage or the processor is failing to come up, reload the processor and contact HP Support.

1003

RDMA Connection Manager failure; CPU: cpunum

Fabric ID:fabid; Failure status: status

`cpunum`	The processor that is reporting this event.
`fabid`	This identifies the InfiniBand Fabric ID of the remote destination that the processor failed to establish a connection with. The format is an 8-digit hexadecimal number. Example, 0x0A000201.
`status`	This is the failure status reported by the NSK RDMA Connection Manager code for the attempt to establish a Control connection. The possible values along with the cause, effect, and recovery action for each value are given below.

zncs-enm-cm-status-timeout (IB_CM_REJ_TIMEOUT)

Cause The destination did not respond to a connection management request.

Effect A processor fails to join the local node or a CLIM fails to reach the STARTED state.

Recovery

Check to make sure that the CPU or CLIM at the remote destination is powered on with no warning indicators.
Check the ME Configuration database to ensure that the InfiniBand Fabric ID of the remote destination appears within that database.
Check the cabling connections and switches (if any) between the local processor and the remote destination.

If none of this solves the problem, contact HP support.

zncs-enm-cm-status-stale (IB_CM_REJ_STALE_CONN)

Cause The RDMA Connection Manager code detected a Queue Pair that was still in the time wait state.

Effect The process of a processor joining the local node or a CLIM reaching the STARTED state is momentarily delayed.

Recovery The system should recover from this condition automatically. If this problem persists, reload the remote destination and contact HP Support.

zncs-enm-cm-status-sid (IB_CM_REJ_INVALID_SERVICE_ID)

Cause There was no server listening for an incoming connection establishment attempt at the remote destination.

Effect A processor fails to join the local node or a CLIM fails to reach the STARTED state.

Recovery

Check to make sure that the CPU or CLIM at the remote destination is powered on with no warning indicators.
Check the ME Configuration database to ensure that the InfiniBand Fabric ID of the remote destination appears within that database.
Check the cabling connections and switches (if any) between the local processor and the remote destination.

If none of this solves the problem, contact HP support.

zncs enm-cm-status-gid (IB_CM_REJ_INVALID_GID)

Cause A Global Identifier could not be found in the local Global Identifier cache table.

Effect A processor fails to join the local node or a CLIM fails to reach the STARTED state.

Recovery Reload the processor and contact HP Support.

1004

Data transfer operation failure; CPU:cpunum

Fabric ID:fabid; Failure status: status

`cpunum`	The processor that is reporting this event.
`fabid`	This identifies the InfiniBand Fabric ID that the data transfer operation was accessing. The format is an 8-digit hexadecimal number. Example, 0x0A000201.
`status`	This is the data transfer operation failure status reported by the NSK RDMA Services code. The possible values along with the cause, effect, and recovery action for each value are given below.

zncs-enm-dto-status-flushed (IB_WC_WR_FLUSH_ERR)

Cause The connection to the remote destination broke. This can happen as a side effect of the system operator taking a NonStop processor or CLIM out of operation. It can also happen if there are transient or permanent software or hardware failures in the communications path between the local processor and the destination.

Effect If both InfiniBand fabrics between the local processor and the destination are operational there will be no noticeable effect, but a Path switch will occur and EMS events to that effect will be logged. If this failure occurs when there is only one InfiniBand fabric operating and the destination is a NonStop processor one or more processors will likely go down. If the destination is a CLIM, the CLIM might transition out of the STARTED state.

Recovery If this occurs due to the system operation taking a processor or CLIM out of operation there is no recovery required. If this happens transiently the system will recover automatically. If this failure keeps reoccurring and there are further indications of resource exhaustion failures in the EMS log on NSK or syslog on the CLIM, configure more resources if that is possible. If this failure keeps recurring contact HP support.

zncs-enm-dto-status-bad-resp (IB_WC_WR_BAD_RESP_ERR)

Cause The local InfiniBand Host Channel Adapter hardware received an InfiniBand protocol message with an invalid opcode.

Recovery If this happens transiently the system will recover automatically. If this failure keeps recurring, contact HP support.

zncs-enm-dto-status-inv_req (IB_WC_WR_REM_INV_REQ_ERR)

Cause The local InfiniBand Host Channel Adapter hardware received an InfiniBand protocol response message indicating that the local processor sent it an invalid InfiniBand protocol packet.

Effect If the destination is a NonStop processor one or more processors will likely go down. If the destination is a CLIM, the CLIM will transition out of the STARTED state.

Recovery If this happens transiently the system will recover automatically. If this failure keeps recurring, contact HP support.

zncs-enm-dto-status- access (IB_WC_WR_REM_ACCESS_ERR)

Cause An attempt to probe the connection with the remote destination failed because the destination was present but inaccessible. This can happen if the remote destination has halted.

Effect If the destination is a NonStop processor one or more processors will likely go down. If the destination is a CLIM, the CLIM will transition out of the STARTED state.

Recovery If this happens transiently the system will recover automatically. If the remote destination is halted restart the remote destination. If this failure keeps recurring contact HP support.

zncs-enm-dto-status-op (IB_WC_WR_REM_OP_ERR)

Cause A data transfer sent to the remote destination could not be successfully received.

Effect If the destination is a NonStop processor one or more processors will likely go down. If the destination is a CLIM, the CLIM will transition out of the STARTED state.

Recovery If this happens transiently the system will recover automatically. If this failure keeps recurring, contact HP support.

zncs-enm-dto-status-timeout (IB_WC_RESP_TIMEOUT_ERR)

zncs-enm-dto-status-retry (IB_WC_RETRY_EXC_ERR)

Cause The local InfiniBand Host Channel Adapter hardware timed out waiting for an acknowledgment for a DTO request that it sent out.

Recovery If this happens transiently the system will recover automatically. If this failure keeps recurring, contact HP support.

zncs-enm-dto-status-rnr-retry (IB_WC_RNR_RETRY_EXC_ERR)

Cause The local InfiniBand Host Channel Adapter hardware timed out waiting for the receiving InfiniBand Host Channel Adapter hardware to become ready.

Recovery If this happens transiently the system will recover automatically. If this failure keeps recurring, contact HP support.

zncs-enm-dto-status-inv-rd (IB_WC_REM_INV_RD_REQ_ERR)

Cause The remote destination reported an invalid incoming message.

Effect If the destination is a NonStop processor one or more processors will likely go down. If the destination is a CLIM, the CLIM will transition out of the STARTED state.

Recovery If this happens transiently the system will recover automatically. If this failure keeps recurring, contact HP support.

zncs-enm-dto-status-abort (IB_WC_REM_ABORT_ERR)

Cause The current DTO operation was aborted by the InfiniBand Host Channel Adapter hardware. Either the local InfiniBand Host Channel Adapter or the destination InfiniBand Host Channel Adapter might have caused the abort.

Effect If the destination is a NonStop processor one or more processors will likely go down. If the destination is a CLIM, the CLIM will transition out of the STARTED state.

Recovery If this happens transiently the system will recover automatically. If this failure keeps recurring, contact HP support.

zncs-enm-dto-gen-err (IB_WC_GENERAL_ERROR)

Cause The current DTO operation was aborted by the InfiniBand Host Channel Adapter hardware, for unknown reasons.

Effect If the destination is a NonStop processor one or more processors will likely go down. If the destination is a CLIM, the CLIM will transition out of the STARTED state.

Recovery If this happens transiently the system will recover automatically. If this failure keeps recurring, contact HP support.

1005

NCSL message received out of order; CPU:cpunum

Fabric ID:fabid; Event Detail: text

`cpunum`	The processor that is reporting this event.
`fabid`	This identifies the InfiniBand Fabric ID of the source of the message. The format is an 8-digit hexadecimal number. Example, 0x0A000201.
`text`	Additional text that provides more detailed information for the event.

Cause NCSL received a protocol message out of order.

Effect One or more subsystems within the NonStop OS may hang, or a CPU may go down, or a CLIM may leave the STARTED state.

Recovery Reload the processor if it is hung and contact HP Support.

1006

CPU:cpunum; fabric fabric Subnet Management change

Change type:type; Event Detail:text

`cpunum`	The processor that is reporting this event.
`fabric`	The InfiniBand fabric whose Subnet Management has changed. This will be displayed as “X” for the X-fabric, and “Y” for the Y-fabric.
`type`	The type of subnet management change that was made. Valid types and their corresponding display strings are as follows: zncs-enm-sm-event-lid-change (IB_EVENT_LID_CHANGE) - New Local Identifier assigned zncs-enm-sm-event-sm-change (IB_EVENT_SM_CHANGE) - New Subnet Manager assigned zncs-enm-sm-event-client-rereg (IB_EVENT_CLIENT_REREGISTER) - Subnet management information change zncs-enm-sm-event-pkey-change (IB_EVENT_PKEY_CHANGE) - Port P-Key change
`text`	Additional text that provides more detailed information for the event.

Cause These events are usually caused by a new InfiniBand Subnet Manager taking control of the InfiniBand fabric, or an existing InfiniBand Subnet Manager restarting itself.

Effect None

Recovery None; this event is informational only.

1007

CPU:cpunum; fabric fabric up

Cause: cause

cpunum

The processor that is reporting this event.

fabric

The InfiniBand fabric that came up. This will be displayed as “X” for the X-fabric, and “Y” for the Y-fabric.

cause

This specifies the cause of the InfiniBand fabric coming up. The valid causes and their corresponding display strings are as follows:

zncs-enm-fab-reason-operator - System operator-initiated change

zncs-enm-fab-reason-port - InfiniBand port status change

Cause See the description for the cause variable associated with this event.

Effect The system continues to operate, but with reduced fault tolerance.

Recovery This event is informational only; there is no recovery.

1008

CPU:cpunum; fabric fabric down

Cause: cause

cpunum

The processor that is reporting this event.

fabric

The InfiniBand fabric that went down. This is displayed as “X” for the X-fabric, and “Y” for the Y-fabric.

cause

This specifies the cause of the InfiniBand fabric going down. The valid causes and their corresponding display strings are as follows:

zncs-enm-fab-reason-operator - System operator-initiated change

zncs-enm-fab-reason-port - InfiniBand port status change

Cause See the description for the cause variable associated with this event.

Effect The system continues to operate, but with reduced fault tolerance.

Recovery Use the administrative interface to bring the InfiniBand fabric back up.

1009

CPU:cpunum; Invalid operation received

Error type: type; Event detail: text

cpunum

The processor that is reporting this event.

type

The type of invalid operation that was received. The valid types and their corresponding display strings are as follows:

zncs-enm-qp-req-err (IB_EVENT_QP_REQ_ERR) - Invalid transport opcode received

zncs-enm-qp-access-err (IB_EVENT_QP_ACCESS_ERR) - Invalid remote QP access

text

Additional text that provides more detailed information for the event.

Cause These events are caused by a remote access that violates the InfiniBand protocol.

Effect The connection to the CPU or CLIM that was the source of the transaction that violated the InfiniBand protocol will is broken.

Recovery If this happens transiently the system will recover automatically. If this failure keeps recurring, contact HP support.

2002

Connection request from untrusted GUID;

CPU:cpunum

Fabric ID: fabid; Port: port; Family: family

Requesting GUID: guid

`cpunum`	The processor that is reporting this event.
`fabid`	This identifies the InfiniBand Fabric ID of the remote destination that sent the connection request. The format is an 8-digit hexadecimal number, for example, 0x0A000201.
`port`	This identifies the port that was targeted by the connection request. The format is a 5-digit or less decimal number, e.g. 5001.
`family`	This identifies the address family being used by the connection request. The format is a 5-digit or less decimal number, e.g. 2.
`guid`	This is the untrusted GUID associated with the HCA that the connection request was sent from.

Cause A connection establishment attempt was received from a source that is deemed to be untrusted.

Effect The system continues to operate normally. The connection request is ignored.

Recovery This event is informational only; there is no recovery required. If the fabid is associated with an NSK CPU or CLIM, this indicates that one of the cables between fabid and the CPU reporting the error is plugged into the wrong switch port, so the IB switch cabling should be checked. If the fabid is not associated with an NSK CPU or CLIM, this event is an indication of misbehavior on the part of an entity on the IB subnet. The fabid and/or guid should be used to identify the source of the connection request and physically disconnect that entity from the IB subnet.