Operator Messages Manual

Chapter 57 NCSL (NonStop Connection Services Library) Messages

The messages in this chapter are generated by the NCSL subsystem, the NonStop Connection Services Library. The subsystem ID displayed by these messages includes NCSL as the subsystem name.



1001

NCSL failed to allocate resource; CPU:cpunum; PIN:pin

Resource Type: restype [threshold exceeded]

cpunum

The processor that is reporting this event.

pin

The PIN of the process that reported this event.

restype

This specifies the type of resource that NCSL was unable to allocate. The resource types and the corresponding Cause, Effect, and Recovery are listed below.

threshold exceeded

When the Severity of the event is Critical, the string “threshold exceeded” appears in the indicated location in the event text. If the Severity is not Critical this text is not displayed.

restypeCause

zncs-enm-restype-kcalloc-mem

zncs-enm-restype-fpool-mem

zncs-enm-restype-cli-mem

NCSL was unable to allocate memory.

zncs-enm-restype-qp

NCSL was unable to allocate an InfiniBand Queue Pair.

zncs-enm-restype-pd

NCSL was unable to allocate an InfiniBand Protection Domain.

zncs-enm-restype-cq

NCSL was unable to allocate an InfiniBand Completion Queue.

zncs- enm-restype-mr

NCSL was unable to allocate an InfiniBand Memory Region.

zncs-enm-restype-port

zncs-enm-restype-dev

zncs-enm-restype-hca

NCSL was unable to access the InfiniBand Host Channel Adapter.

Cause  See the cause for the corresponding restype in the above table.

Effect  One or more subsystems within the NonStop OS may hang.

Recovery  If the Severity is Warning, NCSL will retry automatically and no operator-initiated recovery action is required. If the Severity is Critical, the recovery action is to reload the processor and contact HP Support.



1002

Bad NCSL protocol message received; CPU:cpunum

Fabric ID:fabid; Message Type: msgtype

Bad Field Name: fname

Event Detail: text

cpunum

The processor that is reporting this event.

fabid

This identifies the InfiniBand Fabric ID of the source of the message. The format is an 8-digit hexadecimal number. Example, 0x0A000201.

msgtype

This specifies the type of NCSL protocol message that was received. The valid message types and the corresponding display strings are:

zncs- enm-msgtype-control-req - Control connection request

zncs- enm-msgtype-control-rep - Control connection reply

zncs- enm-msgtype-conn-req - Virtual QP Group connection request

zncs- enm-msgtype-conn-rep - Virtual QP Group connection reply

zncs- enm-msgtype-disc-req - Virtual QP Group disconnect request

zncs- enm-msgtype-disc-rep - Virtual QP Group disconnect reply

zncs- enm-msgtype-switch-req - Path switch request

zncs- enm-msgtype-switch-rep - Path switch reply

zncs-enm-msgtype-unknown - Unknown

fname

This specifies which field within the received NCSL protocol message was invalid. The field identifiers and their corresponding display strings are:

zncs- enm-fname-size - Message size

zncs- enm-fname-reserved - Reserved

zncs- enm-fname-pdata-len - Private Data length

zncs- enm-fname-connid - Connection ID

zncs- enm-fname-version - Protocol version

zncs- enm-fname-num-vqps - Number of Virtual Queue Pairs

zncs- enm-fname-priority - Virtual Queue Pair Priority

zncs- enm-fname-fabric - Virtual Queue Pair Fabric Affinity

zncs- enm-fname-conn -status - Connection reply status

zncs- enm-fname-vqp-port - Virtual Queue Pair port

zncs- enm-fname-disc -reason - Disconnect reason

zncs- enm-fname-switch -reason - Switch reason

zncs- enm-fname-tid - Path switch transaction identifier

zncs-enm-fname-unknown - Unknown

text

More detailed information describing the problem with the protocol message.

Cause  NCSL received a bad or unexpected NCSL protocol message from the remote source identified in the event. The most likely cause is a mismatch between the NCSL protocol in use by the remote source and the local NCSL implementation. Other possible causes are unexpected (but legal) protocol interactions, and possibly unreported data corruption.

Effect  The invalid protocol message will be discarded. Depending on which protocol message was discarded the effect could range from no effect at all, or a pause in the connection establishment or Path switching process, or an OS subsystem outage or a processor failing to come up after a cold load or reload. If this happens event(s) of greater Severity will likely be emitted by the affected subsystem.

Recovery  Ensure that the source and destination are running the same NCSL protocol version. If they are and if the system is repeatedly experiencing these errors, contact HP support. If there is a subsystem outage or the processor is failing to come up, reload the processor and contact HP Support.



1003

RDMA Connection Manager failure; CPU: cpunum

Fabric ID:fabid; Failure status: status

cpunum

The processor that is reporting this event.

fabid

This identifies the InfiniBand Fabric ID of the remote destination that the processor failed to establish a connection with. The format is an 8-digit hexadecimal number. Example, 0x0A000201.

status

This is the failure status reported by the NSK RDMA Connection Manager code for the attempt to establish a Control connection. The possible values along with the cause, effect, and recovery action for each value are given below.

zncs-enm-cm-status-timeout (IB_CM_REJ_TIMEOUT)

Cause  The destination did not respond to a connection management request.

Effect  A processor fails to join the local node or a CLIM fails to reach the STARTED state.

Recovery  

  1. Check to make sure that the CPU or CLIM at the remote destination is powered on with no warning indicators.

  2. Check the ME Configuration database to ensure that the InfiniBand Fabric ID of the remote destination appears within that database.

  3. Check the cabling connections and switches (if any) between the local processor and the remote destination.

If none of this solves the problem, contact HP support.

zncs-enm-cm-status-stale (IB_CM_REJ_STALE_CONN)

Cause  The RDMA Connection Manager code detected a Queue Pair that was still in the time wait state.

Effect  The process of a processor joining the local node or a CLIM reaching the STARTED state is momentarily delayed.

Recovery   The system should recover from this condition automatically. If this problem persists, reload the remote destination and contact HP Support.

zncs-enm-cm-status-sid (IB_CM_REJ_INVALID_SERVICE_ID)

Cause  There was no server listening for an incoming connection establishment attempt at the remote destination.

Effect  A processor fails to join the local node or a CLIM fails to reach the STARTED state.

Recovery  

  1. Check to make sure that the CPU or CLIM at the remote destination is powered on with no warning indicators.

  2. Check the ME Configuration database to ensure that the InfiniBand Fabric ID of the remote destination appears within that database.

  3. Check the cabling connections and switches (if any) between the local processor and the remote destination.

If none of this solves the problem, contact HP support.

zncs enm-cm-status-gid (IB_CM_REJ_INVALID_GID)

Cause  A Global Identifier could not be found in the local Global Identifier cache table.

Effect  A processor fails to join the local node or a CLIM fails to reach the STARTED state.

Recovery  Reload the processor and contact HP Support.



1004

Data transfer operation failure; CPU:cpunum

Fabric ID:fabid; Failure status: status

cpunum

The processor that is reporting this event.

fabid

This identifies the InfiniBand Fabric ID that the data transfer operation was accessing. The format is an 8-digit hexadecimal number. Example, 0x0A000201.

status

This is the data transfer operation failure status reported by the NSK RDMA Services code. The possible values along with the cause, effect, and recovery action for each value are given below.

zncs-enm-dto-status-flushed (IB_WC_WR_FLUSH_ERR)

Cause  The connection to the remote destination broke. This can happen as a side effect of the system operator taking a NonStop processor or CLIM out of operation. It can also happen if there are transient or permanent software or hardware failures in the communications path between the local processor and the destination.

Effect  If both InfiniBand fabrics between the local processor and the destination are operational there will be no noticeable effect, but a Path switch will occur and EMS events to that effect will be logged. If this failure occurs when there is only one InfiniBand fabric operating and the destination is a NonStop processor one or more processors will likely go down. If the destination is a CLIM, the CLIM might transition out of the STARTED state.

Recovery  If this occurs due to the system operation taking a processor or CLIM out of operation there is no recovery required. If this happens transiently the system will recover automatically. If this failure keeps reoccurring and there are further indications of resource exhaustion failures in the EMS log on NSK or syslog on the CLIM, configure more resources if that is possible. If this failure keeps recurring contact HP support.

zncs-enm-dto-status-bad-resp (IB_WC_WR_BAD_RESP_ERR)

Cause  The local InfiniBand Host Channel Adapter hardware received an InfiniBand protocol message with an invalid opcode.

Effect  If both InfiniBand fabrics between the local processor and the destination are operational there will be no noticeable effect, but a Path switch will occur and EMS events to that effect will be logged. If this failure occurs when there is only one InfiniBand fabric operating and the destination is a NonStop processor one or more processors will likely go down. If the destination is a CLIM, the CLIM will transition out of the STARTED state.

Recovery  If this happens transiently the system will recover automatically. If this failure keeps recurring, contact HP support.

zncs-enm-dto-status-inv_req (IB_WC_WR_REM_INV_REQ_ERR)

Cause  The local InfiniBand Host Channel Adapter hardware received an InfiniBand protocol response message indicating that the local processor sent it an invalid InfiniBand protocol packet.

Effect  If the destination is a NonStop processor one or more processors will likely go down. If the destination is a CLIM, the CLIM will transition out of the STARTED state.

Recovery  If this happens transiently the system will recover automatically. If this failure keeps recurring, contact HP support.

zncs-enm-dto-status- access (IB_WC_WR_REM_ACCESS_ERR)

Cause  An attempt to probe the connection with the remote destination failed because the destination was present but inaccessible. This can happen if the remote destination has halted.

Effect  If the destination is a NonStop processor one or more processors will likely go down. If the destination is a CLIM, the CLIM will transition out of the STARTED state.

Recovery  If this happens transiently the system will recover automatically. If the remote destination is halted restart the remote destination. If this failure keeps recurring contact HP support.

zncs-enm-dto-status-op (IB_WC_WR_REM_OP_ERR)

Cause  A data transfer sent to the remote destination could not be successfully received.

Effect  If the destination is a NonStop processor one or more processors will likely go down. If the destination is a CLIM, the CLIM will transition out of the STARTED state.

Recovery  If this happens transiently the system will recover automatically. If this failure keeps recurring, contact HP support.

zncs-enm-dto-status-timeout (IB_WC_RESP_TIMEOUT_ERR)

zncs-enm-dto-status-retry (IB_WC_RETRY_EXC_ERR)

Cause  The local InfiniBand Host Channel Adapter hardware timed out waiting for an acknowledgment for a DTO request that it sent out.

Effect  If both InfiniBand fabrics between the local processor and the destination are operational there will be no noticeable effect, but a Path switch will occur and EMS events to that effect will be logged. If this failure occurs when there is only one InfiniBand fabric operating and the destination is a NonStop processor one or more processors will likely go down. If the destination is a CLIM, the CLIM will transition out of the STARTED state.

Recovery  If this happens transiently the system will recover automatically. If this failure keeps recurring, contact HP support.

zncs-enm-dto-status-rnr-retry (IB_WC_RNR_RETRY_EXC_ERR)

Cause  The local InfiniBand Host Channel Adapter hardware timed out waiting for the receiving InfiniBand Host Channel Adapter hardware to become ready.

Effect  If both InfiniBand fabrics between the local processor and the destination are operational there will be no noticeable effect, but a Path switch will occur and EMS events to that effect will be logged. If this failure occurs when there is only one InfiniBand fabric operating and the destination is a NonStop processor one or more processors will likely go down. If the destination is a CLIM, the CLIM will transition out of the STARTED state.

Recovery  If this happens transiently the system will recover automatically. If this failure keeps recurring, contact HP support.

zncs-enm-dto-status-inv-rd (IB_WC_REM_INV_RD_REQ_ERR)

Cause  The remote destination reported an invalid incoming message.

Effect  If the destination is a NonStop processor one or more processors will likely go down. If the destination is a CLIM, the CLIM will transition out of the STARTED state.

Recovery  If this happens transiently the system will recover automatically. If this failure keeps recurring, contact HP support.

zncs-enm-dto-status-abort (IB_WC_REM_ABORT_ERR)

Cause  The current DTO operation was aborted by the InfiniBand Host Channel Adapter hardware. Either the local InfiniBand Host Channel Adapter or the destination InfiniBand Host Channel Adapter might have caused the abort.

Effect  If the destination is a NonStop processor one or more processors will likely go down. If the destination is a CLIM, the CLIM will transition out of the STARTED state.

Recovery  If this happens transiently the system will recover automatically. If this failure keeps recurring, contact HP support.

zncs-enm-dto-gen-err (IB_WC_GENERAL_ERROR)

Cause  The current DTO operation was aborted by the InfiniBand Host Channel Adapter hardware, for unknown reasons.

Effect  If the destination is a NonStop processor one or more processors will likely go down. If the destination is a CLIM, the CLIM will transition out of the STARTED state.

Recovery  If this happens transiently the system will recover automatically. If this failure keeps recurring, contact HP support.



1005

NCSL message received out of order; CPU:cpunum

Fabric ID:fabid; Event Detail: text

cpunum

The processor that is reporting this event.

fabid

This identifies the InfiniBand Fabric ID of the source of the message. The format is an 8-digit hexadecimal number. Example, 0x0A000201.

text

Additional text that provides more detailed information for the event.

Cause  NCSL received a protocol message out of order.

Effect  One or more subsystems within the NonStop OS may hang, or a CPU may go down, or a CLIM may leave the STARTED state.

Recovery  Reload the processor if it is hung and contact HP Support.



1006

CPU:cpunum; fabric fabric Subnet Management change

Change type:type; Event Detail:text

cpunum

The processor that is reporting this event.

fabric

The InfiniBand fabric whose Subnet Management has changed. This will be displayed as “X” for the X-fabric, and “Y” for the Y-fabric.

type

The type of subnet management change that was made. Valid types and their corresponding display strings are as follows:

zncs-enm-sm-event-lid-change (IB_EVENT_LID_CHANGE) - New Local Identifier assigned

zncs-enm-sm-event-sm-change (IB_EVENT_SM_CHANGE) - New Subnet Manager assigned

zncs-enm-sm-event-client-rereg (IB_EVENT_CLIENT_REREGISTER) - Subnet management information change

zncs-enm-sm-event-pkey-change (IB_EVENT_PKEY_CHANGE) - Port P-Key change

text

Additional text that provides more detailed information for the event.

Cause  These events are usually caused by a new InfiniBand Subnet Manager taking control of the InfiniBand fabric, or an existing InfiniBand Subnet Manager restarting itself.

Effect  None

Recovery  None; this event is informational only.



1007

CPU:cpunum; fabric fabric up

Cause: cause

cpunum

The processor that is reporting this event.

fabric

The InfiniBand fabric that came up. This will be displayed as “X” for the X-fabric, and “Y” for the Y-fabric.

cause

This specifies the cause of the InfiniBand fabric coming up. The valid causes and their corresponding display strings are as follows:

zncs-enm-fab-reason-operator - System operator-initiated change

zncs-enm-fab-reason-port - InfiniBand port status change

Cause  See the description for the cause variable associated with this event.

Effect  The system continues to operate, but with reduced fault tolerance.

Recovery  This event is informational only; there is no recovery.



1008

CPU:cpunum; fabric fabric down

Cause: cause

cpunum

The processor that is reporting this event.

fabric

The InfiniBand fabric that went down. This is displayed as “X” for the X-fabric, and “Y” for the Y-fabric.

cause

This specifies the cause of the InfiniBand fabric going down. The valid causes and their corresponding display strings are as follows:

zncs-enm-fab-reason-operator - System operator-initiated change

zncs-enm-fab-reason-port - InfiniBand port status change

Cause  See the description for the cause variable associated with this event.

Effect  The system continues to operate, but with reduced fault tolerance.

Recovery  Use the administrative interface to bring the InfiniBand fabric back up.



1009

CPU:cpunum; Invalid operation received

Error type: type; Event detail: text

cpunum

The processor that is reporting this event.

type

The type of invalid operation that was received. The valid types and their corresponding display strings are as follows:

zncs-enm-qp-req-err (IB_EVENT_QP_REQ_ERR) - Invalid transport opcode received

zncs-enm-qp-access-err (IB_EVENT_QP_ACCESS_ERR) - Invalid remote QP access

text

Additional text that provides more detailed information for the event.

Cause  These events are caused by a remote access that violates the InfiniBand protocol.

Effect  The connection to the CPU or CLIM that was the source of the transaction that violated the InfiniBand protocol will is broken.

Recovery  If this happens transiently the system will recover automatically. If this failure keeps recurring, contact HP support.



2002

Connection request from untrusted GUID;

CPU:cpunum

Fabric ID: fabid; Port: port; Family: family

Requesting GUID: guid

cpunum

The processor that is reporting this event.

fabid

This identifies the InfiniBand Fabric ID of the remote destination that sent the connection request. The format is an 8-digit hexadecimal number, for example, 0x0A000201.

port

This identifies the port that was targeted by the connection request. The format is a 5-digit or less decimal number, e.g. 5001.

family

This identifies the address family being used by the connection request. The format is a 5-digit or less decimal number, e.g. 2.

guid

This is the untrusted GUID associated with the HCA that the connection request was sent from.

Cause  A connection establishment attempt was received from a source that is deemed to be untrusted.

Effect  The system continues to operate normally. The connection request is ignored.

Recovery  This event is informational only; there is no recovery required. If the fabid is associated with an NSK CPU or CLIM, this indicates that one of the cables between fabid and the CPU reporting the error is plugged into the wrong switch port, so the IB switch cabling should be checked. If the fabid is not associated with an NSK CPU or CLIM, this event is an indication of misbehavior on the part of an entity on the IB subnet. The fabid and/or guid should be used to identify the source of the connection request and physically disconnect that entity from the IB subnet.