Operator Messages Manual

Chapter 110 SYSH (Syshealth) Messages

The messages in this chapter are sent to $0 by the Syshealth subsystem. You can view operator messages using either the Viewpoint application or the Syshealth event viewing screen. To view messages on the event viewing screen, you must first select the $0 message log using the File menu Use Log command. Syshealth operator messages are listed as subsystem SYSH.

NOTE: Negative-numbered messages are common to most subsystems. If you receive a negative-numbered message that is not described in this chapter, see Chapter 15.


1000

module-title (process-name) - Module launched [with undefined externals].

module-title

is the name of the Syshealth module that was launched. It is derived from the unique module identification number. This ID is globally defined for all modules in Syshealth.

process-name

is the process name or the processor number and process identification number (PIN) of the launched process.

Cause  The Persistence Monitor is launching a Syshealth module.

Effect  The Syshealth module is started.

Recovery  Informational message only; no corrective action is needed.



1001

module-title (process-name) - Too many module relaunch attempts.

module-title

is the name of the Syshealth module that failed to start. It is derived from the unique module identification number. This ID is globally defined for all modules in Syshealth.

process-name

is the process name of the failed module. It is the process name or the processor number and process identification number (PIN) of the launched process.

Cause  The monitored module has failed to start.

Effect  The module is not running.

Recovery  Analyze the previous Syshealth MODULE-FAILED error (#1002) and the error buffer to determine the cause of the problem. Correct the problem and use the Syshealth Management screen to start the module.



1002

module-title (process-name) - Failed due to { CPU failure | ABEND }

module-title

is derived from the unique module identification number. This ID is globally defined for all modules in Syshealth.

process-name

is derived from the process name, processor and process identification number (PIN) of the launched process.

Cause  The monitored module failed.

Effect  The module is not running.

Recovery  Informational message only; no corrective action is needed. The Persistence Monitor tries to relaunch the module.



1003

module-title (process-name) - Internal Error. Error Information:

module-title

is the name of the module that had an error. It is derived from the unique module identification number. This ID is globally defined for all modules in Syshealth.

process-name

is the process name of the module in error. It is the process name or the processor number and process identification number (PIN) of the launched process.

Cause  The Persistence Monitor failed because of an internal error.

Effect  The Persistence Monitor is not running.

Recovery  Analyze the error information in this error message to determine the cause of the problem. Correct the problem and use the Syshealth Management screen to start the Persistence Monitor.



2000

Unable to access Syshealth Command Database file. File error file-error while attempting proc-failing.

file

is the name of the Syshealth command database file on which the error was detected.

file-error

is the file-system error associated with the failure.

proc-failing

is the file-system procedure associated with the failure.

Cause  When TMDSAUTO is run, it first tries to open the Syshealth command database. Then it performs a read on the database’s header record. If the open or read fails, this error is generated.

Effect  The Syshealth monitoring processes (Persistence Monitor, System Health Monitor, and Notification system) and the Syshealth user interface depend on the command database to implement command security. Syshealth runs, but it uses its default command security.

Recovery  If the file was not found, it either was not installed or has been purged. Reinstall the file using Install. If the error was other than “File Not Found,” see Appendix B, for a definition of the specified error. For more detailed information including recovery actions, see the Guardian Procedure Error and Messages Manual.



2001

Unable to access Syshealth Management Database file during init-phase. File error file-error while attempting proc-failing. Syshealth was not started.

file

is the name of the Syshealth management database file on which the error was detected.

init-phase

indicates what phase of the Syshealth initialization failed. If message 2001 is generated, this parameter indicates how the management database was being used when the error occurred.

file-error

is the file-system error associated with the failure.

proc-failing

is the file-system procedure associated with the failure.

Cause  When TMDSAUTO is run, the management database is used to bring up Syshealth. First the database file is opened. Then the file description records are read and used in checking the accessibility of all Syshealth files. The Persistence Monitor process startup information is read and used to start the monitor. Finally, other processes’ startup information is read and sent to the Persistence Monitor to start them.

If any of these steps fails due to an error when accessing the management database, message 2001 is generated.

Effect  Syshealth depends on the management database being accessible. If it is not accessible, Syshealth is not started, although TMDSAUTO starts the Tandem Maintenance and Diagnostic Subsystem (TMDS) collector ($ZLOG).

Recovery  See Appendix B, for a definition of the specified error. For more detailed information including recovery actions, see the Guardian Procedure Error and Messages Manual.



2002

Unable to access Syshealth Configuration Database file. File error file-error while attempting proc-failing.

file

is the name of the Syshealth configuration database file on which the error was detected.

file-error

is the file-system error associated with the failure.

proc-failing

is the file-system procedure associated with the failure.

Cause  When TMDSAUTO is run, it attempts to open the Syshealth configuration database and perform a read on the database file’s header record. If the database file is not found, TMDSAUTO attempts to create the file. If the create, open, or read fails, then message 2002 is generated.

Effect  The Syshealth user interface depends on the configuration database being accessible. If it is not, the Syshealth user interface cannot alter any of the configurable options of Syshealth.

Recovery  See Appendix B, for a definition of the specified error. For more detailed information including recovery actions, see the Guardian Procedure Error and Messages Manual.



2003

Syshealth not properly installed. n Syshealth files missing

n

is the number of Syshealth files that were not found on the system.

Cause  When TMDSAUTO is run, it checks that all of the appropriate Syshealth files are present. It does this by reading the file-verification records in the management database, then by opening each file that should be on the current system. If any files are missing, this error is generated. If the open fails for any other reason, this error also is generated.

NOTE: This error message is generated only if one of the Syshealth monitor files (Health Monitor, Persistence Monitor, or Notification module) is inaccessible.

Effect  If a monitor file is missing (that is, one critical to the operation of Syshealth automatic monitoring), then Syshealth does not run. The effects of other files missing depend on the file: some screens might not be available, help might be missing, and so on. Note that even if Syshealth does not run, TMDSAUTO starts the Tandem Maintenance and Diagnostic Subsystem (TMDS) message collector ($ZLOG).

Recovery  Examine the error using the TMDS FIND command (or, if sufficient portions of Syshealth are available, using the Syshealth Event Viewing screen) to see exactly which files are missing. The missing files either were not installed or have been purged. Reinstall the files using the INSTALL program.



2004

Syshealth not properly installed. n files were inaccessible.

n

is the number of Syshealth files that were not accessible to TMDSAUTO.

NOTE: This error message is generated only if one of the Syshealth monitor files (Health Monitor, Persistence Monitor, or Notification module) is inaccessible.

Cause  When TMDSAUTO is run, it checks that all of the appropriate Syshealth files are present. It does this by reading the file-verification records in the management database, then by opening each file that should be on the current system. If the open fails for any reason other than “File Not Found,” then message 2004 is generated.

Effect  Syshealth may not be installed or secured correctly. If Syshealth cannot access a monitor file (a file critical to the operation of Syshealth automatic monitoring), it does not run. If other files are inaccessible, some Syshealth functions may be lost, depending on which files are not available. Note that even if Syshealth does not run, TMDSAUTO starts the Tandem Maintenance and Diagnostic Subsystem (TMDS) collector ($ZLOG).

Recovery  See Appendix B, for a definition of the specified error. For more detailed information including recovery actions, see the Guardian Procedure Error and Messages Manual.



2005

Syshealth Persistence Monitor could not be started due to [ No definition of the process in the Syshealth Management Database.] [ Newprocess error newprocess-error, with program file filename. ] [ File error file-error when performing proc-failing. ] [ SPI error spi-error during proc-failing. ] [ TMDSAUTO internal error during proc-failing. ] Syshealth was not started.

newprocess-error

is the error associated with the failure of a NEWPROCESS procedure call.

filename

is the name of the program file that experienced the NEWPROCESS error.

file-error

is the file-system error associated with the failure.

proc-failing

is the procedure associated with the failure.

spi-error

is the Subsystem Programmatic Interface (SPI) error code associated with the failure.

Cause  When TMDSAUTO is run, it performs initial checks and then starts the Syshealth Persistence Monitor. If any of the functions (memory allocation, NEWPROCESS, OPEN, or WRITEREAD) fail, then Syshealth does not start and message 2005 is generated. Also, if the Syshealth management database was improperly built or corrupted, the Persistence Monitor definition may not be accessible.

Effect  The Syshealth Persistence Monitor is a process pair that starts the remaining Syshealth processes (which are not process pairs) and monitors them to ensure that they remain running. If the Persistence Monitor cannot be started, the remaining Syshealth processes are not started and Syshealth automatic fault monitoring and reporting is not operational, although TMDSAUTO starts the Tandem Maintenance and Diagnostic Subsystem (TMDS) message collector ($ZLOG) and TMDS fault analysis ($ZMOM).

Recovery  If a file-system error is specified, see Appendix B, for a definition of the error. Check the Guardian Procedure Errors and Messages Manual.for a description of the file-system and NEWPROCESS errors and their recovery actions.

If the problem resulted from an internal error, contact the Global NonStop Solution Center (GNSC) and provide all relevant information as follows:

  • Descriptions of the problem and accompanying symptoms

  • Details from the message or messages generated

  • Supporting documentation such as Event Management Service (EMS) logs, trace files, and a processor dump, if applicable

If your local operating procedures require contacting the Global Mission Critical Solution Center (GMCSC), supply your system number and the numbers and versions of all related products as well.



2006

Syshealth Process Startup Error. target-process Not Started. { File Error | SPI Error } error when attempting proc‑failing while communicating with the Persistence Monitor.

target-process

is the name of a Syshealth process that was not started due to the failure to communicate with the Persistence Monitor.

error

is the file-system error or the Subsystem Programmatic Interface (SPI) error code associated with the failure, if any.

proc-failing

is the procedure associated with the failure.

Cause  This message reports a failure when an SPI ADD MODULE or SPI LAUNCH command is sent to the Syshealth Persistence Monitor. The failure can be either an SPI error or a file-system OPEN or WRITEREAD error. Each Syshealth process is added to and then launched by the Persistence Monitor when starting Syshealth. If any of the SPI commands fail, the startup terminates.

Effect  If any portion of Syshealth fails to start, all of Syshealth is shut down. In this case, Syshealth is not running. Nevertheless, TMDSAUTO starts the Tandem Maintenance and Diagnostic Subsystem (TMDS) message collector ($ZLOG).

Recovery  See Appendix B, for a definition of the specified error. For more detailed information including recovery actions, see the Guardian Procedure Error and Messages Manual. If the error is an SPI error, contact your service provider.



2007

Syshealth Startup Verify Error. target-process failed verification. [{SPI error | File Error} error when attempting proc‑failing.] [Verify code = verify-code

target-process

is the name of a Syshealth process that was not started due to the failure to communicate with the Persistence Monitor.

error

is the file-system error or the Subsystem Programmatic Interface (SPI) error code associated with the failure, if any.

proc-failing

is the procedure associated with the failure.

verify-code

is the error returned by the Syshealth target process, describing its internal verification failure. Verify codes are defined under “Recovery.”

Cause  Syshealth processes are started one at a time and are verified (using an SPI VERIFY command) prior to starting the next process. If the verify fails because the process is down, does not respond within a 60-second time-out, or returns a verify code indicating that it is not functioning correctly, then message 2007 is generated.

Effect  If Syshealth processes are down or did not verify, Syshealth is not started, although TMDSAUTO starts the Tandem Maintenance and Diagnostic Subsystem (TMDS) message collector ($ZLOG).

Recovery  If a Syshealth process is down, check to see whether a NEWPROCESS error occurred, and examine the NEWPROCESS error codes in the Guardian Procedure Errors and Messages Manual to determine the fault and corrective action.

If the process failed its verification, check the codes below for the fault and corrective actions:

Verify CodeDescriptionAction
0OKNone required.
1TimeoutRestart the process that did not start.

If the process was restarted the maximum number of times and failed, check to see whether there is some reason (such as processors failing) that the process stopped. If not, contact the Global NonStop Solution Center (GNSC) and provide all relevant information as follows:

  • Descriptions of the problem and accompanying symptoms

  • Details from the message or messages generated

  • Supporting documentation such as Event Management Service (EMS) logs, trace files, and a processor dump, if applicable

If your local operating procedures require contacting the Global Mission Critical Solution Center (GMCSC), supply your system number and the numbers and versions of all related products as well.



2008

Syshealth started OK.

Cause  When TMDSAUTO is run, it performs the initial verification of Syshealth to check that:

  • The command and management databases are accessible.

  • All Syshealth files are present and accessible.

Both Tandem Maintenance and Diagnostic Subsystem (TMDS) and the Syshealth user interface can start the Persistence Monitor. Then the remainder of the Syshealth processes are registered with the Persistence Monitor using Subsystem Programmatic Interface (SPI) ADD MODULE commands and are launched by the Persistence Monitor. TMDSAUTO (or the user interface) sends SPI VERIFY commands to all Syshealth processes.

If everything is working and there were no errors, message 2008 is generated.

Effect  Syshealth is operational on the system.

Recovery  Informational message only; no corrective action is needed. This message indicates that Syshealth was started correctly.



2009

A Syshealth Shutdown command was issued for target-system.

target-system

is the name of the system to which the Syshealth user interface commands are directed.

Cause   The Syshealth user interface provides commands to shut down and start up the Syshealth processes. Syshealth, $ZLOG, and $ZMOM have been shut down.

Effect   Syshealth has been shut down. If the Tandem Maintenance and Diagnostic Subsystem (TMDS) alternate collector was shut down, no TMDS errors are logged on the system, so information about hardware faults is lost. With Syshealth shut down, no hardware fault coverage is provided and there are no dial-outs caused by system faults.

Recovery  None, if the shutdown was anticipated. Otherwise, the operators should check to ensure that Syshealth has not been left inoperative, which would compromise system fault detection and reporting.



3000

Coldload report of report-severity severity for system‑serial‑number system. resource-name: specific-problem. There are related-alarms related alarms.

report-severity

defines the perceived severity of the system-load report. The levels of severity correspond to the Open Systems Interconnection (OSI) perceived severity definitions for managed objects. This value is determined by using the highest (most critical) perceived severity of all the alarms contained in the system-load report. This variable can have one of the following values:

CRITICAL-SEVERITY

indicates that a service-affecting condition has occurred and an immediate corrective action is needed. Such a severity can be reported, for example, when a resource becomes totally out of service and its capability must be restored.

MAJOR-SEVERITY

indicates that a service-affecting condition has developed and urgent corrective action is required. Such a severity can be reported, for example, when there is the potential for a single point of failure (fault tolerance lost) or severe degradation in resource capability, and full capability must be restored.

MINOR-SEVERITY

indicates that a non-service-affecting fault condition exists and that corrective action should be taken in order to prevent a more serious (for example, service-affecting) fault. Such a severity can be reported, for example, when the dedicated alarm condition is not currently degrading the capacity of the resource.

WARNING-SEVERITY

indicates the detection of a potential or impending service-affecting fault before any significant effects have been felt. Action should be taken to further diagnose (if necessary) and correct the problem in order to prevent it from becoming a more serious service-affecting fault.

report-severity

is the highest (most critical) perceived severity of all the alarms contained in the system-load report. The values for this variable are defined for message 3001.

system-serial-number

is the serial number of the NonStop Kernel from which the system-load report originated.

resource-name

is the name of the system resource involved in the most critical alarm in the system‑load report.

specific-problem

is the specific problem of the most critical alarm in the system-load report.

related-alarms

is a count of the number of alarms in the system-load report.

Cause  The NonStop Kernel identified by system-serial-number has completed a system load. The Syshealth Health Monitor has generated a summary of the outstanding alarms after the system load finished.

Effect  If the system-load report contains alarms, then one or more system resources have been compromised. The system-load report contains a detailed status for each problem resource.

Recovery  On a remote system, use Syshealth to examine this system-load report on the Syshealth Main screen and the Alarm Viewing screen. If corrective action is needed, dial in to the system that generated the report. On a local system or while dialed in remotely, use Syshealth and other Tandem Maintenance and Diagnostic Subsystem (TMDS) diagnostic tools to locate and repair the problem.



3001

Problem report of report-severity severity for system‑serial-number system. resource-name:specific-problem. There are related-alarms related alarms.

system-serial-number

is the serial number of the NonStop Kernel from which the problem report originated.

report-severity

defines the perceived severity of the problem report. The levels of severity correspond to the Open Systems Interconnection (OSI) perceived severity definitions for managed objects. This value is determined by using the highest (most critical) perceived severity of all the alarms contained in the problem report.

The levels of severity can be one of the following values:

CRITICAL-SEVERITY

indicates that a service-affecting condition has occurred and an immediate corrective action is required. Such a severity can be reported, for example, when a resource becomes totally out of service and its capability must be restored.

MAJOR-SEVERITY

indicates that a service-affecting condition has developed and urgent corrective action is required. Such a severity can be reported, for example, when there is the potential for a single point of failure (fault tolerance lost), or severe degradation in resource capability and full capability must be restored.

MINOR-SEVERITY

indicates that a non-service-affecting fault condition exists and that corrective action should be taken in order to prevent a more serious (for example, service-affecting) fault. Such a severity can be reported, for example, when the dedicated alarm condition is not currently degrading the capacity of the resource.

WARNING-SEVERITY

indicates the detection of a potential or impending service-affecting fault, before any significant effects have been felt. Action should be taken to further diagnose (if necessary) and correct the problem in order to prevent it from becoming a more serious service-affecting fault.

system-serial-number

is the serial number of the NonStop Kernel from which the problem report originated.

resource-name

is the name of the system resource involved in the most critical alarm in the problem report.

specific-problem

is the specific problem of the most critical alarm in the problem report.

related-alarms

is a count of the number of alarms in the problem report.

Cause  The Health Monitor on the indicated system has detected faults or error conditions in one or more system resources. The severity field defines the urgency of the problem report.

Effect  System resources have been compromised. The problem report contains a detailed status for each problem resource.

Recovery  On a remote system, use Syshealth to decode this problem report on the Syshealth Main screen and the Alarm Viewing screen. If corrective action is needed, dial in to the system that generated the report. On a local system or while dialed in remotely, use Syshealth and other Tandem Maintenance and Diagnostic Subsystem (TMDS) diagnostic tools to locate and repair the problem.



3002:

System summary report for system-serial-number system.

system-serial-number

is the serial number of the NonStop Kernel from which the system report originated.

Cause  The Syshealth Health Monitor is reporting a summary of system activity since the last system summary report.

Effect  None

Recovery  Informational message only; no corrective action is needed.



3100

System Health Monitor Internal Error. Error Level: error‑level Error Type: error-type Error Code: error-code Error Tag: error-tag

error-level

is the severity level of the error.

error-type

is the type of internal error that occurred; for example, Event Dispatcher Error.

error-code

further defines the cause of the error. This value is different for each error type.

error-tag

is the location in code at which the internal error occurred. The value has the format FILENAME_LINE-NUMBER. For example, if an error occurred in source file DISPC at line 25, the value for error-tag is DISPC_25.

Cause  The Syshealth Health Monitor encountered an unexpected internal error during execution.

Effect  The Syshealth Health Monitor abends.

Recovery  The Persistence Monitor restarts the Health Monitor a certain number of times in case the condition is due to an intermittent problem. This message should be reported to Tandem Development.



3101

System Health Monitor External Error. Error Level: error‑level Error Type: error-type Error Code: error-code Error Tag: error-tag

error-level

is the severity level of the error.

error-type

is the type of internal error that occurred. The following types of internal errors are defined:

Subsystem Programmatic Interface (SPI) error. Operating system procedure call failed. Library function failed. Variable overflowed. Event Management Service (EMS) Distributor failed. Dynamic System Configuration (DSC) NEWPROCESS call failed. DSC SPI open failed. DSC communication error occurred. DSC returned unexpected error. Subsystem Control Point (SCP) NEWPROCESS call failed. SCP Subsystem Programmatic Interface (SPI) open failed. SCP communication error occurred. SCP returned unexpected error. Memory error occurred. String list error occurred. Task Queue error occurred. Event missing required tokens Required resource object missing Open of $0 collector failed. Open of Tandem Maintenance and Diagnostic System (TMDS) collector failed. Unable to load required filter file. Alarm missing required tokens. Error accessing help file. Alarm database operation error. Error accessing configuration file. $RECEIVE input/output (I/O) error. Scripting error. Program abnormally terminated. Event Dispatcher error. Program terminated because of too many takeovers.

error-code

further defines the cause of the error. This value is different for each error type.

error-tag

is the location in code at which the internal error occurred. The variable has the format FILENAME_LINE-NUMBER. For example, if an error occurred in source file DISPC at line 25, the value for error-tag is DISPC_25.

Cause  The Syshealth Health Monitor encountered an external error during execution.

Effect  The Syshealth Health Monitor stops.

NOTE: The Persistence Monitor does not restart the Health Monitor for an external error condition, because the Health Monitor cannot continue to execute until the external problem is resolved.

Recovery  Fix the external error condition and then start the Health Monitor from the Syshealth Management screen.



4001

Authorization to Deliver Remote Notification Needed. Notification ID = action-id

action-id

is the ID (a 16-bit integer) of the notification that needs to be authorized or denied authorization. This value is the same as the notification identifier used in notification Subsystem Programmatic Interface (SPI) commands.

Cause  An occurrence on the indicated system has triggered a remote notification. What occurrences result in a notification depends on the configuration of the notification module on that system. Typically, notification triggers include problem reports (arising from system resources that have encountered a problem), system-load summary reports, or periodic system reports.

Effect  A system resource on the specified system may be unavailable.

Recovery  Run Syshealth on the indicated system, and examine unauthorized notifications through the notification screen. Either authorize or deny authorization to pending notifications. Authorized notifications are forwarded through the configured notification ports (typically to the NonStop Support Center).



4002

Authorization to Deliver Remote Notification disposition. Notification ID = action-id.

disposition

indicates whether authorization has been granted. It can have the value GRANTED or DENIED.

action-id

is the ID (a 16-bit integer) of the notification that was authorized or denied authorization. This value is the same as the notification identifier used in notification Subsystem Programmatic Interface (SPI) commands.

Cause  A pending notification has been authorized for delivery.

Effect  The indicated notification is delivered to all appropriate destinations.

Recovery  Informational message only; no corrective action is needed.



4003

Dial-out test performed

Cause  A Syshealth user has issued a Test Dial-out command.

Effect  This error should cause a dial-out to occur, unless dial-out is not enabled. Use the Syshealth Remote Notification screen to examine the test results.

Recovery  Informational message only; no corrective action is needed.



4004

Remote Notification Port Failing. error-text

error-text

is the text of the last port error message.

Cause  A notification path is failing. Examine the error-text portion of the message to determine the exact cause.

Effect  Remote notification for system resource problems is not occurring on the specified system.

Recovery  Examine the error message and take corrective action. If no corrective action is possible, disable the notification port through the Syshealth user interface. Notify the Global NonStop Solution Center (GNSC) of system failures manually until the port is repaired.



4005

REMOTE NOTIFICATION PROCESS notif-process ENCOUNTERED EXCEPTION. CODE = module-error:text

notif-process

contains the name of the notification port process that failed.

module-error

is the fatal error that was encountered by the notification module.

text

is the text of the last notification module error message.

Cause  The notification module encountered an exception that it recovered from. Typical exceptions include the inability of the module to find or read the notification database, a filter file, or other file.

Effect  The effect of this error depends on the nature of the exception. If the notification module does not find the database or cannot read it, the notification module creates a new one. Filter files that cannot be found or read are ignored, leading to an increased number of dial-outs. In any case, the notification function continues.

Recovery  Recovery action depends on the type of exception.



4006

REMOTE NOTIFICATION PROCESS notif-process FAILED. ERROR = module-error:text

notif-process

contains the name of the notification port process that failed.

module-error

is the fatal error that was encountered by the notification module.

text

is the text of the last notification module error message.

Cause  The notification module encountered an unrecoverable internal error.

Effect  Remote notification is unavailable for a brief period, until the Syshealth Persistence Monitor restarts it.

Recovery  Informational message only; no corrective action is needed.



4007

Remote notification redelivery started. Notification ID = action-id.

action-id

is the ID (a 16-bit integer) of the notification that was redelivered. This value is the same as the notification identifier used in notification Subsystem Programmatic Interface (SPI) commands.

Cause  A Syshealth user has requested redelivery for a remote notification which could not be delivered earlier.

Effect  The indicated notification will be delivered to all appropriate destinations.

Recovery  Informational message only; no corrective action is needed.