DEADMAN
attemps to detect when a system is not operating properly and to reboot the system when this occurs.
Program is distributed as ZIP package: download to temporary directory and unpack to destination folder. See below for download link(s).
Following ones are the download links for manual installation:
Deadman v. 0.7 (11/7/2022, Steven Levine) | Readme/What's new |
deadman user guide v0.7
2017-12-05 SHL Version 0.1
2022-04-03 SHL Version 0.3
2022-06-21 SHL Version 0.5
2022-06-23 SHL Version 0.6
2022-06-23 SHL Version 0.7
== Introduction ==
Deadman attempts to detect when one or more of a known set of problems
the might occur on a running system and to take appropriate recovery
actions when one of these problems is detected.
Deadman is an evolving application. New features are added when new
failure modes and/or new recovery modes are discovered.
Deadman was originally written to keep apache httpd servers that I
maintain up and running with minimal human interaction so some of deadman's
features are specific httpd servers. Other features are more generic and
may be useful for use with other applications.
Deadman logs its actions to the deadman.log log file. The log file will be
written to the %LOGFILES% directory if defined. Otherwise it will be
written the %TEMP% directory.
Deadman also logs its actions the STDOUT, unless it is running detached.
Deadman writes its PID to deadman.pid in the %TEMP% directory. This allows
other processes to check and/or control deadman.
== Usage ==
Deadman is a VIO command line application which is typically run detached.
Output is written to the standard output, if deadman is not running
detached, and to the log file (%LOGFILES%\deadman.log). The log file
entries are timestamped so that they can be correlated with information
from other timestamped logs.
Each log file entry includes a message id of the form (#number). The id
number can be used to locate the code that generated the message, if
needed.
To display the help screen, enter
deadman.exe -?
at the command line. The help screen currently displays as:
The deadman daemon checks system health based on the configuration file settings.
See deadman.txt for a detailed description of operation and options.
deadman [-c] [-h] [-s] [-t] [-v] [-V] [-?] [cfgfile]
-c Check daemon status
-h -? Display this message
-s Stop daemon
-t Run in TEST mode
-v Display verbose status
-V Display version
cfgfile Configuration file to process
Copyright (c) 2008-2022 Steven Levine and Associates, Inc.
All rights reserved.
== Theory of operation ==
Deadman attempts to monitor system health by watching the state of a user
selected files, as defined in the configuration file. Deadman can:
- monitor one or more transaction log files for activity
- monitor one or more error log files for certain errors
- reboot the system on request
When monitoring a transaction log file for activity, deadman expects the
file size to increase over time. If the file size fails to increase for
longer than the configured interval, deadman will attempt to reboot the
system after the reboot delay expires. The check interval and the reboot
delay interval are both configurable. Deadman contains logic to handle log
rotation which will cause the log file size to be reduced.
When monitoring an error log file for errors, deadman will check the log
file for known errors at configurable intervals. The set of known errors
is currently:
- httpd cannot create child process
As deadman evolves additional checks may be implemented.
When one of the known errors is detected, deadman will perform error
specific recovery actions. If the recovery actions fail, deadman will
attempt to reboot the system after the reboot delay expires. The check
interval and the reboot delay interval are both configurable.
When monitoring for reboot requests, deadman checks if the reboot request
file has been cremated. When the file is created, deadman will attempt to
reboot the system. If the reboot request file is not empty, deadman will
write the first line of the file to the deadman log file to record the
reason for the reboot.
== Sample configuration file ==
The Configuration File section describes the configuration file in more
detail.
; hostname: steven, domain: www.scoug.com
; checks error log for child process start failures
; checks transaction log for lack of activity
; checks transaction log for lack of activity
; 2022-04-03 SHL Baseline - steven
translogfile = d:\logs\apache\scoug-combined_log
processname = httpd
TransLogCheckIntervalSec = 60 ; 1 minute
errlogfile = d:\logs\apache\scoug-error_log
ErrorLogCheckIntervalSec = 30
rebootfile = d:\apps\apache24\reboot-me-now
SleepSec = 10
RebootDelaySec = 3600 ; 1 hour, 0 suppresses reboots
ForceStatusSec = 21600 ; 6 hours
== Sample command lines ==
To start deadman in VIO mode
start "deadman" deadman d:\apps\bin\deadman.cfg
To start deadman detached
detach "deadman" deadman d:\apps\bin\deadman.cfg
To check if deadman daemon is running:
deadman -c
To stop the running instance of deadman:
deadman -s
== The Configuration File ==
Deadman is controlled by the settings provided in the configuration file.
The configuration file contains one statement per line. Each statement
consists of a keyword and a value. The configuration file may contain
comments and blank lines.
All keywords are optional. If a keyword enables a feature, the feature
will not be enabled if the keyword is omitted. If the keyword sets a time
interval, a default interval will be set if the keyword is omitted.
The translogfile keyword names a transaction log file and enables the
transaction log file monitor feature. Deadman monitors this file for
growth. If the file stops growing for longer than the configured
interval, deadman will schedule a reboot. There is no default for the
transaction log file. If this keyword is omitted, transaction log
monitoring will not be enabled. To monitor multiple transaction log
files, specify each log file in a separate translogfile statement.
The translogcheckintervalsec keyword defines how often the
transaction log monitor feature will check the transaction log file. If
this keyword is omitted, the default check interval is 600 seconds (i.e. 5
minutes).
The processname keyword names the process that is responsible for writing
to the configured transaction log file. If a process name is defined,
deadman monitors the processes with this name. If there are no instances
with this process name running, deadman assumes that the user has stopped
the processes for maintenance and suspends transaction log file monitoring
until one or more instances of the process are restarted. This prevents
deadman from rebooting during planned shutdowns of these processes.
There is no default for the process name. If this keyword is omitted,
process monitoring will not be enabled.
The errlogfile keyword names an apache httpd error log file and enables the
httpd error log monitoring feature. Deadman will monitor the error log
file for httpd child create failures. There is no default for the error
log file. If this keyword is omitted, error log monitoring will not be
enabled. To monitor multiple error log files, specify each log file in a
separate errlogfile statement.
The errorlogcheckintervalsec keyword defines how often deadman will check
the configured error log file. If this keyword is omitted, the default
check interval is 600 seconds (i.e. 5 minutes).
The rebootfile keyword names the reboot request file and enables the
reboot request feature. If this file exists, deadman will reboot the
system. If this file exists when deadman is started, it will be deleted to
prevent a stale reboot request file from triggering a reboot. There is no
default for the reboot request file. If the keyword is omitted, reboot
request monitoring will not be enabled.
The sleepsec keyword defines how long deadman sleeps between check cycles.
If this keyword is omitted, the default interval is 30 seconds.
The rebootdelaysec keyword defines how long deadman waits after scheduling
a reboot to perform the reboot. This allows for intermittent errors to be
reported without forcing an unneeded reboot. If this keyword is
omitted, the default delay interval is 30 seconds.
The forcestatussec keyword defines how long deadman will wait before
writing a proof of life message to the deadman log file. If this keyword
is omitted, the default reporting interval is 21,600 seconds (i.e. 6 hours).
== Tuning deadman ==
Every system is different. The goal of tuning the deadman timing
parameters is to check often enough so that problems can be detected and
effectively handled, while at the same time miminizing false positives and
not checking so often as to waste system resources that could be better
used elsewhere.
When tuning deadman, it is recommended that deadman be run in test mode
(i.e. -t). Test mode suppresses reboots and reduces the forcestatussec
check interval which makes the the tuning process more efficient.
When tuning deadman, the deadman log file can be helpful. Look for
spurious reports that can be avoided by optimizing the timing parameters.
Sleepsec defines the minimum reasonable value for all the other checking
intervals.
Translogcheckintervalsec should be set large enough to avoid most false
positives, but small enough so that any reboot attempt occurs before the
system has become so unstable that the reboot attempt will fail.
Errorlogcheckintervalsec should be set large enough to avoid wasting system
resources, but small enough so that the recovery attempt has a high
probability of success.
Rebootdelaysec large enough to allow intermittent reboot requests to clear,
but small enough so that the reboot attempt occurs before the system has
become so unstable that the reboot request will fail.
== Running multiple deadman instances ==
If needed, you can run multiple instances of deadman.
To do this, make a copy of deadman.exe giving it a unique name (i.e.
deadman2.exe) and run the copy with a unique configuration file.
The deadman log file name, the deadman pid file name and the default
configuration file name are determined by the deadman executable's name so
there will be no conflict with other running deadman instances.
== Requirements ==
The dos.sys driver must be installed. This driver provides application
level access to the DosReboot DevHlp API.
== Known issues ==
None
== Ideas for the future ==
- Enhance the error log monitor feature to detect more types of errors and
provide recovery support.
- Support units of measure for numeric values
- Support deadmanlogfile keyword.
- Support deadmanpidfile keyword.
== Copyright and License ==
COVERED CODE IS PROVIDED UNDER THIS LICENSE ON AN "AS IS" BASIS,
WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING,
WITHOUT LIMITATION, WARRANTIES THAT THE COVERED CODE IS FREE OF
DEFECTS, MERCHANTABLE, FIT FOR A PARTICULAR PURPOSE OR NON-INFRINGING.
THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE COVERED CODE
IS WITH YOU. SHOULD ANY COVERED CODE PROVE DEFECTIVE IN ANY RESPECT,
YOU (NOT THE INITIAL DEVELOPER OR ANY OTHER CONTRIBUTOR) ASSUME THE
COST OF ANY NECESSARY SERVICING, REPAIR OR CORRECTION. THIS DISCLAIMER
OF WARRANTY CONSTITUTES AN ESSENTIAL PART OF THIS LICENSE. NO USE OF
ANY COVERED CODE IS AUTHORIZED HEREUNDER EXCEPT UNDER THIS DISCLAIMER.
Copyright (c) 2008-2022 Steven Levine and Associates, Inc.
All rights reserved.
Deadman is provided AS-IS, WITHOUT ANY WARRANTY OF ANY KIND, EITHER
EXPRESS, IMPLIED OR STATUTORY, not even any implied warranty of
MERCHANTABILITY.
YOUR USE THIS PRODUCT IS CONDITIONED UPON YOUR ACCEPTANCE OF THIS
LICENSE AGREEMENT. INSTALLING AND/OR USING THE PRODUCT INDICATES YOUR
ACCEPTANCE OF THESE TERMS AND CONDITIONS. IF YOU DO NOT AGREE TO THESE
TERMS AND CONDITIONS PROMPTLY DELETE THIS PRODUCT.
You are granted a non-exclusive, non-assignable, non-transferable
right to use deadman.exe.
== eof == |
www.warpcave.com/betas/deadman-0.7-20220711.zip |
This work is licensed under a Creative Commons Attribution 4.0 International License.
Comments
Martin Iturbide
Sun, 02/07/2023 - 18:57
Permalink
New Link http://www.warpcave
Add new comment