Skip to content

Health Checks

Health checks monitor the condition of your hosts through automated verifications. You define checks once and assign them to any number of hosts.

Concept

A health check consists of:

  1. Check Definition - What is being verified? (Name, category, scripts)
  2. Severity Mapping - How are exit codes interpreted?
  3. Auto-Remediation - Optional automatic remediation
  4. Assignments - Which hosts are checked and when?

Managing Checks

Access check management via the Checks navigation item.

Creating a Check

  1. Click Create Check
  2. Configure the check:
FieldDescriptionRequired
NameName of the checkYes
DescriptionWhat does this check verify?No
CategoryOrganizational category (freely selectable)No
ModuleWhich modules the check is available forYes
ScriptsOne or more scripts from the libraryYes
EnabledWhether the check can be executedYes

Categories

Categories are used to organize your checks. The category name is freely selectable, common examples:

CategoryTypical Checks
diskDisk usage, SMART status
memoryRAM utilization, swap usage
servicesService availability, process checks
securitySecurity policies, certificate expiration
customUser-defined checks (default)

Severity Mapping

The severity mapping assigns a severity level to each exit code:

Exit CodeSeverityMeaning
0OKCheck passed
1WarningWarning
2CriticalCritical
3+UnknownUnknown

Nagios-Compatible

The mapping follows the Nagios plugin convention. Existing monitoring scripts can be reused directly.

Multi-Script Checks

A check can contain multiple scripts that are executed sequentially:

  • Scripts are executed in the defined order
  • When Stop on Failure is enabled (default: Yes), execution is aborted on the first error
  • The overall status is determined by the worst individual result

Auto-Remediation

Automatic remediation executes a remediation script when a check fails.

Configuration

FieldDescriptionDefault
Remediation enabledEnable/disable auto-remediationNo
Remediation scriptScript to execute on failure-
Max. attemptsMaximum number of remediation attempts per failure series (1-10)3
CooldownMinimum time between remediation attempts (in seconds, min. 60)300 (5 min.)

Workflow

  1. Check is executed and fails (Status: Warning or Critical)
  2. The consecutive failures counter is incremented
  3. When the failure count reaches the Grace Failures limit (default: 1):
    • And the maximum attempt count has not been exceeded
    • And the cooldown since the last attempt has elapsed → The remediation script is executed
  4. The remediation script runs with elevated priority
  5. On the next successful check, all counters are reset

Example

Check: "Nginx is running"
Script: Checks if nginx is active (exit 0 = ok, exit 2 = critical)
Remediation script: systemctl restart nginx
Max. attempts: 3
Cooldown: 300 seconds

Workflow:
1. Check returns "Critical" → Counter: 1/1 (Grace reached)
2. → Remediation: nginx is restarted (attempt 1/3)
3. Next check: Still "Critical" → Counter: 2
4. → Check cooldown: 300s elapsed? → Remediation (attempt 2/3)
5. Next check: "OK" → All counters reset

Assigning Checks

Creating an Assignment

  1. Navigate to Checks > Assignments
  2. Click Create Assignment
  3. Configure:
FieldDescription
CheckWhich check should be executed
ModuleTarget module
TargetSingle host, group, or customer
ScheduleWhen the check should be executed

Schedule Options

TypeDescription
One-timeSingle execution at the scheduled time
DailyEvery day at the specified time
WeeklyOn the selected day of the week at the specified time
MonthlyOn the selected day of the month
CronFree cron expression (e.g., */5 * * * * for every 5 minutes)

Single Assignment

  1. Open a host in the detail modal
  2. Select the Checks tab
  3. Click Assign Check
  4. Select the check from the list
  5. Save

Custom Field Mappings

Checks can write script outputs to custom fields. To do this, the check script must output special lines:

Script Output Format

bash
#!/bin/bash
USAGE=$(df / | tail -1 | awk '{print $5}' | tr -d '%')
echo "FIELD:disk_usage=$USAGE"

if [ "$USAGE" -gt 90 ]; then
  echo "CRITICAL: Disk usage at $USAGE%"
  exit 2
else
  echo "OK: Disk usage at $USAGE%"
  exit 0
fi

Lines with the prefix FIELD: are parsed by the backend:

  • Format: FIELD:fieldname=value
  • The fieldname is mapped to a custom field via the check's Custom Field Mappings
  • The value is automatically saved in the corresponding custom field of the host

Results

In the Host Detail

In the Checks tab of the detail modal, you can see:

  • All assigned checks
  • Latest results with severity color
  • Timestamp of the last check
  • Output of the check script
  • Whether a remediation was triggered

Color Coding

ColorSeverityMeaning
GreenOKEverything is fine
YellowWarningAttention required
RedCriticalImmediate action needed
GrayUnknownCould not be verified

Result Details

Each check result contains:

  • Status (ok/warning/critical/unknown)
  • Exit code of the script
  • Stdout - Standard output
  • Stderr - Error output
  • Custom Field Updates - Which fields were updated
  • Remediation triggered - Whether auto-remediation was triggered
  • Start/End time and duration

Retention

Check results are automatically deleted after 30 days.

Execution

Checks are executed:

  • Scheduled according to the configured schedule of the assignment
  • Manually via the "Run now" button in the assignment
  • After updates as a health check in update schedules

Example Checks

Check Disk Usage

bash
#!/bin/bash
USAGE=$(df / | tail -1 | awk '{print $5}' | tr -d '%')
if [ "$USAGE" -gt 90 ]; then
  echo "CRITICAL: Disk usage at $USAGE%"
  exit 2
elif [ "$USAGE" -gt 80 ]; then
  echo "WARNING: Disk usage at $USAGE%"
  exit 1
else
  echo "OK: Disk usage at $USAGE%"
  exit 0
fi

Check Service

bash
#!/bin/bash
SERVICE="nginx"
if systemctl is-active --quiet $SERVICE; then
  echo "OK: $SERVICE is running"
  exit 0
else
  echo "CRITICAL: $SERVICE is not active"
  exit 2
fi

Check Certificate

bash
#!/bin/bash
DOMAIN="example.com"
DAYS=$(echo | openssl s_client -connect $DOMAIN:443 2>/dev/null | openssl x509 -noout -enddate 2>/dev/null | cut -d= -f2)
REMAINING=$(( ($(date -d "$DAYS" +%s) - $(date +%s)) / 86400 ))

if [ "$REMAINING" -lt 7 ]; then
  echo "CRITICAL: Certificate expires in $REMAINING days"
  exit 2
elif [ "$REMAINING" -lt 30 ]; then
  echo "WARNING: Certificate expires in $REMAINING days"
  exit 1
else
  echo "OK: Certificate valid for $REMAINING days"
  exit 0
fi

DATAZONE Control Documentation