Health Checks
Health checks monitor the condition of your hosts through automated verifications. You define checks once and assign them to any number of hosts.
Concept
A health check consists of:
- Check Definition - What is being verified? (Name, category, scripts)
- Severity Mapping - How are exit codes interpreted?
- Auto-Remediation - Optional automatic remediation
- Assignments - Which hosts are checked and when?
Managing Checks
Access check management via the Checks navigation item.
Creating a Check
- Click Create Check
- Configure the check:
| Field | Description | Required |
|---|---|---|
| Name | Name of the check | Yes |
| Description | What does this check verify? | No |
| Category | Organizational category (freely selectable) | No |
| Module | Which modules the check is available for | Yes |
| Scripts | One or more scripts from the library | Yes |
| Enabled | Whether the check can be executed | Yes |
Categories
Categories are used to organize your checks. The category name is freely selectable, common examples:
| Category | Typical Checks |
|---|---|
disk | Disk usage, SMART status |
memory | RAM utilization, swap usage |
services | Service availability, process checks |
security | Security policies, certificate expiration |
custom | User-defined checks (default) |
Severity Mapping
The severity mapping assigns a severity level to each exit code:
| Exit Code | Severity | Meaning |
|---|---|---|
| 0 | OK | Check passed |
| 1 | Warning | Warning |
| 2 | Critical | Critical |
| 3+ | Unknown | Unknown |
Nagios-Compatible
The mapping follows the Nagios plugin convention. Existing monitoring scripts can be reused directly.
Multi-Script Checks
A check can contain multiple scripts that are executed sequentially:
- Scripts are executed in the defined order
- When Stop on Failure is enabled (default: Yes), execution is aborted on the first error
- The overall status is determined by the worst individual result
Auto-Remediation
Automatic remediation executes a remediation script when a check fails.
Configuration
| Field | Description | Default |
|---|---|---|
| Remediation enabled | Enable/disable auto-remediation | No |
| Remediation script | Script to execute on failure | - |
| Max. attempts | Maximum number of remediation attempts per failure series (1-10) | 3 |
| Cooldown | Minimum time between remediation attempts (in seconds, min. 60) | 300 (5 min.) |
Workflow
- Check is executed and fails (Status: Warning or Critical)
- The consecutive failures counter is incremented
- When the failure count reaches the Grace Failures limit (default: 1):
- And the maximum attempt count has not been exceeded
- And the cooldown since the last attempt has elapsed → The remediation script is executed
- The remediation script runs with elevated priority
- On the next successful check, all counters are reset
Example
Check: "Nginx is running"
Script: Checks if nginx is active (exit 0 = ok, exit 2 = critical)
Remediation script: systemctl restart nginx
Max. attempts: 3
Cooldown: 300 seconds
Workflow:
1. Check returns "Critical" → Counter: 1/1 (Grace reached)
2. → Remediation: nginx is restarted (attempt 1/3)
3. Next check: Still "Critical" → Counter: 2
4. → Check cooldown: 300s elapsed? → Remediation (attempt 2/3)
5. Next check: "OK" → All counters resetAssigning Checks
Creating an Assignment
- Navigate to Checks > Assignments
- Click Create Assignment
- Configure:
| Field | Description |
|---|---|
| Check | Which check should be executed |
| Module | Target module |
| Target | Single host, group, or customer |
| Schedule | When the check should be executed |
Schedule Options
| Type | Description |
|---|---|
| One-time | Single execution at the scheduled time |
| Daily | Every day at the specified time |
| Weekly | On the selected day of the week at the specified time |
| Monthly | On the selected day of the month |
| Cron | Free cron expression (e.g., */5 * * * * for every 5 minutes) |
Single Assignment
- Open a host in the detail modal
- Select the Checks tab
- Click Assign Check
- Select the check from the list
- Save
Custom Field Mappings
Checks can write script outputs to custom fields. To do this, the check script must output special lines:
Script Output Format
#!/bin/bash
USAGE=$(df / | tail -1 | awk '{print $5}' | tr -d '%')
echo "FIELD:disk_usage=$USAGE"
if [ "$USAGE" -gt 90 ]; then
echo "CRITICAL: Disk usage at $USAGE%"
exit 2
else
echo "OK: Disk usage at $USAGE%"
exit 0
fiLines with the prefix FIELD: are parsed by the backend:
- Format:
FIELD:fieldname=value - The
fieldnameis mapped to a custom field via the check's Custom Field Mappings - The value is automatically saved in the corresponding custom field of the host
Results
In the Host Detail
In the Checks tab of the detail modal, you can see:
- All assigned checks
- Latest results with severity color
- Timestamp of the last check
- Output of the check script
- Whether a remediation was triggered
Color Coding
| Color | Severity | Meaning |
|---|---|---|
| Green | OK | Everything is fine |
| Yellow | Warning | Attention required |
| Red | Critical | Immediate action needed |
| Gray | Unknown | Could not be verified |
Result Details
Each check result contains:
- Status (ok/warning/critical/unknown)
- Exit code of the script
- Stdout - Standard output
- Stderr - Error output
- Custom Field Updates - Which fields were updated
- Remediation triggered - Whether auto-remediation was triggered
- Start/End time and duration
Retention
Check results are automatically deleted after 30 days.
Execution
Checks are executed:
- Scheduled according to the configured schedule of the assignment
- Manually via the "Run now" button in the assignment
- After updates as a health check in update schedules
Example Checks
Check Disk Usage
#!/bin/bash
USAGE=$(df / | tail -1 | awk '{print $5}' | tr -d '%')
if [ "$USAGE" -gt 90 ]; then
echo "CRITICAL: Disk usage at $USAGE%"
exit 2
elif [ "$USAGE" -gt 80 ]; then
echo "WARNING: Disk usage at $USAGE%"
exit 1
else
echo "OK: Disk usage at $USAGE%"
exit 0
fiCheck Service
#!/bin/bash
SERVICE="nginx"
if systemctl is-active --quiet $SERVICE; then
echo "OK: $SERVICE is running"
exit 0
else
echo "CRITICAL: $SERVICE is not active"
exit 2
fiCheck Certificate
#!/bin/bash
DOMAIN="example.com"
DAYS=$(echo | openssl s_client -connect $DOMAIN:443 2>/dev/null | openssl x509 -noout -enddate 2>/dev/null | cut -d= -f2)
REMAINING=$(( ($(date -d "$DAYS" +%s) - $(date +%s)) / 86400 ))
if [ "$REMAINING" -lt 7 ]; then
echo "CRITICAL: Certificate expires in $REMAINING days"
exit 2
elif [ "$REMAINING" -lt 30 ]; then
echo "WARNING: Certificate expires in $REMAINING days"
exit 1
else
echo "OK: Certificate valid for $REMAINING days"
exit 0
fi