Alertmanager Integration
Native Prometheus Alertmanager support for battle-tested alert routing and notification delivery.
LogChef provides a powerful alerting system that continuously evaluates your log data against custom conditions. When thresholds are exceeded, alerts are automatically sent to Alertmanager, which routes notifications to your preferred channels like Slack, PagerDuty, email, or webhooks.
The alerting system is designed for production use with built-in reliability features including retry logic, delivery failure tracking, and comprehensive error handling.
What you do in LogChef:
What Alertmanager does:
Important to understand:
This separation means you can change notification channels (add Slack, remove email, etc.) without touching your alert definitions in LogChef.
Scenario: You want to get notified on Slack when your API has more than 100 errors in 5 minutes.
In LogChef (using LogChefQL - the simple way):
severity = "ERROR"The generated SQL will be: SELECT count(*) as value FROM logs WHERE (severity = 'ERROR') AND timestamp >= now() - toIntervalSecond(300)
In Alertmanager config:
receivers: - name: 'slack-alerts' slack_configs: - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK' channel: '#api-alerts'Result: Every minute, LogChef checks error count. If > 100, it fires the alert to Alertmanager. Alertmanager sends a message to your Slack #api-alerts channel. You can see all firings in LogChef’s alert history.
Alertmanager Integration
Native Prometheus Alertmanager support for battle-tested alert routing and notification delivery.
Simple & Advanced Modes
Use LogChefQL for simple filter conditions, or write raw ClickHouse SQL for complex queries.
Delivery Guarantees
Automatic retry logic with exponential backoff ensures alerts reach Alertmanager even during network issues.
Rich Metadata
Alerts include team and source names, custom labels, annotations, and direct links to the web UI.
LogChef’s alert manager runs in the background, continuously evaluating active alerts at the configured interval (default: 1 minute). When an alert’s SQL query returns a value that exceeds the threshold, the alert fires and is sent to Alertmanager with full context including labels, annotations, and metadata.
LogChef supports two ways to define alert conditions:
The simplest way to create alerts. Write a filter condition and LogChef generates the SQL automatically:
severity = "ERROR" or status_code >= 500Example conditions:
severity = "ERROR"severity = "ERROR" and service = "api"status_code >= 500response_time > 1000message ~ "timeout"For complex queries, switch to SQL mode and write raw ClickHouse SQL:
valueEvery alert requires the following components:
info, warning, or critical>, >=, <, <=, ==, !=)Monitor when error log count exceeds acceptable levels.
LogChefQL:
severity_text = "ERROR"SQL equivalent:
SELECT count(*) as valueFROM logsWHERE severity_text = 'ERROR' AND timestamp >= now() - toIntervalSecond(300)Threshold: Greater than 100 | Frequency: 60 seconds | Severity: Critical
Alert on HTTP 5xx errors for a specific service.
LogChefQL:
status_code >= 500 and service = "api-gateway"Threshold: Greater than 10 | Frequency: 60 seconds | Severity: Warning
Alert on suspicious authentication activity.
LogChefQL:
body ~ "authentication failed"SQL equivalent:
SELECT count(*) as valueFROM logsWHERE body LIKE '%authentication failed%' AND timestamp >= now() - toIntervalSecond(900)Threshold: Greater than 10 | Frequency: 300 seconds | Severity: Warning
For complex aggregations, use SQL mode:
SELECT avg(JSONExtractFloat(log_attributes, 'response_time_ms')) as valueFROM logsWHERE service = 'api-gateway' AND timestamp >= now() - toIntervalSecond(600)Threshold: Greater than 500.0 (ms) | Frequency: 120 seconds | Severity: Warning
Monitor when connection pool usage is critically high:
SELECT max(JSONExtractInt(log_attributes, 'pool_active_connections')) as valueFROM logsWHERE service = 'database-proxy' AND timestamp >= now() - toIntervalSecond(300)Threshold: Greater than or equal to 95 | Frequency: 60 seconds | Severity: Critical
Alerting is configured through the Administration → System Settings → Alerts tab in the web interface. This provides a user-friendly way to manage all alert settings without editing configuration files.
Available Settings:
https://username:password@alertmanager.example.comOn first boot, you can optionally seed alert settings from config.toml. After first boot, all changes must be made via the Admin Settings UI:
[alerts]# Enable alert evaluation and deliveryenabled = true
# How often to evaluate all alerts (default: 1 minute)evaluation_interval = "1m"
# Default lookback window if not specified in alert (default: 5 minutes)default_lookback = "5m"
# Maximum alert history entries to keep per alert (default: 100)history_limit = 100
# Alertmanager API endpointalertmanager_url = "http://alertmanager:9093"
# Backend URL for API access (used for fallback)external_url = "http://localhost:8125"
# Frontend URL for web UI generator linksfrontend_url = "http://localhost:5173"
# HTTP request timeout for Alertmanager communicationrequest_timeout = "5s"
# Skip TLS certificate verification (for development only)tls_insecure_skip_verify = falseNote: After first boot, changes to [alerts] in config.toml are ignored. Use the Admin Settings UI to modify alert configuration.
If your Alertmanager requires authentication, include credentials in the URL using HTTP Basic Auth:
Format:
https://username:password@alertmanager.example.comExamples:
# Basic authenticationhttps://admin:secretpass@alertmanager.internal:9093
# URL-encoded special characters in passwordhttps://admin:my%40pass%3A123@alertmanager.internal:9093Special character encoding:
@ → %40: → %3A# → %23% → %25%20Before saving Alertmanager configuration, use the Test Connection button in the Admin Settings UI to verify:
The health check calls Alertmanager’s /api/v2/status endpoint and reports success or detailed error messages.
Configure Alertmanager to route LogChef alerts to your notification channels. Example configuration:
global: resolve_timeout: 5m
route: receiver: 'default-receiver' group_by: ['alertname', 'severity', 'team', 'source'] group_wait: 10s group_interval: 30s repeat_interval: 12h
routes: # Critical alerts to PagerDuty - match: severity: critical receiver: 'pagerduty' continue: true
# Warning alerts to Slack - match: severity: warning receiver: 'slack-oncall'
receivers: - name: 'default-receiver' webhook_configs: - url: 'http://webhook-receiver:8080/alerts'
- name: 'pagerduty' pagerduty_configs: - service_key: 'YOUR_SERVICE_KEY'
- name: 'slack-oncall' slack_configs: - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL' channel: '#alerts' title: 'LogChef Alert: {{ .GroupLabels.alertname }}'Every alert includes these labels automatically:
alertname: Name of the alertalert_id: Unique alert identifierseverity: Alert severity levelstatus: Current status (triggered or resolved)team: Human-readable team nameteam_id: Numeric team identifiersource: Human-readable source namesource_id: Numeric source identifierAdd custom labels to categorize and route alerts:
{ "env": "production", "service": "payment-api", "region": "us-east-1", "component": "database"}These labels can be used in Alertmanager routing rules to send alerts to appropriate teams or channels.
Annotations provide additional context that doesn’t affect routing:
description: Alert description textquery: The SQL query used for evaluationthreshold: Threshold value and operatorvalue: Actual value that triggered the alertfrequency_seconds: Evaluation frequencylookback_seconds: Query lookback windowCustom annotations can be added for runbooks, dashboards, or documentation links:
{ "runbook": "https://wiki.example.com/runbooks/high-error-rate", "dashboard": "https://grafana.example.com/d/logs-overview", "playbook": "Check database connection pool and recent deployments"}The alerts dashboard provides a quick overview of all your alert rules with real-time status.
| Column | Description |
|---|---|
| Active | Toggle switch to quickly enable/disable alerts |
| Alert | Name, severity badge, and description |
| Condition | Threshold value and evaluation frequency |
| Status | Live indicator: 🔴 firing (pulsing) or 🟢 resolved |
| Last Triggered | When the alert last fired |
| Actions | Edit, view history, duplicate, delete |
Alert List View:
Alert Detail Page:
LogChef maintains a complete history of all alert evaluations including:
What you can see in history:
Use alert history to:
If Alertmanager is temporarily unavailable, LogChef automatically retries alert delivery:
Failed deliveries are recorded in alert history with:
Query failures or database issues are captured as error status in history:
Alert evaluations produce structured logs for observability:
// Successful evaluation{"level":"DEBUG","msg":"alert evaluation complete","alert_id":1,"value":42.5,"triggered":true}
// Alert triggered{"level":"INFO","msg":"alert triggered","alert_name":"High Error Rate","value":150,"threshold":100}
// Successful delivery{"level":"INFO","msg":"alert successfully sent to Alertmanager","alert_id":1}
// Alert resolved{"level":"INFO","msg":"alert resolved","alert_name":"High Error Rate","value":45}Access the Alertmanager web interface to:
curl http://alertmanager:9093/-/healthy)Using LogChefQL (Recommended):
severity_text = "ERROR"Using SQL Mode:
SELECT count(*) as value FROM logs WHERE severity_text = 'ERROR' AND timestamp >= now() - toIntervalSecond(300)To create a similar alert quickly:
severity: critical and team: your-team