Skip to content

Real-time Alerting

LogChef provides a powerful alerting system that continuously evaluates your log data against custom conditions. When thresholds are exceeded, alerts are sent directly via email (SMTP) and webhook notifications configured per alert.

The alerting system is designed for production use with built-in reliability features including retry logic, delivery failure tracking, and comprehensive error handling.

Understanding the Alert Flow

What you do in LogChef:

  1. Create an alert with a condition or SQL query and threshold (e.g., “error count > 100”)
  2. Select team members to notify and add webhook URLs in the alert form
  3. Configure SMTP settings in Administration → System Settings → Alerts

What LogChef does:

  1. Evaluates your query every X seconds (you set the frequency)
  2. Builds an alert notification payload with labels, annotations, and context
  3. Sends email notifications to the selected recipients
  4. Posts JSON payloads to each configured webhook URL
  5. Records delivery outcomes in alert history

Important to understand:

  • LogChef creates, evaluates, and delivers notifications
  • Email delivery uses SMTP settings configured by admins
  • Webhooks are defined per alert
  • Alert history shows when alerts fired and their delivery status

Quick Example

Scenario: You want to get notified on Slack when your API has more than 100 errors in 5 minutes.

In LogChef (using LogChefQL - the simple way):

  1. Navigate to your team/source → Alerts
  2. Click “New Alert”
  3. Name: “High API Errors”
  4. Filter condition: severity = "ERROR"
  5. Threshold: Greater than 100
  6. Lookback: 5 minutes (time filter is auto-applied)
  7. Frequency: Every 60 seconds
  8. Save

The generated SQL will be: SELECT count(*) as value FROM logs WHERE (severity = 'ERROR') AND timestamp >= now() - toIntervalSecond(300)

In the alert form:

  • Team members: select the on-call recipients
  • Webhook URLs: https://hooks.slack.com/services/YOUR/WEBHOOK

Result: Every minute, LogChef checks error count. If > 100, it sends emails to the selected team members and posts the webhook payload. You can see all firings in LogChef’s alert history.

Email & Webhook Delivery

Send notifications directly via SMTP email and webhook endpoints.

Simple & Advanced Modes

Use LogChefQL for simple filter conditions, or write raw ClickHouse SQL for complex queries.

Delivery Guarantees

Automatic retry logic with exponential backoff ensures notifications are delivered reliably.

Rich Metadata

Alerts include team and source names, custom labels, annotations, and direct links to the web UI.

How it Works

LogChef’s alert manager runs in the background, continuously evaluating active alerts at the configured interval (default: 1 minute). When an alert’s SQL query returns a value that exceeds the threshold, the alert fires and LogChef sends email and webhook notifications with full context including labels, annotations, and metadata.

Alert Lifecycle

  1. Evaluation: Alert SQL query runs against ClickHouse
  2. Threshold Check: Result is compared against configured threshold
  3. Triggered: If threshold is met, alert fires and notification payload is built
  4. Delivered: Email notifications are sent to selected recipients
  5. Webhooked: JSON payloads are posted to configured webhook URLs
  6. Resolved: When conditions clear, a resolution notification is sent

Creating Alerts

Query Modes

LogChef supports two ways to define alert conditions:

The simplest way to create alerts. Write a filter condition and LogChef generates the SQL automatically:

  • Filter Condition: Simple expressions like severity = "ERROR" or status_code >= 500
  • Aggregate Function: Choose count, sum, avg, min, or max
  • Auto Time Filter: Lookback period is automatically applied—no manual time filters needed
  • Live Preview: See the generated SQL before saving

Example conditions:

severity = "ERROR"
severity = "ERROR" and service = "api"
status_code >= 500
response_time > 1000
message ~ "timeout"

SQL Mode (Advanced)

For complex queries, switch to SQL mode and write raw ClickHouse SQL:

  • Full control over the query
  • Access to all ClickHouse functions
  • Must return a single numeric value as value

Basic Alert Configuration

Every alert requires the following components:

  • Name: Human-readable identifier for the alert
  • Description: Optional context about what the alert monitors
  • Severity: info, warning, or critical
  • Query: LogChefQL condition or ClickHouse SQL
  • Threshold: Value and operator (>, >=, <, <=, ==, !=)
  • Frequency: How often to evaluate the alert (in seconds)
  • Lookback Window: Time range for the query to analyze

Example Alerts

High Error Rate

Monitor when error log count exceeds acceptable levels.

LogChefQL:

severity_text = "ERROR"
  • Aggregate: count(*)
  • Lookback: 5 minutes

SQL equivalent:

SELECT count(*) as value
FROM logs
WHERE severity_text = 'ERROR'
AND timestamp >= now() - toIntervalSecond(300)

Threshold: Greater than 100 | Frequency: 60 seconds | Severity: Critical


Server Errors by Service

Alert on HTTP 5xx errors for a specific service.

LogChefQL:

status_code >= 500 and service = "api-gateway"
  • Aggregate: count(*)
  • Lookback: 5 minutes

Threshold: Greater than 10 | Frequency: 60 seconds | Severity: Warning


Failed Authentication Attempts

Alert on suspicious authentication activity.

LogChefQL:

body ~ "authentication failed"
  • Aggregate: count(*)
  • Lookback: 15 minutes

SQL equivalent:

SELECT count(*) as value
FROM logs
WHERE body LIKE '%authentication failed%'
AND timestamp >= now() - toIntervalSecond(900)

Threshold: Greater than 10 | Frequency: 300 seconds | Severity: Warning


API Response Time Degradation (SQL Mode)

For complex aggregations, use SQL mode:

SELECT avg(JSONExtractFloat(log_attributes, 'response_time_ms')) as value
FROM logs
WHERE service = 'api-gateway'
AND timestamp >= now() - toIntervalSecond(600)

Threshold: Greater than 500.0 (ms) | Frequency: 120 seconds | Severity: Warning


Database Connection Pool Exhaustion (SQL Mode)

Monitor when connection pool usage is critically high:

SELECT max(JSONExtractInt(log_attributes, 'pool_active_connections')) as value
FROM logs
WHERE service = 'database-proxy'
AND timestamp >= now() - toIntervalSecond(300)

Threshold: Greater than or equal to 95 | Frequency: 60 seconds | Severity: Critical

Configuration

Admin Settings UI

Alerting is configured through the Administration → System Settings → Alerts tab in the web interface. This provides a user-friendly way to manage all alert settings without editing configuration files.

Available Settings:

  • Enabled: Toggle alert evaluation on/off
  • SMTP Host: Email server hostname
  • SMTP Port: Email server port
  • SMTP Username: SMTP auth username (optional)
  • SMTP Password: SMTP auth password (optional)
  • SMTP From: From address for alert emails
  • SMTP Reply-To: Reply-To address (optional)
  • SMTP Security: none, starttls, or tls
  • Evaluation Interval: How often to evaluate all alerts
  • Default Lookback: Default time range for alert queries
  • History Limit: Number of historical events to keep per alert
  • External URL: Backend URL for API access
  • Frontend URL: Frontend URL for web UI links in alert notifications
  • Request Timeout: HTTP timeout for alert delivery
  • TLS Insecure Skip Verify: Skip TLS certificate verification (development only)

Alert Recipients & Webhooks

Recipients and webhook URLs are configured per alert. In the alert form, select team members to notify and add one or more webhook URLs for integrations like Slack, PagerDuty, or custom endpoints.

Initial Configuration (First Boot)

On first boot, you can optionally seed alert settings from config.toml. After first boot, all changes must be made via the Admin Settings UI:

[alerts]
# Enable alert evaluation and delivery
enabled = true
# How often to evaluate all alerts (default: 1 minute)
evaluation_interval = "1m"
# Default lookback window if not specified in alert (default: 5 minutes)
default_lookback = "5m"
# Maximum alert history entries to keep per alert (default: 100)
history_limit = 100
# SMTP configuration for email delivery
smtp_host = "smtp.example.com"
smtp_port = 587
smtp_username = ""
smtp_password = ""
smtp_from = "alerts@example.com"
smtp_reply_to = ""
smtp_security = "starttls"
# Backend URL for API access (used for fallback)
external_url = "http://localhost:8125"
# Frontend URL for web UI generator links
frontend_url = "http://localhost:5173"
# HTTP request timeout for alert delivery
request_timeout = "5s"
# Skip TLS certificate verification (for development only)
tls_insecure_skip_verify = false

Note: After first boot, changes to [alerts] in config.toml are ignored. Use the Admin Settings UI to modify alert configuration.

Alert Labels and Annotations

Default Labels

Every alert includes these labels automatically:

  • alertname: Name of the alert
  • alert_id: Unique alert identifier
  • severity: Alert severity level
  • status: Current status (triggered or resolved)
  • team: Human-readable team name
  • team_id: Numeric team identifier
  • source: Human-readable source name
  • source_id: Numeric source identifier

Custom Labels

Add custom labels to categorize and route alerts:

{
"env": "production",
"service": "payment-api",
"region": "us-east-1",
"component": "database"
}

These labels are included in webhook payloads and email notifications for filtering and routing on the receiver side.

Annotations

Annotations provide additional context that doesn’t affect routing:

  • description: Alert description text
  • query: The SQL query used for evaluation
  • threshold: Threshold value and operator
  • value: Actual value that triggered the alert
  • frequency_seconds: Evaluation frequency
  • lookback_seconds: Query lookback window

Custom annotations can be added for runbooks, dashboards, or documentation links:

{
"runbook": "https://wiki.example.com/runbooks/high-error-rate",
"dashboard": "https://grafana.example.com/d/logs-overview",
"playbook": "Check database connection pool and recent deployments"
}

Alerts Dashboard

The alerts dashboard provides a quick overview of all your alert rules with real-time status.

Dashboard Features

ColumnDescription
ActiveToggle switch to quickly enable/disable alerts
AlertName, severity badge, and description
ConditionThreshold value and evaluation frequency
StatusLive indicator: 🔴 firing (pulsing) or 🟢 resolved
Last TriggeredWhen the alert last fired
ActionsEdit, view history, duplicate, delete

Quick Actions

  • Toggle Switch: Enable or disable alerts directly from the list without opening the edit form
  • Duplicate: Create a new alert based on an existing one—great for similar conditions across services
  • Status Indicator: Red pulsing dot for firing alerts, green dot for resolved

Where to Find Your Alerts

Alert List View:

  • Navigate to your team → source → Alerts tab
  • See all alerts for that source with live status
  • Toggle alerts on/off with one click
  • Quick access to edit, duplicate, delete, or view history

Alert Detail Page:

  • Click on any alert to see full details
  • Edit tab: Modify query, threshold, frequency
  • History tab: See all past evaluations and firings
  • Test query before saving changes

Alert History

LogChef maintains a complete history of all alert evaluations including:

  • Triggered Events: When alerts fire with the metric value
  • Resolved Events: When conditions clear
  • Error Events: Query failures or evaluation errors
  • Delivery Status: Whether alerts successfully reached email/webhook endpoints

What you can see in history:

  • Timestamp: When the alert was evaluated
  • Status: triggered, resolved, or error
  • Value: The actual metric value from your query
  • Delivery Status: Success or failure with error details
  • Duration: How long the alert was active (for triggered alerts)

Use alert history to:

  • Investigate why an alert fired
  • See the actual query result that triggered the alert
  • Check how long conditions persisted
  • Review previous occurrences and patterns
  • Debug delivery failures to email/webhook endpoints
  • Verify alerts are evaluating correctly

Reliability Features

Retry Logic with Exponential Backoff

If email or webhook delivery fails, LogChef automatically retries alert delivery:

  • Default: 2 retry attempts
  • Initial Delay: 500ms
  • Backoff: Exponential (500ms → 1s → 2s)
  • Retry On: Network errors and 5xx HTTP status codes

Delivery Failure Tracking

Failed deliveries are recorded in alert history with:

  • Error message and timestamp
  • Retry attempts counter
  • Automatic retry on next evaluation cycle

Evaluation Error Handling

Query failures or database issues are captured as error status in history:

  • Error message preserved for debugging
  • Query and configuration included in error payload
  • Alert evaluation continues on next cycle

Monitoring Alerts

Log Messages

Alert evaluations produce structured logs for observability:

// Successful evaluation
{"level":"DEBUG","msg":"alert evaluation complete","alert_id":1,"value":42.5,"triggered":true}
// Alert triggered
{"level":"INFO","msg":"alert triggered","alert_name":"High Error Rate","value":150,"threshold":100}
// Successful delivery
{"level":"INFO","msg":"alert notifications sent","alert_id":1}
// Alert resolved
{"level":"INFO","msg":"alert resolved","alert_name":"High Error Rate","value":45}

Best Practices

Query Design

  • Start with LogChefQL: Use simple filter conditions when possible—time filters are auto-applied
  • Return Single Value: Queries must return exactly one numeric value
  • Use Lookback Windows: In SQL mode, always include a time filter for performance
  • Test First: Use “Test Query” in the UI to validate before saving
  • Keep it Simple: Complex aggregations may timeout or return unexpected results
  • Preview SQL: In LogChefQL mode, review the generated SQL to understand what will run

Threshold Selection

  • Avoid Flapping: Set thresholds with enough buffer to prevent constant triggering
  • Consider Baselines: Analyze normal metrics before setting thresholds
  • Use Percentiles: For variable metrics, consider 95th or 99th percentile instead of max/avg

Frequency Configuration

  • Match Urgency: Critical alerts can evaluate every 30-60 seconds
  • Resource Aware: Frequent evaluation increases database load
  • Align with Lookback: Evaluation frequency should be less than lookback window

Organization

  • Descriptive Names: Use clear, searchable alert names
  • Team Ownership: Assign alerts to appropriate teams
  • Runbook Links: Add runbook URLs in annotations for quick response
  • Review Regularly: Audit and tune alerts based on actual incidents

Production Deployment

  • Use TLS: Enable TLS for SMTP (smtp_security set to tls or starttls)
  • Set External URL: Configure frontend_url for correct generator links
  • Configure Recipients: Set per-alert email recipients and webhook URLs
  • Test Notifications: Verify alert delivery to all configured channels
  • Monitor Delivery: Watch logs for email/webhook delivery failures

Troubleshooting

Alerts Not Firing

  1. Check Alert Status: Verify alert is enabled and active
  2. Test Query: Run the SQL query manually to verify it returns a numeric value
  3. Check Logs: Look for evaluation errors in LogChef logs
  4. Verify Frequency: Ensure enough time has passed since last evaluation

Alerts Not Delivered

  1. Check SMTP Settings: Verify smtp_host, smtp_port, and smtp_from are correct
  2. Verify Recipients/Webhooks: Ensure the alert has recipients or webhook URLs configured
  3. Review Logs: Check for delivery errors in LogChef logs
  4. Inspect History: Check alert history for delivery failure details

False Positives

  1. Adjust Threshold: Increase threshold to reduce noise
  2. Extend Lookback: Longer windows smooth out temporary spikes
  3. Use Aggregation: Consider avg() instead of max() for less sensitive alerts
  4. Add Filters: Narrow down query to specific services or conditions

Performance Issues

  1. Optimize Queries: Add appropriate indexes in ClickHouse
  2. Reduce Frequency: Increase evaluation interval for non-critical alerts
  3. Limit Lookback: Shorter time windows query less data
  4. Monitor Database: Watch ClickHouse query performance

Example Workflows

Setting Up Your First Alert

Using LogChefQL (Recommended):

  1. Navigate to Alerts in your team/source
  2. Click “New Alert”
  3. Name: “High Error Count”
  4. Keep “LogChefQL” mode selected (default)
  5. Filter condition: severity_text = "ERROR"
  6. Aggregate: count(*) (default)
  7. Lookback: 5 minutes
  8. Threshold: Greater than 50
  9. Frequency: 60 seconds
  10. Severity: Warning
  11. Click “Test Query” to verify the generated SQL
  12. Save and monitor alert history

Using SQL Mode:

  1. Navigate to Alerts in your team/source
  2. Click “New Alert”
  3. Name: “High Error Count”
  4. Switch to “SQL” mode
  5. Query: SELECT count(*) as value FROM logs WHERE severity_text = 'ERROR' AND timestamp >= now() - toIntervalSecond(300)
  6. Threshold: Greater than 50
  7. Frequency: 60 seconds
  8. Severity: Warning
  9. Click “Test Query” to verify
  10. Save and monitor alert history

Duplicating an Alert

To create a similar alert quickly:

  1. Find the alert in the dashboard
  2. Click the ”…” menu → “Duplicate”
  3. A new alert form opens pre-filled with the original settings
  4. Modify the name and any conditions as needed
  5. Save the new alert

Creating an On-Call Rotation

  1. Add a PagerDuty webhook URL to the alert notification settings
  2. Create critical alerts for your services
  3. Tag alerts with severity: critical and team: your-team
  4. Use your webhook receiver to route based on labels if needed
  5. Test with a temporary threshold adjustment

Building a Comprehensive Monitoring Suite

  1. Error Alerts: Monitor error rates across all services
  2. Performance Alerts: Track response times and latency
  3. Availability Alerts: Watch for service health check failures
  4. Resource Alerts: Monitor memory, CPU, and connection pools
  5. Business Metrics: Alert on transaction failures or conversion drops