Real-time Alerting

LogChef provides a powerful alerting system that continuously evaluates your log data against custom conditions. When thresholds are exceeded, alerts are sent directly via email (SMTP) and webhook notifications configured per alert.

The alerting system is designed for production use with built-in reliability features including retry logic, delivery failure tracking, and comprehensive error handling.

Understanding the Alert Flow

What you do in LogChef:

Create an alert with a condition or SQL query and threshold (e.g., “error count > 100”)
Select team members to notify and add webhook URLs in the alert form
Configure SMTP settings in Administration → System Settings → Alerts

What LogChef does:

Evaluates your query every X seconds (you set the frequency)
Builds an alert notification payload with labels, annotations, and context
Sends email notifications to the selected recipients
Posts JSON payloads to each configured webhook URL
Records delivery outcomes in alert history

Important to understand:

LogChef creates, evaluates, and delivers notifications
Email delivery uses SMTP settings configured by admins
Webhooks are defined per alert
Alert history shows when alerts fired and their delivery status

Quick Example

Scenario: You want to get notified on Slack when your API has more than 100 errors in 5 minutes.

In LogChef (using LogChefQL - the simple way):

Navigate to your team/source → Alerts
Click “New Alert”
Name: “High API Errors”
Filter condition: severity = "ERROR"
Threshold: Greater than 100
Lookback: 5 minutes (time filter is auto-applied)
Frequency: Every 60 seconds
Save

The generated SQL will be: SELECT count(*) as value FROM logs WHERE (severity = 'ERROR') AND timestamp >= now() - toIntervalSecond(300)

In the alert form:

Team members: select the on-call recipients
Webhook URLs: https://hooks.slack.com/services/YOUR/WEBHOOK

Result: Every minute, LogChef checks error count. If > 100, it sends emails to the selected team members and posts the webhook payload. You can see all firings in LogChef’s alert history.

Email & Webhook Delivery

Send notifications directly via SMTP email and webhook endpoints.

Simple & Advanced Modes

Use LogChefQL for simple filter conditions, or write raw ClickHouse SQL for complex queries.

Delivery Guarantees

Automatic retry logic with exponential backoff ensures notifications are delivered reliably.

Rich Metadata

Alerts include team and source names, custom labels, annotations, and direct links to the web UI.

How it Works

LogChef’s alert manager runs in the background, continuously evaluating active alerts at the configured interval (default: 1 minute). When an alert’s SQL query returns a value that exceeds the threshold, the alert fires and LogChef sends email and webhook notifications with full context including labels, annotations, and metadata.

Alert Lifecycle

Evaluation: Alert SQL query runs against ClickHouse
Threshold Check: Result is compared against configured threshold
Triggered: If threshold is met, alert fires and notification payload is built
Delivered: Email notifications are sent to selected recipients
Webhooked: JSON payloads are posted to configured webhook URLs
Resolved: When conditions clear, a resolution notification is sent

Creating Alerts

Query Modes

LogChef supports two ways to define alert conditions:

LogChefQL Mode (Recommended)

The simplest way to create alerts. Write a filter condition and LogChef generates the SQL automatically:

Filter Condition: Simple expressions like severity = "ERROR" or status_code >= 500
Aggregate Function: Choose count, sum, avg, min, or max
Auto Time Filter: Lookback period is automatically applied—no manual time filters needed
Live Preview: See the generated SQL before saving

Example conditions:

severity = "ERROR"
severity = "ERROR" and service = "api"
status_code >= 500
response_time > 1000
message ~ "timeout"

SQL Mode (Advanced)

For complex queries, switch to SQL mode and write raw ClickHouse SQL:

Full control over the query
Access to all ClickHouse functions
Must return a single numeric value as value

Basic Alert Configuration

Every alert requires the following components:

Name: Human-readable identifier for the alert
Description: Optional context about what the alert monitors
Severity: info, warning, or critical
Query: LogChefQL condition or ClickHouse SQL
Threshold: Value and operator (>, >=, <, <=, ==, !=)
Frequency: How often to evaluate the alert (in seconds)
Lookback Window: Time range for the query to analyze

Example Alerts

High Error Rate

Monitor when error log count exceeds acceptable levels.

LogChefQL:

severity_text = "ERROR"

Aggregate: count(*)
Lookback: 5 minutes

SQL equivalent:

SELECT count(*) as value
FROM logs
WHERE severity_text = 'ERROR'
  AND timestamp >= now() - toIntervalSecond(300)

Threshold: Greater than 100 | Frequency: 60 seconds | Severity: Critical

Server Errors by Service

Alert on HTTP 5xx errors for a specific service.

LogChefQL:

status_code >= 500 and service = "api-gateway"

Aggregate: count(*)
Lookback: 5 minutes

Threshold: Greater than 10 | Frequency: 60 seconds | Severity: Warning

Failed Authentication Attempts

Alert on suspicious authentication activity.

LogChefQL:

body ~ "authentication failed"

Aggregate: count(*)
Lookback: 15 minutes

SQL equivalent:

SELECT count(*) as value
FROM logs
WHERE body LIKE '%authentication failed%'
  AND timestamp >= now() - toIntervalSecond(900)

Threshold: Greater than 10 | Frequency: 300 seconds | Severity: Warning

API Response Time Degradation (SQL Mode)

For complex aggregations, use SQL mode:

SELECT avg(JSONExtractFloat(log_attributes, 'response_time_ms')) as value
FROM logs
WHERE service = 'api-gateway'
  AND timestamp >= now() - toIntervalSecond(600)

Threshold: Greater than 500.0 (ms) | Frequency: 120 seconds | Severity: Warning

Database Connection Pool Exhaustion (SQL Mode)

Monitor when connection pool usage is critically high:

SELECT max(JSONExtractInt(log_attributes, 'pool_active_connections')) as value
FROM logs
WHERE service = 'database-proxy'
  AND timestamp >= now() - toIntervalSecond(300)

Threshold: Greater than or equal to 95 | Frequency: 60 seconds | Severity: Critical

Configuration

Admin Settings UI

Alerting is configured through the Administration → System Settings → Alerts tab in the web interface. This provides a user-friendly way to manage all alert settings without editing configuration files.

Available Settings:

Enabled: Toggle alert evaluation on/off
SMTP Host: Email server hostname
SMTP Port: Email server port
SMTP Username: SMTP auth username (optional)
SMTP Password: SMTP auth password (optional)
SMTP From: From address for alert emails
SMTP Reply-To: Reply-To address (optional)
SMTP Security: none, starttls, or tls
Evaluation Interval: How often to evaluate all alerts
Default Lookback: Default time range for alert queries
History Limit: Number of historical events to keep per alert
External URL: Backend URL for API access
Frontend URL: Frontend URL for web UI links in alert notifications
Request Timeout: HTTP timeout for alert delivery
TLS Insecure Skip Verify: Skip TLS certificate verification (development only)

Alert Recipients & Webhooks

Recipients and webhook URLs are configured per alert. In the alert form, select team members to notify and add one or more webhook URLs for integrations like Slack, PagerDuty, or custom endpoints.

Initial Configuration (First Boot)

On first boot, you can optionally seed alert settings from config.toml. After first boot, all changes must be made via the Admin Settings UI:

[alerts]
# Enable alert evaluation and delivery
enabled = true

# How often to evaluate all alerts (default: 1 minute)
evaluation_interval = "1m"

# Default lookback window if not specified in alert (default: 5 minutes)
default_lookback = "5m"

# Maximum alert history entries to keep per alert (default: 100)
history_limit = 100

# SMTP configuration for email delivery
smtp_host = "smtp.example.com"
smtp_port = 587
smtp_username = ""
smtp_password = ""
smtp_from = "alerts@example.com"
smtp_reply_to = ""
smtp_security = "starttls"

# Backend URL for API access (used for fallback)
external_url = "http://localhost:8125"

# Frontend URL for web UI generator links
frontend_url = "http://localhost:5173"

# HTTP request timeout for alert delivery
request_timeout = "5s"

# Skip TLS certificate verification (for development only)
tls_insecure_skip_verify = false

Note: After first boot, changes to [alerts] in config.toml are ignored. Use the Admin Settings UI to modify alert configuration.

Alert Labels and Annotations

Default Labels

Every alert includes these labels automatically:

alertname: Name of the alert
alert_id: Unique alert identifier
severity: Alert severity level
status: Current status (triggered or resolved)
team: Human-readable team name
team_id: Numeric team identifier
source: Human-readable source name
source_id: Numeric source identifier

Custom Labels

Add custom labels to categorize and route alerts:

{
  "env": "production",
  "service": "payment-api",
  "region": "us-east-1",
  "component": "database"
}

These labels are included in webhook payloads and email notifications for filtering and routing on the receiver side.

Annotations

Annotations provide additional context that doesn’t affect routing:

description: Alert description text
query: The SQL query used for evaluation
threshold: Threshold value and operator
value: Actual value that triggered the alert
frequency_seconds: Evaluation frequency
lookback_seconds: Query lookback window

Custom annotations can be added for runbooks, dashboards, or documentation links:

{
  "runbook": "https://wiki.example.com/runbooks/high-error-rate",
  "dashboard": "https://grafana.example.com/d/logs-overview",
  "playbook": "Check database connection pool and recent deployments"
}

Alerts Dashboard

The alerts dashboard provides a quick overview of all your alert rules with real-time status.

Dashboard Features

Column	Description
Active	Toggle switch to quickly enable/disable alerts
Alert	Name, severity badge, and description
Condition	Threshold value and evaluation frequency
Status	Live indicator: 🔴 firing (pulsing) or 🟢 resolved
Last Triggered	When the alert last fired
Actions	Edit, view history, duplicate, delete

Quick Actions

Toggle Switch: Enable or disable alerts directly from the list without opening the edit form
Duplicate: Create a new alert based on an existing one—great for similar conditions across services
Status Indicator: Red pulsing dot for firing alerts, green dot for resolved

Where to Find Your Alerts

Alert List View:

Navigate to your team → source → Alerts tab
See all alerts for that source with live status
Toggle alerts on/off with one click
Quick access to edit, duplicate, delete, or view history

Alert Detail Page:

Click on any alert to see full details
Edit tab: Modify query, threshold, frequency
History tab: See all past evaluations and firings
Test query before saving changes

Alert History

LogChef maintains a complete history of all alert evaluations including:

Triggered Events: When alerts fire with the metric value
Resolved Events: When conditions clear
Error Events: Query failures or evaluation errors
Delivery Status: Whether alerts successfully reached email/webhook endpoints

What you can see in history:

Timestamp: When the alert was evaluated
Status: triggered, resolved, or error
Value: The actual metric value from your query
Delivery Status: Success or failure with error details
Duration: How long the alert was active (for triggered alerts)

Use alert history to:

Investigate why an alert fired
See the actual query result that triggered the alert
Check how long conditions persisted
Review previous occurrences and patterns
Debug delivery failures to email/webhook endpoints
Verify alerts are evaluating correctly

Reliability Features

Retry Logic with Exponential Backoff

If email or webhook delivery fails, LogChef automatically retries alert delivery:

Default: 2 retry attempts
Initial Delay: 500ms
Backoff: Exponential (500ms → 1s → 2s)
Retry On: Network errors and 5xx HTTP status codes

Delivery Failure Tracking

Failed deliveries are recorded in alert history with:

Error message and timestamp
Retry attempts counter
Automatic retry on next evaluation cycle

Evaluation Error Handling

Query failures or database issues are captured as error status in history:

Error message preserved for debugging
Query and configuration included in error payload
Alert evaluation continues on next cycle

Monitoring Alerts

Log Messages

Alert evaluations produce structured logs for observability:

// Successful evaluation
{"level":"DEBUG","msg":"alert evaluation complete","alert_id":1,"value":42.5,"triggered":true}

// Alert triggered
{"level":"INFO","msg":"alert triggered","alert_name":"High Error Rate","value":150,"threshold":100}

// Successful delivery
{"level":"INFO","msg":"alert notifications sent","alert_id":1}

// Alert resolved
{"level":"INFO","msg":"alert resolved","alert_name":"High Error Rate","value":45}

Best Practices

Query Design

Start with LogChefQL: Use simple filter conditions when possible—time filters are auto-applied
Return Single Value: Queries must return exactly one numeric value
Use Lookback Windows: In SQL mode, always include a time filter for performance
Test First: Use “Test Query” in the UI to validate before saving
Keep it Simple: Complex aggregations may timeout or return unexpected results
Preview SQL: In LogChefQL mode, review the generated SQL to understand what will run

Threshold Selection

Avoid Flapping: Set thresholds with enough buffer to prevent constant triggering
Consider Baselines: Analyze normal metrics before setting thresholds
Use Percentiles: For variable metrics, consider 95th or 99th percentile instead of max/avg

Frequency Configuration

Match Urgency: Critical alerts can evaluate every 30-60 seconds
Resource Aware: Frequent evaluation increases database load
Align with Lookback: Evaluation frequency should be less than lookback window

Organization

Descriptive Names: Use clear, searchable alert names
Team Ownership: Assign alerts to appropriate teams
Runbook Links: Add runbook URLs in annotations for quick response
Review Regularly: Audit and tune alerts based on actual incidents

Production Deployment

Use TLS: Enable TLS for SMTP (smtp_security set to tls or starttls)
Set External URL: Configure frontend_url for correct generator links
Configure Recipients: Set per-alert email recipients and webhook URLs
Test Notifications: Verify alert delivery to all configured channels
Monitor Delivery: Watch logs for email/webhook delivery failures

Troubleshooting

Alerts Not Firing

Check Alert Status: Verify alert is enabled and active
Test Query: Run the SQL query manually to verify it returns a numeric value
Check Logs: Look for evaluation errors in LogChef logs
Verify Frequency: Ensure enough time has passed since last evaluation

Alerts Not Delivered

Check SMTP Settings: Verify smtp_host, smtp_port, and smtp_from are correct
Verify Recipients/Webhooks: Ensure the alert has recipients or webhook URLs configured
Review Logs: Check for delivery errors in LogChef logs
Inspect History: Check alert history for delivery failure details

False Positives

Adjust Threshold: Increase threshold to reduce noise
Extend Lookback: Longer windows smooth out temporary spikes
Use Aggregation: Consider avg() instead of max() for less sensitive alerts
Add Filters: Narrow down query to specific services or conditions

Performance Issues

Optimize Queries: Add appropriate indexes in ClickHouse
Reduce Frequency: Increase evaluation interval for non-critical alerts
Limit Lookback: Shorter time windows query less data
Monitor Database: Watch ClickHouse query performance

Example Workflows

Setting Up Your First Alert

Using LogChefQL (Recommended):

Navigate to Alerts in your team/source
Click “New Alert”
Name: “High Error Count”
Keep “LogChefQL” mode selected (default)
Filter condition: severity_text = "ERROR"
Aggregate: count(*) (default)
Lookback: 5 minutes
Threshold: Greater than 50
Frequency: 60 seconds
Severity: Warning
Click “Test Query” to verify the generated SQL
Save and monitor alert history

Using SQL Mode:

Navigate to Alerts in your team/source
Click “New Alert”
Name: “High Error Count”
Switch to “SQL” mode
Query: SELECT count(*) as value FROM logs WHERE severity_text = 'ERROR' AND timestamp >= now() - toIntervalSecond(300)
Threshold: Greater than 50
Frequency: 60 seconds
Severity: Warning
Click “Test Query” to verify
Save and monitor alert history

Duplicating an Alert

To create a similar alert quickly:

Find the alert in the dashboard
Click the ”…” menu → “Duplicate”
A new alert form opens pre-filled with the original settings
Modify the name and any conditions as needed
Save the new alert

Creating an On-Call Rotation

Add a PagerDuty webhook URL to the alert notification settings
Create critical alerts for your services
Tag alerts with severity: critical and team: your-team
Use your webhook receiver to route based on labels if needed
Test with a temporary threshold adjustment

Building a Comprehensive Monitoring Suite

Error Alerts: Monitor error rates across all services
Performance Alerts: Track response times and latency
Availability Alerts: Watch for service health check failures
Resource Alerts: Monitor memory, CPU, and connection pools
Business Metrics: Alert on transaction failures or conversion drops

Real-time Alerting

Understanding the Alert Flow

Quick Example

How it Works

Alert Lifecycle

Creating Alerts

Query Modes

LogChefQL Mode (Recommended)

SQL Mode (Advanced)

Basic Alert Configuration

Example Alerts

High Error Rate

Server Errors by Service

Failed Authentication Attempts

API Response Time Degradation (SQL Mode)

Database Connection Pool Exhaustion (SQL Mode)

Configuration

Admin Settings UI

Alert Recipients & Webhooks

Initial Configuration (First Boot)

Alert Labels and Annotations

Default Labels

Custom Labels

Annotations

Alerts Dashboard

Dashboard Features

Quick Actions

Where to Find Your Alerts

Alert History

Reliability Features

Retry Logic with Exponential Backoff

Delivery Failure Tracking

Evaluation Error Handling

Monitoring Alerts

Log Messages

Best Practices

Query Design

Threshold Selection

Frequency Configuration

Organization

Production Deployment

Troubleshooting

Alerts Not Firing

Alerts Not Delivered

False Positives

Performance Issues

Example Workflows

Setting Up Your First Alert

Duplicating an Alert

Creating an On-Call Rotation

Building a Comprehensive Monitoring Suite

Related Documentation