Purpose
Set up a metrics pipeline that collects container stats, system metrics, and HTTP endpoint health — then visualizes everything in Grafana dashboards. This is the TIG stack (Telegraf, InfluxDB, Grafana), the self-hosted equivalent of Datadog for your home lab.
Architecture
Containers/System → Telegraf (collector) → InfluxDB (storage) → Grafana (visualization)
Telegraf scrapes metrics on an interval (10s default), writes them to InfluxDB, and Grafana queries InfluxDB to render dashboards. Each component runs in its own container.
Docker Compose
services:
influxdb:
image: influxdb:2
restart: unless-stopped
ports:
- "8086:8086"
volumes:
- ./influxdb/data:/var/lib/influxdb2
- ./influxdb/config:/etc/influxdb2
environment:
- TZ=America/Chicago
- DOCKER_INFLUXDB_INIT_MODE=setup
- DOCKER_INFLUXDB_INIT_USERNAME=admin
- DOCKER_INFLUXDB_INIT_PASSWORD=<your-password>
- DOCKER_INFLUXDB_INIT_ORG=homelab
- DOCKER_INFLUXDB_INIT_BUCKET=telegraf
- DOCKER_INFLUXDB_INIT_ADMIN_TOKEN=<your-token>
telegraf:
image: telegraf:latest
restart: unless-stopped
volumes:
- ./telegraf/telegraf.conf:/etc/telegraf/telegraf.conf:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
environment:
- TZ=America/Chicago
depends_on:
- influxdb
# On macOS, Docker socket path may differ — check Docker Desktop settings
grafana:
image: grafana/grafana:latest
restart: unless-stopped
ports:
- "3000:3000"
volumes:
- ./grafana/data:/var/lib/grafana
environment:
- TZ=America/Chicago
- GF_SECURITY_ADMIN_PASSWORD=<your-password>
Telegraf Configuration
The Telegraf config (telegraf.conf) defines what to collect and where to send it.
[global_tags]
lab = "homelab"
[agent]
interval = "10s"
round_interval = true
flush_interval = "10s"
# ── Output: InfluxDB v2 ──
[[outputs.influxdb_v2]]
urls = ["http://influxdb:8086"]
token = "<your-influxdb-token>"
organization = "homelab"
bucket = "telegraf"
# ── System metrics ──
[[inputs.cpu]]
percpu = true
totalcpu = true
[[inputs.mem]]
[[inputs.disk]]
ignore_fs = ["tmpfs", "devtmpfs", "devfs", "overlay"]
[[inputs.net]]
# ── Docker container metrics ──
[[inputs.docker]]
endpoint = "unix:///var/run/docker.sock"
gather_services = false
container_names = [] # Empty = all containers
perdevice = false
total = true
# ── HTTP endpoint checks ──
[[inputs.http_response]]
urls = [
"http://host.docker.internal:8100",
"http://host.docker.internal:8101",
"http://host.docker.internal:9010",
]
response_timeout = "5s"
method = "GET"
follow_redirects = true
Key Inputs Explained
- docker — CPU, memory, network, and block I/O per container. Requires Docker socket access.
- http_response — Checks if services respond and measures latency. Use
host.docker.internalon macOS since Telegraf runs in a container. - cpu/mem/disk — Host-level system metrics.
Grafana Setup
1. Add InfluxDB Data Source
- Grafana > Configuration > Data Sources > Add
- Type: InfluxDB
- Query language: Flux (InfluxDB v2 uses Flux, not InfluxQL)
- URL:
http://influxdb:8086(container-to-container) - Organization:
homelab - Token: your InfluxDB admin token
- Default bucket:
telegraf - Click "Save & Test"
2. Essential Dashboard Panels
Container CPU Usage:
from(bucket: "telegraf")
|> range(start: -1h)
|> filter(fn: (r) => r._measurement == "docker_container_cpu")
|> filter(fn: (r) => r._field == "usage_percent")
|> aggregateWindow(every: 1m, fn: mean)
Container Memory:
from(bucket: "telegraf")
|> range(start: -1h)
|> filter(fn: (r) => r._measurement == "docker_container_mem")
|> filter(fn: (r) => r._field == "usage_percent")
|> aggregateWindow(every: 1m, fn: mean)
HTTP Endpoint Latency:
from(bucket: "telegraf")
|> range(start: -6h)
|> filter(fn: (r) => r._measurement == "http_response")
|> filter(fn: (r) => r._field == "response_time")
|> aggregateWindow(every: 5m, fn: mean)
Retention and Storage
InfluxDB v2 uses retention policies per bucket. For a home lab:
- telegraf bucket: 30-day retention (detailed metrics)
- telegraf-longterm bucket: 365-day retention (downsampled to 1h aggregates)
Create a downsampling task in InfluxDB to roll up data:
option task = {name: "downsample-hourly", every: 1h}
from(bucket: "telegraf")
|> range(start: -2h)
|> aggregateWindow(every: 1h, fn: mean)
|> to(bucket: "telegraf-longterm")
Troubleshooting
Telegraf can't connect to Docker socket: On macOS with Docker Desktop, the socket is at /var/run/docker.sock but permissions can be tricky. Check that the volume mount is correct and the Telegraf container can read it.
No data in Grafana: Verify Telegraf is writing data: check docker logs telegraf for write errors. Common cause: wrong token or organization name in the output config.
High disk usage from InfluxDB: Set retention policies. Without them, metrics accumulate forever. A 50-container lab generating metrics every 10s can produce several GB per month.
What This Gets You
After setup, you'll have visibility into:
- Which containers are consuming the most CPU and memory
- Container restart patterns (sudden drops in uptime metrics)
- HTTP endpoint response times and availability
- Disk fill rates (critical for planning storage expansion)
- Network throughput per container
This is the foundation for alerting — Grafana can send notifications when metrics cross thresholds, giving you an early warning system for your entire lab.