Skip to content

Grafana Stack & Observability#

This section details the monitoring, logging, and observability stack managed via Flux. The stack is primarily built around the LGTM stack (Loki, Grafana, Tempo, Mimir), utilizing Alloy as the collector.

Grafana Stack#

The core observability platform is defined in grafana-stack.

Overview#

  • Namespace: monitoring
  • Components: Loki (Logs), Mimir (Metrics), Alloy (Collector), k8s-monitoring (Meta-chart).
  • Source: Grafana Helm Charts

Components & Architecture#

1. Alloy (Collector)#

Alloy is deployed as a DaemonSet (and singleton Deployment via k8s-monitoring) to collect telemetry.

  • Configuration: Defined in apps/production/grafana-stack/config.alloy and injected via a ConfigMapGenerator named alloy-config.
  • Pipelines:
    • Logs: Discovers Pods/Nodes, relabels Kubernetes metadata (namespace, pod, container), and pushes to loki-distributor.
    • Metrics: Scrapes Pods, Nodes (cAdvisor), OS metrics, and standard Kubernetes ServiceMonitors/PodMonitors.
    • Remote Write: Pushes metrics to http://mimir-nginx/api/v1/push.
    • Static Targets: Explicitly scrapes Docker (192.168.1.32) and Graphite Exporter.

2. K8s Monitoring#

A meta-chart (k8s-monitoring) orchestrates the monitoring configuration.

  • Version: 3.5.3
  • Cluster Name: k3s-prod
  • Integrations:
    • Cert-Manager: Auto-discovered.
    • Node Logs: Scrapes kubelet.service and containerd.service from systemd journal.
    • PrusaLink: Static scrape configuration for 3D printer metrics (192.168.1.32:10009).
  • Features Enabled: Annotation Autodiscovery (prometheus.io/scrape), Cluster Events, Cluster Metrics.

3. Loki (Logs)#

  • Deployment: Distributed microservices mode (implied by chart loki).
  • Version: 6.x.x

4. Mimir (Metrics)#

  • Deployment: Distributed mode (mimir-distributed).
  • Version: 6.x.x

GitOps Strategy#

  • Base: apps/base/grafana-stack defines the HelmRepository and base HelmReleases.
  • Production: apps/production/grafana-stack applies overlays:
    • Loki/Mimir: Updates chart versions.
    • Alloy: Enables clustering and binds the custom config.alloy.
    • K8s-Monitoring: Defines the cluster-specific monitoring rules and destinations.

Observability Extras (o11y)#

The apps/production/o11y directory contains additional observability configurations.

Inactive Status

The files in apps/production/o11y are currently not included in the production kustomization.yaml resource list. These components are defined but not deployed.

Components#

1. API Server SLO (apiserver-slo.yaml)#

  • Kind: PrometheusServiceLevel (Sloth)
  • Objectives:
    • Availability: 99.9% (Errors 5xx/429).
    • Latency: 99% (Requests < 0.4s).
  • Alerts: Generates K8sApiserverAvailabilityAlert and K8sApiserverLatencyAlert.

2. SNMP Probe (snmp-probe.yaml)#

  • Kind: Probe (Prometheus Operator)
  • Module: ubiquiti_unifi
  • Targets: Unifi devices (192.168.1.152, .181, .193).
  • Endpoint: docker.local.isabelsoler.es:9116

3. Uptime Kuma (uptime-kuma.yaml)#

  • Kind: Probe
  • Target: status.local.isabelsoler.es
  • Auth: Uses Basic Auth from uptime-kuma-auth secret.
  • Relabeling: Rewrites monitor_url to mask credentials in PostgreSQL connection strings.

Recommendations#

Activate o11y

If these probes and SLOs are desired, add - ../base/o11y (if a base exists) or direct file references (e.g., - o11y/apiserver-slo.yaml) to apps/production/kustomization.yaml.

Sloth Integration

The root kustomization.yaml installs Sloth CRDs. Enabling apiserver-slo.yaml would leverage this to provide automatic high-quality recording rules and alerts for the Kubernetes API.

Prometheus Blackbox Exporter#

The Blackbox Exporter allows probing of endpoints over HTTP, HTTPS, DNS, TCP, and ICMP. It is used to monitor the external and internal availability of services.

Overview#

Configuration#

Modules#

The exporter is configured with a custom module http_2xx for HTTP checking:

  • Prober: http
  • Timeout: 5 seconds
  • Features: Follows redirects, prefers IPv4, and skips TLS verification (useful for internal self-signed certs or avoiding expiry noise).

Service Monitor & Targets#

The deployment includes a ServiceMonitor that automatically scrapes the following targets using the http_2xx module.

External Targets (Public Endpoints):

  • Docs: https://docs.igresc.com/
  • Personal Site: https://sergicastro.com/healthcheck
  • Authentik: https://auth.igresc.com/-/health/ready/
  • Immich: https://photos.igresc.com/
  • Jellyfin: https://jellyfin.igresc.com/
  • Mealie: https://mealie.igresc.com/
  • Navidrome: https://music.igresc.com/

Internal Targets (Cluster Internal):

  • Authentik Service: https://authentik-server.auth.svc.cluster.local/-/health/ready/

GitOps Strategy#

  • Base: apps/base/prometheus-blackbox defines the HelmRepository and the base HelmRelease.
  • Production: apps/production/prometheus-blackbox/blackbox-exporter-release.yaml patches the configuration to add the specific probe targets and module definitions.