Monitoring Home Network With Grafana, Loki, Prometheus, and Alloy in Docker (part 1)

Having gotten to three whole nodes in my home network, I’ve decided it’s about time to get some centralized monitoring going. To that end, I’ve gone with what I’m (vaguely) familiar with: the Grafana open source logs display and its accompanying pieces. Here, I’ll lay out approximately how I got working home monitoring set up after a couple missteps.

Part 1 will focus on everything except Prometheus and then we’ll come back to that one in part 2.

The Grafana dashboard, showing details on my Mastodon node: free space left on the device and queries per second

It ain’t much, but it’s home.

!
Worth noting: this is not a very secure way to set up monitoring. I’ve left everything exposed on my intranet to make it easier to debug, and no authentication is done when sending log data. A more robust solution should take advantage of authentication features and lock the UIs behind an authenticating proxy.

Pieces used

Broadly speaking, this system goes collection → storage → display. For this setup, I’ve used the following:

  • Alloy (collection): Alloy is a Swiss army knife data collector that can output to multiple storage systems. Why Alloy? No particular reason other than Promtail is deprecated (sidebar: a fact that it’d be nice if Grafana’s own Getting Started documentation could tell us before we set everything up with Promtail). Anyway, Alloy has a pretty good acquire → transform → output flow that’s pretty easy to get into.

  • Loki (storage): Loki takes lines of logs and some associated metadata as a key-value list. The metadata goes into a database index; the logs are compressed and stored. As a result, searches against Loki are fast against the data in the key-value store and slower against individual log lines (compared to having all your logs in text and just ripgrep’ing across them). This is important to know when using Loki: if you can dice your data by your key-value indicators, it’ll be fast.

  • Prometheus (storage): Prometheus is a time-series datastore; it stores values associated with specific points in time, and then lets you aggregate those values into statistical analysis. This differs from log lines mostly in the data density; time series datapoints are “pre-digested” so that the statistical analysis can happen fast with no additional transformation of the data itself (Grafana is capable of doing statistical analysis on data extracted from log lines, but since it has to run a transform on each log line to go from text to numbers, that’s slow). Alloy can emit to Prometheus in addition to Loki. I set up Prometheus in part 2.

  • Grafana (display): Grafana is capable of rendering data stored in multiple datastores (including Loki and Prometheus), building dashboards, and setting up alerts and notifications for unexpected events. This is my top-level.

The setup

Now that all the actors are in place, here’s how they’re set up on my system.

  • One node hosts Loki, Prometheus, and Grafana. For real robustness you want to spread that kind of thing out so the loss of one node doesn’t wreck your ability to get logs, but this gets me started.

  • Each node hosts an Alloy instance that does local collection and pushes everything off to the host node.

For each of these, I set up a small Docker Compose and a set of config files to get it going.

The host node

The host node has one compose file configuring the two containers and Grafana, living on a bridged network named loki.

services:
  loki:
    restart: always
    image: grafana/loki:3.4.1
    ports:
      - 3100:3100
    volumes:
      - type: bind
        source: ./loki-config.yaml
        target: /etc/loki/loki-config.yml
    networks:
      - loki

  grafana:
    environment:
      - GF_PATHS_PROVISIONING=/etc/grafana/provisioning
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
      - GF_FEATURE_TOGGLES_ENABLE=alertingSimplifiedRouting,alertingQueryAndExpressionsStepMode
    entrypoint:
      - sh
      - -euc
      - |
        mkdir -p /etc/grafana/provisioning/datasources
        cat <<EOF > /etc/grafana/provisioning/datasources/ds.yaml
        apiVersion: 1
        datasources:
        - name: Loki
          type: loki
          access: proxy
          orgId: 1
          url: http://loki:3100
          basicAuth: false
          isDefault: true
          version: 1
          editable: false
        EOF
        /run.sh        
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    networks:
      - loki

networks:
  loki:
    driver: bridge

All of this can be started with a simple docker compose up --detach once the config files are in place. The Grafana entrypoint executes a script that dumps a local ds.yaml file into the container that describes my Loki node. That’s a little silly; I can pull this out into another file living at a bound volume location, but this works okay. Worth noting is that Grafana is a little bit stubborn about changing the configuration of its sources after it boots up; it stores them to a local DB and changing ds.yaml doesn’t automatically update them even if you restart the node. If there’s a better way to address this I’d like to know what it is.

The config file for Loki is basically pulled from the Grafana tutorial:

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: debug
  grpc_server_max_concurrent_streams: 1000

common:
  instance_addr: 127.0.0.1
  path_prefix: /tmp/loki
  storage:
    filesystem:
      chunks_directory: /tmp/loki/chunks
      rules_directory: /tmp/loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

query_range:
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        max_size_mb: 100

limits_config:
  metric_aggregation_enabled: true

schema_config:
  configs:
    - from: 2020-10-24
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

pattern_ingester:
  enabled: true
  metric_aggregation:
    loki_address: localhost:3100

ruler:
  alertmanager_url: http://localhost:9093

frontend:
  encoding: protobuf

# By default, Loki will send anonymous, but uniquely-identifiable usage and configuration
# analytics to Grafana Labs. These statistics are sent to https://stats.grafana.org/
#
# Statistics help us better understand how Loki is used, and they show us performance
# levels for most users. This helps us prioritize features and documentation.
# For more information on what's sent, look at
# https://github.com/grafana/loki/blob/main/pkg/analytics/stats.go
# Refer to the buildReport method to see what goes into a report.
#
# If you would like to disable reporting, uncomment the following lines:
#analytics:
#  reporting_enabled: false

Although these are running on a network named loki, that network is bridged so we can confirm Grafana is up by going to (in my case) bighost.home:3000 for the Grafana UI and bighost.home:3100/ready to healthcheck Loki.

We look good!

Alloy on all nodes

Every node that I want to monitor has its own Alloy instance running, which gathers data from various services. These are also running in Docker; I have a separate docker-compose-alloy.yml file for those configs.

services:
  alloy:
    restart: always
    user: root
    image: grafana/alloy:latest
    volumes:
      - type: bind
        source: ./config.alloy
        target: /etc/alloy/config.alloy
      - type: bind
        source: /var/log
        target: /var/log
      - type: bind
        source: /proc
        target: /host/proc
        read_only: true
      - type: bind
        source: /
        target: /host/rootfs
        read_only: true
      - type: bind
        source: /sys
        target: /host/sys
        read_only: true
    ports:
      - 12345:12345
    environment:
      - NODE_NAME=$HOSTNAME
    entrypoint:
      - /bin/alloy
      - run
      - --server.http.listen-addr=0.0.0.0:12345
      - --storage.path=/var/lib/alloy/data
      - /etc/alloy/config.alloy
    networks:
      - loki

networks:
  loki:
    driver: bridge

Two things worth noting here:

  1. It took me entirely too long to remember that if I’m going to harvest data from logs, the Docker container needs to be able to see the host logs. Took me two hours to realize I hadn’t put the /var/log:/var/log binding rule in the volumes stanza.

  2. I want logs sent to Loki to be tagged with the hostname of the source. By default, this doesn’t happen; the NODE_NAME=$HOSTNAME environment variable binds to a rule in the Alloy config that does the job. This requires me to run this compose as HOSTNAME=$(hostname) docker compose -f docker-compose-alloy.yml up --detach.

I actually have a heterogeneous setup at home; most of my nodes are Ubuntu, but one is actually a Debian instance, for reasons. So I have two different config.alloy files based on what kind of node I’m on. For the Debian machine:

/* Config for Debian */
local.file_match "system" {
	path_targets = [{
		__address__ = "localhost",
		__path__    = "/var/log/*log",
	}]
}

loki.source.file "system" {
	targets               = local.file_match.system.targets
	forward_to            = [loki.write.default.receiver]
	legacy_positions_file = "/tmp/positions.yaml"
}

loki.write "default" {
	endpoint {
		url = "http://bighost.home:3100/loki/api/v1/push"
	}
	external_labels = {
		node        = sys.env("NODE_NAME"),
    }
}

These three parts define the rules for identifying files to scrape data from, scraping that data, and hten writing that data out. local.file_match collects all the filenames matching the __path__ pattern. The identified files show up in local.file_match.<name of the rule>.targets. loki.source.file picks them up and generates individual log lines (with attached metadata). The legacy_positions_file is actually a holdover from the fact that I created a Promtail config first; there is a migration tool to build an Alloy config from a promtail config, and it creates a starting point for log harvesting so that Alloy won’t re-harvest anything Promtail already harvested. This rule both specifies which files to read and what rule to send the log entries to (forward_to).

The final rule, loki.write, specifies what URL to emit the logs to. We can add additional key-value relationships with external_labels; the one I add identifies the hostname as node by pulling the value we set in the environment hen we spun up the Docker container.

Once Alloy is spun up, it creates an endpoint at port 12345. I can look at bighost.home:12345 to confirm Alloy is online. The UI provides some pretty diagrams of what is running.

The Alloy UI, showing a graph of local.file_match “system” sending data to loki.source_file “system” and then on to loki.write “default”

With all the pieces spun up, we can switch over to Grafana and see the logs coming in.

The Grafana UI showing logs from one of the nodes

For the nodes running Ubuntu, I can additionally scrape the logs from systemd. There are just a couple more rules to do that.

loki.relabel "journal" {
  forward_to = []

  rule {
    source_labels = ["__journal__systemd_unit"]
    target_label = "unit"
  }

  rule {
    source_labels = ["__journal_errno"]
    target_label = "errno"
  }

  rule {
    source_labels = ["__journal_priority_keyword"]
    target_label = "severity"
  }
}

loki.source.journal "systemd" {
  forward_to = [loki.write.default.receiver]
  relabel_rules = loki.relabel.journal.rules
  path          = "/var/log/journal"
  labels = { component = "loki.source.journal" }
}

The systemd scraper has a flow a little backwards from the local file scraper: it pulls in relabeling rules from a loki.relabel module. This remaps labels in journal’s format to more common labels. Finally, the loki.source.journal module gets the data from /var/log/journal, relabels it, and forwards it on to the same receiver as the rest of the logs.

That’s enough to get us some basic capture of log data. In a subsequent post, we’ll talk about capturing time series data as well using Prometheus.

Comments