Monitoring Windows system metrics with Grafana

grafana home Beautiful dashboards for your home components

Update: 2017-10-30: With new preferred graphite docker image

Update: 2017-11-01: I’ve opensourced a Windows utility that will export Open Hardware sensor data into Graphite. It adds GPU, power, temperature, and more, which won’t be covered in this post. Check it out! The follow up post contains a little more context.

At work we’re using Grafana for realtime visualization dashboards. We recently started this work and have had a blast creating dashboards and teasing new insights from data. Last month, I attended Grafanacon 2016, and one of the themes is how prevalent Grafana has become; with screenshots of Microsoft, NASA, and Intel using Grafana dashboards in advertisements. Anyways, a coworker linked to a Grafana dashboard of someone’s Plex setup running ontop of ESXi.

esxi

I run Hyper-v (I’ve written Linux Virtualization with a Mounted Window’s Share on Client Hyper-V) and don’t use plex, but I wanted to know if I could create a dashboard about my machine as well. Since Windows already publishes metrics via Performance Counters, I wanted to visualize them with Grafana. Here’s how we’ll do it:

  • The host machine has performance counters
  • Install the single binary, time series data collector, telegraf onto the host.
  • Run Graphite within a docker container on virtual machine.
  • Install Grafana onto virtual machine.

This may be overcomplicating things, as both docker and Grafana can be installed on Windows. Indeed, docker contains thorough information for Windows users. But I like to keep my host machine as clean as it can be and I still feel like docker and Grafana treat Windows as a second class citizencitation needed

Telegraf

After downloading the Telegraf Windows bundle, the default config is pretty good, but could use some updating for my use case. Below is the config that I’m currently using:

# Telegraf configuration

# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "5s"

  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will cache metric_buffer_limit metrics for each output, and will
  ## flush this buffer on a successful write.
  metric_buffer_limit = 1000
  ## Flush the buffer whenever full, regardless of flush_interval.
  flush_buffer_when_full = true

  ## Default flushing interval for all outputs. You shouldn't set this below
  ## interval. Maximum flush_interval will be flush_interval + flush_jitter
  flush_interval = "5s"

  ## Run telegraf in quiet mode
  quiet = false
  ## Specify the log file name. The empty string means to log to stdout.
  logfile = "D:/logs/telegraf.log"

  ## Override default hostname, if empty use os.Hostname()
  hostname = ""


###############################################################################
#                                  OUTPUTS                                    #
###############################################################################

# Configuration for Graphite server to send metrics to
[[outputs.graphite]]
  servers = ["192.168.137.201:2003"]
  prefix = ""
  template = "measurement.tags.field"
  timeout = 2

###############################################################################
#                                  INPUTS                                     #
###############################################################################

[[inputs.win_perf_counters]]
  [[inputs.win_perf_counters.object]]
    # Processor usage, alternative to native, reports on a per core.
    ObjectName = "Processor"
    Instances = ["*"]
    Counters = [
      "% Idle Time",
      "% Interrupt Time",
      "% Privileged Time",
      "% User Time",
      "% Processor Time",
    ]
    Measurement = "win_cpu"
    # Set to true to include _Total instance when querying for all (*).
    IncludeTotal=true

  [[inputs.win_perf_counters.object]]
    # Disk times and queues
    ObjectName = "LogicalDisk"
    Instances = ["*"]
    Counters = [
      "Free Megabytes",
      "Disk Reads/sec",
      "Disk Writes/sec",
      "Disk Bytes/sec",
      "% Idle Time",
      "Current Disk Queue Length",
    ]
    Measurement = "win_disk"
    # Set to true to include _Total instance when querying for all (*).
    #IncludeTotal=false

  [[inputs.win_perf_counters.object]]
    ObjectName = "Network Interface"
    Counters = [
      "Bytes Sent/sec",
      "Bytes Received/sec",
    ]
    Instances = ["*"]
    Measurement = "win_network"

  [[inputs.win_perf_counters.object]]
    # Example query where the Instance portion must be removed to get data back,
    # such as from the Memory object.
    ObjectName = "Memory"
    Counters = [
      "Available Bytes",
      "Committed Bytes",
      "Cache Bytes",
      "Pages/sec",
      "Pool Nonpaged Bytes",
      "Pool Paged Bytes",
    ]
    # Use 6 x - to remove the Instance bit from the query.
    Instances = ["------"]
    Measurement = "win_mem"
    # Set to true to include _Total instance when querying for all (*).
    #IncludeTotal=false

Notes:

  • Collection interval set to 5s as the default of 10s seems too slow and values smaller than 5s tend to timeout gathering the metrics.
  • Moved logfile to my D: drive as my C: drive is fast SSD and it’s not worth storing logs.
  • The one output is our graphite instance running on our VM. There may be something wrong with Go’s (the language that telegraf is written in) domain name resolution as only the IP address worked.
  • The graphite template removes the host segment and swaps measurement with tags, as measurement will be something like win_disk and one of the tags will be C:. The one thing I dislike with telegraf’s data output is that I can’t choose to filter out tags, so even useless tags get written to graphite
  • I IncludeTotal for processors as it gives a good overview of processor usage across all processors, which is convenient for calculations and also helpful if the number of processors ever change.
  • For LogicalDisk, I prefer seeing the disk throughput instead of percentages
  • Remove System performance metrics. I don’t think there is a use for these metrics for average computer user.
  • For Memory I prefer to see how many committed bytes as well as available bytes

Service

Running it at the commandline is fine for testing, but having telegraf boot up on start up is ideal. Telegraf provides a way to install as a service. Since I install telegraf into C:\Apps\telegraf, I executed the service command like the following:

.\telegraf.exe  --config C:\Apps\telegraf\telegraf.conf --service install

Graphite

Update: 2017-10-30: My new favorite graphite docker image is the official one graphiteapp/graphite-statsd, so installation may be a little different. Also I removed the configuration that started graphite on boot (it is handled natively by docker using the restart command line)

Next comes Graphite and it’s bad rap for being hard to install. Yes, a lot of it is rooted in truth and work needs to be done to make installation easier, but people tend to make mountains out of mole hills. If one makes technical decisions solely on difficulty of installations, then they’d be missing out. Instead, we’re just going to use docker setup our instance. My favorite graphite installation is praekeltfoundation/graphite because it doesn’t come with a significant amount of unneeded programs. The only thing we need to change is the data retention policy; we want store data at the five second level, and not minutely.

docker run praekeltfoundation/graphite --name init_graphite
docker exec -it init_graphite sed -i -e 's/1m:1d/5s:1d/' conf/storage-schemas.conf
docker commit init_graphite my_graphite

Now that we have our image, let’s boot it up!

docker run -p 2003:2003 -p 8000:8000 \
  -v /home/nick/docker/graphite:/opt/graphite/storage \
  --restart "unless-stopped" \
  --name graphite_server my_graphite
  • We expose port 2003, which is the inbound port for metrics data
  • Also port 8000 so grafana can talk to the graphite render api

Grafana

Grafana is the easiest step.

  • Follow the install instructions
  • Login with admin, admin
  • Add graphite as a datasource
  • Start visualizing

Result

grafana home

Future

  • Hardware temperature
  • GPU metrics

These two will probably require information outside of telegraf. I’m hoping to hitch a ride off Open Hardware Monitor.

See the follow up post for how to include GPU, temperature and power metrics.

Comments: