Introduction

In a previous post, I demonstrated how to use Cython to wrap C functions in Slurm's sdiag utility and expose the metrics to Python.

In this guide, we will use PySlurm to graph the output from sdiag.


Prerequisites

This guide assumes you are using Slurm, an HPC workload manager, and the Graphite suite.


Getting the Statistics

The sdiag utility is a diagnostic tool that keeps statistics on the Slurm controller, the Main scheduler and the Backfill scheduler. You can run this utility periodically, or as you make changes to the scheduler in slurm.conf. However, if you want a historical view of these statistics, you could save them in a time-series database and graph them over time.

Let's start by writing a function to get the various stats and store them in a dictionary:

import pyslurm

def get_sched_stats():
    stats = {}

    try:
        sdiag = pyslurm.statistics().get()
    except:
        return
    else:
        # Slurmctld Stats
        stats["server_thread_count"] = sdiag.get("server_thread_count")
        stats["agent_queue_size"] = sdiag.get("agent_queue_size")

        # Jobs Stats
        stats["jobs_submitted"] = sdiag.get("jobs_submitted")
        stats["jobs_started"] = sdiag.get("jobs_started")
        stats["jobs_completed"] = sdiag.get("jobs_completed")
        stats["jobs_canceled"] = sdiag.get("jobs_canceled")
        stats["jobs_failed"] = sdiag.get("jobs_failed")

        # Main Scheduler Stats
        stats["main_last_cycle"] = sdiag.get("schedule_cycle_last")
        stats["main_max_cycle"] = sdiag.get("schedule_cycle_max")
        stats["main_total_cycles"] = sdiag.get("schedule_cycle_counter")

        if sdiag.get("schedule_cycle_counter") > 0:
            stats["main_mean_cycle"] = (
                sdiag.get("schedule_cycle_sum") / sdiag.get("schedule_cycle_counter")
            )
            stats["main_mean_depth_cycle"] = (
                sdiag.get("schedule_cycle_depth") / sdiag.get("schedule_cycle_counter")
            )

        if (sdiag.get("req_time") - sdiag.get("req_time_start")) > 60:
            stats["main_cycles_per_minute"] = (
                sdiag.get("schedule_cycle_counter") /
                ((sdiag.get("req_time") - sdiag.get("req_time_start")) / 60)
            )

        stats["main_last_queue_length"] = sdiag.get("schedule_queue_len")

        # Backfilling stats
        stats["bf_total_jobs_since_slurm_start"] = sdiag.get("bf_backfilled_jobs")
        stats["bf_total_jobs_since_cycle_start"] = sdiag.get("bf_last_backfilled_jobs")
        stats["bf_total_cycles"] = sdiag.get("bf_cycle_counter")
        stats["bf_last_cycle"] = sdiag.get("bf_cycle_last")
        stats["bf_max_cycle"] = sdiag.get("bf_cycle_max")
        stats["bf_queue_length"] = sdiag.get("bf_queue_len")

        if sdiag.get("bf_cycle_counter") > 0:
            stats["bf_mean_cycle"] = (
                sdiag.get("bf_cycle_sum") / sdiag.get("bf_cycle_counter")
            )
            stats["bf_depth_mean"] = (
                sdiag.get("bf_depth_sum") / sdiag.get("bf_cycle_counter")
            )
            stats["bf_depth_mean_try"] = (
                sdiag.get("bf_depth_try_sum") / sdiag.get("bf_cycle_counter")
            )
            stats["bf_queue_length_mean"] = (
                sdiag.get("bf_queue_len_sum") / sdiag.get("bf_cycle_counter")
            )

        stats["bf_last_depth_cycle"] = sdiag.get("bf_last_depth")
        stats["bf_last_depth_cycle_try"] = sdiag.get("bf_last_depth_try")

        return stats


Sending to Graphite

Graphite is suite of three components: Carbon, Whisper and Graphite-web. Carbon is the metric collector. Our script needs to send the statistics to Carbon, which will store the values in whisper time-series databases. The graphs can then be visualized using Graphite-web.

The next part of our script is defining a function to send the metrics using the pickle protocol.

import pickle
import socket
import struct
import time

def run(sock, delay):
    while True:
        now = int(time.time())
        tuples = ([])

        stats = get_sched_stats()

        if stats is not None:
            prefix = "cluster.slurm_sched_stats.gauge-"
            for key in stats:
                tuples.append((prefix + key, (now, stats[key])))
            package = pickle.dumps(tuples, 1)
            size = struct.pack('!L', len(package))
            try:
                sock.sendall(size)
                sock.sendall(package)
            except socket.error:
                pass

        time.sleep(delay)

The run function takes two arguments: the socket connection to the Carbon port and a delay value in seconds. This function will pack the statistics in tuples, send it to Carbon, sleep for a while, and start the whole process again. The delay value determines your graphs resolution.

The prefix value is the location to store the various metrics. Therefore, Carbon will create the whisper databases in /path/to/graphite/storage/whisper/cluster/slurm_sched_stats/. Each metric will be prefixed with gauge- to signify that each value is a gauge.

Some of the values are actually counters, but we will use Graphite's Function API to take the derivative of the series to generate per-second values.


Bringing It Together

Now, we can wrap up these two functions in a main function to bring everything together.

CARBON_SERVER = "127.0.0.1"
CARBON_PICKLE_PORT = 2004
DELAY=30

def main():
    sock = socket.socket()
    try:
        sock.connect((CARBON_SERVER, CARBON_PICKLE_PORT))
    except socket.error:
        raise SystemExit("Couldn't connect to %s on port %d.  Is carbon-cache \
                         running" % (CARBON_SERVER, CARBON_PICKLE_PORT))

    try:
        run(sock, DELAY)
    except KeyboardInterrupt:
        sys.stderr.write("\nExiting on CTRL-c\n")
        sys.exit(0)


if __name__ == "__main__":
    main()

You will need to make sure that Carbon is configured with the Pickle receiver port. This can be found in carbon.conf.

PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2004

Additionally, and preferably before you create any graphs, you should also set the schema for this metric. Set its retention to match your desired resolution. This is done in the storage-schemas.conf file. In this example, since we set the DELAY to 30 seconds, we need to set the retention to 30s, otherwise the default of 10s will take effect and we will end up with gaps in the graphs.

Since the prefix is cluster.slurm_sched_stats.gauge-, we create the following schema definition in storage-schemas.conf:

[cluster]
pattern = ^cluster\.
retentions = 30s:1d, 5m:1y

The first line is a regex that matches any metric beginning with cluster. The second line tells Carbon to create a whisper database for each metric where 30 second values are kept for 1 day and then those values are aggregated into 5 minute averages and kept for 1 year. Adjust accordingly.

Carbon re-reads this file once a minute, so there's no need to restart anything here.


Daemonizing the Script with Supervisor

The script runs in an infinite while loop, making it easy to daemonize. We will daemonize this script with Supervisor.

The easiest way to install Supervisor is with pip:

pip install supervisor

You will have to generate a configuration file if you are not already using Supervisor. When the configuration file is ready, append the following definition for our script:

[program:slurm-sched-stats]
command=/usr/bin/python /usr/local/libexec/slurm_sched_stats.py
directory=/usr/local/libexec/
stdout_logfile=/var/log/slurm-sched-stats.log
stdout_logfile_maxbytes=1MB
stdout_logfile_backups=3
redirect_stderr=true
loglevel=warn
user=somebody

Lastly, grab one of the user-contributed initscripts so that you can start and stop Supervisor easily. Once you have an initscript installed and added to chkconfig, you can then start Supervisor which will in turn start the script:

service supervisord start


Creating a Dashboard in Grafana

If you are using Graphite, chances are you use a different frontend for creating graphs and dashboards. Grafana is an excellent frontend for visualizing Graphite metrics, among other data sources.

I mentioned earlier that all the metrics would be created as gauges and not counters. Let's take a look at one metric, Jobs Submitted. If we graph it normally, the graph continuously grows until it gets reset by Slurm, typically at midnight UTC. The graphs end up taking a sawtooth shape as the values drop back to zero and start growing again.

To get a Jobs Submitted per second metric, add the same metric to the graph, except, apply the NonNegativeDerivative function to this series in order to calculate the deltas between the datapoints.

You may want to do this for the following graphs:

  • Jobs Submitted
  • Jobs Started
  • Jobs Completed
  • Jobs Canceled
  • Jobs Failed
  • Total Cycles
  • Total Backfilled Jobs

Here are what the graphs would look like:

Slurm Core States

Slurm Core States

Slurm Core States


Conclusion

Using PySlurm and Graphite/Grafana, we can keep historical data for sdiag. The backfill scheduler in particular has many options that are used to tweak the behavior of the algorithm. These graphs could help to provide feedback after changing values to the scheduler.

This gist contains the full slurm_sched_stats.py script.

This gist contains the full Grafana dashboard in JSON format, which can be imported into Grafana.


Comments

comments powered by Disqus