Introduction

Collectd comes with a long list of plugins. In a previous article, we showed how to install the necessary dependencies to enable most of the collectd plugins when compiling collectd from source.

One of these plugins is the Python plugin, which is a binding that allows you to write Python scripts that behave as custom read and write plugins for collectd. With this plugin enabled, the collectd daemon will embed a Python interpreter and can directly run Python scripts without having to launch a new interpreter every time the scripts need to run.

In this article, we will demonstrate how to write a Python script for collectd that will graph CPU cores by state from Slurm, an HPC resource manager.


Prerequisites

This article assumes you have some familiarity with Collectd and have an existing Graphite installation. For instructions on how to install Graphite, read the installation docs.

Furthermore, details about Slurm is outside the scope of this article and only used as an real-world example.

Lastly, of the python modules used in the example below, the PyParsing module is not a core Python module. You can, however, install it easily with Pip:

sudo pip install pyparsing


Enabling the Python Plugin

The default collectd.conf comes with the Python plugin disabled:

## <LoadPlugin python>
##   Globals true
## </LoadPlugin>

Simply uncomment the stanza to have collectd load this plugin at run time:

<LoadPlugin python>
  Globals true
</LoadPlugin>


Writing a Python Read Plugin

The collectd-python manpage documents how to use the Python plugin structure to create your own plugins. We will walk through writing a script, section by section, to understand how to gather and dispatch metrics to collectd.


Importing Modules

The first part of our script will import the necessary modules.

import collectd
import signal
import subprocess
from pyparsing import Word, alphanums, nums

The first import statement loads the collectd module. This is the module that will register the read and write functions with collectd. The remaining modules are used elsewhere in the script to gather information.

Gathering Metrics

The purpose of the plugin is to gather metrics and feed it to collectd. From there, collectd will send it wherever it is configured to, for example to RRD files, to Graphite or another time-series database.

Our example uses the subprocess module to execute a Slurm command that will return the number of cores by state per partition in a parsable format. This is done using the sinfo command. It then uses the pyparsing module to parse the output. Lastly, it iterates over the output and returns a dictionary. This will be used in the read_callback function.

def get_cpus_by_state():
    """Returns a dictionary of CPU cores and their states, per Slurm
    partition."""

    cores = {}

    cores['all-allocated'] = 0
    cores['all-idle'] = 0
    cores['all-other'] = 0
    cores['all-total'] = 0

    sinfo = "/path/to/sinfo"

    try:
        used_core_count = subprocess.Popen([sinfo, '-o', '%P %C', '--noheader'], stdout=subprocess.PIPE).communicate()
        stdout = used_core_count[0].strip().split('\n')

        if stdout == ['']:
            return

        part_parse = Word(alphanums + '*') + Word(nums + '/' + nums + '/' + nums + '/' + nums)
        output = part_parse.scanString(stdout)

        for partition in output:
            queue = partition[0][0]
            allocated, idle, other, total = partition[0][1].split('/')

            cores[queue.strip('*') + '-allocated'] = int(allocated)
            cores[queue.strip('*') + '-idle'] = int(idle)
            cores[queue.strip('*') + '-other'] = int(other)
            cores[queue.strip('*') + '-total'] = int(total)

            cores['all-allocated'] += int(allocated)
            cores['all-idle'] += int(idle)
            cores['all-other'] += int(other)
            cores['all-total'] += int(total)

        return cores

    except:
      return
Taking care of SIGCHLD

This particular example uses the subprocess module to fork a child process that will execute the sinfo command. The Python plugin is a parent to this forked process, and as such, the forked process will send a SIGCHLD signal to the parent process upon completion. Collectd, however, will ignore the SIGCHLD signal that the parent Python plugin process is waiting for. As a result, the Python plugin will throw an OSError exception when the child process terminates and logs the following error:

Unhandled python exception in read callback: OSError: [Errno 10] No child processes

This function will restore the default SIGCHLD behavior so that the plugin can create new processes without throwing exceptions.

def restore_sigchld():
    signal.signal(signal.SIGCHLD, signal.SIG_DFL)
Defining a Read Callback

The purpose of the read callback is to setup the Values object with the parameters of the metrics you are dispatching to collectd. It sets:

  • host (slurm, which is not a real host)
  • plugin name (core_states)
  • type of metric (gauge)
  • submit interval (30 seconds)

This is required to abide by collectd's naming schema. Therefore, the resulting metric name will be slurm.core_states.gauge-, where metric_name refers to each of the keys in the dictionary returned by the get_cpus_by_state() function. Finally, the metric is dispatched to collectd.

def read_callback(data=None):
    """ Callback function for dispatching data into collectd"""

    cores = get_cpus_by_state()

    if not cores:
        pass
    else:
        # for each key in each partition output, put the value
        for key in cores:
            metric = collectd.Values()
            metric.plugin = 'core_states'
            metric.host = 'slurm'
            metric.interval = 30
            metric.type = 'gauge'
            metric.type_instance = key
            metric.values = [cores[key]]
            metric.dispatch()
Registering Functions

The final lines of the script registers two callbacks: an init function to restore the behavior of SIGCHLD and the other to periodically run the read_callback function, which in turn calls the get_cpus_by_state() function to gather metrics and dispatch the values to the daemon.

collectd.register_init(restore_sigchld)
collectd.register_read(read_callback)

Save all the bits in a file, such as slurm_core_states.py (available here as a gist).


Configuring the Python Plugin Section

Now that we have enabled the Python plugin in collectd and wrote a read plugin, the last thing needed is to configure the plugin in collectd's configuration file. Add the following configuration stanza to collectd.conf:

<Plugin python>
  ModulePath "/path/to/python/plugin"
  LogTraces true
  Interactive false
  Import "slurm_core_states"
</Plugin>

The ModulePath option prepends the path to sys.path, thus making your plugin available to Python when importing it. Otherwise, you would get an ImportError exception.

The LogTraces option will log any exceptions and the full stacktrace thrown by the plugin. This is good for debugging.

The Interactive option will launch an interactive Python interpreter. This script will not run interactively and is therefore set to false.

The Import option imports our plugin into collectd's embedded Python interpreter.


Verifying

After restarting the collectd service, check the logs to make sure there are no issues with the plugin. If there are no warning or errors in the logs, verify that the plugin is working by querying the metrics.

This step will vary depending on where your metrics are stored. For example, if the metrics are written to a Graphite instance, you could query the metrics using the Graphite URL API:

curl -s 
"http://graphite.example.com/render?target=slurm.core_states.gauge-all-allocated&from=-1mins&format=json"
| python -m json.tool

This fetches the cumulative number of allocated cores across all Slurm partitions, in json format. The last two datapoints are shown below:

[
    {
        "datapoints": [
            [
                19488.0,
                1435969380
            ],
            [
                19488.0,
                1435969410
            ]
        ],
        "target": "slurm.core_states.gauge-all-allocated"
    }
]

And finally, here is what a Graphite graph would look like:

Slurm Core States


Conclusion

This is just one example of how to use Python and collectd to gather metrics and visualize them. You could write a plugin with a similar structure to dispatch custom metrics to your metric collector, such as the Graphite suite or any other time-series database.


Comments

comments powered by Disqus