Bergamot

Bergamot is a scalable, distributed monitoring system which offers an easy migration path from Nagios. Bergamot is written in Java and makes use of RabbitMQ to distribute checks. It is able to read the Nagios object configuration and is able to execute Nagios checks.

Scripted HTTP Checks

Bergamot Monitoring V2.0.0 (Yellow Sun) introduces a scripted HTTP check engine. This allowing you to control HTTP checks via Javascript. So... what is so cool about this?

Well, it allows you to implement checks which call a HTTP based API using nothing but configuration in Bergamot Monitoring. If an application, product, service has an API you can implement a customised check really easily, without the need to deploy anything.

As an example of this, a number of RabbitMQ checks are provided in the default Bergamot Monitoring site configuration templates. These checks use a Javascript snippet within the command definition to define the logic of the check. The check makes a call to the RabbitMQ HTTP REST API, which returns a JSON response. This JSON is then parsed (just like you would in the browser) and the logic implemented.

Obviously this technique can be used to implement a whole raft of checks, especially with the growing number of things which provide a HTTP API.

Implementing A Check

So, lets look at how we implement a check. The following is the definition of a command to check the number of active connections which are connected to a RabbitMQ server.

    <command name="rabbitmq_active_connections" extends="http_script_check">
        <summary>RabbitMQ Active Connections</summary>
        <parameter name="host">#{host.address}</parameter>
        <parameter name="port">15672</parameter>
        <parameter name="username">monitor</parameter>
        <parameter name="password">monitor</parameter>
        <parameter description="Warning threshold" name="warning">20</parameter>
        <parameter description="Critical threshold" name="critical">50</parameter>
        <script>
        <![CDATA[
            /* Validate parameters */
            bergamot.require('host');
            bergamot.require('port');
            bergamot.require('username');
            bergamot.require('password');
            bergamot.require('warning');
            bergamot.require('critical');

            /* Call the RabbitMQ HTTP API */
            http.check()
            .connect(check.getParameter('host'))
            .port(check.getIntParameter('port'))
            .get('/api/overview')
            .basicAuth(check.getParameter('username'), check.getParameter('password'))
            .execute(
                function(r) {
                    if (r.status() == 200)
                    { 
                        var res = JSON.parse(r.content());
                        bergamot.publish(
                            bergamot.createResult().applyGreaterThanThreshold(
                                res.object_totals.connections,
                                check.getIntParameter('warning'),
                                check.getIntParameter('critical'),
                                'Active connections: ' + res.object_totals.connections
                            )
                        );
                        bergamot.publishReadings(
                            bergamot.createLongGaugeReading('connections', null, res.object_totals.connections, check.getLongParameter('warning'), check.getLongParameter('critical'), null, null)
                        );
                    }
                    else
                    {
                        bergamot.error('RabbitMQ API returned: ' + r.status());
                    }
                }, 
                function(e) { 
                    bergamot.error(e); 
                }
            );
        ]]>
        </script>
        <description>Check RabbitMQ active connections</description>
    </command>

No doubt the above block of XML configuration is somewhat bewildering at first glance, so lets break it down. For the purpose of this article, we will only look at the code defined in the script parameter.

First off, the script starts with some basic validation, the following lines simply require that a parameter value is specified.

bergamot.require('host');
bergamot.require('port');
bergamot.require('username');
bergamot.require('password');
bergamot.require('warning');
bergamot.require('critical');

Next, we construct a HTTP call to make, this uses a fluent style interface to build the HTTP request.

http.check()
.connect(check.getParameter('host'))
.port(check.getIntParameter('port'))
.get('/api/overview')
.basicAuth(check.getParameter('username'), check.getParameter('password'))

When constructing the HTTP request, parameters are fetched using:

check.getParameter('host')
check.getIntParameter('port')

Once the HTTP request is defined, it is executed asynchonously. One of two functions will be called back when the request is complete. The first function defines the on success callback, the second the on error callback.

.execute(
    function(r) {
        if (r.status() == 200)
        { 
            var res = JSON.parse(r.content());
            bergamot.publish(
                bergamot.createResult().applyGreaterThanThreshold(
                    res.object_totals.connections,
                    check.getLongParameter('warning'),
                    check.getLongParameter('critical'),
                    'Active connections: ' + res.object_totals.connections
                )
            );
            bergamot.publishReadings(
                bergamot.createLongGaugeReading('connections', null, res.object_totals.connections, check.getLongParameter('warning'), check.getLongParameter('critical'), null, null)
            );
        }
        else
        {
            bergamot.error('RabbitMQ API returned: ' + r.status());
        }
    }, 
    function(e) { 
        bergamot.error(e); 
    }
);

In the event the HTTP call returns 200 (OK), we publish a result dependent upon the data returned. First we need to parse the JSON response, using the normal JSON.parse method. Here r.content() will return the content of the response in string form. Once we've parsed the request, we apply a threshold decision based on the object_totals.connections property of the response. The warning and critical threshold parameters are used to decide the state of the check. If the value is greater than the critical threshold a critical result is published. Should the value be greater than the warning threshold a warning result is published. Otherwise an OK result is published.

var res = JSON.parse(r.content());
bergamot.publish(
    bergamot.createResult().applyGreaterThanThreshold(
        res.object_totals.connections,
        check.getIntParameter('warning'),
        check.getIntParameter('critical'),
        'Active connections: ' + res.object_totals.connections
    )
);

After the result has been published, a metric reading is published, this is used to build a graph of the active connections into RabbitMQ. The function bergamot.publishReadings publishes a set of readings, a long gauge reading is created using bergamot.createLongGaugeReading. This takes a few arguments: name, unit of measure, the value, the warning threshold, the critical threshold, the minium and the maximum. In this instance the reading name is connections. There is no unit of measure. The value is taken from object_totals.connections property. The warning and critical thresholds are taken from the defined parameters. Finally min and max are null as they are not applicable in this use case. Note that all value arguments to create a long gauge must be of type long (or null). Note the obviously the reading name must be unique in a command definition, you can't publish two readings with the same name, you also cannot change the type of a reading.

bergamot.publishReadings(
    bergamot.createLongGaugeReading('connections', null, res.object_totals.connections, check.getLongParameter('warning'), check.getLongParameter('critical'), null, null)
);

In the event the HTTP API does not return a 200 (OK) response, an error result is published.

bergamot.error('RabbitMQ API returned: ' + r.status());

In the event we hit any other exception, for example not being able to connect to the host or an error in the Javascript, the on error callback function is invoked. This callback simply publishes an error result, using the exception as the error message.

bergamot.error(e);
The great thing with this approach is that new checks can be defined only using configuration. Nothing need be deployed to worker servers or targeted hosts.

Developing A Monitoring System

I currently spend most of my spare time developing Bergamot Monitoring . Developing a monitoring throws up some interesting challenges. I want to discuss some of the things, as a web developer that I've realised during the course of this project.

Caching

Caching is a technique often used by web applications to improve performance. Often is applied at multiple levels within a web application: data layer caching, view caching, etc. For most web application, the caching of a rendered page provides a massive performance gain. However for a monitoring system, caching is next to useless. The key issue with monitoring systems, is that that everything changes and changes often.

In the worst case (with defaults) a check could be executed every minute by Bergamot. Oh and users need to know the second that something changes, after all that is the point of a monitoring system. This means it is guaranteed that a view will change within one minute, little point in caching that.

This problem is compounded by group views, where the result of multiple checks are displayed. Even on a modest sized system, these views could change every 10 seconds. On larger deployments, the state of a group can change multiple times a second.

The core issue with monitoring, is that stuff is changing all the time.

Coherency

Failing out of the caching problem, and the constant change problem. Is that users need to see consistent result. To scale the resources of multiple servers are needed. But unlike simpler web applications, coherency needs to be managed across these machines.

When the result of a check is processed all data caches need to be updated and invalidated coherently across the cluster. This is further complicated by the result transition logic being transactional.

Recursive SQL

Groups form a heirarchial tree, where child groups exist for a parent, fairly normal stuff really. However the state of child groups needs to be encompassed by the parent group. In other words to compute the state of a group at any moment in time, we need to compute the state of the whole tree.

If you naively attempt to solve that problem in your application, be prepared for the performance and scalability hit. The latency of round tripping to the database (even when on Localhost) completely kills performance, due to the sheer volume of queries that need to be executed (even for a trivaially small system).

Enter the joys of recursive SQL, where we can use a single SQL query to compute the state of the entire graph with one query (albeit about 30 lines of SQL). SQL is an often underused powerhouse of data querying and manipulation, fuck that ORM and spend the time learning propper SQL, you'll thank yourself that you did.

Message Queues

Message queues are awesome, Bergamot makes use of RabbitMQ to pass messages between multiple nodes. This is how Bergamot distributes work across multiple servers.

We push alot of routing logic down into RabbitMQ. Using features such as exchange to exchange bindings, alternate exchanges, per message time to live, dead lettering. This allows Bergamot to build a really flexible routing model without having to deal with any of the mechanics of it.

Word of advice, get a peice of papper and sketch out your routing before you code it up.

Websockets

Websockets are seriously cool, they allow Bergamot to realise updating checks in real time. Websockets implement true push messaging for the web and the technology should not be overlooked, it is super easy to use via the browser.

The server side however is a little more complex. Websockets rely upon a long running TCP / HTTP connection, as such you need to ensure that the backend server is non-blocking / event based (like wise for all servers in the connection path.

Programming for non-blocking / event based servers is very different from programming for threaded servers. Bergamot makes use of Netty to handle websockets, Netty is an event based networking library for Java and has support for websockets. Bergamot uses Netty to bridge between websockets and message queues. The change in state of a check is published to a message queue, Netty is used to simply listen to these messages and transmit them to browsers.

This allows for less than 200ms of latency between telling Bergamot to execute a check in the UI, to Bergamot executing the check and publishing the result to the browser. I deliberately had to have a slow animation effect in the UI so that users could realise that a check had actually updated!

HOT PostgreSQL

A design goal of my monitoring system, Bergamot Monitoring, was to ensure that the monitoring state was persisted. As a long time PostgreSQL user, PostgreSQL was the obvious choice and it hasn't been a bad decision.

An interesting aspect of monitoring systems is that they are constantly busy. Even a small scale deployment is likely to be executing one check every second. This translates to around two database updates a second.

At the outset I was concerned about table bloat. A facet of the MVCC concurrency system used in PostgreSQL (and most databases) is that updating a row is essentially a delete and insert of the row. As such for tables which get constantly updated a large number of dead tuples will build up. In PostgreSQL cleaning up these dead tuples is the job of vacuum, which happens automatically via the autovacuum processes.

PostgreSQL has an update specific optimisation called Heap-Only-Tuples (HOT). Normally an update would leave a dead tuple in both the index and table, which would need to be cleaned up by vacuum. However when updating columns which are not part of an index, hopefully a HOT update will be used. A HOT update will attempt to place the new tuple copy within the same page and points the old tuple to the new tuple. This means that the index does not need to be updated, reducing the clean-up which needs to be performed by vacuum.

Looking at the statistics from my demo system, the check_state and check_stats tables, which gets updated everytime a result is processed are almost exclusively using HOT updates. My statistics show that 99.4% of updates to these tables are HOT updates.

Again looking at the statistics, autovacuum is being invoked roughly every two to three minutes. I suppose this is unsurprising considering that every row in the check_state and check_stats tables is being updated every five minutes.

About Bergamot Monitoring

Bergamot Monitoring is an Open Source, distributed monitoring system with an easy migration path from Nagios. The project was founded my me (Chris Ellis), partly by accident and partly out of a frustration of working with Nagios.

Bergamot Monitoring has a dedicated website bergamot-monitoring.org should you want to find out more, or give it a try.

The project started after I had wrote a Nagios config parser and thought to myself how much harder could it be to just execute the checks. That turned out to be fairly easy, executing a Nagios check is just forking a process. My frustrations borne out for dealing with Nagios took over, I've detailed some of my gripes which lead me to take Bergamot Monitoring in the direction it is heading.

Whilst the project started off utilising the Nagios config format, this quickly changed, so as to address some of the limitation. However as an easy migration path is considered a critical aspect of the project, it is possible to convert a Nagios configuration to the Bergamot Monitoring format.

What Is Wrong With Nagios

Bergamot Monitoring was born out of frustration with some fundamental flaws in Nagios. Nagios has become the defacto infrastructure monitoring system. However it has a significant number of failings. There are a number of third-party extensions which aim to address some of these issues. However lets be honest, if a solution is rotten at the core, nothing encompassing it can fix that.

That being said, what is wrong with Nagios:

Configuration

Nagios' configuration is brain dead. Whilst it has inheritance as a top level concept, which solves some problems, this does not address a number of key issues. This has resulted in a number of third-party configuration systems, which simply add another layer of complexity rather than fixing the issues.

The configuration cannot separate grouping checks for display purposes versus apply configuration. To apply the same service to multiple hosts, you need to setup a host group. So your either need to muddy your display or have configuration which is difficult to change.

Every infrastructure I've every dealt with, has 'classes' of servers, so why not inherit service's from templates. This approach ends up with reusable templates which can simply be applied to hosts, making it really easy to add a host to monitoring.

When configuring a check, you either need to provide all arguments or use a limited number of global variables (32). Rather than being able to define parameters where they are logically defined. For example, imagine a situation where each data centre or each device has a different SNMP community string.

Logically you would want to define this on a location or host and then reference this from the check, well you can't.

Reloads

Nagios has no ability to apply configuration changes in a live, real-time manner. This problem is exacerbated when using scaling / distributed extensions, which then require a 'synchronised' restart of all monitoring servers. This is a brutal approach to changing configuration and it doesn't really fit into the modern cloud world of provisioning machines whenever you want.

Distribution

Modern infrastructures are complex beasts, often spread across multiple sites, possibly in multiple time zones. A monitoring system must be able to cope with this. To distribute checks in Nagios, you need to resort to third-party extensions, rather than this being handled in the core. As far as I'm aware Nagios cannot cope with scheduling checks in multiple time zones, or sending notifications in multiple time zones.

Scaling

The core architecture of Nagios' opposes scaling, it is inherently a single process at it is core, retaining all information in memory. Nagios 4 and Naemon are starting to address this, by splitting of check execution into separate processes. However even with that, Nagios pays little attention to being a scalable solution.

This has lead to third-party extensions which attempt to scale Nagios, but they are fair from perfect. Especially which how they handle configuration changes.

Stability

Political instability in the Nagios project of the last few years has led to a couple of significant forks of the project, namely: Naemon. With these forks pursuing different directions, leads to third-party extensions having to pick an particular project to support, or incur the overhead of supporting multiple diverging projects. This ends up backing users into a corner.