Apache

Sling Health Check Tools

Based on simple HealthCheck OSGi services, the Sling Health Check Tools ("hc" in short form) are used to check the health of live Sling systems, based on inputs like JMX MBean attribute values, OSGi framework information, Sling requests status, etc.

Health checks are easily extensible either by configuring the supplied default HealthCheck services, or by implementing your own HealthCheck services to cater for project specific requirements.

However for simple setups, the out of the box health checks are often sufficient. Executing Health Checks is a good starting point to run existing checks and to get familiar with how health checks work.

See also:

Use cases

Generally health checks have two high level use cases:

  • Operations teams checking sling instances for their internal state manually
  • Load balancers can query the health of a sling instance and decide to take it out or back into the list of used backends automatically

The strength of Health Checks are to surface internal Sling state for external use:

  • Verify that performance counters are in range
  • Run smoke tests at system startup
  • Check that all OSGi bundles are up and running
  • Check that demo content has been removed from a production system
  • Check that demo accounts are disabled
  • Ping external systems and raise alarms if they are down

The health check subsystem uses tags to select which health checks to execute so you can for example execute just the performance or security health checks once they are configured with the corresponding tags.

The out of the box health check services also allow for using them as JMX aggregators and processors, which take JMX attribute values as input and make the results accessible via JMX MBeans.

What's a HealthCheck ?

A HealthCheck is just an OSGi service that returns a Result.

public interface HealthCheck {

    /** Execute this health check and return a {@link Result} 
     *  This is meant to execute quickly, access to external
     *  systems, for example, should be managed asynchronously.
     */
    public Result execute();
}

Where Result is a simple immutable class that provides a Status (OK, WARN, CRITICAL etc.) and one or more log-like messages that can provide more info about what, if anything, went wrong.

public class Result implements Iterable <ResultLog.Entry> {

    public boolean isOk() {
        return getStatus().equals(Status.OK);
    }

    public Status getStatus() {
        return resultLog.getAggregateStatus();
    }

    @Override
    public Iterator<ResultLog.Entry> iterator() {
        return resultLog.iterator();
    }

    ... details omitted
} 

SlingHealthCheck annotation

The SlingHealthCheck annotation makes it easier to specify the required HealthCheck service properties.

Here's an example from the samples module - see the annotations module for more details.

@SlingHealthCheck(
    name="Annotated Health Check Sample", 
    mbeanName="annotatedHC",
    description="Sample Health Check defined by a java annotation",
    tags={"sample","annotation"})

public class AnnotatedHealthCheckSample implements HealthCheck {

    @Override
    public Result execute() {
        ...health check code
    }
}

Executing Health Checks

Health Checks can be executed via a webconsole plugin, the health check servlet or via JMX. HealthCheck services can be selected for execution based on their hc.tags multi-value service property.

The HealthCheckFilter utility accepts positive and negative tag parameters, so that -security,sling selects all HealthCheck having the sling tag but not the security tag, for example.

For advanced use cases it is also possible to use the API directly by using the interface org.apache.sling.hc.api.execution.HealthCheckExecutor.

Health Check bundles

The Health Check subsystem consists of the following bundles:

  • The only required bundles are org.apache.sling.hc.api which provides the API and org.apache.sling.hc.core which provides some utility classes and some generally useful HealthCheck services (e.g. the health check executor)
  • org.apache.sling.hc.support provides more Sling-specific HealthCheck services.
  • org.apache.sling.hc.webconsole provides the Webconsole plugin described below.
  • org.apache.sling.junit.healthcheck provides a HealthCheck service that executes JUnit tests in the server-side OSGi context.
  • org.apache.sling.hc.samples provides sample OSGi configurations and HealthCheck services. The sample configurations are provided as Sling content, so the Sling Installer is required to activate them.
  • org.apache.sling.hc.junit.bridge makes selected Health Checks available as server-side JUnit tests. See below for more info.

Out-of-the-box HealthCheck services

The following default HealthCheck services are provided by the org.apache.sling.hc.core bundle:

The org.apache.sling.hc.samples bundle provides OSGi configurations that demonstrate them.

  • JmxAttributeHealthCheck checks the value of a single JMX attribute and supports ranges like between 12 and 42.
  • ScriptableHealthCheck evaluates an expression written in any scripting language that Sling supports, and provides bindings to access JMX attributes.
  • CompositeHealthCheck executes a set of HealthCheck selected by tags, useful for creating higher-level checks.

A few more Sling-specific ones are provided by the org.apache.sling.hc.support bundle:

  • SlingRequestStatusHealthCheck checks the HTTP status of Sling requests.
  • DefaultLoginsHealthCheck can be used to verify that the default Sling logins fail.
  • ThreadUsageHealthCheck can be used to monitor for deadlocks using JRE ThreadMXBean (see SLING-6698 )

A bridge to server-side OSGi-aware JUnit tests is provided by the JUnitHealthCheck, from the org.apache.sling.junit.healthcheck bundle.

The org.apache.sling.hc.samples bundle provides an example OsgiScriptBindingsProvider for the default ScriptableHealthCheck, which provides OSGi-related information to health check script expressions.

Configuring Health Checks

HealthCheck services are created via OSGi configurations. Generic health check service properties are interpreted by the health check executor service. Custom health check service properties can be used by the health check implementation itself to configure its behaviour.

The following generic Health Check properties may be used for all checks:

Property Type Description
hc.name String The name of the health check as shown in UI
hc.tags String[] List of tags: Both Felix Console Plugin and Health Check servlet support selecting relevant checks by providing a list of tags
hc.mbean.name String Makes the HC result available via given MBean name. If not provided no MBean is created for that HealthCheck
hc.async.cronExpression String Used to schedule the execution of a HealthCheck at regular intervals, using a cron expression as specified by the Sling Scheduler module.
hc.resultCacheTtlInMs Long Overrides the global default TTL as configured in health check executor for health check responses (since v1.2.6 of core)
hc.warningsStickForMinutes Long This property will make WARN/CRITICAL results stay visible for future executions, even if the current state has returned to status OK. It is useful to keep attention on issues that might still require action after the state went back to OK, e.g. if an event pool has overflown and some events might have been lost (since v1.2.10 of core)

All service properties are optional.

As an example, here's a ScriptableHealthCheck configuration provided by the org.apache.sling.hc.samples bundle:

Factory PID = org.apache.sling.hc.ScriptableHealthCheck
"hc.name" : "LoadedClassCount and ManagementSpecVersion are in range" 
"hc.mbean.name" : "LoadedClassCount and ManagementSpecVersion"
"hc.tags" : [jvm, script]
"expression" : "jmx.attribute('java.lang:type=ClassLoading', 'LoadedClassCount') > 10 &&  jmx.attribute('java.lang:type=Runtime', 'ManagementSpecVersion') > 1" 
"language.extension" : "ecma" 

The service properties starting with the hc. prefix in this example should be provided by all HealthCheck services.

Configuring the Health Check Executor

The health check executor can optionally be configured via service PID org.apache.sling.hc.core.impl.executor.HealthCheckExecutorImpl:

Property Type Default Description
timeoutInMs Long 2000ms Timeout in ms until a check is marked as timed out
longRunningFutureThresholdForCriticalMs Long 300000ms = 5min Threshold in ms until a check is marked as 'exceedingly' timed out and will marked CRITICAL instead of WARN only
resultCacheTtlInMs Long 2000ms Result Cache time to live - results will be cached for the given time

Webconsole plugin

If the org.apache.sling.hc.webconsole bundle is active, a webconsole plugin at /system/console/healthcheck allows for executing health checks, optionally selected based on their tags (positive and negative selection, see the HealthCheckFilter mention above).

The DEBUG logs of health checks can optionally be displayed, and an option allows for showing only health checks that have a non-OK status.

The screenshot below shows an example.

Health Check Webconsole Plugin

JMX access to health checks

If the org.apache.sling.hc.jmx bundle is active, a JMX MBean is created for each HealthCheck which has the service property hc.mbean.name service property set. All health check MBeans are registered in the domain org.apache.sling.healthcheck with a type of HealthCheck.

The MBean gives access to the Result and the log, as shown on the screenshot below.

See the example configurations of the org.apache.sling.hc.samples for more details.

JConsole showing Sling Health Check MBeans

Health Check Servlet

Starting with version 1.2.4 of the org.apache.sling.hc.core bundle, a flexible Health Checks execution servlet is available. It provides similar features to the Web Console plugin described above, with output in HTML, JSON (plain or jsonp) and TXT (concise or verbose) formats (see HTML format rendering page for more documentation).

The Health Checks Servlet is disabled by default, to enable it create an OSGi configuration like

PID = org.apache.sling.hc.core.impl.servlet.HealthCheckExecutorServlet
servletPath = /system/health

which specifies the servlet's base path. That URL then returns an HTML page, by default with the results of all active health checks and with instructions at the end of the page about URL parameters which can be used to select specific Health Checks and control their execution and output format.

Note that by design the Health Checks Servlet doesn't do any access control by itself to ensure it can detect unhealthy states of the authentication itself. Make sure the configured path is only accessible to relevant infrastructure and operations people. Usually all /system/* paths are only accessible from a local network and not routed to the Internet.

By default the HC servlet sends the CORS header Access-Control-Allow-Origin: * to allow for client-side browser integrations. The behaviour can be configured using the OSGi config property cors.accessControlAllowOrigin (a blank value disables the header).

Health Checks as server-side JUnit tests

The org.apache.sling.hc.junit.bridge bundle makes selected Health Checks available as server-side JUnit tests.

It requires the org.apache.sling.junit.core bundle which provides the server-side JUnit tests infrastructure.

The idea is to implement the smoke tests of your system, for example, as health checks. You can then run them as part of integration testing, using the Sling Testing Tools
remote testing utilities, and also as plain Health Checks for monitoring or troubleshooting Sling instances.

To use this module, configure sets of tags at /system/console/configMgr/org.apache.sling.hc.junitbridge.HealthCheckTestsProvider using the standard includeThisTag,-omitThatTag syntax, and JUnit tests will be available at /system/sling/junit/HealthChecks.html to run the corresponding Health Checks.