Grant Street Group

Grant Street Group HealthCheck Standard

A definition of the result format for the results of a HealthCheck for an application. Loosely based on the FT Health Check Standard, with much simplification and the major addition of the “results” field that allows nesting results. These nested results provide much power to see a summary of the state and yet dig deeper into why something has failed. Tag support means you can query for just the features you are interested in and not have to wait for other results to be calculated.

Checker Implementations

Perl HealthCheck Module

Other useful code

HealthCheck::Diagnostic - Base class for writing HealthCheck Diagnostic Checks
- As well as some documentation on writing one.
Plack::Middleware::HealthCheck - HealthChecks over the web
- Note this implements the correct HTTP response statuses
More HealthCheck::Diagnostic checks on the MetaCPAN

Checker Implementation Best Practices

The checker SHOULD summarize the “status” field, picking the worst severity of all responses.

A checker SHOULD return the list of tags it checked. An individual check MAY return all the tags it has associated.

A checker MUST timeout in a reasonable amount of time and return an appropriate response for not getting results. For example, some endpoints may be hit every ten seconds and expected response time is shorter than that.

For HTTP(s) health checks, the HTTP response code MUST be a 200 if the app is in good health, and MUST be 503 in bad health. See also Google’s standard.

Results

A health check checker implementation should return a data structure that can be round-tripped into JSON. Fields in the result are expected to comply with definitions as described here. Results may include fields not described here and these additional fields should be included if the result is stored or forwarded.

 {
   "id" : "my_app",
   "status" : "CRITICAL",
   "label" : "My App's Health Check",
   "info" : "Something has gone terribly wrong",
   "timestamp" : "2001-02-03 04:05:06Z",
   "runbook" : "https://grantstreetgroup.github.io/HealthCheck.html",
   "results" : [
      {
         "id" : "simple_check",
         "status" : "OK",
         "runtime" : 0.012
      },
      {
         "id" : "named_check",
         "status" : "OK",
         "label" : "Pretty Label",
         "runtime" : 0.027
      },
      {
         "id" : "timed_out",
         "status" : "UNKNOWN",
         "info" : "Timed out after too many seconds",
         "runtime" : 15.009
      },
      {
         "id" : "aggregate_check",
         "status" : "CRITICAL",
         "label" : "A check that aggregates different problems",
         "runbook" : "https://grantstreetgroup.github.io/HealthCheck.html",
         "results" : [
            {
               "id" : "subcheck_1",
               "status" : "OK"
            },
            {
               "id" : "subcheck_2",
               "status" : "OK",
               "label" : "Before the dawn of time",
               "timestamp" : "1969-12-31T12:59:59+00:00"
            },
            {
               "id" : "subcheck_3",
               "status" : "CRITICAL",
               "info" : "Check failed to do the thing it set out to do"
            }
         ],
         "runtime" : 2.087
      }
   ],
   "runtime" : 17.377,
   "tags" : [
      "my_app",
      "multi_check"
   ]
}

Standard Result Fields

Key	Constraints	Default	Notes
id	Lowercase letters, numbers and underscores	index into the results list, if an aggregate, 0 otherwise.	Unique identifier for this check. Must be unique within in a group.
status	OK (0), WARNING (1), CRITICAL (2), UNKNOWN (3)	UNKNOWN	A string, the definition from the Nagios Plugin Return Code Service State.
label	String	“id” field	Displayed to a human describing the check.
info	String	“status” field	Displayed to a human describing the status. May include more detail than the status.
timestamp	RFC3339 timestamp	timestamp of a “parent” result or a top-level result will default to the current time.	The top-level result should fill this out when it is run. Checks that do caching of results should use the timestamp of the last time the result was updated. It is highly recommended that you stick with GMT, or the “Z”/+00:00 timezone for consistency. RFC3339 is a subset of ISO8601.
results	List of additional check results	None	If this is an aggregate check that combines sub-checks, this can be a list of the results for each sub-check. Each result must comply with the constraints defined here.
runtime	Float (any precision)	None; may be auto-calculated at top-level	The amount of time it took to run the check, in seconds. Individual results can have their own individual runtimes. The checks SHOULD NOT use cached values, and only reflect the real amount of time it took to run. Checks could be ran in parallel, so summarizing runtimes for the parent check may not be appropriate, and parent checks should use their own timers.
runbook	A troubleshooting runbook link	None	A string of an URL that is linked to a troubleshooting runbook when healthcheck is not in OK status.
tags	Array of strings	None	A set of tags that can be used to classify the result. These can generally be filtered with a separate “tags” query from the check implementation.
data	A freeform structure	None	A freeform machine-readable set of data, providing additional details to the test. The structure and keys SHOULD be consistent if the same type of test is run multiple times.

Default values are what the result reader is expected to assume, not necessarily provided by the check implementation.