MARRIED: Monitor, Alert, Recover, Repair, Integrate, Explain, Document

As software developers, for better or for worse, we are married to the applications we build and the systems we run. In good and in bad times, we are responsible for ensuring that our systems are health.

One indicator are tests: unit tests, integration tests, regression tests. But tests are generally used to check the static health of an application, not the dynamic state of the running system. Here I mean we test an application before we deploy it, subsequently it becomes part of a dynamic running system.

Since it is not practical to continually test a running system, we take measurements as a proxy for these tests and thereby have a measure of the system health. Health is defined by limits: if a measurement is within its respective measurement limits, the system is considered healthy.

There are general measurements, such CPU and memory usage, disk space and response times. However, a larger pool of measurements are system specific, or semantically oriented measurements that are unique for each application and the system in which the application is running.

Example of an semantically oriented measurement is measuring clicks on banner ads for an advertising application. Having zero clicks on a banner might be an indication of an unhealthy system. But it might also indicate that an ad campaign was stopped. Which of the two scenarios is the case, is depended on the state of the running system and defines system healthiness.

In combination, all measurements provide an insight into the state of the system. Defining limits for these measurements defines system healthiness. In defining these limits, we begin to understand the dynamic system in its entirety, not only as a single static application.

Unhealthy Systems

In turn, having a strategy for handling unhealthy systems allows for the adoption of a on-call duty roster and 24/7 support of running systems. Which might or might not be relevant, depending on the costs involved. The cost of waking a developer in the middle of the night might well be far less than the cost of a broken system; there aren’t any hard and fast rules.

Also, if there is no measurement, then there can’t be an alert. Meaning, MARRIED is only as effective as its weakest (or least missing) measurement. Without measurements, the MARRIED strategy is not possible.

MARRIED is one possible strategy for handling system failure. It is not the only strategy to handle system failure. Even ignoring failure is a strategy. If the only take-away from this article is that you start to think about how to deal with your system when it becomes unhealthy or fails, that’s great. If you don’t think that having a healthy system is important for you, that’s also fine. Eventually, there will be situations where every second of downtime has a cost and defining a strategy to reduce and prevent downtime will then be important.

Embrace Failure

Avoid the false sense of security that comes with writing lots of tests. Applications will always behaviour exotically when real users, in real environments are let loose on dynamic running system. Know the state of the system at all times.

Off-the-Shelf Solutions

If your company is unwilling to invest in third party solutions, then the company has either not experienced system failures or has no vested interest in recovery from failure. Make the case, what are the costs of downtime? Here cost can be non-monetary: PR costs, lost users, angry users, delays, and so on. What is the cost of recovery? Walking developers, extra SaaS tools, and so on.

I’ll now explain the steps of MARRIED: Monitor, Alert, Recover, Repair, Integrate, Explain, Document. Roughly ordered by occurrence, each step also has a non-negligible association with its previous step.

Monitoring: based on Instrumentation

For me, instrumentation are all measurements of the running system that third-party tools will provide out-of-the box. Custom Instrumentation (CI13N) on the other hand, are all those measures that are semantically oriented measurements of a dynamic system, defined by what the application actually does.

Example of CI13N, Pingdom can ping your top level domain, but it can also ping specific URLs into your system. Pinging the top level domain, e.g. https://example.com, comes out of the box. Pinging https://example.com/is_the_database_alive is a specific URL of the system, so it’s a CI13N that you need to setup.

CI3N — Custom Instrumentation

A metric being defined as a specific measurement in time of an system attribute, e.g. banner clicks.

If collecting metrics is too complex, then no one will do it. This includes the setup of a new metric: metrics should be dynamically created by the third party collecting services.

Collecting metrics should be done as soon as possible. They can be a good indication of incorrect system assumptions. Also starting early will make it easier to know what to measure. Usually once you start, you will realise that there is a lot more that can be measured.

Hence, when collecting metrics, the more the merrier! Err on the side of more instrumentation than less. Basically if in doubt, create a metric and record the value. You never know when something goes wrong, what caused the failure, and in those cases it great to have lots of insights into the running system.

The second part to monitoring is creating limits which define system health. It is these limits that define when alerts are generated. A zero-based limit, i.e. one is too much, would be if an exception is generated, then that would be an alert. If Pingdom fails to reach your application, then that would be an alert.

More difficult limits are for CI13N. Since these are specific to the system, no one has experience with what is a healthy limit for a CI13N. Therefore it’s always a good idea, when defining a new metric, to wait a week before defining a limit for that metric. Else you will generate too many false-positive alerts. Another reason to start early with collecting metrics.

Informational Exceptions

Of course, system failure can be caused by unexpected behaviour but unexpected behaviour should not cause exceptions.

Monitoring can be seen as the process of regularly checking that the metric values remain within the bounds of specific limits. Hence monitoring can be seen as the enforcement of limits on the current state of the running system.

When these limits aren’t being meet, alerts will be triggered and become the initial indicator for an unhealthy system.

Alerts: when monitoring detects unhealthiness

Alerts are the visible part of a good monitoring strategy. Generally they will end up in Slack channels, Email inboxes and in worse cases, as SMSs on peoples phones — at three in the morning!

Generally alerts have an time-based upwards percolation: if the first developer assigned to an alert doesn’t fix it, then the next in line gets notified. All the way to the C-levels.

Call-to-Action

Alerts are triggered by the monitoring setup. Monitoring is based on metrics and limits. Is a limit reached, an alert is triggered. Alerts are re-triggered as long as the metric isn’t within the limit. Alerts can also be setup to be time-based: only if the metric is above a certain limit for certain time period is an alert triggered.

This is the reason why it’s very important to set sensible limits for metrics. If there are too few alerts as the system is failing, i.e. no one is noticing failure, then management will see no value in having monitoring. Are there too many false-positive alerts, then no one will notice a system failure since alerts will start to be ignored. Are there just the right number of alerts, developers will be paid bonuses for doing a good job!

Having said that, in the beginning, there will always be false-positives alerts since limits are changing or not yet understood, that’s fine. However, it should not become the gold standard.

Alerts should be triggered not when the system has failed but when it is in a critical state, i.e. the system is unhealthy. This is a fine judgement call since it makes the assumption that the system won’t self-heal and is instead is facing certain doom.

This call is made when setting up the limits. Err on the side of caution when setting limits for metrics. There aren’t any hard and fast rules for getting this right, experience is the only saving grace here.

Handling Alerts

Alerts should be linked to runbooks. Runbooks provide details on what the alert means, how it can be diagnosed, what tools might be useful, what metric caused the alert to be triggered and how to recover from the alert (if known).

Alerts always happen at 3 in the morning

Of course, unknown alerts will not have runbooks. In those cases it is important to create missing runbooks or update existing ones. Always link new runbooks to their corresponding alert.

Recovery: returning to a health system

Recovery addresses the symptoms but doesn’t fix the underlying cause. It is taking painkillers instead of getting a cure.

Why is recovery like this? Because this is what you do at three in the morning. The system needs to be put back into an health state. The aim of recovery is to get the system back up and running, in the shortest possible timeframe. This also tends to be something done by a single developer, late at night.

When an alert happens, there should be a clear path to recovery. Recovery should be as smooth as possible. When that is not the case, it’s a matter of improving descriptions of how recovery can be achieved. So the runbook should be updated if it’s not clear how to recover.

Remember: Runbooks are the recipes to recovery.

Everything around the process of recovery will eventually lead to fixing the cause permanently. Although, a fix may also be that the system self-heals by applying the path-to-recovery automagically, by itself.

Of course, self-healing systems are one thing, a working system is always better. It’s a fine line between self-healing and working: sometimes there isn’t a permanent fix (or the cost for a fix is too high in comparison) and perhaps the best option is getting the system to fix the symptoms automagically, i.e. self healing.

Repair: fix the underlying problems

Repairing addresses the cause, it is a permanent fix for something that is broken. As such, it takes time and shouldn’t be done in hurry or rush. Repairing the code so that the alert doesn’t get triggered is something that happens during office hours, not late at night. As such, everyone is involved.

Roughly speaking, there are three approaches to repairing:

  1. Adjust the limits that triggered the alert. Perhaps the limit was not set correctly,
  2. Take the recovery steps outline in the runbook and apply them programmagically when the alert would happen. Thus making the system self-healing, or
  3. Fix the underlying cause, having understood what the cause is.

These are ordered roughly in amount of effort required. It all depends on what is expected — either way, do the right thing.

There is also an fourth possibility and that is to remove the alert and metric from the code and ignore the underlying cause. This might well be the case if the code has become redundant. However, this should be rare since the effort was made to collect the metric, setup the trigger and potentially write a runbook.

Having found the underling cause, integrate the fix into the codebase.

Integrate: apply fixes permanently in the codebase

Having a fix for an alert will generally require more monitoring, i.e. the definition of new metrics. So integration is also concerned with adding monitoring and/or extending existing monitoring for the repaired code.

Avoid removing existing alert triggers even if the fix has been integrated. Triggers and alerts that are defined but remain unused are a good source of regression. If something does go wrong with the fix, then existing alerts might well pick that up.

Only remove alert triggers if the code is removed, refactored or redundant.

Explain: why did it happen

Explaining an alert and it’s underlying cause is something, of course, that does not start here. It begins with the alert and ends with the documentation. But MARRIED would then be something like MARERID!

Being able to explain exactly what happened and why it happened can sometimes be very difficult. Code can be confusing, systems are a combination of code, assumptions, reality and users; fixing issues based on wrong assumptions in combination with confusing code is one of the more interesting experiences in a developers professional career.

Edge cases in code. Incorrect assumptions about external services. Users using systems other than for what they were designed for. And many more strange and impossible things are the cause for alerts and system failures.

This is another good reason for alerts to happen as early as possible, it provides more time to understand failure. Alerts should happen early and fast!

Document: learn and prevent reoccurrence

Runbooks are the documentation for alerts. And as such, they are close to the codebase. My preference, when using GitHub, is to use GitHub Wikis for runbooks. This isn’t a perfect solution since these can be renamed and then links become stale. But they can be quickly and easily updated.

Every alert should be documented. Documentation includes the why there is an alert, aka limit, the what to do when an alert happens — this includes links to logs, links to further documentation — and how to recover from the alert.

Other things to include are links to tools that provide more insights are also important. Or links to the code that collects the underlying metrics. Links to metric graphs to have an insight of value development are also very useful.

Aim of documentation is to prevent incident handling of live system from becoming tacit knowledge of a few developers. Spreading the knowledge allows the workload to be spread amongst the developers.

Runbooks

Also embrace failure and don’t be shy to document it: there is no such thing as 100% or if it’s 100%, it’s not being used!

Rarity

Conclusion

MARRIED is one such strategy. MARRIED isn’t about specifics, it’s about how to structure the handling of live incidents in running systems. Covering the entire lifecycle, from identifying potential incidents quickly to preventing repetition of live incidents.

Each company and system is different, and it might well be the case that the cost of having a 24/7 incident response strategy is financial not comparable to the cost of failure and downtime. This is also a valid strategy. It questionable whether this strategy future proof though.

If this article made you reflect on your incident response strategy then that’s a good thing. Thanks for reading!

Further Reading

  • Custom Instrumentation — (CI13N) measure the application specific semantics and generate application-specific metrics. There are no golden rules how or what to measure, it all depends on the system. For example, for a classic blog application page-views would be a good metric. If that hits zero, you know something is not working.
  • Fault tolerance, high availability and disaster recovery — what does this cover? MARRIED is focussed on disaster recovery, with the aim of having a high availability system that is also fault tolerant. MARRIED does this by documenting alerts: the more you know about your system, the better you can maintain it.

Software Developer & Architect.