Outages are inevitable. In this post, we show you how to end the argument about uptime and instead demonstrate how to start defining the RTO and RPO targets which drive resilience and fast recovery.
In the introduction, we looked at recent outages at major providers such as AWS, Azure and Cloudflare. The pattern isn’t going away.
Most of the world is now perched on top of a relatively small number of platforms and networks. At the same time, many organisations have shifted from on-premise to hybrid, cloud and SaaS for critical services.
While the combination is powerful, it concentrates risk.
But it’s not just about major providers being affected and you unable to reach their platforms. What if your own data and systems are affected?
To recap, outages can happen due to many different reasons:
Put it all together and one conclusion is unavoidable:
You cannot prevent every outage. Your differentiator is resilience and recovery speed.
The catch, of course, is that resilience and fast recovery come with a price tag. Which is why you need numbers, not just adjectives, when working on a strategy.
Before you argue about tools, buy more licences or design a complex DR architecture, you need to know what you’re protecting.
Start with 5 to 10 critical business processes that define if you can keep your business running if it comes to the worst, for example:
These are the processes where sustained downtime becomes existential rather than merely annoying.
For each critical process, map out what it sits on. That will usually include:
This is where nasty surprises turn up. You may discover:
You cannot fix what you don’t know about.
Recovery Time Objective (RTO) is the maximum tolerable time a service can be unavailable before serious damage occurs.
In other words: How long can this process be down before things break in a way that really hurts?
Examples:
The key point is that RTO is a business decision, not a technical aspiration. Only the people running the company can say whether 4 hours or 24 hours is acceptable for a given process.
Recovery Point Objective (RPO) is the maximum tolerable amount of data loss, measured in time.
In practice, it answers: If we had to roll back to an earlier copy of the data, how far back could we go without making a mess?
Examples:
Again, this isn’t a number IT should invent based on what the current backup solution happens to achieve. It’s a business risk decision.
Set RPO too high and you may end up explaining to customers and regulators why a sizeable chunk of critical data has evaporated.
You hear a lot of vague statements such as “we need five nines” (= 99.999% server uptime) or “we cannot afford downtime”. While they may sound strong, they don't provide any information without context.
RTO and RPO do three vital things:
Once you have agreed RTO and RPO for your critical services, you can design a resilience and DR strategy that’s grounded in reality rather than guesswork.
In the next blog post, we'll dig into exactly that: how to move from merely moving backups to a true resilience strategy.