Latest news and insights from our industry experts

You Can’t Prevent Every Outage: The Risk Landscape, RTO and RPO

Written by Innovate | Jan 20, 2026 10:15:00 AM

Outages are inevitable. In this post, we show you how to end the argument about uptime and instead demonstrate how to start defining the RTO and RPO targets which drive resilience and fast recovery.

The Uncomfortable Truth: Outages Are Inevitable

In the introduction, we looked at recent outages at major providers such as AWS, Azure and Cloudflare. The pattern isn’t going away.

Most of the world is now perched on top of a relatively small number of platforms and networks. At the same time, many organisations have shifted from on-premise to hybrid, cloud and SaaS for critical services.

While the combination is powerful, it concentrates risk.

But it’s not just about major providers being affected and you unable to reach their platforms. What if your own data and systems are affected?

To recap, outages can happen due to many different reasons:

  • Cyberattacks, especially ransomware
  • Cloud and SaaS incidents, from configuration errors to full platform outages
  • Human error, such as failed deployments or misconfigurations (think of the CrowdStrike outage last year)
  • Infrastructure failures, including network, power and storage - a good example of this is the Heathrow outage in March 2025. One out of three power connection lost was enough for the whole airport to go down, and resuming full operations took many hours, resulting in hundreds of cancelled flights and tens of thousands of inconvenienced passengers.

Put it all together and one conclusion is unavoidable:

You cannot prevent every outage. Your differentiator is resilience and recovery speed.

The catch, of course, is that resilience and fast recovery come with a price tag. Which is why you need numbers, not just adjectives, when working on a strategy.

Step 1: Identify the Processes That Keep You Alive

Before you argue about tools, buy more licences or design a complex DR architecture, you need to know what you’re protecting.

Start with 5 to 10 critical business processes that define if you can keep your business running if it comes to the worst, for example:

  • Order capture and fulfilment
  • Payment processing
  • Manufacturing control systems
  • Clinical or patient record systems
  • Shipping and logistics
  • Contact centre operations

These are the processes where sustained downtime becomes existential rather than merely annoying.

Step 2: Map the Technology and Suppliers Underneath

For each critical process, map out what it sits on. That will usually include:

  • Applications and platforms
  • Databases and data stores
  • Integrations and interfaces
  • Cloud, SaaS and managed service providers

This is where nasty surprises turn up. You may discover:

  • Multiple critical processes sitting on a single legacy database
  • A key SaaS provider with weak or opaque disaster recovery
  • An integration that can only be fixed by one person who happens to be on holiday a lot

You cannot fix what you don’t know about.

Step 3: RTO – How Long Can You Really Be Down? What Is RTO?

Recovery Time Objective (RTO) is the maximum tolerable time a service can be unavailable before serious damage occurs.

In other words: How long can this process be down before things break in a way that really hurts?

Examples:

  • If online ordering is down for 2 hours, you might lose some transactions, but customers will probably retry.
  • If a clinical system is down for 2 hours, you could be putting patient safety at risk.
  • If payroll is down for a week, staff will remember.

The key point is that RTO is a business decision, not a technical aspiration. Only the people running the company can say whether 4 hours or 24 hours is acceptable for a given process.

Step 4: RPO – How Much Data Are You Willing To Lose? What Is RPO?

Recovery Point Objective (RPO) is the maximum tolerable amount of data loss, measured in time.

In practice, it answers: If we had to roll back to an earlier copy of the data, how far back could we go without making a mess?

Examples:

  • An RPO of 24 hours means you’re willing to lose a day’s worth of transactions.
  • An RPO of 15 minutes means you’re willing to lose, at most, 15 minutes of data.

Again, this isn’t a number IT should invent based on what the current backup solution happens to achieve. It’s a business risk decision.

Set RPO too high and you may end up explaining to customers and regulators why a sizeable chunk of critical data has evaporated.

Why These Metrics Matter More Than Slogans

You hear a lot of vague statements such as “we need five nines” (= 99.999% server uptime) or “we cannot afford downtime”. While they may sound strong, they don't provide any information without context.

RTO and RPO do three vital things:

  1. They force decision-makers to put a price on downtime and data loss.
  2. They give IT concrete targets to design against.
  3. They expose conflicts where you have “critical” labels on everything, while having a very limited budget in reality.

Once you have agreed RTO and RPO for your critical services, you can design a resilience and DR strategy that’s grounded in reality rather than guesswork.

In the next blog post, we'll dig into exactly that: how to move from merely moving backups to a true resilience strategy.