Beyond Backups: Designing a Working IT Resilience Strategy
In this blog post, we're addressing why an effective disaster recovery strategy requires more than just backups.
Why “We Have Backups” Is Not a Strategy
Ask most organisations how prepared they are and you will hear the same line: “We have backups.”
Backups are important, but they’re only part of the picture. Backups give you a safety net. Resilience is your ability to keep critical services running or restore them quickly enough that the business is not badly damaged.
You can have flawless backups and still:
- Spend days restoring systems
- Leave customers angry and in the dark
- Fail regulatory expectations
All because recovery was slow, untested or only covered part of the estate.
A serious resilience strategy starts with business impact and works backwards. A backup policy usually starts with whatever the existing tools can do.
From “We Have Backups” to “We Can Keep Operating”
Backups answer one question: “Is our data stored somewhere else?”
Resilience answers a tougher one: “Can we continue to deliver our most important services when something fails?”
You should be asking:
- How long would it actually take to restore this system from backup in a real incident, at scale?
- What would staff and customers do while we are restoring?
- Which dependencies such as identity, networking or integrations would still block us even if the data is back?
A credible IT resilience strategy treats backups as one component among many instead of an entire solution.
The Core Building Blocks of Resilience
1. Redundancy and High Availability
Redundancy means not relying on a single component that can bring everything down if it fails. High availability means designing systems so that if one part fails, another takes over and users hardly notice.
Examples:
- Multiple servers behind a load balancer
- Two data centres rather than one
- Cloud services spread across multiple regions
We have seen what happens when this is ignored. One of our customers had services hosted in a single Azure region. When that region went down, so did they. That is the classic “all eggs in one basket” problem.
2. Data Protection: Backups, Replicas, Snapshots, Off-Site and Cross-Region
You already understand backups. The detail matters:
- Backups
Copies of your data stored separately, often on slower, cheaper storage. Good for recovery from deletion, corruption or ransomware, but usually slow to restore. - Replicas
Live, continuously updated copies of data elsewhere, designed for fast failover. Great for uptime, but if you replicate too aggressively you can also replicate corruption or malicious changes. - Snapshots
Point-in-time images of systems or volumes, useful for quick rollbacks. - Off-site / cross-region
Storage or replicas in other physical locations or cloud regions, to protect against site-level issues such as fires, floods or regional cloud failures.
A mature strategy combines these deliberately, based on the Recovery Time Objective (RTO) and Recovery Point Objective (RPO you agreed as a business, not on the default settings in your backup software.
3. Network Resilience: Multiple Links, Routes and VPNs
Your systems are useless if people and other systems cannot reach them.
Network resilience is about avoiding a single cable, router or ISP becoming your Achilles heel.
What does network resilience planning involve?
- Multiple internet providers
- Redundant firewalls and core network devices
- Diverse routes between sites
- Tested VPNs for remote access
If you’re heavily cloud-based, you also need to consider what happens if a key region or interconnect is degraded.
4. Identity and Access: IAM, AD and SSO Recovery
After a serious incident, one of the most common blockers is depressingly simple: no one can log in.
If your identity provider, such as Active Directory (AD), your Single Sign-On (SSO) platform or your IAM (Identity and Access Management) service is down (think Microsoft or Google), your recovered applications are effectively bricks.
An IT resilience strategy must treat identity services as tier-one services, with their own DR and high availability design.
If you're an executive, you should be asking yourself: “In an incident, how will administrators and key users authenticate if our primary directory or SSO is offline?”
Trade-Offs: Cost vs Resilience vs Complexity
Resilience isn’t free and it isn’t linear. Doubling spend does not magically halve risk.
Each extra layer of protection adds cost and usually adds complexity, which can itself create new failure modes.
For example:
- Active/active architectures reduce downtime but are harder to operate and test
- Aggressive replication improves your Restore Point Objective (RPO) but increases the risk of replicating corruption or ransomware
- Extra vendors or regions reduce concentration risk but increase integration and monitoring effort
Leadership’s job is to decide, explicitly, where high resilience is essential and where slower recovery is acceptable. Labelling everything as “critical” results in bloated, fragile designs, and wasted money.
Architecture Standards, Technical Debt and the Drag of Legacy
Resilience goes beyond DR tools. It’s heavily influenced by the quality of your architecture and the amount of technical debt you’re carrying.
Common warning signs:
- Legacy systems that can’t be clustered or replicated
- Applications only understood by one engineer
- Point-to-point integrations and hard-coded dependencies
- “Temporary” workarounds that quietly became permanent
These make recovery slow, unpredictable and dependent on individual heroics.
The alternative is to invest in architecture standards, for example:
- Common patterns for how critical services are built and protected
- Clear “gold / silver / bronze” resilience tiers and what each means
- Approved, well-understood technologies for backup, replication, monitoring and identity, rather than a mess of bespoke setups
If you refuse to confront technical debt and inconsistent architecture, you’re effectively betting your continuity on a few exhausted people in a crisis. And if there’s anything you should avoid in an emergency, it’s panic.
A serious IT resilience strategy treats debt reduction and standardisation as first-class resilience activities rather than side projects.
In the next blog post, we'll look beyond your own estate and talk about third-party risk, testing and the human side of recovery.