How to scale and mitigate risk in the cloud?

Technical Viewpoint by Costa Christodoulou, Head of IT Systems, ZetaSafe

Our platform underlying ZetaSafe now operates with over 100 different components from Managed Storage Accounts, Virtual Machines, Application Instances, Network Security Groups, Traffic Managers and Backup Services to name but a few.

This is all achievable when you adopt a cloud first approach as a business. It gives you the power to truly scale your services both horizontally for capacity and vertically for performance, with little or no downtime and normally with a few clicks of a button.

The power to scale your application goes hand in hand with mitigating risk. While many cloud services offer SLAs, some are as low as 99.9% (That’s 43 minutes of possible downtime a month compared to 4 minutes with 99.99%) and at that low level your need to mitigate the risk as much as possible. One quick answer is to scale out your service from 1 to 2 instances and you have now automatically lowered the risk while increasing capacity.

But want about a data centre outage?

While the cloud is great, you rarely see one cloud in the sky. Data centres have outages, this is a fact of life, and this is when you need to design your solution to be robust. Tools like Microsoft Azure Traffic manager allows us to not only route traffic to any online data centre but also to route our clients to the nearest data centre geography, in order to give them the best performance possible. Furthermore, whilst not only operating multiple instances in one region, we operate multiple instances in different geographical regions to mitigate this risk, and improving the capacity and performance of our service.

Below is a simple example architecture diagram:

Automatic failover

The key to risk mitigation is to automate your fail-over points where possible. Having multiple instances is great, however waiting 10 minutes for a person to initiate a fail over undermines some of that effort to build resilience into your service.

This is where utilising cloud services, such as traffic managers and internal load-balancers, which have been configured with fail over in mind, come into play and provide that level of fast failover. Nevertheless, there are rare scenarios and occasions when a controlled fail-over is still preferred, especially when we are talking about data consistency and integrity. For these rare events monitoring, alerting, escalation, procedures and processes are important to minimise the impact. Even so remember, automate what and where you can.

ZetaSafe in action at South London and Maudsley NHS

Providing a safe environment for patients, clinicians and visitors is vital for South London and Maudsley NHS Foundation Trust. Download case study now..

Fill out my online form.

Comments are closed.