What it really takes to prevent your site crashing
Were you one of the hundreds of thousands of people stuck in an online queue for Ocado, Boots or B&Q at the start of lockdown? Did you hang on in there or did you end up shopping somewhere else?
Either way, the experience of being unable to access sites and services we normally take for granted - especially if we’re loyal customers - was supremely frustrating.
While nobody could have predicted the pressure and importance COVID would place on digital services, the sad fact is that disaster recovery is often the last thought when organisations are launching new sites.
When the worst does happen – such as a pandemic – it really shows.
But it’s not just in times of crisis that business continuity is important.
No online business can afford downtime
As my colleague Steve covered in a recent post on Four Things That Are Harming Your Conversion Rate, if your pages don’t load instantly, you’ll lose customers.
Amazon famously reported that one second of load site delay can cost them over $1.6 billion in annual sales.
Site crashes and delays also harm trust – 79% of customers “dissatisfied” with a site’s performance are less likely to buy from them again.
This is one of the many reasons business continuity and disaster recovery are right there at the top of the checklist when we work on a new project.
In this month’s article, we thought we’d share everything we do here at Tangent to make our clients online services available and guarantee at least 99.95% uptime on enterprise systems.
Achieving 99.95% uptime
We work with many different clients who frequently experience big surges in traffic.
UK Power Networks look after the power infrastructure for the South East. If there’s a sudden storm in the area there are likely to be more power cuts and they get a significant amount of traffic as a result.
Likewise, the Labour party, another one of our clients, gets a significant amount of traffic during a general election or a leadership election.
Our service level agreements guarantee we’ll keep all of our clients' sites available, up and running 99.95% of the time, even during these peak periods. This is known as an 'uptime’ of 99.95%.
To put it another way, this means the maximum downtime they can expect is 43 seconds a day, which is 5 minutes a week and 21 minutes a month.
This is the industry standard that most sites should be aiming for although there are some critical services, such as HMRC or NHS, which cannot afford to be down at all and will need even higher uptimes.
While there are several different measures we implement, proactively and reactively, to mitigate the impact of traffic spikes, it starts with designing a strong technical foundation in the first place.
1. Zero downtime deployment
Traditionally, when you launched a new site or service, there'd be downtime with it. But now with CI/CD, which stands for continuous integration deployment, we can achieve zero-downtime deploys, otherwise known as blue-green/slots deploys.
We can also deploy faster than ever and run automated tests on our applications. We aim for at least between 70-80% code coverage on enterprise projects, which means we know at least 70-80% of our code is working as it should.
All of which means we know we’re deploying robust sites in the first place.
2. Release management and testing
Many times a site has been accidentally broken by someone going in and tinkering with code. Not on our watch! As part of our quality assurance (QA) efforts, we meticulously test new versions of the site after we’ve updated them.
Firstly, we do a manually regression test, to ensure that the site still performs just as well after the change as it did before. Then we run a User Acceptance Testing (UAT).
We also do ‘smoke tests’ once in production, which is like a quick test of everything to make sure it's set up.
All of which is super important to ensure these sites are well-built and remain available.
3. Scalable infrastructure
Along with building applications so that they can scale, we also make sure that the infrastructure of the site can deal with spikes in traffic.
With auto-scaling, we can scale up or scale out PaaS infrastructure, ensuring either additional resources or instances of the infrastructure respectively. These increases are based on agreed thresholds or CPU usage or memory.
This means clients only pay for this additional resource when it is required.
However, this is something we need to be careful with. We don’t want to be paying out a fortune for an increase in server space that turns out to be a DDoS attack rather than a legitimate increase in traffic.
We manage this with careful monitoring...
4. Monitoring & Alerting
Proactive and continuous monitoring allows us to act quickly when something unexpected happens - such as a hacker attack or other kind of emergency, including pandemic-induced demand!
Some of the ways we actively monitor the way our clients’ sites are performing include:
- Custom monitoring alerting for example, abnormal traffic spikes, items in queues or load on individual services, business related logic and response times
- Uptime monitoring
- this means we have services constantly pinging our servers, and if we don't get a response within a certain threshold (a few seconds), we know the site is down and can instantly alert everyone. Hence, we should know there is an issue before you do.
- Application monitoring
- if there's an error with the application, we receive a notification immediately detailing the stack trace and source of the issue.
- Infrastructure monitoring
- this allows us to look at CPU and memory usage. We receive an alert if it gets to an agreed threshold.
5. Regional redundancy & failover
As you may or may not know, Cloud infrastructure is globally distributed into regions, for example AWS has a region called EU-WEST-1 hosted in Ireland.
It’s rare, but potentially an entire region could go down, particularly in the event of a disaster. In order to prevent your site going down with it, we can build in regional redundancy or failover to another region.
There are two types of failover: hot failover and cold failover. Hot means we’d maintain and update all the data from the primary to the failover environment. Cold means we’d only have the failover infrastructure and wouldn’t be able to recover or replicate data.
While this does guarantee a higher level of uptime, it obviously brings extra costs. To keep costs down, an alternative option is to create Infrastructure As Code (IaC) using something like Terraform.
This means that in the event of a disaster, we can recreate the infrastructure programmatically and get the application up and running much quicker than if we had to do it manually.
6. Data handling and queuing
Another key element of keeping a site up and running is how we design the application to process and handle data, especially when it involves passing on data to third party APIs.
The queuing of requests is particularly important.
So, for example, if we have a simple contact form that sends data to an external email service like SendGrid, we ensure those emails go into a queue before being processed. This means we can handle huge amounts of traffic hitting that form.
If we just sent it all straight to the email service and there was an outage, important data might never get there.
By processing data from a queue and only clearing items from that queue once we receive a successful response from the service. This offers a level of ‘eventual consistency’ by design.
Without building this in, you could potentially lose valuable emails or data, which makes it an important part of handling large volumes of traffic.
We also monitor these queues at all times, with alerts to make sure that we can quickly deal with any issues that may occur.
7. Caching policies
Another important failsafe to have in place, in the event of extreme, unexpected increases in traffic are ‘caching’ policies.
We obviously have caching at an application level but in some cases, we consider it at an infrastructure level. Using services such as Varnish Cache or a CDN Service like Cloudflare, Amazon Cloudfront or Azure CDN.
These services offer a feature called “dynamic site acceleration”. This means we can cache the entire front end/rendered markup of a site. This means that, rather than requests hitting the actual application’s infrastructure, we ensure that they hit a cache service instead.
Ensuring you can handle incredibly high volumes of traffic without the site crashing.
8. Vulnerability scanning
Keeping a site up and running and coping with traffic spikes is very much a proactive activity.
We need to keep everything that makes up a system, from content management systems to third party libraries, updated to their latest stable versions. We do this by periodically scan all the applications dependencies, normally as part of the CI/CD pipeline and our release process.
To perform these ‘vulnerability scans’ we use something called NPM audit for the front-end libraries. For the back-end libraries, we use GitLab's built-in vulnerability scanner. Both of which reference authorities such as NVD (National Vulnerabilities Database) and CVE (Common, Vulnerabilities and Exposures).
These scans mean we get to see if there are any weaknesses or security threats in the libraries we're using and we can proactively update them before they become an issue.
The pandemic took everyone by surprise and what was exposed is that many organisations’ continuity planning wasn’t up to scratch.
Most of us take it for granted that our go-to sites and apps will work when we want them to, but there’s a considerable amount of effort that goes on behind the scenes to make that happen.
And while it may seem excessive to some, it costs a lot more to scramble to get something into place in the event of an emergency, than it does to have it as part of your foundation and processes from the beginning.
Not to mention the costs of losing potential sales and trust when a loyal customer can’t access a crucial service when they need it.
Doing it right from the start always pays its way in the long run.