Cloud Computing - 7 Fundamentals
In a previous post, I've touched very briefly on the benefits of cloud. In this one, though I'm going to give a quick overview of some of the key concepts of cloud. These address the foundational theory of why we use cloud and why its the go-to choice, superseding on-premises set ups, and independent of which provider you favour.
I'll be introducing the concepts with the technical word (because it's always good to get the terminology right for clear communication), but also explain what they mean - hopefully - in an accessible way. But first...
What is Cloud Computing?
You've probably already figured that cloud computing has nothing to do with the fluffy things in the sky. Really, it gets its name because of the illusion (in my mind at least). Things are no longer right in front of you, taking up space in the office or a special room just for servers. It's all just "out there". In reality of course, everything is still on servers with actual hardware, just away from where you can see them.
Cloud computing covers 3 areas, all of which are important for an MLOps set up: compute, network and storage. Or to put that a different way, cloud covers the virtual computer to run your models, store your data and the ability for these to communicate with each other and the wider world.
Compared to the on-premises equivalent set up, cloud offers a low cost solution which is easy to adopt and with "unlimited" resources. Why? Well let's delve into that as we explore the different fundamentals of cloud computing. NB "unlimited" in inverted commas, because there is a limit due to the hardware but to the end user it feels limitless.
High Availability
This is the key thing in my mind that gives us that "unlimited" feel to cloud computing. What we need is there, whenever we need it and we can change it if those needs changes. Being liberated from only what is available on local hardware gives us a whole new ability to add/remove services with just a few clicks/lines of code. We don't have to find space for a new server, or figure out what to do with ones we don't need anymore (worse still, we don't have to predict an uncertain future when setting up!) The benefits of this high availability also show when something is broken, mostly you're unlikely to know a server you're using is down, because everything has seamlessly been moved across to another one. It is the use of clusters that makes this so doable.
Reliability
This is also known as fault tolerance or disaster recovery by some. Less snappy, but I think more descriptive about what this actually means. In an ideal world, and most scenarios, this is the aspect that guarantees the end user experiences no down time. It's a way to describe the resilience of a system. You can have high availability, but without them being reliable you might have a lot of down time if there are multiple failures.
There are two strategies for ensuring reliable systems. Firstly, deploy in multiple locations - that way if there is a local failure at the data centre (e.g. power cut) your systems can still run elsewhere. The other strategy is to ensure that there is no single point of failure so that no one server can bring down the entire service i.e. have a back up plan! Needless to say, employing a bit of both of those strategies is the ideal to cover all bases when it comes to having a reliable system.
Scalability
The process of being able to change the amount of resources you have on a as-needed basis. There are two types of scaling that you should be familiar with. The most common of which is horizontal scaling or scaling out: this is where you add more of the same resource e.g. another VM that is a duplicate of existing ones. It's probably most common because it's quick. The other way to scale is vertically, or scaling up, which is where you replace existing resources with something that's a different size (smaller or bigger depending on needs).
Now, you can do this manually when you need but what if you just don't have the time to do that? Or the situation changes so quickly that you don't have the time to respond. This is where auto-scaling comes in. This is where the changes, whether scaling up or out, increasing or reducing resources happens automatically within the set of rules that you have created.
A key advantage of autoscaling is preventing you from overpaying for services. For instance, if your APIs usually get maybe 100 calls at a time then you will pay for the resources that manage that most of the time. However, if you get the occasional spike of 2000 calls - then what? Well, with a system that doesn't scale you might need to deny requests, which can be a big issue for your users and therefore you as a business. Or you would have to have all the hardware and everything in place for 2000 requests all the time, even for a one off peak which could be pricey and a waste the rest of the time. With cloud computing, your systems will expand and contract based on need - and you only pay for what you use!
Predictability
Within predictability, there are two areas: performance and cost. Both of which are important, but different.
Performance
This is about giving the end users a consistent experience. You need to be able to predict performance so you know you are delivering. This relies heavily on a few of the previous points and some others e.g. autoscaling, high availability.
Cost
Here we are talking about tracking and forecasting those pennies. This is to help with budgeting to make sure you have the funds to meet your requirements, there should be no surprise invoices at the end of the month! But also, there are analytics tools built into the cloud which can help with long-term predictability and planning.
Cloud Management
There are three concepts that can be grouped into this umbrella term. Each is important to understand and so I've counted them each as separate fundamentals.
Security
Cloud services allow you to have full control of the security in your cloud environment, but depending on the level of control you want you could opt for different models of provision from your cloud provider (you might have heard of Infrastructure- or Platform- As a Service i.e. IaaS or SaaS.) This covers aspects including patches, maintenance and network control amongst many others.
Governance
This makes creating a standardised environments to meet common regulatory requirements pretty straightforward. You don't have to figure all of it out for yourself, because other people have probably done something for similar/the same regulations before you. Bonus (even if it doesn't sound like it) - you can audit your environments against these regulations which helps to monitor compliance.
Manageability
This covers both the ability to manage of the cloud and in the cloud. A very subtle but important distinction. When we talk about managing of the cloud, we're thinking about how we can manage the resources that we have on the cloud e.g. thinking about autoscaling, monitoring and even template based deployment. Whereas when we're thinking about managing in the cloud we are looking at how we manage our interaction with cloud resources e.g. through the portal, CLI or APIs.
That's it, that's 7 of the fundamental principals of cloud computing! There are more, of course there are more. But, I think I've covered the main ones, the ones that will start us off in the world of cloud computing and from which we can build further knowledge.