The real cost of hesitation with migration to the cloud
Photo by pexels.com
Regulations, compliance, security, long-term contracts, and even politics are common answers to “Why are you not deployed on a public cloud?” But the most common answer I keep hearing is “It’s just too expensive”. Is that so?
The words below are written with AWS terminology for convenience but are very much relevant to any public cloud provider.
“On-prem” resources — the term is usually referring to the physical machines hosted in a non-public-cloud environment in companies’ storage floors, basements or in a remote server farm that’s usually leased for long periods bounded by unbreakable contracts. These machines, which require physical maintenance, are physically located in the building and need both the space they live in and dedicated personnel to attend to them. Regardless of location, they require management of resource allocation, monitoring, and different errors handling.
Migrating different deployments to a public cloud can speed up processes, enhance workflows, improve productivity and product lead time. It will increase security, simplify infrastructural modifications and will generally make engineers very, very happy.
Yes, but
“Running on a public cloud is not safe!”
While in the old days we were used to being physically close to our servers, taking care of them in person, and enjoying a general sensation of safety by seeing them with our own eyes, these days it’s quite the opposite; servers are safer living in a remote cloud provider farm, both physically and virtually. Not to mention the overall environmental implications of running your infrastructure, much like driving fifty cars by fifty people instead of taking one bus to work.
“Sorry, it won’t work with compliance, there’s nothing to do.”
Not to bring you down the dark road of reading AWS’s risk and compliance white paper (which I have, for certification reasons), for most companies competing with AWS level of compliance will be very difficult. Unsure? Read the documents and contact their staff for legal documentation and standardization certifications to be on the safe side.
“I can’t take the risk of networking latency; it’s insanely different.”
Laziness can cost some $$; the amount of networking issues, routing mistakes, proxy components overload, and general shit storms I’ve encountered working with on-prem resources is endless. I have never experienced any noticeable latency working internally in one availability zone on AWS, and cross AZ noticeable problems are so rare they are negligible moreover if you’re anxious about sub-ten-millisecond latency you can use dedicated hardware, which also answers compliancy and specific certification requirements.
Enough excuses, let’s talk business
Resource caps, limitations, and scale
“a sunk cost is a cost that has already been incurred and cannot be recovered ”
- [Wikipedia](https://en.wikipedia.org/wiki/Sunk_cost)
When working with a locally managed resource pool, scaling both vertically and horizontally, need to be planned, and naturally always capped. Bounded by the cap, companies purchase ~90% resource buffer to handle the expected load, usually planned on a yearly basis; this means paying and getting 100 servers where you only need 10 to work. **The overspend is huge! ** It is also caused by this thing called a “budget,” you have a budget, you have to spend it … so you get MORE servers and drive utilization lower and lower each year.
The immediate repercussion is ~90% continually wasted compute power, easily noticed by operations teams. Usually, the buffer is quickly filled up with all kinds of testing servers and exaggerated allocation of space and resources because “it’s already there”. It doesn’t take a rocket scientist to see where this is going; the quickly filled up buffer is not wasted anymore creating yet again a requirement for buffer and a virtual “scale” that is expressed in further purchases.
Time waste
Usually, asking for an operational change with on-prem resources requires a request in the form of a ticket to the team in charge. The ticket takes time to process and handle, usually for a change that could have been handled immediately with a single API call. Let’s take a minute and consider I’m a developer working on a locally deployed k8s cluster in a local environment; I’ve added a new service and registered everything within the cluster, but I also need a DNS entry registered on the local servers for other developers and applications to interact with it. In an automated infrastructure where permissions are granted in a granular way allowing an engineer to handle every aspect of his environment, this change can be done with a millisecond of a local API call to the DNS server or better yet, done by the system itself without any manual intervention (e.g., External-DNS in k8s).
On a public cloud, such a setup is effortless to implement: the engineer would be part of a group that’s allowed to access the lab’s infrastructure and make changes to its internal DNS with no external intervention.
Having a request filed to a local team for the change will usually take roughly one hour to two days, depending on the backlog of tickets and availability of dedicated engineers. During that time, the developer has switched multiple contexts, leaving unfinished work in process.
It’s not that user permissions and access policies can’t be created on local resources, it’s just more complicated and almost never implemented. I don’t claim that no one does that, but I’ve never witnessed it nor have I ever heard of any of our clients achieving it.
Let’s talk about scale
I assume I don’t have to elaborate on scalability or why it’s so important, but if you are into it and would like to read some stories, you can find me rambling about it here.
When discussing production systems serving clients in real time, scaling is key to survival, but it has to go both ways, and so does survival; when traffic is peaking, and systems are required to respond to 2,3,4-X the additional load, they should be designed to automatically scale while reporting the change. Failing to handle the load will most likely lead to frustrated customers, loss of revenue and most importantly loss of faith and confidence in the product. The solution is called *scaling out, **as in *adding more of the same components to handle more load.
Peaks in demand are not once a year or a seasonal phenomenon; they can be a daily routine and changing throughout different seasons and occasions. Manual handling stops being a reasonable option quickly when the product grows, and a substantial load of customers starts using the system. If such a change in load isn’t regularly scaled in *(i.e., reducing the number of service duplications to save used resources, increase efficiency and reflect a real system state), *unused resources will keep living until scaled down either manually by engineers upon alerts, or when the change will be noticed on an occasional scan.
On a public cloud implementing automatic scaling policies is easily accomplished and considered best practice in both ways; scaling out is usually limitless and is not blocked based on low availability. Scaling in will usually improve costs and utilization of the system. On a local infrastructure, however, adding extra resources to handle the load is usually capped based on the initial calculation of required infrastructure. It can work for planned events and traffic growth, but on special occasions for traffic doubles or triples with little or no notice, production may be in danger.
as a service
Using a public cloud has a lot more to it than just on-demand virtual machines; almost every cloud provider offers a wide range of managed services. Take for example the need to maintain a Kubernetes cluster, scaling a MySQL cluster or deploying a robust Redis cache. Working on-prem would most likely result in operations work for each of these. Allow me to assume that the operations team doesn’t have the professional level to create all three of the above with best-practices and scale in mind. Not because they are careless or unskilled, but merely because they already have too much on their plate for not having any automation around any of these, and for not being able to be a world class pros in every field. Moreover, these might be new technologies that these engineers need to learn first. Sure, a chef cookbook or an ansible playbook may be there ready to help, but what about robustness, scale, monitoring, security, etc.?
Having these as a managed service(s) will make the technology available for developers in a click of a button. Scaleability will most likely be provided as a feature of the system, DNS and routing would be integrated around the rest of your deployed resources, and security would be managed like everything else through IAM policies and roles.
Suddenly, clustering and scale are a no-brainer. Third party technologies are consumed as a commodity, leaving the operations team dealing with what they’re interested in, making developers happy and maintaining easy work.
The common miscalculation of cloud costs
I get to see quite a lot of tech companies, which vary by size, character, and stack, but those who are using on-premise resources tend to answer the same when asked about the processes of cloud migration costs. Putting aside answers like customer’s demands or legal regulations, which usually are only relevant to specific parts of the system, I keep hearing the known same story, here’s a typical conversation:
Me: I keep hearing your complaints about lead time, operational tickets, and eternal waiting for resources, so why are you not considering a migration, even if partial to the cloud? Client: We had X-cloud representative here Y months ago. By the calculations we made after his visit we’ll be spending three times the current bill. Me: Can you explain the calculation to me? Just generally, what was counted in? And before you answer, how did you count in engineering time waste, context switches, time spent on manual changes and pursuading operations to make a random change or add a random resource? Allow me to assume you didn’t? Client: Well, it’s more complicated than you may think, but genrally no, we did not address any of these.
As one can guess, no-one calculates the above into their bills. Whether because its hard to measure, or just ignored, companies leave out the fluency of work, quality of production systems and changes pace out of the equation.
The reasons are different, sometimes defending their own decision, and sometimes forgetting about it, but in the last time I asked I was shocked to hear “we can’t just get rid of our operations team”. I know, WOW. The thought of letting someone go — or easier: make changes in their responsibilities, made a company miscalculate cloud costs and make a decision affecting it’s entire 600 R&D employees. Sounds crazy? It’s more common than you imagine.
What about happiness?
Another property that’s often ignored is developer-happiness. And it doesn’t mean just keeping your code-generating machines well fed. Developers are professionals; they want to innovate and make great products, making them wait in line for infrastructure on which they need to deploy their code, or switch context every couple of hours waiting for a networking change and sometimes just an unanswered question, can frustrate them. Frustration in the development process will most likely lead to deterioration in quality.
Try to locate those bottlenecks and time wasted by developers. See where they feel pain in the process, and how can you make their experience of developing new features and ideas as fluent as possible. Help them prevent any manual intervention or button clicking to make their new change deployed to development environments. Make sure they use an automatic process and receive automated feedback from the system on whether their new code failed tests, deployed well, changed metrics for better or worse, or simply deployed and ready for integration testing. Getting to that sweet spot will show you the true nature of your team, and let their real potential out.
Do you know this joke where a tech lead tries to convince their R&D manager they should move to a public cloud?
The manager then thinks about it and replies “great idea, I’ll have a managers’ roundtable scheduled, and we’ll discuss it”. They finish the roundtable meeting agreeing to have another meeting discussing the technicalities of the migration, and on the next session, they set another one for the financial aspects, and then legal ones and so it goes on and on. Calculating seven meetings of 6 manages, each 2 hours long, turns out to cost more than their planned POC together with six months of production cloud usage.
Hidden costs will always stay hidden to those who don’t wish to see them.
My name is Omer, and I am an engineer at ProdOps — a global consultancy that delivers software in a Reliable, Secure and Simple way by adopting the DevOps culture. Let me know your thoughts in the comments below, or connect with me directly on Twitter @0merxx. Clap if you liked it, it helps me focus my future writings.