Knowing what you want, having a vision, or a guiding light if you will, isn’t always enough. Matters can quickly get out of hand in a team, especially one that is growing exponentially and fast.
This is one of the reasons I decided to compile a list of “DevOps engineering principles”. It was important for me to create a basic structure, something that could be shared, discussed, and reinforced periodically.
Most of it came from mentors I had or glorious failures I experienced over the years. This is the idea:
- Take a stand and make a decision
- Every point can be discussed and debated
- Upon discussion and agreements, we use it as our guiding light
- Observe, assess, make adjustments, keep moving
The number of “DevOps questions” asked daily by internal teams is too high. Whether it’s newly hired engineers or instructions that have already been forgotten. By documenting things, you will be able to both link to the document and gradually engrain the habit of searching the docs before asking a question.
Our team tries to document everything it can:
- Whenever we experience a production failure, or something goes wrong, such as a breaking change of an external library hitting us or introducing new limitations by a provider (cough DOCKER cough), we write it up as a post mortem. Anyone reading it can understand what went wrong, even if they were not involved, and maybe learn something. In order to prevent a recurrence, a post-mortem should conclude with action items.
- There’s nothing like a good ol’ instruction page to explain something once and for all, whether it’s logging into a Docker registry or running localstack to mimic a serverless environment.
- Research - Being a DevOps team in a DevOps company involves a lot of research. It is easy to follow, understand, remind yourself and others of findings and lessons learned when it is documented, and equally important, it prevents rework and re-examination.
- How-tos - Oftentimes, we find ourselves instructing something very ‘simple’, in or two sentences. Even then, a document can be very helpful in providing additional context. Having a “search-the-docs” culture and a habit of documenting everything has a compounding effect.
In particular, we use Notion for all things documentation. There’s no way to say whether that’s objectively “the finest”. As readers and writers alike, we find it more inviting, easy to use and feature-rich.
Ask questions / always open a discussion
Experts working in their own bubbles and sharing information randomly miss the point. Creating a habit of discussing is beneficial to everyone; The one asking may learn something they did not know or consider, the responder shares their expertise, and passive readers gain enriched knowledge “for free”.
A culture of “there are no stupid questions”, is easy to decalre and hard to practice, but the benefits are incredible. I’d say the distance between something so un-measureable yet so valuable is amazing.
- Don’t be afraid to ask questions
- Don’t be afraid to challenge anything
- Are you bringing an innovative idea to the table? Come prepared to be challenged
- Create a discussion everywhere - Slack groups (not private chats), notion comments, pull-request remarks. The discussion may be of interest to the team, but it’s happening far from their eyes (e.g. a Github PR)? Invite others to read and participate by linking it in the team’s channel
Think lean, speed and quality
The point here is not to do things quickly, but to create things that work quickly. Spend more time setting up processes, so that you can not only save time down the road, but also help others reap the rewards of your labor.
In the world of DevOps, this takes the form of
- Lean packages - Zip packages uploaded to Lambda, for instance, can sometimes be compressed further, or unnecessarily packed with non-production content
- Lean containers - Can be based on
:slimtags. Can use fewer Docker layers, and utilize multi-step build containers to only carry essential production components
- Lean CI - The list of examples is endless, but steps can use cache from other steps, use lighter images (see lean containers), or perhaps run in parallel to save time and speed up the production process. Consequently, other jobs and applications have more time to work, the build queue is shortened, and resources are used less. There is an exponential effect.
- Utilizing cache - Whether it be between CI steps, Docker layers, query results stored in Redis, pre-built images and, of course, using a CDN.Cache is a driver of speed and efficiency, apply wherever possible
- Before hacking your way through, do things in the native way. “Read the docs instead of writing another bash command” or “Search the knowledge base before spending a day creating your own environment” are two quotes we used recently. System features often include solutions for what most engineers need. There are too many reasons to list why “hacking as a standard” is counterproductive in large teams.
Although this is a big one, here’s a short list:
- MFA - For everything. The effects of this security layer are mind-blowing yet it is surprisingly easy to implement
- Secrets management - Too often overlooked when it shouldn’t be. A dedicated manager should be responsible for managing secrets. Not in a Git repository, not as an encrypted text file, not on the CI server (although not as severe).
- Run resources on private networks. In a DevOps team, this goes without saying. With developers literally testing the cloud (did I mention we’re a DevOps company?) it’s hard to educate, let alone enforce. However, all resources should run privately and be accessed through a zero-trust proxy.
- Segregate environments - Good - separate VPCs, Better - unpeered networks, Best - separate accounts and use different access groups
- Using WAF as an example: use whatever deployable tool you have in your toolkit to make attackers’ lives difficult. Utilize zero-trust, limit network access, create complex password policies, and enforce multi-factor authentication. Maintain your devs’ sanity while limiting attack vectors
- HA - It is a principle in and of itself to ensure high availability. When designing an architecture, building production (and staging) environments, or running log aggregation systems, HA should be considered. You are guaranteed to spend time on a failure if you let something slip under the radar. The following is an example of a recent real issue - a self-managed ELK cluster hanging by a thread on an AWS spot instance serving the entire R&D department. Fortunately, we spotted the problem in time
- Cost efficiency - Terminology first: cost-efficient does not mean cost-reduction. The concept of efficiency can be viewed in a variety of ways, such as utilizing VPC endpoints for private, quick, and cost-effective solutions, or using spot instances for non-critical operations. A container can also be used over a serverless function that lasts a long time, or the opposite when the job runs in a few milliseconds, and a container/instance isn’t a good option
Everything as code
- Infrastructure as code. There are no words to describe how important this is. Just do it, as the saying goes.
- Meaningful commit messages - It is incredibly useful to have meaningful messages, for history searching, formatting, debugging, compiling change-logs, and so on.
- Use a Git system whenever possible. Is it time to update a WordPress plugin? Is it time to upgrade the infrastructure? (It’s all coded now, isn’t it?). Changing the permissions of a group? Is a secret being updated? USE GIT. The number of cases I’ve solved by comparing multiple Git histories is beyond my ability to count. It’s not for nothing that Git was once compared to a blockchain. The Git repository holds the entire (almost) history of the project, so use it wisely!
It’s kind of a “no-s**t-sherlock” thing, but still, it’s amazing how often backups are not available when they’re most needed. As such, I consider it relevant to list it as a “principle” even though it has been a common best practice since ancient times.
By automating backups, hours of work, pain, and sorrow can be saved. Most cloud platforms offer backups as part of their products, so you just need to turn them on. As an example, let’s look at AWS:
- RDS will enable backups and require a timeframe and a retention periods
- S3 can have versioning turned on
- EC2 (unless used as cluster nodes) can be snapshotted
- EBS snapshots are just as straight forward It goes on and on with ECS task-definition versions, Lambda functions, DynamoDB tables, etc. All provide some form of data backup and restoration for point-in-time recovery. Don’t call a task “backlog” or “tech debt”, because by the time it’s needed, the Jira ticket will no longer be efficient. In the army, there is a saying that states “these instructions were written in blood” (softening the analogy: systems experienced downtime). In other words, they are there to teach a lesson learned from others’ unfortunate experiences, and they have a good basis for doing so.
Another AWS specific but has counterparts in other platforms: AWS Backup can help with the “heavy” lifting of most backup services.
One extremely important point to note here is that backups need to be regularly tested! It’s never enough to “know they’re there”. When the fit hits the shan 😉 they may actually not work, or just as bad - there’s no knowledge of how to apply them.
Build to last, self-heal, and think of future maintainers
For one minute, imagine you were fighting yet-another-fire caused by a hacky script, or a poor choice of technology, grunting quietly and cursing at an engineer leaving legacy code. Instead of becoming someone else’s future headache, try being the source of quality and standards. My belief is that this has a compounding effect, where others learn from engineers and apply, and their names are kept in a positive light long after they are gone. With that in mind, try being the source of quality and standards rather than a future headache. I honestly believe, this has a compounding effect, where others learn and apply, and engineers’ names are kept in positive aspects long after they’ve left.
Change management and reviews for everything
Keep changes managed. Those of you who have read The Phonix Project will likely have an image of sticky notes on walls describing a practically-non-existent change management system. Things don’t get missed when a change management process is in place, preferably with a chain of approval and quality review process. Even more importantly, the engineers evolve, and production tends not to burst into flames. There’s nothing like review for sharing ideas, concepts, conventions and standards! It can be applied to almost anything; from code to infrastructure to blog posts to landing pages.
Despite staging (and other environments) being well-known for handling changes and testing them, they are too often ignored when it comes to operations-related systems. Exactly these types of systems should begin with a staging environment. Yes, they are usually not customer-facing, but they serve all others and ensure that they are safe, healthy, and functional. CI servers, alert systems, and container orchestrators are all essentials for the production environment. Even if they’re consumed as a service, in spite of the fact that they are all managed by a company you trust, they still suffer from many of the known pain points; versions can be upgraded, changes can break, retention can run out, and capacities can be reduced. “Staging First” is a concept that argues that everything should be staged in a separate environment before changes get to production. These should also be separated by accounts and user logins, for security, but also for human error and blast radius management.
This is by no means a comprehensive list. These are principles we developed as a team, agreed on, and adopted. You are invited to challenge them, add or suggest changes, and create a fruitful discussion around them. Let’s face it, we’re all here to learn, otherwise what’s the point? ;)
Thank you for reading 🖤
Feel free to reach out with questions or comments.