The Agile Operations methodology
Note: This is the culmination of years of work managing and optimizing the practice of technical operations groups/DevOps at scale. These are proven tactics and techniques that can be applied across any technical value delivery organization of any size to increase efficiency, satisfaction, and enterprise agility. While I had hoped to write a book on this eventually (see the outline for what that would have looked like), I do not have the time to do so, and yet these topics are extremely relevant especially as the cloud native revolution takes hold. This is not a replacement of DevOps, but instead the overarching framework that DevOps is a part of.
Just Enough Design Initially (JEDI) -- architectural planning and spikes necessary to complete the iteration (not feature) with minimal rework
Tackle the easy problems first
Assign a feature lead who will assemble the pairs necessary per iteration to estimate and complete the stories (may include customer developers!)
Runs like Scrum-Ban for the 2 week iterations
Prizes Agile principles like small batches, frequent delivery and active participation of customers in the process, PO manages this relationship
Teams are cross-functional and purpose-driven
Teams can become customers as well, meaning a IDD/FDD team may need an infrastructure feature in order to deliver, and this may constitute another team. We don't care who the customer is, necessarily, only that we treat them all the same and with the same level of care
Vector analysis:
VECTOR | DESCRIPTION |
Supportability | The support model in place around the infrastructure, including escalations, off-hours support and level of continual operations support, this also includes instrumentation and monitoring |
Operability | A highly-operable infrastructure means that supporting it during normal operations (managing services, for example) is extremely straightforward. Troubleshooting is easy, configurations are completely managed, and operational tasks can be handled by almost anyone with the requisite permissions. |
Recoverability | The ease of which you can recover from a failure or complete loss of the infrastructure. A high level of recoverability means it's zero downtime and auto-healing. |
Documentation | This doesn't necessarily mean documents in the traditional sense. It could be comments in code, readme files in github, spreadsheets, whatever. A high level of documentation means it's easy to parse, relevant and maintained |
Architectural design maturity | Architectural maturity means the underlying components and design are cohesive and maintained by either architecture or people acting as architects. Good architecture encompasses related systems and means that downstream/upstream providers are also stable and mature. |
Currency | This represents how up-to-date the system is from a patch, security, and revision level. In the case of home-grown software, it would have an active defect management process. |
Usability | From the developer and implementer perspective, usability means you can interface with it easily and reliably. It doesn't have to be rocket science, and results of use are generally well-understood. Databases like Postgres usually have high usability. |
Automation and Self-service | This represents the programmatic solution to operations work. This is typically the domain thought of as DevOps where deployments, provisioning, and new service generation are all done programmatically. Developer teams, and others can do things like restart services or upgrade their own stacks without any intervention. |
Deployability | The speed and ease which the infrastructure can be deployed from the development environment to production including all requisite testing (security, regression, compatibility, performance, etc.) |
Stability | How often does the infrastructure fail? High stability means almost never, and ideally without causing customer-facing service interruptions. |
Community support/adoption | How much support and adoption is there in the world outside of the company for this infrastructure? If it is open source, is the community actively developing and patching it? If it's a paid product, does the vendor have a wide customer base? Are they still going to be in business in 2 years? |
Maturity | How long has this infrastructure been around? If it is in-house developed, has it been in place for days? months? years? |
Ease of use (inverse of complexity) | How hard is it to work with and understand? Nagios is an example of a highly-complex solution, where redis has high ease of use. |
Ubiquity | How many installations or different uses are there of this technology in the infrastructure? MongoDB is an example of an infrastructure component with high ubiquity. |
Happiness | This is the human factor of the infrastructure. Do people like it? Do they enjoy working with it? Docker is typically a high-happiness piece of infrastructure. |
Aggregate analysis:
0 | No one knows anything about it, except maybe one person who no-longer works at the company. Maybe this is technology that is no longer supported or was purchased and mothballed. This could be internally-developed and poorly-designed or not maintained. This is the classic thing no one will touch with a 10-foot pole. Or, it is something completely brand-new to the company and is being championed by an individual or occasionally a team. There's little or no recovery, fault tolerance, backups, security or documentation. In the case of new technologies, these are probably not in production yet. |
1 | This is something that has gotten past the initial research phase and there's some intent that it could go into production. If it's already in production, chances are one or two people know something about it, and are constantly getting derailed from their normal work to deal with it, especially when things go wrong, or someone needs to use it. Recoverability is largely manual and sketchy. Security is low and probably not compliant. Patch levels may be out of date, and people may be afraid to do too much with it because it is finicky and/or unstable. Many people have been "burned" by this. Maybe a replacement is on the way, maybe not. There is some documentation but not much. It's mostly tribal knowledge or what you can dig up on forums or vendor sites. Chances are there is no SLA, or if there is, it gets violated. |
2 | This is something a few people are experts in and has the feel of general adoption. There's some documentation/recovery capability, but most of what happens with it is done manually. There is little or no self-service capability and minimal automation. Some parts of this may be under configuration management control, but it may not be 100% up-to-date. It may have dependencies on other barely-supported technologies and is frequently error-prone. This may be a frequent headache for on-call support, and may trigger customer-facing outages. It may be more current with patch levels, but there's not a real process around that. Security holes exist in it, and most people know it. It may have more than one installation in the company, but they would likely be inconsistently done. |
3 | Most production technologies fall into this rating. This is something that can largely be recovered and has reasonable fault tolerance. A lot of it is manual work, but there are at least scripts and a good amount of it under configuration management control. Patch levels may be only a few releases behind, and there are people who do so with some regularity. SLAs exist for this and are usually met. People are not eager to use this infrastructure, but do so because it is considered part of the core stack. Troubleshooting and supporting it is usually okay, but can occasionally be extremely challenging due to unusual use cases or user error. It is not super straightforward to upgrade or manage, so these activities are usually risky and prone to rework. |
4 | This infrastructure has been around for a while. People know how to use it, and frequently do. It's extremely stable and reliable and kept as current as is reasonable without triggering outages. When it fails, 90% of the time or more it is either fault-tolerant or not customer-facing. This infrastructure requires very little after-hours support, is integrated into build pipelines and probably has extremely high test coverage. There are plenty of people throughout the organization who are experts on it. It's considered a core piece of technology and is prized for its ease of use, ubiquity and lack of operational/interfacing complexity. Community support is wide, and it is relatively easy to hire people who know how to work with it and support it. It may have a group of people who are dedicated to it in operations, and possibly engineering. It's on the architectural roadmap. |
5 | This is something most people know how to use, upgrade, support and recover. 90%+ of all use cases are automated or have easy-to-use self-service tooling. It is able to be spun-up and destroyed on demand for testing and validation purposes. It has most likely been around for a year or more in the infrastructure. There may be specific infrastructure engineers (e.g. DBAs) associated with it. It is secure, compliant, and scalable. It is a low-incident generator. It operates so smoothly that people don't have to think about it. |