"DevOps" is more about Customer Feedback and Quick Learning than Culture/Process/Tools

Are you really "DevOps"?

You have spent years going through an amazing transformation. You claim that you are now officially "DevOps". But, what now? Your customers don't really care if you are a DevOps shop/factory; they care about how your products --built from that shop/factory-- add value to their lives. Common sense stuff, right? Well, unfortunately, no. You'd be surprised as to how many people still quite don't get what the true spirit of DevOps is. You often hear things like:

We are a team of developers doing operations - we are DevOps!
  • We are a Dev team continually deploying to production - we are DevOps!
  • We are a team where Ops and Dev love each other - we are DevOps!
  • We are a Dev team with no QA support; we are responsible for both manual and automated tests (unit, functional, perf, integrations, etc) - we are DevOps!
We are a team of operations engineers doing software development - we are DevOps!

Sound familiar? While all the above are not false claims, they fail to draw a complete picture of what DevOps is truly about. The first thing that comes to my mind when I hear this, "Well, all this is great, but now what? Who are you doing all this for? All these changes to your culture, process, and tools for whom? Where do your customers fit into all this? How are all these changes impacting your customers? Most importantly, what are you learning about your customers?"

Customer-focused Software Engineers - the missing piece of the "DevOps" puzzle

Most people don't realize that software engineering is so much more than just writing clean, quality code using TDD, with close to 100% test coverage so you can do CD; it is so much more than moving from a monolith to microservices or containerizing your applications or orchestrating Docker containers with Kubernetes or moving to a serverless architecture (well, all the shiny things out there).
The part that is often not talked about is how does all this technical excellence improve the lives of your customers. This where the customer-focused software engineers come into the picture. They exhibit the below traits and behaviors:
  • They are intrapreneurs; they are relentlessly focused on making the lives of their customers (both internal and external) better through the products they build and the services they offer.
  • They don't indulge in mindless tasks; they are mindful and self-aware (always have the customer in mind). They have inner peace. They are inherently mission-driven and find meaning in everything they do --even mundane work. They are all about the problem they are trying to solve.
  • They aren't afraid to admit their failures; they are always learning from their mistakes. They take steps --without being asked-- to prevent a defect from ever reaching production again.
  • They aren't afraid to take ownership of their products' and operate them; they aren't afraid to be on call for the code they wrote; they don't shy away from learning new skills and are always looking to build new tools to better understand how customers are experiencing the products they built.
  • They strive to get customer feedback quickly and use it in an iterative manner to constantly improve their products. They love staying close to the customer; they vehemently oppose any process or a change that distances them from their customers; they dislike layers of bureaucracy that don't add value.

This customer-focused mindset --and the behaviors that follow-- is essential for a true DevOps transformation. Customers come first; DevOps comes after. DevOps is a means to an end; not an end in itself. I have been fortunate to work with such engineers during my career. It is a true joy. You learn so much from them.

DevOps Patterns & Anti-Patterns - From a customer's perspective.

This is a good segway to talk about some of the misconceptions about reliability and velocity. And also to highlight some of the DevOps trends/patterns and anti-patterns. Most of them are insights I gathered from operating some of the world's most popular web destinations and games.

#1: Uptime is overrated

Customers don't care about five-nines reliability; they care about five-nines customer service.

I tweeted about this some time back. I can't help but talk about how people just mindlessly think they --and their customers-- need/expect extreme availability/uptime; like five-nines uptime (which equates to like 5 minutes of downtime per year), without taking into account what it would actually take to get there or what the customers really want.
There is a reason why customers don't expect or care about extreme reliability: most of the devices that are used to access a service or product over the internet don't have five-nines availability themselves. I mean, how often do you have issues with your ISP at home? Your router? Your wifi? Your cell carrier? Your smartphone? A lot, right? Given this fact, there is no point in trying to make your product more available than any of those devices. As customers and users, we are always accustomed to a certain amount of downtime with almost everything we use. We just live with it (or have learned to live with it).
Of course, too much of downtime is not good. To know what the right level is, you need to truly understand what your tolerance for risk is; what is that your customers really want; how they react to downtimes; etc. You need the right tools to get accurate and quick feedback. Based on that data, you need to make a decision as to how much you can reliably trade-off new features for reliability. A good way to do this is to establish budgets: error budgets; uptime budgets. The formula is simple: go over? slow down. under? ship....ship...ship.
The reality these days is that you can get to a few nines availability without having to do much. And, maybe, that is all you need to do (at least initially).With the democratization of distributed systems and rise in adoption of container orchestration systems like Kubernetes (K8s) which --among many other things-- give you things like auto-remediation, auto-scaling, load balancing, service discovery, etc, natively - you get to a few "nines" right out of the box. Every "nine" after that is not cheap (certainly not free; can get very expensive).
You really need to take a hard look and try to truly understand if reliability and performance are your strategic differentiators and make the right strategic choices. Remember, sometimes, too much uptime (yes, you heard it right) can actually work against you.
Finally, while on this topic, uptime is everyone's responsibility; not just Ops'. So is incident/outage management; so are postmortems. If you are an ops engineer and ask product what they care about and if you get the answer, "make sure product is up all the time", it is your responsibility to ask why; ask why is that they need your product to be up "all" the time; ask, what do they mean by "all"; ask for a number. Compartmentalizing ops as the sole owners of uptime is so wrong. 

#2: Velocity is overrated; customer feedback is underrated.

Your customers don't care about how many times a day you ship; they care about how quickly you can listen to them and learn from the feedback
Like uptime, most people assume that if they fast release cycles, their job is done. You are shipping multiple times a day; you have reduced your deployment time from days to hours; you have increased your release frequency from weeks to days or even hours. This is all great. I am not saying this is bad. The point is fast release cycles are again, table stakes (these days). The problem is, most are not as focused on what happens after they have released something. How are the customers reacting? What is the feedback? Are you getting that feedback quickly? directly? Do you have the right tools to measure feedback accurately? What actions are you taking based on the feedback? In other words, what have you learned?
As you know, most startups fail; most features are never used by customers; most tools are never adopted. This is not because of tech, it is because they failed to implement fast feedback cycles; they failed to learn from past mistakes and failures. Look, it is not enough to just keep shipping mindlessly; it is important that you are also focusing on what you are learning after you have shipped.
We often forget why do we even deploy multiple times to prod and the real value it provides; we often forget why we even did that massive initiative that took months/years to complete, so we are able to continuously deploy to production. If you --as an engineer-- are not interested in understanding and learning how your change impacts your customers, it doesn't really matter if that change goes live in a day or a month or a year (after you commit, that is).
It is your responsibility as a software developer to be curious and inquisitive and get rapid feedback about a feature or a change you shipped to production. Velocity --how fast you ship-- is not enough; it is critical to have fast feedback loops. Have you ever wondered what kind of an impact that important change you made months ago has had on your customer engagement? Do you care about that? Do you think about that stuff?
It is important to not build something your customers don't want. Fast feedback loops are an effective way to ensure that outcome. Make sure to invalidate your hypothesis asap ("fail fast").

#3: Democratization of Operations - Why everyone is Ops & everyone is a Dev

If you want to develop a customer-focused mindset, this cultural shift (everyone is ops; everyone is a dev) is critical. Most of these trends we have been seeing have been the direct result of the need to get feedback quickly and directly from the customers, so engineers can iterate on that feedback quickly and ship better products. They are also great trends to try and emulate so you can get engineers closer to your customers.

Ops is moving up the Stack

Basically, with the rise of Cloud Computing, (Docker) Containers, Immutable Deployments, Kubernetes, Serverless, Istio, Envoy (for service mesh), Microservices, and the general democratization of distributed systems, the infrastructure space is undergoing a major transformation and a (paradigm) shift. Business logic --by itself-- is no longer the differentiator when it comes to creating value and innovating.
Given the shift, the core software teams building the apps and the business logic truly need the whole ecosystem to be successful and to be able to rapidly innovate at scale. So, it is even more critical that they also learn how to operate their complex distributed system at scale; otherwise, they will become irrelevant sooner than later. If you are a dev team and are having to call a different team each time you are tasked with analyzing a distributed trace, or running a packet capture, while debugging a complex production defect, that strategy will not slow you down, it will also dramatically increase your TTR, and thereby frustrate customers and ultimately hurt your business.
With all the abstractions (most of which leak), the operational complexity in running these distributed applications is also increasing proportionally (especially at scale). This is where, typically, the ops experts come in and add value (they are not needed up until then). They make the economics of scale work. They are truly moving up the stack; they have to; otherwise, they will become irrelevant sooner than later. The trend is inevitable given the need to democratize operations.
Ops --as a profession/career-- has come a long way in the past decade: from racking/stacking servers (back in the late nineties) to tweaking the kernel to running packet traces to config management. These days, operations engineers are focused on building infrastructure, platforms, and container orchestration; they are focused on building observability stacks. Some are also focused on how apps are built, tested, deployed, provisioned (test infrastructure and frameworks). Some are more focused on middleware and the edge/ingress. Some are focused on network orchestration. Some of providing database as a service. The list goes on.

Dev is moving down the Stack

The corollary of the previous trend. No one really hires ops folks early on (at least not until series B or C); you just don't need them because --as a business-- you just don't need the expertise to solve operational problems at scale (reliability, perf, CD, etc). Your software developers (with an eye for Ops) do most of the initial scaling and making the product more operable. With easy access to on-demand, scalable, billable compute clouds; with the rise of container orchestrion systems and democratization of distributed systems and other abstractions that hide the complexity of the underlying systems, it is not hard to scale to tens (if not hundreds) of millions of users, using brute-force, off-the-shelf solutions. In fact, hiring ops folks (at this early of a stage) may just be a distraction and a self-inflicted problem.
This trend is real and happening not just at smaller companies, but at larger enterprises as well. In fact, at Yahoo --which has over a billion users-- I was the primary driver behind launching Yahoo's Daily Fantasy back in 2015 in a "DevOps" model (you wrote it; you own/run it). This was a radical departure on how software was traditionally operated at scale. The model was simple: the dev team was responsible for almost all aspects of the development lifecycle: from design to writing code to writing tests to going through multiple architecture senior tech council discussions to paranoids' security reviews. They were also responsible for testing and deploying the code they wrote (using CD). And, finally, operate it and scale it. So, things like monitoring, on call, capacity, postmortems, etc. Basically, they are their own ops.
The ops teams, on the other hand, helped with the product launch with areas like documentation; understanding of things like DNS, CDN, infrastructure provisioning, monitoring, load balancing. Also, in the understanding of how all the components are interconnected in a complex distributed system. With writing playbooks, how software is deployed in a continuous fashion multiple times a day, how to roll back, etc. But the ops team was mostly hands-off after that. They still provide consulting and expert services periodically and are the go-to POC for things like tracing, monitoring, on call, capacity, infrastructure, etc, but that's pretty much the extent of the arrangement or the contract.
The beauty of such a model is that it encourages ownership, holds people accountable and has led to great positive outcomes. Most importantly, it brings engineers closer to their customers: has gotten software engineers closer to external customers/users; has gotten operations engineers closer to internal customers (i.e. developers).
Engineers on that team really care about customers. Just to give an example, there was once an alert when one underage customer was trying to register to play daily fantasy; immediately, an alert triggered (an HTTP 4XX) that went directly to the dev on call, who immediately responded and verified that the system did what it was supposed to do (i.e. did not allow the person to register), then created a ticket to follow up to make the alert more actionable (do not need to wake someone up at 3 am if the alert is not actionable) and went back to bed. Awesome, right? Can't get better than that? I have in fact been talking about this transformation at several conferences with some good feedback.

#4: DevOps is table stakes

I don't think DevOps is a strategic differentiator anymore. DevOps is table stakes these days; it has been so for a while now. A lot of enterprises have been doing things like CI/CD (a big part of "DevOps") since 2007 or earlier (that is more than a decade ago). If you are one of those still behind, it's time to do some serious catching up, so you can focus on the bigger, better problems.
The strategic differentiator will be how obsessed you are with your customers (both internal and external). The obsession would mean, as an engineer, you would focus on things that matter the most to your customer. That mindset will lead you to develop skills and the tools you need to find innovative ways to quickly get feedback from your customers and see what you can learn from it (quickly). Ultimately, you will become a well-rounded engineer and an intrapreneur; one, who not only takes pride in writing clean code of high quality but also someone who is in love with his/her customers.

DevOps Anti-Patterns

Can't end without listing some anti-patterns. When I hear someone mouth them, it drives me nuts.
"I want my software developers to focus on writing features; I don't want them to be doing low-level work like responding to alerts, running incidents, etc. Let me insert 3 layers in between. Those layers will shield all feedback from production; they will prevent the devs from waking up at 3 am"

  • "I want my software developers to be productive; I don't want them to be working on things --like writing tests-- that will delay my launch. I want them to focus on writing the business logic and shipping features. Let me hire a team who will go and add test coverage later. Let me hire a team who will test the heck out of every release; who will be the manual gates to every release"
  • "I don't want to waste my software developers' time by having them focus on failed tests. I don't want them to do dealing with failed deployments. Heck, I don't want them to be dealing with deployments at all. Let me hire a team that will do the build, test, deploy, rollback, etc"
  • "Can we get that team to handle lower-level, routine tasks for us? So, we can focus on things that are important."
  • "I don't want my software developers dealing with customers, or have them sit in user feedback sessions. That is the responsibility of the product. The sprint planning is done; they have enough on their plate".
"I don't want my software developers to ship anything (even run an experiment) without going through 3-layers of bureaucracy."

If you come across such anti-patterns --most of which move your developers farther away from your customers and result in long feedback loops-- you know what to do. It's important though, to be honest with yourself; be selfless; do some reflection, and then, do what you think is right.
Good luck in your journey building Intrapreneurs within your enterprise!
Would love to hear your feedback and thoughts!


Post a Comment

Popular posts from this blog

Difference between Junior and Senior Engineers/Managers/Leaders

Want to Solve Over-Monitoring and Alert Fatigue? Create the Right Incentives!

Artwork: DevOps Patterns and Antipatterns