We often hear about what DevOps is or what are the tools to achieve DevOps in your organization (whatever it means), we know of Terraform and Cloudformation but we rarely see a definition of the principles behind the work organization of our teams.
At Curve, I was exactly hired to create and structure the SRE/DevOps team of the company. In this article, instead of the usual technical deep dive, I’d like to share some of the inspiring principles of the DevOps culture and of how they were adapted to define the work in a startup.
As engineers we are constantly battling the internal need to “build stuff” and the hard necessities of our days, where we spend most of our time distracted by colleagues that require our help or by notifications from a plethora of monitoring tools (OpsGenie, Cloudwatch) At Curve to handle this flow of events, we use the role of the floating engineer, which just means that one of us will handle all the interruptions for a week and allow the rest of the team to focus on productive work.
We also use multiple “log levels”: info, warning and critical for most of the notifications coming from our system. Info is very noisy and goes unnoticed most of the time, but we found it peculiarly useful when after an alert we can review previous low importance events. Critical events instead are part of the very rule that “every alert must be actionable”; we don’t want to be notified unless absolutely necessary. As a result, every alarm creates a task automatically in JIRA that then needs to be worked on, either by fixing the bug or by modifying the alert.
- Minimize work in progress.
This is, basically, the other side of the same problem. A lot of traditional operations team have a lot of projects they’re working on at the same time. Often they’re even working for multiple dev teams; resulting in competition for a scarce resource. We use Kanban boards with an upper bound of “work in progress” to minimize it and to ensure priorities are clear for everyone. These boards also have a clear indication of “how long” a task has been in progress. Additional “work in progress” is pushed back or reprioritized.
- Remove invisible work
An Operations team is usually “pestered” by a lot of “small” work request: “Can you check this configuration parameter? It will only take you 5 seconds.” The result of this is a lot of small work that not only is forcing us to lose context but also ends up in being way longer than expected with a measurable impact on productivity. A lot of people will be skeptical about this but have you ever tried to do this simple test: “Keep Slack closed for a day, open it at 5 pm and take a look at how much time you’ve just spent at replying to all the messages“.
We should defend ourselves and our jobs from this type of interruption as much as possible, always request for a ticket to be created even for small tasks. This will be difficult at the beginning, but will slowly become part of the process of the company and will allow.
- Create short feedback loops
In the “traditional world” operations would be a team by itself with little/no day to day interaction with the development teams. Here, instead, each component of the SRE/DevOps team is part of a “development squad”, joins most of the stand-up meetings and his job is considered part of the time “estimated” to put a feature into production, we contribute to their velocity rather than being outsiders. This allows providing faster feedback loops for the company as a whole, to break knowledge silos and gives us a fresh “product” perspective.
- Do standups
We all know the benefits of structuring teams in a cross-functional way. It’s a great idea: ensures velocity of execution and breaks down rigid structures. However, I believe that this pattern took “as is,” is more suited for bigger companies where procedures have already been standardized. In a smaller or younger company, where a clear line of work is still to be set, ensuring a collective behavior becomes of utmost importance.
For this reason on top of the squad standups, we also do a DevOps daily meeting. Having an internal standup ensures knowledge sharing and essential coordination in the team. It is quite clear that this can become quite time-consuming for everyone, so, to protect ourselves we use an “hourglass” of ten minutes to timebox the meetings. It is a strict deadline. If it takes longer, we leave the room and, if really needed, set-up dedicated catch-up sessions later on.
And more…This article continues here !
The inspiration for this article came from a couple of books that I’d suggest everyone read:
Kim, G., Humble, J., Debois, P. and Willis, J. (n.d.). The DevOps Handbook.
Kim, G., Behr, K. and Spafford, G. (n.d.). The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win.
Beyer, B., Jones, C., Petoff, J. and Murphy, N. (n.d.). Site reliability engineering.