Site Reliability Engineer – In Search of a Unicorn

At Curve, we’re rolling on the “Great Fintech Adventure”™  of revolutionizing the way in which you spend and manage your money. At its very core, the company is a blend of Finance and Engineering.  Two disciplines that get together to deliver at your doorstep the Curve card that you know and love.

Engineering is working hard these days to support and enable the organic growth of the team and, as part of our process, we’re constantly hiring new players that can help us go the extra mile and build amazing stuff!

We have openings for a lot of positions at the moment (btw, why don’t you join us ? https://www.imaginecurve.com/hiring) but among them the hardest to find so far has proven to be the mythological figure of the “Site Reliability Engineer” also known, in the wild, as “DevOps or Cloud Engineer”.

So…What are they?

They are a peculiar breed of software developers; usually, highly motivated individuals not scared by the complexity of code or by the configuration nuances of an operating system; they are, instead, attracted by a blurring line between development and operations with strong fundamentals in both the worlds. Ideally, they should be equally comfortable debugging the interactions of a Docker container and writing a piece of code that automates a manual task.

 

Has this not always been the case? What’s so special about this?

This role can exist only in a world revolutionized, a few years ago, by cloud computing. A world where new technologies are launched daily, legacy is almost non-existent (and regulations and compliance are still not well defined). Cloud computing enables a company or even an individual to quickly rent a shared pool of virtual computing resources and scale them on-demand depending on the workload. It means that a company starting today does not need to buy upfront any expensive physical infrastructure, but it can “rent,” for a ridiculous portion of the price, resources on a public data center and scale them elastically accordingly to its need. This enables business models that were unthinkable only five years ago – every startup on the planet at the moment is trying to use this opportunity. But deploying on the cloud and scaling virtual resources is a complicated problem to master, it requires a person with a unique blend of skills.

 

So…DevOps ? SRE ? Cloud engineer? Greengrocers ? 🙂

In theory, DevOps is a larger movement that encompasses both a culture and a role, strengthening communications between the development and operations team and trying to automate the delivery process to be as fast as possible. Site Reliability Engineering instead is a specialization of DevOps that was defined in a famous book written by Google [2] and can be synthesized as “what happens when you ask a software engineer to design an operations function.” It focuses on designing and coding production system that respects their SLAs, however, obtaining this by sharing the same ideas and techniques of the DevOps movement. Truth be told, in a startup, such as Curve, the difference between these roles is so blurry, to be almost non-existent; nonetheless, we believe it is important to start defining the right culture and practices from the beginning. SRE felt like the obvious choice in an industry where the reliability of the product is of core importance, and there are heavy compliance regulations.

 

Nice, but what exactly are is the SRE team doing at Curve and why is it actually SRE?

 

At Curve, we are a very small team but we are involved in the design, and the scalability of every feature developed. We are not working, hidden, in the background: we are doing distributed systems engineering every day. We are doing operations, making sure our containers run on updated machines and operating systems. We are doing development by writing functions that work with our firewalls or provide insights and monitors the usage of the card and the reliability of the system and, if needed, notify our super valuable Business Operations team.  We are doing Site Reliability Engineering by looking and defining the SLOs of our systems with the developers that code them, and we’re managing them together. But we’re also cautious about security and compliance, ensuring that all the compliance and regulations requirements are indeed taken into account. We are alive, growing and kicking!

 

Ok, great, so why is it hard to find someone?

This, in fact, is a surprisingly complicated question, but in my view, it happens for many reasons some of them organic to software engineering others due to market and education:

 

  1. Blurring the lines – Historically the worlds of development and operations have always been separated “by a fence”: people with different skillsets and mindsets were managing different portions of the same product. They have always been focused on conflicting goals: creating features in the shortest possible time vs. keeping the system as stable as possible”. Requiring a change in this way of working is far from being easily understandable even by experienced individuals.
  2. Breadth over depth – In a world, where “breadth over depth” is key, it becomes very hard to find professionals whose depth is not too small and are simply unfocused and jumping from one thing to the next.
  3. Mindset – A large number of people very experienced in operations and now adapting to cloud environments are adopting these new technologies “as tools” without realizing the implications that they bring. They are changing their skillset instead of changing their mindset; “Learning how to use Terraform without understanding what the potential of Infrastructure as Code is and how it may help the developers to be faster at creating their features instead of only being used to manage the machines.” Still thinking that their role is to “operationalize the product” instead of being involved in designing it.
  4. Taking the plunge – Experienced  “IT pros” traditionally used to “isolated” systems management are professionally scared by learning how to code and how to work with developers, Agile and the company.  (If I had a penny for all the people that said: “ I do operations and use Python, but I’m not a developer”)  
  5. Universities – The shift towards cloud computing has been massive but occurred in a short number of years, and the universities are struggling in preparing experts in the field. There are only a handful universities in the UK offering a dedicated cloud computing module (among them City, is doing a good job at it [4]). As a result, most experts in the field are self-taught.  Junior DevOps engineers are having a hard time deciding which path to follow to become a recognized expert. There are only a handful certifications coming from different vendors but none of them is actually trying to teach or verify anything cross-platform.
  6. Money – Money – Money – A shortage of professionals in this field has increased the competition and thus the expense necessary for a company to acquire skilled workers; in a market where startups are no game for the bigger players.

So, what are you looking for, in the end?

The SRE team is already growing, but we’re always looking for someone that is ready to analyze, plan and maintain production systems as they scale in capacity and complexity. Someone that will refuse to do routine administration BUT will engineer an automated solution! Someone that will help the developers in defining the scalability requirements for a feature.  Are you interested in it? Does it ring a bell? Come and join us:  https://www.imaginecurve.com/hiring/sre

 

 

REFERENCES:

  1. https://www.usenix.org/publications/login/june15/hiring-site-reliability-engineers
  2. https://landing.google.com/sre/interview/ben-treynor.html
  3. https://www.docker.com/what-docker
  4. http://www.city.ac.uk/courses/postgraduate/software-engineering
  5. https://www.terraform.io/
  6. https://itrevolution.com/book/the-phoenix-project/
Advertisements