Keeping It Together: The SRE Behind the Scenes

[00:00:02] Speaker A: Hey, it's Agla from Precision Sourcing. And welcome to her Heads in the Cloud, the podcast where we dive into everything happening in the world of DevOps cloud and site reliability engineering. With over six years in the industry, I've built a successful DevOps practice from scratch, helping companies connect with the top tier talent they need to build diverse, high performing teams. Each episode I'll be bringing you insights from some of the brightest minds in tech, engineers, leaders and industry experts. Whether you're here to learn practical tips or hear inspiring stories, you're in the right place. Let's get into it. [00:00:40] Speaker B: Welcome to another episode of Her Heads in the Cloud. Pretty excited about this one as we've got Iman joining us from Atlassian. Today we'll be covering quite a few topics around the site reliability space, but before diving in, I'll let you do a little introduction to yourself, Imani. [00:01:01] Speaker C: Hello everyone. First of all, thanks so much Eagle for having me. I'm super excited to be here today. I work at Atlassian, I'm a senior engineering manager there, I lead a couple of teams and one of the teams is looking after the database powering one of our flagship product, jira, and the other team is looking at, you know, how we can actually uplift reliability across Atlassian. It's quite a small team, but impactful. [00:01:33] Speaker B: Awesome. Thank you. Iman, I know you've touched upon it a little bit, but could you walk us through on a slightly deeper level your journey into the site reliability space And I guess what also drew you to that space initially? [00:01:48] Speaker C: Yeah, definitely. So I, I would say I officially become, you know, a site reliability engineer around 2018 to maybe late 2017 when I was a student, when I was postgraduate student I studied distributed software engineering and I remember like when I was probably a few months before when I was ready to submit my thesis, my PhD supervisor said actually in order to make your tests relevant, we actually need to talk about cloud computing. [00:02:22] Speaker B: Right. [00:02:23] Speaker C: And that was. That was it. Like there was one page where I talked about cloud computing and then from that on it just one opportunity, another which is leading me into this path. While I worked as a backend engineer as well as a data engineer at some point. [00:02:41] Speaker B: Oh wow. [00:02:42] Speaker C: But I think, but that was very. Quite a short stint. I would say I should not even mention it. But I think the main thing is like when I actually joined Culture, I'm not sure you know, Culture is actually Australian, one of Australian unicorns, quite a successful startup. I joined there as an infrastructure engineer, but at some point there was A direction change. Director of engineering at the time who was leading our space said, actually we want to move into this space. We are not doing just infrastructure engineering, but it's beyond that. We are engineers. Operation is in our domain and actually we are satellite engineers. That's how I finally officially become an sre. [00:03:28] Speaker B: Yes, the beautiful transition and having covered, I guess quite a lot throughout your career, even a touch of data, some backend development infrastructure. Now, site reliability, what would you say has been your favorite path? [00:03:46] Speaker C: I, you know, I'm okay, I try not to be biased, but I still love site reliability engineering. That's what my pick. It's just because, you know, like I, we get to. I, I love solving like bigger problems in a way that, you know, here are. Here is a system. It's quite big. It is structured in this way. How can we provide, you know, good experience for our customers. But at the same time we need to be mindful about what is the cost, you know, to the business, what it looks like. Because there is always the balance, right? [00:04:21] Speaker B: Yeah. [00:04:22] Speaker C: If you go looking for. I'm actually make sure that my service is uninterrupted all the time. Is it going to be at what cost though? Right. You know, these are the kind of questions you need to answer and I love those kind of conversations and that's why, you know, I love it. And you know, in s. One of the thing we do like as a routine task we do is also, you know, responding to incidents. So back of the incident, once you mitigate the incident, there is the relief. You know, I have done something. I love that part as well. So I think I'm still. I would choose sre, definitely. [00:04:56] Speaker B: Yeah, no, absolutely. And how would you define the role of a site reliability engineer? And how have you seen that definition evolve over time as well? [00:05:09] Speaker C: So I think when I look at sre, like an SRE is, it's like any other software engineer. So the main difference though is what is your domain like for us? You know, like when you are a software engineer working for example on a JIRA product, you are looking at how do we make the board appear, how our issues are created. But for us, our domain is the operation itself. When in our customers, when they try to access their data from the database, can they access it? How often is it, like, are they able to access it all the time? Do we know if they are struggling to access our service? So all those things, you know, previously one of the things you have done is there is a lot of manual element. You set up your Infrastructure in a such a way that you know, it's just was not. There was no software. There is a limited software engineering practice applied but initially it's all about software engineering. It's all about managing your system in a way that you know, through code. You would say you can't. Yeah. So we are, we know what's happening in our system, how do we get alerted, how do you set up your alerts? That's all. It's all about writing the code, making sure that's deployed safely, it's arrived in the production and then we are aware of that. So that's the main thing for me when I look at sre like any other engineer except your domain is operations. [00:06:40] Speaker B: Yeah, no, absolutely. Coming from that software background it definitely allows you to thrive a little bit more so. [00:06:48] Speaker C: Exactly. And also like if it's actually quite common to see nsrut you see two type of engineers, some of them, they come from a software engineering background where they have done back end development or it could be front end as well. They would say look I'm actually quite interested. I want to see how deployment works. I want to understand how our customers experience is that and they will be drawn into that operation element. And the flip side is you do have system engineers where that's their bread and butter is like how is the system set up? That doesn't necessarily. So that will draw them into the domain. What that means they will upskill themselves in other software engineering skills so that both of them will somehow meet in the middle to make that a site reliability engineering practice. [00:07:41] Speaker B: And I'm sure you've encountered quite a few of these during your time. But how do you handle high stake incidents or outages and what have those moments taught you about leadership under pressure? [00:07:56] Speaker C: Yes, I mean it is, yes, I had a few incidents where I personally caused outages or where I also participated in resolving an existing outages because of some other reason. Now there will be pressure, right. Especially like there is when there is an incident in your environment. Especially if you are the responder, if you are the responding engineer there is always going to be an element of that pressure. But for me personally what I see is what happened has happened. What's the next step we need to make to minimize the impact? Yeah, it is like that is one of the moment for me is I would never panic, right. When I say I would never panic I've only panicking inside. Right. You know, but I would try to be calm and say okay, here is the situation. Do we have all the people or do we have all the tools, the system that allow us to overcome this? Yeah, that's all I think about. How do I make this situation a business hour problem? Yeah, actively. This is mitigating, restoring service by any means necessary. That's, that's always the thinking. That's the default thinking. And even when I am, you know, helping with, you know, resolving incidents, that's the thing I'm trying to make sure people are not like engineers responding engineers are not worried about this. You know, I'm not worried about what's the next thing. It just like just focusing on step one step at a time, get their service restored and we'll talk about next. So that's how I see it. Now, however, this is, you know, kind of usually the incidents, they happen and it's expected to happen at some point. But for me, what I think about is you actually also have the case where you feel quite a lot of pressure outside the incidents. Right? Yeah. Are working on a big project where things might not go as you planned, you know, like things might fall apart. And it is really important, you know, to even, you know, to understand the situation and be able to guide your team in the past because it is really important to say, where are we heading? You know, even also question like, you know, when you are under a lot of pressure, are we doing the right thing now? Should we even do this? Yeah, like, you know, understanding. It's like, you know, because I probably use like this words, which is quite common for me. Looking at what is the root cause? Why are we feeling this pleasure? What is missing? What's the thing we can do so that we will make it just quite normal? Is it like because we have a lot of work or is it because there is a deadline? It's unrealistic. Is there a way where we can get out of this? That's always in my head, right. And always I keep asking questions. I asked the team, I'm sure there is a solution, there's something we can do about that. What is the thing we are doing? And most of the time people are quite creative. I found them quite saying, okay, here are the kind of showing this is actually the key element. And also understanding when you say that we want this to be done by the certain date, you ask the question, why? Why is the state important? Can we do it? And if we need to do it, what are the elements that come to mind? And that's how I apply in general for any kind of leader, whether it's incident response, getting the service backup for our Customers running or running big projects where sometimes they will have a deadline which seems quite unmovable. [00:11:40] Speaker B: Yeah, for sure. And how I know, in what advice would you give to a business to not implement a. Or to implement like a non blame culture? Because sometimes, you know, if some, if an incident happens, you know, and I'm sure it happens to everyone at some point in their career and it probably can be easy, you know, maybe for more junior teammates or something to, you know, be upset perhaps at someone for that incident has happened. How do you implement a culture where it's. They're not to blame, you know, it's a mistake, we need to move, move on from it. [00:12:17] Speaker C: So I think, you know, unless in, in my opinion, unless someone did a malicious thing to cause an incident, it will never be someone's fault. For me, it is a failing in the process that will lead into that. So for me, one of the things like you know, this big incident happened, right? We mitigated. The first thing is please do not blame people while you are mitigating an incident. That is actually. It's not helping anyone, it's not helping the situation. [00:12:47] Speaker B: Let's fix that. [00:12:48] Speaker C: But after you mitigated that incident, the next thing is you will do like, you want to review, you want to do like a post incident review. What are the things, what are the things that happened? What was the impact and what are the different processes? And actually one of the way you can identify is they can do a five why questions and say why was the. There was an outage? Because you know, like, you know, one of the. Our database fold over. Why did it fall over? Like why. It was like CPU was 100%. Why was it 100%? There was a lot of, there was a specific request coming for this specific agent and it's, you know, overloaded our system. Okay, why did this agent, you know, was making all these requests have a rate limiting in place, right? Like this kind of thing. It will give you the past to understand the failing or the gap in your process, the gap in your system. It's not a gap in people, right? Like maybe it is possible the way when you make changes into production changes, you don't have a thorough review process. So you know, you have your engineer, let's say like you mentioned, it's a junior engineer made a change, you know, made it to production. It was buggy. So I would ask, do we have a continuous integrate? Do we have actually a pull request review process? If you do have, do we have automated testing? Do we have that in place? Like if we do have them or what gap or maybe there is a gap we have we needed to add additional test so that we can catch this class of bugs. For example, you know, it just like it is the process. As long as you remove the human element out, as long as you focus on the gap you have in your system, that's the best way forward. Now you understand you close the gap next time you will not have similar incidents because of the same root cause. So for me that is the key element. Focusing on the gap that you have in your processes, on in your system. [00:14:50] Speaker B: That's a fantastic way of looking at it and analyzing it and doing the work behind the scenes I suppose to eliminate the problems. So thank you for sharing Iman. And that brings us to our next question as well. What does a successful career path look like? Insight reliability engineering. [00:15:15] Speaker C: That actually depends on what you want to achieve. Because there are different paths you can take. Like you know, this is more will be like it's just, it's a personal, I would say what a success look like because there is. You can be. If you say I actually I want to be like a principal, you know, like central library engineer or you know the highest role which forever for for that company as an individual contributor you can do that as well. You know, you can start from being a grad or intern, being a mid level senior, you know, whatever the kind of the role progression you can get. But I think if you are from the technical element what is really important is you are learning, you know, like you are understanding what is the different types of systems, how they are interacting. What does it mean if you are adding a specific, you know, component in your system? What does that mean? Are you like you should always be thinking about, it's not only about making a continued service for your customers but are you able to if you get an extra demand like are you able to scale how quickly are you able to scale if you have an incident that happens, do you need a human intervention all the time? Are you able to like do you can you being able to think through to avoid those manual touches as much as possible, you know to be able to say I actually I when I make a decision I not only look at reliability, scalability, I'm actually going to bring cost element into it. Because one of the things you can think about is you'll say you know what my system is up all the time because I'm running you know, 15 copy of my system. [00:17:02] Speaker B: Right? [00:17:03] Speaker C: Not the best approach is it? Because you are, you know, if you are, let's say on the cloud, if you are using Amazon, for example, if you have 15 red replicas responding, is that proportional to the request you are getting? Because it is, you know, your system is probably almost 100% up, but is that, do you really need all those or is this okay for you to have just two replicas which the cost dramatically increase, you know, decreases. But then, you know, you might have some outages, maybe one minute a month or could be like, you know, you know, 10 seconds every day. Is that acceptable? Like these are the different things you need to be able to look at it. The more successful you are, the more you're able to see it's not a single dimension. You have to look at multiple dimensions. The other thing you want to look at is there is the human element, right? That the thing I mentioned, how do you remove the human touch? How do you stop the one in course if you get one, how do you reduce them? You can't stop them, but how do you reduce them? So you need to start thinking about those things to be successful. This is not about the individual contributor. It is quite important for someone who even who decides to be. I actually want to be on a management track. I want to, you know, that's quite important to understand to bring those elements, to bring the processes, to bring, you know, the system that actually quite work well for the people you have in the team as well as for the business. [00:18:28] Speaker B: And I'm curious to hear as well, Iman, who do you think site reliability is for and do you think it's for every company? [00:18:38] Speaker C: So I would say let's say a qualified yes. Right. The reason I would say it's yes is because I just want to make sure like when I say cyber there is different degree, you know, where you can say you can apply like satellite engineering. You might have like a startup, right? Let's say a startup, we're trying to find a product fit for them. One more than anything else, cost is very important. You can't be throwing things into that. But at the same time you'll say you need to be. They want to provide continued service for their system for their customers. They need to be as well. Okay with the fact that, you know, it's not really. As long as the system is secure, they don't have to worry about if the system falls over, you know, from time to time, if there is a surge in traffic. Yeah, might be okay with that. Right? This is the kind of you make. You should be able to make an informed decision like as you go along you might have a big company and for big company you think like cost might not be relevant is actually even more relevant because the bigger you are the amount of like you know, cost you have associated with your infrastructure will be much bigger. Like with the amount of infrastructure you've put in to make your system reliable will be actually much, much higher. It will still be relevant. But you need to see what's the context I'm playing in what kind of and what's the expectation of my customers. I think what that's one of the key elements inside reliability is do my are like if you say if you are a heart patient, if you rely on a pacemaker you want that's actually 100% reliable, right? It's like there is nothing that's, that's like and this. But if you are looking at systems where let's say you know you are doing a Google search, right? If you are searching site you should be there almost 100% of the time. But I would expect if I was, I'm not able to, you know, search on Google. It's not the end of the wallet. That's fine, that's okay, right? I'll go back maybe hopefully in the next minute it'll work and that's, that's fine. So that's why you need to look at the context you are in. But why like the search Liberty engineering important at any level is to do am I providing a service that is that satisfies my customers expectation. Now what will be different is when you are you know, in a new startup trying to find a product feed. Maybe customers are a little bit forgiving. They would expect, you know, they know that you are trying. This is a new thing. Your system might be out you know, once every couple of weeks or you know, once a day. They understand that, you know that's. But when your context is different you just need to make sure that you are irrespective of the stage you are in, you are satisfying your customer's expectation. And if you do that you are already doing sat reliability engineering. [00:21:23] Speaker B: Yeah, 100%. 100%. And finally Iman, how do you see the future of site reliability engineering evolving Especially with you know the huge buzzword AI and platform engineering in the mix also. [00:21:41] Speaker C: So I think that is an interesting one because you know it's AI is everywhere and it's here to stay, which is great. So for me generally when I look at AIs you know, at the minimum they are really good at taking away things that are Repetitive things that are done all the time. So one of the example I can think of in SRE is incident response. So that's the thing every SRE will do. Like, I actually have seen some prototypes where we do have this. What they do is they will actually look at, as soon as there is an alert coming that will pitch, you know, an incident responder. They will also look at the historical data, what the alert is, and then they will say, okay, based on what I know what the alert is, here is the mitigation. If you actually execute this 1, 2, 3 steps like in SR would mean scaling up your infrastructure, or it could be rebooting your instance, whatever that means. They will say, okay, here is. I have seen all the runbooks you have used in the past. I have seen this pattern of incident. Here is how you mitigate it. What I found interesting is in one trial, I actually have seen this, the AI, like, you know, the boat, let's call it the boat. It will take under three minutes to figure out what is the mitigation strategy is for that incident. And when you compare that for, you know, an SRE who's, you know, let's say seasoned sre, you know, they will get the call, they will wake up, they will open the laptop, you know, look at what are the, what are the logs file, you know, what are the different data we have and then locating. Okay, now I understand what the issue is. Here is the runbook to fix the issue. On average, NSLE will take 32 minutes. We're talking about three minutes. And 32 minutes is quite a huge gap right now. Obviously, like we need to train more this sport so that, you know, the accuracy is higher. Like at the moment. One of the things you need to make sure, like, you know, when the board says this is the thing you need to be doing, you need to make sure that actually consistent. You agree with that. But I think it will cut that incident investigation response time significantly. So for me, what's actually even more exciting is what I would love to see is at some point the board not only identify the mitigation, they will say, you know what? This is the mitigation. I'm confident and I'm going to do and run that myself. You don't have to call anyone else, right? It's done, right? That's the next step. And I personally believe we will get there very soon. I don't think it's far out. We just like the first step is making sure that our AI boats actually their accuracy is high. So that's high enough acceptable so that we can give them that permission to be able to execute those, you know, mitigation action. And, and I'm sure like any, I don't, I'm sure a lot of service will be excited about this. No one would be called at 1am I mean I have been called at 1amultiple times and I'm sure there is much rather, you know, go to bed at that time than respond to an incident and absolutely, I think that's what I would see. So this is where I would say the AI will kind of replace those repetitive tasks from us so that it will allow us to focus on project work which at least for now cannot be done by the AIs. [00:25:21] Speaker B: Yeah. Oh, it would be such a game changer if it could do the on call and the after hours work because it's not super fun for anyone, to be honest. [00:25:31] Speaker C: Yeah, yeah. And the other thing I have seen is actually this prototype, I'm not sure it's not there yet. The other thing I have seen is once the incident is actually closed, the AI will generate a post incident review. It will say here is the narrative, here are the timelines, this is what SREs need to do today. But it will generate that just like maybe it's a little bit too much text. It is. The AI is generating. I wouldn't say great, but I'm sure there is definitely a room for improvement in that aspect. So the AI is doing not only the incident analysis, identifying the mitigation, closing the incident, but also writing a review. That would be a dream come true, isn't it? [00:26:15] Speaker B: Oh, absolutely. We'll have to see what happens over time, but I guess that brings us to a halt there. But it's been absolutely lovely having you on the show, Iman. So thank you so much for sharing all of your insights and thank you everyone else for listening. [00:26:34] Speaker C: Thank you. Thanks for having me. Ego.

Show Notes

Episode Transcript

Other Episodes

Episode 2

Passion, Pressure, and Progress: An Infrastructure Success Story

Episode 1

Leadership Lessons Behind High-Performing SRE Teams

Episode 3

Confidence, Cloud & the Future of Networking