You Can't Automate Expectations

About the episode

Establishing consistent automation habits helps keep those skills sharp and gets the systems set up promptly. But getting to that point takes time. And even when automating processes becomes second nature, you can still overlook potential pitfalls.

Joshua Bradley of Cox Edge describes what it’s like managing the expectations teams and stakeholders may have about automating infrastructure. The systems may be more complex. Timelines may be longer. And even when you leave detailed instructions, users may still make mistakes. It just means you need to keep adjusting until you get it right.

About the guests

Joshua Bradley

Director Of Technology And Security Operations

Transcript

00:01 — Jamie Parker
Automation can make things as easy as clicking a button, but behind that simplicity, there's usually a mountain of work, and if you don't get it right, that mountain could be a volcano waiting to erupt. You want to get it right, you're not looking to make life harder down the line. So you want to be picky about the tooling and take the time to learn to do things the right way, but then you actually have to do the thing on a regular basis. This episode, we hear the story of an engineer at Cox Edge and the work he and his team have done to make automation almost second nature. It requires taking a hard look at what their actual processes are, blocking out the steps, scripting them, accounting for unusual cases, and most of all, accounting for human error because nothing is ever truly completely automatic.

01:01 — Josh Bradley
Automation is wonderful because it will update thousands of things at a time. Automation is also very scary because it will update thousands of things at a time.

01:10 — Jamie Parker
Josh Bradley is the Director of Technology and Security Operations for Cox Edge. He's been at Cox for 25 years, and in that time has helped get their automation program up and running. People know what's possible with automation, and so when it's absent from processes, the ensuing delays can seem interminable.

01:29 — Josh Bradley
The developers were always thinking we were taking a long time because it would take 2 or 3 weeks before we could even start our work. So what we were really trying to do is, since we didn't have control over that realm, is to take control over our realm and really automate and streamline anything that we could for our side.

01:43 — Jamie Parker
Josh was setting up the infrastructure for developers to be able to do their own work. It was complicated and depended on the cooperation of other teams, which slowed things down. Over time, automation helped with that, but it takes a lot of effort to get to that consistency. To meet compliance requirements and to improve security, it requires being thorough, which takes time. Unfortunately, not everyone knows how complicated setting up an automation framework can be.

02:12 — Josh Bradley
Generally, someone's introduction to automation is, "Okay. You're going to automate this thing? Well, why has it taken several weeks longer than when you didn't automate it? I thought automation was faster." So it's really just making sure that your customer understands that end goal of yes, this will take longer upfront, but I promise you down the line you'll save hours and days on things. Yes, the first time is going to be a little longer and painful, but as you get that automation practice down and you get the flow of knowing what needs to be automated and in what order, then all that stuff gets streamlined.

02:42 — Jamie Parker
As with many skills, there's a learning curve to adopting automation. It can take a while to get the hang of it. Hopefully you spend less and less time learning the tool on each subsequent project, but at first you won't save any time when automating processes. That time saved comes later, and in the meantime, there's a heap of preparation to do.

03:02 — Josh Bradley
Yeah. Automation takes longer the first time because you're not only doing the manual config, you're then having to turn around and automate that. So there was some initial hesitancy around, "Wait a second, you said you were going to save us time, but now you're adding time to be able to do that." But what really came into play is when we were able to do the additional environments where they said, "Okay. Now I need a production environment." Well, 2 hours later, they had that environment.

03:25 — Jamie Parker
Processes go from weeks to set up an environment to taking a couple of hours. It's difficult to argue against that kind of time-saving work. It's just that the road to get there may have been a little longer and a little bumpier than expected.

03:40 — Josh Bradley
That's really when the value came into play and really where it became a requirement because people no longer wanted to wait for those additional environments. They wanted that push-button experience. So we got ourselves into that situation by providing that service that people wanted to use. But again, automation is one of those things that most everybody gets value out of. So after you get over that initial standup hurdle, it's pretty easy to sell.

04:05 — Jamie Parker
Getting over that initial standup hurdle is the trick, as is getting everyone to understand that the standup is complex, unforgiving, and potentially treacherous. Okay—yes that was a little dramatic, but a little caution will help in the long run when automating updates for thousands of components at once. There's a lot to account for, even for temporary requests—especially for temporary requests.

04:33 — Josh Bradley
What's really valuable is we can turn those into additional environments. So for example, if a QA [quality assurance] team comes and says, "Hey, I need a low-test environment. I only need it for two weeks." Two hours later, we've got them an environment stood up. They can use it for as long as they need it, and then when they're done with it, we just reclaim the resources and add them back to the pool.

04:51 — Jamie Parker
Standing down environments and reclaiming those resources can be easily overlooked. How many times had the operations team had to investigate which of the cost-draining environments are still in use and which are sitting idle and running up unnecessary bills. Making it easy to turn those off is an easy win, it's one of many. After years of experience, Josh's team has a simple rule to rack up those easy wins.

05:15 — Josh Bradley
Our rule is, if we have to do it more than 3 times, it gets automated. Once or twice—not worth the lift really, but 3 times, that's when we get the value out of it. We do the manual once approach just to help with complexity. Automation adds time and complexity on things. So we do the manual approach first. Get it all working, prove it out. We don't really want to waste time automating something that won't be part of the final product.

05:39 — Jamie Parker
Talk about having automation become part of your work culture. They not only identify when a process is going to be repeated, but also quantify whether that repeated process is worth the effort of automation. It takes experience to make that determination. Josh shared what a typical process right for automation looks like.

05:57 — Josh Bradley
Days to a week, just every little component, and you get that and make sure that gets working, and then you move on to the next component and make sure that gets working and then move on.

06:05 — Jamie Parker
And the difficulty of the task depends on the makeup of the system.

06:09 — Josh Bradley
The more components and the more wiring of those components, the more complex it is just because, again, you have to get all those components working first before you can start the automation.

06:19 — Jamie Parker
That may seem like a straightforward idea, but the relationship between complexity and components is not necessarily linear, and getting all of those components working with each other can be a challenge on its own. That's especially true of distributed systems. These days cloud-native environments often run on Kubernetes. Josh and his team set up an automation process they could reuse for a multitude of cloud-native projects. With the tech world moving so much to the cloud, that reusability is a big time saver. And if you're not familiar with containerization, container orchestration, and microservices in general, here's a quick peek at the components involved.

06:56 — Josh Bradley
All the complexity of standing up in Kubernetes, we didn't really have to tackle that as part of that automation. We did do that as part of a previous automation. So we worked really hard. As you know, Kubernetes is a very complex environment and really when people say Kubernetes and there's like 12 different technologies underneath that to make up the Dockers and Grafana and all those bits that are needed to power a Kubernetes environment.

07:18 — Jamie Parker
Knowing all the different components needed is no small feat, let alone which options to choose, how they interact with each other, what configurations are needed for them to play nice, making sure the security policies are set up right, and on and on. You can see how automation would come in handy, but also what could go wrong.

07:36 — Josh Bradley
So those were taking us days and days and days to do. So knowing that—hey, we want a repeatable process, we want our dev environment to look just like our Q environment, to look just like our production environment. We automated Kubernetes to the extreme, and we really had it down to 2 hours, and that was literally from the time that we would get the VM to production level monitoring. Like integrated with the NOC, full monitoring, we could see exactly what that cluster was doing, but we had that down to 2 hours, and most of that really was the patching of the operating system.

08:05 — Jamie Parker
Cloud-native environments are particularly tricky to set up, especially when specialized hardware is needed for its particular set of skills.

08:13 — Josh Bradley
A good example is our Cox Media arm had to do some video transcoding and they had a very specialized processor to meet these exact streaming demands. So those boxes were like $30,000 a piece, so pretty expensive to have a large SDLC environment. So what we did is we worked with them and my team set up some automation, and again, it was Kubernetes focused. So we turned all the expensive nodes into worker nodes, and then when we needed to do some development testing, we actually had automation that did a role swap.

08:42 — Jamie Parker
Changing the role of that hardware was costly and time consuming, which meant development for that part of the business was limited too. With Kubernetes and an automation framework, Josh and his team were able to make those transitions faster and more reliable.

08:56 — Josh Bradley
So literally the customer could go in and click a button and put in a number. They had like 30-something boxes so they could say, "I want 6 of these to be in my QA environment and these 6 to be in my development environment, the rest of them to be in prod." And they would enter those numbers and click "go," and literally the pipeline would go execute. It would take the worker nodes, put them in the right environment, make sure they were in the right network, had the right user access, and then really complete that deployment, integrated with monitoring and everything.

09:22 — Jamie Parker
With a couple of clicks, that web of complex Kubernetes components is rearranged to meet the demands of the development team without interrupting uptime. And those threats to that uptime could be efficiently dealt with too.

09:36 — Josh Bradley
And then when the user needed to—all of a sudden they had a big production demand where they needed to get all those workloads back. They didn't need to talk to us at all. They literally just went into the pipeline, put in the number, hit "go," and then all of a sudden they had the production environment size to what they needed.

09:52 — Jamie Parker
From production to development and back again, a successful end result to automation, which took time and deliberation to get right. But no automation system is completely infallible. Even when set up correctly, there's always the chance for user error. When we come back—examples of when things can go wrong.

10:21 — Jamie Parker
You can spend hours writing the docs, you can spend more setting up the scripts, but you can't always predict how people will use them, if at all. Josh learned that lesson the hard way.

10:35 — Josh Bradley
So we had one service and it was multiple different technologies. It was very complex. So the doc was admittedly very long just to due the complexity of the service. So I put it together, very detailed, and I handed over to the operations team.

10:49 — Jamie Parker
People complain about the lack of documentation or about its quality or lack of detail. Josh put in the work to give the people what they want and need.

10:59 — Josh Bradley
Well, a week or so goes by, and so one of the Ops teams stops by my desk and says, "Hey, listen, can you spare a day or so to come help us do this install?" So I was like, "Well, I'm happy to help and I can share the install doc that should walk you through everything. If there's any issues, let me know." And they said, "No, no, no. I've got the install doc, but that thing's way too long. I'm not going to sit down there and read that doc and follow that." So it just put me in a spot of, we do all this work to get this in a spot to where it's great for them, but then all of a sudden we don't want to turn it into a lot more work for them to do.

11:29 — Jamie Parker
Apparently it was too much to do. They didn't have the time to work through the dock and set the system up themselves. So instead of saving time, Josh ended up having to help them through it anyway, and that led to a pivotal decision.

11:45 — Josh Bradley
That led us into the first automation. So we had to run with that service as it is, but learned a lesson there. So it moved into the next service deployment. So at that point, I knew I was going to automate that thing to make it easier for me and the operations team in the log run. So I wrote everything in VI at the time, we didn't have the nice Ansible tools like we have now. So I wrote it all, had scripts out there, and I had an initial configuration doc that they needed to follow to set up the initial environments. Put in the config files. Production has different IP addresses than non-prod, so there's some things that resource would have to update.

12:19 — Jamie Parker
That was the beginning of his automation journey, but he wasn't done learning about the pitfalls of user error. Some time later, the operations team was trying to run through the process Josh had set up for them. He thought he had left them everything they needed to succeed.

12:35 — Josh Bradley
Of course, I was out on PTO [paid time off] and the production team got the VMs to their install, and I got a frantic call, "Help, help. We got a deployment, your automation's not working. It's failing all around." So I always bring my work computer with me of course.

12:48 — Jamie Parker
Of course.

12:49 — Josh Bradley
I jumped online to help them. So we went through and I opened up the config files and they were all blank. So immediately I was like, "Well, I think I figured out what's going on." I was like, "Well, you missed doing the config files." And they were adamant, "No, followed that doc exactly. We updated every config file. I put it in there. There's something wrong with the automation."

13:07 — Jamie Parker
Hear what he said? They followed the doc exactly. Let's find out where the stumbling block was.

13:14 — Josh Bradley
So I was like, "Well, let's go through the doc and fill out one of these config files." So the resource went through, they added the IPs, added everything exactly as was supposed to. Said, "All right. Looks good, let's move on to the next file." So they closed the first file and the window pops up and it says, "Would you like to save what you're doing?" They clicked no and shut the file down. I was like, "Well, wait a second. Hold on. You didn't save that file?" And they go, "Well, your doc didn't say to save the file."

13:41 — Jamie Parker
I couldn't believe it when I heard it and I'm sure Josh couldn't either. Maybe you can if you've been in this industry for long enough. Josh realized he had a little more work to do because you cannot account for everything.

13:55 — Josh Bradley
Automation is great, but you've really got to be careful with the instructions that you provide to both the executor and the automation itself, and to be very particular.

14:04 — Jamie Parker
Amen to that. There are so many things we take for granted that we overlook in the instructions. It might not be as obvious as hitting "save" on a file, but a lot of things that are obvious to you may not be for the end user of the process you put together. So when we say the process needs to be deliberate and detailed, this is a big part of that.

14:26 — Josh Bradley
It made me really take a look at the automation that we were doing, and yes, we thought this was very user-friendly, but it wasn't as user friendly as I thought it was. That's what we really took it to—it was bare minimum user input for anything, it was only the exact required bits, and then the scripts took care of everything else, including saving the files.

14:46 — Jamie Parker
The more sophisticated our automation tools become and the scripts we write with them, the less we need to account for user error. But this is a reminder that there is always something that the user can miss, misunderstand, or straight up get wrong. That's when one process in particular can come in handy—and hopefully you've got that process automated as well.

15:07 — Josh Bradley
That's where we really tout rollbacks. Working on something at 3:00 AM is very different than working on something at 3:00 PM. So we really try and get the production teams, "Hey, if you do run into an issue during this deployment, roll it back and we'll all get together the next day. Get multiple people in the room, not just whoever happened to be on call and be able to troubleshoot that." Of course, if it's a production service that has to absolutely be deployed, then of course we're happy to get on and troubleshoot, but we really still need somebody that knows the automation side to be able to help troubleshoot those. But again, once you've really QA-ed your automation, it's generally some environment difference from a manual update or the code issues that sometimes happen.

15:50 — Jamie Parker
Automation is great. Automation is complex. And automation can absolutely go wrong even when it's gone right many times before. Some change somewhere can have an unintended effect. Have that rollback option set up to be rock solid just in case. Josh and his team are by now experts in automation and they're able to share that expertise more widely.

16:16 — Josh Bradley
We collaborate a bunch, but we're not like evangelists that go out. But my previous team, we were Kubernetes focused, and so we were very, very automation-heavy on that. So we did set a lot of the standards for some other teams, and as part of that, they were seeing some of the cool things that we were doing. So we got some adoptions through that way, but it's really just collaboration across the company that really got us to where we are.

16:39 — Jamie Parker
We just spent an episode talking about the importance of getting automation processes set up correctly to minimize the chances of human error. Josh and his team refine their automation processes, but they're one small team in a large organization. They can't take on the whole company's automation projects.

16:58 — Jamie Parker
Next time on Code Comments, we hear from a small team of internal consultants at Ulta Beauty and how they help the rest of their company help themselves. You can read more at redhat.com/codecommentspodcast or visit redhat.com to find out more about our automation solutions. Many thanks to Josh Bradley for being our guest, and thank you for joining us.

17:27 — Jamie Parker
This episode was produced by Johan Philippine, Kim Huang, Caroline Creaghead, and Brent Simoneaux. Our audio engineer is Elisabeth Hart. The audio team includes Leigh Day, Stephanie Wonderlick, Mike Esser, Nick Burns, Aaron Williamson, Karen King, Jared Oates, Rachel Ertel, Carrie da Silva, Mira Cyril, Ocean Matthews, Paige Stroud, Alex Traboulsi, and Victoria Lawton. I'm Jamie Parker, and this has been Code Comments, an original podcast from Red Hat.

Our rule is if we have to do it more than three times, it gets automated. Once or twice, not worth the lift really, but any three times, that's when we get the value out of it.

Joshua Bradley

More like this

You Can’t Automate Buy-in

Learning new skills and changing habits takes time. Automation’s helping World Wide Technology—but only after they found a solution the team accepted.

You Can’t Automate Cultural Change

Making automation work takes more than just writing the scripts. Hear how Deloitte helps their customers make that transition successful—and overcome reluctance to change culture.

Lightspeed Automation with Generative AI

How will generative AI help IT automation in the near future? Technically Speaking explores the transformative potential of AI-assisted code with Ansible Lightspeed.

You Can’t Automate Expectations